kioxia nvme / num_err_log_entries 0xc004 / smartctl

kioxia nvme / num_err_log_entries 0xc004 / smartctl

  • Written by
    Walter Doekes
  • Published on

So, these new Kioxia NVMe drives were incrementing the num_err_log_entries as soon as they were inserted into the machine. But the error said INVALID_FIELD. What gives?

In contrast to the other (mostly Intel) drives, these drives started incrementing the num_err_log_entries as soon as they were plugged in:

# nvme smart-log /dev/nvme21n1
Smart Log for NVME device:nvme21n1 namespace-id:ffffffff
...
num_err_log_entries                 : 932

The relevant errors should be readable in the error-log. All 64 errors in the log looked the same:

error_count  : 932
sqid         : 0
cmdid        : 0xc
status_field : 0xc004(INVALID_FIELD)
parm_err_loc : 0x4
lba          : 0xffffffffffffffff
nsid         : 0x1
vs           : 0

INVALID_FIELD, what is this?

The error count kept increasing regularly — like clockwork actually. And the internet gave us no clues what this might be.

It turns out it was our monitoring. The Zabbix scripts we employ fetch drive health status values from various sources. And one of the things they do, is run smartctl -a on all drives. And for every such call, the error count was incremented.

# nvme list
Node           SN            Model                FW Rev
-------------  ------------  -------------------  --------
...
/dev/nvme20n1  PHLJ9110xxxx  INTEL SSDPE2KX010T8  VDV10131
/dev/nvme21n1  X0U0A02Dxxxx  KCD6DLUL3T84         0102
/dev/nvme22n1  X0U0A02Jxxxx  KCD6DLUL3T84         0102

If we run it on the Intel drive, we get this:

# smartctl -a /dev/nvme20n1
...
Model Number:                       INTEL SSDPE2KX010T8
...

=== START OF SMART DATA SECTION ===
Read NVMe SMART/Health Information failed: NVMe Status 0x4002
# nvme smart-log /dev/nvme20n1 | grep ^num_err
num_err_log_entries                 : 0
# nvme error-log /dev/nvme20n1 | head -n12
Error Log Entries for device:nvme20n1 entries:64
.................
 Entry[ 0]
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0

But on the Kioxias, we get this:

# smartctl -a /dev/nvme21n1
...
Model Number:                       KCD6DLUL3T84
...

=== START OF SMART DATA SECTION ===
Read NVMe SMART/Health Information failed: NVMe Status 0x6002
# nvme smart-log /dev/nvme21n1 | grep ^num_err
num_err_log_entries                 : 933
# nvme error-log /dev/nvme21n1 | head -n12
Error Log Entries for device:nvme21n1 entries:64
.................
 Entry[ 0]
.................
error_count  : 933
sqid         : 0
cmdid        : 0x6
status_field : 0xc004(INVALID_FIELD)
parm_err_loc : 0x4
lba          : 0xffffffffffffffff
nsid         : 0x1
vs           : 0

Apparently the Kioxia drive does not like what smartctl is sending.

Luckily this turned out to be an issue that smartctl claims responsibility for. And it had already been fixed.

If this works, the problem is that this drive requires that the broadcast namespace is specified if SMART/Health and Error Information logs are requested. This issue was unspecified in early revisions of the NVMe standard.

In our case, applying this fix was easy on this Ubuntu/Bionic machine:

# apt-cache policy smartmontools
smartmontools:
  Installed: 6.5+svn4324-1ubuntu0.1
  Candidate: 6.5+svn4324-1ubuntu0.1
  Version table:
     7.0-0ubuntu1~ubuntu18.04.1 100
        100 http://MIRROR/ubuntu bionic-backports/main amd64 Packages
 *** 6.5+svn4324-1ubuntu0.1 500
        500 http://MIRROR/ubuntu bionic-updates/main amd64 Packages
        100 /var/lib/dpkg/status
# apt-get install smartmontools=7.0-0ubuntu1~ubuntu18.04.1

This smartmontools update from 6.5 to 7.0 not only got rid of the new errors, it also showed more relevant health output.

Now if we could just reset the error-log count on the drives, then this would be even better...


Back to overview Newer post: migrating vm interfaces / eth0 to ens18 Older post: openssl / error 42 / certificate not yet valid