So, these new Kioxia NVMe drives were incrementing the num_err_log_entries as soon as they were inserted into the machine. But the error said INVALID_FIELD. What gives?
In contrast to the other (mostly Intel) drives, these drives started
num_err_log_entries as soon as they were plugged in:
# nvme smart-log /dev/nvme21n1 Smart Log for NVME device:nvme21n1 namespace-id:ffffffff ... num_err_log_entries : 932
The relevant errors should be readable in the error-log. All 64 errors in the log looked the same:
error_count : 932 sqid : 0 cmdid : 0xc status_field : 0xc004(INVALID_FIELD) parm_err_loc : 0x4 lba : 0xffffffffffffffff nsid : 0x1 vs : 0
INVALID_FIELD, what is this?
The error count kept increasing regularly — like clockwork actually. And the internet gave us no clues what this might be.
It turns out it was our monitoring. The Zabbix scripts we
employ fetch drive health
status values from various sources. And one of the things they do, is
smartctl -a on all drives. And for every such call, the error
count was incremented.
# nvme list Node SN Model FW Rev ------------- ------------ ------------------- -------- ... /dev/nvme20n1 PHLJ9110xxxx INTEL SSDPE2KX010T8 VDV10131 /dev/nvme21n1 X0U0A02Dxxxx KCD6DLUL3T84 0102 /dev/nvme22n1 X0U0A02Jxxxx KCD6DLUL3T84 0102
If we run it on the Intel drive, we get this:
# smartctl -a /dev/nvme20n1 ... Model Number: INTEL SSDPE2KX010T8 ... === START OF SMART DATA SECTION === Read NVMe SMART/Health Information failed: NVMe Status 0x4002
# nvme smart-log /dev/nvme20n1 | grep ^num_err num_err_log_entries : 0
# nvme error-log /dev/nvme20n1 | head -n12 Error Log Entries for device:nvme20n1 entries:64 ................. Entry[ 0] ................. error_count : 0 sqid : 0 cmdid : 0 status_field : 0(SUCCESS) parm_err_loc : 0 lba : 0 nsid : 0 vs : 0
But on the Kioxias, we get this:
# smartctl -a /dev/nvme21n1 ... Model Number: KCD6DLUL3T84 ... === START OF SMART DATA SECTION === Read NVMe SMART/Health Information failed: NVMe Status 0x6002
# nvme smart-log /dev/nvme21n1 | grep ^num_err num_err_log_entries : 933
# nvme error-log /dev/nvme21n1 | head -n12 Error Log Entries for device:nvme21n1 entries:64 ................. Entry[ 0] ................. error_count : 933 sqid : 0 cmdid : 0x6 status_field : 0xc004(INVALID_FIELD) parm_err_loc : 0x4 lba : 0xffffffffffffffff nsid : 0x1 vs : 0
Apparently the Kioxia drive does not like what smartctl is sending.
Luckily this turned out to be an issue that smartctl claims responsibility for. And it had already been fixed.
If this works, the problem is that this drive requires that the broadcast namespace is specified if SMART/Health and Error Information logs are requested. This issue was unspecified in early revisions of the NVMe standard.
In our case, applying this fix was easy on this Ubuntu/Bionic machine:
# apt-cache policy smartmontools smartmontools: Installed: 6.5+svn4324-1ubuntu0.1 Candidate: 6.5+svn4324-1ubuntu0.1 Version table: 7.0-0ubuntu1~ubuntu18.04.1 100 100 http://MIRROR/ubuntu bionic-backports/main amd64 Packages *** 6.5+svn4324-1ubuntu0.1 500 500 http://MIRROR/ubuntu bionic-updates/main amd64 Packages 100 /var/lib/dpkg/status
# apt-get install smartmontools=7.0-0ubuntu1~ubuntu18.04.1
This smartmontools update from 6.5 to 7.0 not only got rid of the new errors, it also showed more relevant health output.
Now if we could just reset the error-log count on the drives, then this would be even better...