The need was to replace the 512Gb SSD in my Lenovo T14s Gen 2, so I have bought a 2T Seagate FireCuda 530 SSD.
Lenovo T14s Gen 2 not usable with FireCuda 530
Lenovo decided that T14s Gen 2 only accepts single sided SSD. FireCuda 530 is a double sided SSD. Thus, the laptop is unusable with the FireCuda 530 – I can’t put the back cover on with the drive sitting at 45°!
Here is my fault for not researching the limits of Lenovo product design and only going by specs. Thankfully I bought the laptop second hand at a very very good price – otherwise I would have been very annoyed with Lenovo asking so much money but offering so little plug.
AsRock X470D4U still AsRockRocks
Returning the drive was an option, but it was not the seller’s fault ..
In the other corner of the room, I saw my server. Built 4 years ago, dead quiet, with an AsRock X470D4U motherboard. It has two M.2 slots, one of which was still free. Of course the drive fits just fine there.
Why is num_err_log_entries increasing ?
Checking to see that my monitoring system picks up the hard disk, I spotted a very alarming trend, num_err_log_entries
was increasing.
I found a very helpful page here https://www.osso.nl/blog/kioxia-nvme-num-err-log-entries-0xc004-smartctl/ which describes something similar. Their monitoring system was causing the num_err_log_entries
to increase on the drive. But that was long time ago, and long time ago too, the bug in smartctl
was already fixed.
collectd
does my metrics collection, having its own plugin to inspect the SMART data of the hard disks, not relying on smartctl
.
I stopped collectd
and the FireCuda 530 did not report any additional errors. I started collectd
with the smart plugin disabled, the hard disk was still not reporting any additional errors. Hmmm.
The errors were all: 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field):
# nvme error-log /dev/nvme0n1
Error Log Entries for device:nvme0n1 entries:63
.................
Entry[ 0]
.................
error_count : 52
sqid : 0
cmdid : 0x9010
status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag : 0
parm_err_loc : 0x4
lba : 0
nsid : 0xffffffff
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 1]
.................
error_count : 51
sqid : 0
cmdid : 0xa014
status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field)
phase_tag : 0
parm_err_loc : 0x4
lba : 0
nsid : 0xffffffff
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
Entry[ 2]
Investigations
I checked out version 5.12.0 and built it with the following commands:
export CFLAGS="-fPIC"
export LDFLAGS="-shared -pie -Wl,-z,now"
./configure --prefix=/usr --disable-static --disable-werror --enable-smart
make -j4
I had troubles replacing the smart.so
plugin in /usr/lib64/collectd
– it was crashing the whole collectd
with weird stack traces, (like inside strcmp, etc). I figured that it may be related to Fedora 37: the default collectd
was compiled with PIE, I wasn’t compiling it with that at first.
I needed a way to find what exactly in smart.c
was triggering the increase in the number of errors, but not relying on the main collectd
process. Thus, I devised a small configuration file:
# cat smart_collectd.conf
PluginDir ".libs"
LoadPlugin logfile
<Plugin "logfile">
LogLevel debug
File STDOUT
</Plugin>
LoadPlugin smart
LoadPlugin csv
<Plugin "smart">
Disk "nvme0n1"
IgnoreSelected false
</Plugin>
<Plugin "csv">
DataDir "/tmp"
StoreRates true
</Plugin>
and ran the freshly built collectd
with:
./collectd -C smart_collectd.conf -f
To make sure I was running the compiled smart.so
, I configured the PluginDir
to point to the output files in .libs
Debug technique was a crude one, just add a return before various method calls, while still keeping an eye on num_err_log_entries
.
# nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 27°C (300 Kelvin)
available_spare : 100%
available_spare_threshold : 5%
percentage_used : 0%
endurance group critical warning summary: 0
data_units_read : 57002
data_units_written : 0
host_read_commands : 708714
host_write_commands : 0
controller_busy_time : 0
power_cycles : 1
power_on_hours : 131
unsafe_shutdowns : 1
media_errors : 0
num_err_log_entries : 133
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
Culprit
Starting at the plugin’s the main entry point and following the calls with strategically chosen return
s, I identified the culprit as the ioctl
in get_vendor_id
method:
static int get_vendor_id(const char *dev, char const *name) {
int fd, err;
__le16 vid;
[...]
err = ioctl(fd, NVME_IOCTL_ADMIN_CMD,
&(struct nvme_admin_cmd){.opcode = NVME_ADMIN_IDENTIFY,
.nsid = NVME_NSID_ALL,
.addr = (unsigned long)&vid,
.data_len = sizeof(vid),
.cdw10 = 1,
.cdw11 = 0});
if (err < 0) {
ERROR(PLUGIN_NAME ": ioctl for NVME_IOCTL_ADMIN_CMD failed with %s\n",
strerror(errno));
close(fd);
return err;
}
[...]
}
Thus, with a return -1
just before the ioctl -> no errors. With a return -1
after the ioctl, new error logged by the hard disk.
Reading the code closer, it seems the code wants to have the VID (Vendor ID) of the hard disk by using the Identify Controller NVME command. Was hopeful I’ll get it working since the command line equivalent was returning correct results without triggering an error from the hard disk:
# nvme id-ctrl /dev/nvme0n1
NVME Identify Controller:
vid : 0x1bb1
ssvid : 0x1bb1
...
So where was the problem ?!
I started read about the arguments and checked others were using this command (I mean, I would have to eventually RTM, so let’s try the easy way first).
Two important pieces:
- When smartmontools is triggering the same command, the NSID argument is sent as 0:
The code source here was also sending 0x0 for NSID, but with a very telling comment, that, for Identity Controller, NSID must be 0.
I haven’t crosschecked this, however, using NSID with 0 retrieves the proper VID and, the most important, does not increase anymore the error count of the drive.
Wrapping up
I submitted a Pull Request to Github’s collectd page here: https://github.com/collectd/collectd/pull/4128
If you buy the FireCuda 530 to replace the SSD on a laptop that is limited to single sided drives and since you can’t install it ends up in your server that is monitored with collectd and if you notice the the error log is increasing, then you arrived to the right place.
Not sure how fast the PR is accepted, so you may have to build the smart.so
yourself.
The drive is fine and reports the error correctly, but is not an error with the drive – but an error caused by the monitoring system not sending the right NSID when communicating with the drive.
Leave a Reply