So even after replacing bad memory and a bad processor, my system is still crashing. Looks like it’s the SSDs this time.
The problems started yesterday when I was just connecting a new external drive over USB and all of my plotting jobs crashed because somehow that broke the SSD mountpoint. It was only reporting this when you tried to access it in any way:
OSError: [Errno 5] Input/output error: '/mnt/tmp/00'
It was working again after remounting and starting over with plotting, but then the system went into kernel panic a day later. I remembered that I created the md array under a different Ubuntu Server version and thought that might be causing problems, so I stopped it and created a new one. When I tried to format it to XFS though, I got kernel panic again.
There are 63 entries of this cryptic message in the nvme error log for both 1.6TB Intel P4610s:
Entry[63]
.................
error_count : 0
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
And these are the smart logs for both:
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning : 0
temperature : 40 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 1%
endurance group critical warning summary: 0
data_units_read : 596,988,276
data_units_written : 564,366,836
host_read_commands : 3,792,553,717
host_write_commands : 2,961,035,683
controller_busy_time : 2,725
power_cycles : 15
power_on_hours : 304
unsafe_shutdowns : 5
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning : 0
temperature : 39 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 1%
endurance group critical warning summary: 0
data_units_read : 575,183,605
data_units_written : 542,139,408
host_read_commands : 3,735,260,606
host_write_commands : 2,597,280,823
controller_busy_time : 2,747
power_cycles : 12
power_on_hours : 281
unsafe_shutdowns : 5
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
Other than the unsafe shutdowns that looks fine to me, but what do I know. I don’t think they got too hot or anything, I’ve been obsessively monitoring the temperature. It stayed under 50 celsius all the time, and they’re rated up to 55 for operating temperature.
I don’t think it’s bad firmware either, I updated it just a week ago.
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 PHLN026001YK1P6AGN INTEL SSDPE2KE016T8 1 1.60 TB / 1.60 TB 4 KiB + 0 B VDV10170
/dev/nvme1n1 PHLN035300CV1P6AGN INTEL SSDPE2KE016T8 1 1.60 TB / 1.60 TB 4 KiB + 0 B VDV10170
Specs:
OS : Ubuntu Server 21.04 / kernel 5.11.0
CPU : AMD Ryzen 9 5900X
RAM : 2x 32GB Mushkin Redline Ridgeback G2 3600
MBD : ASUS Prime B550-Plus
SSD : 2x 1.6TB Intel P4610
HDD : 12TB WD My Book over USB3.2
PSU : Corsair HX750