Kernel panic journey continues, looks like its a md driver bug

zictes · May 11, 2021, 2:56pm

So even after replacing bad memory and a bad processor, my system is still crashing. Looks like it’s the SSDs this time.

The problems started yesterday when I was just connecting a new external drive over USB and all of my plotting jobs crashed because somehow that broke the SSD mountpoint. It was only reporting this when you tried to access it in any way:

OSError: [Errno 5] Input/output error: '/mnt/tmp/00'

It was working again after remounting and starting over with plotting, but then the system went into kernel panic a day later. I remembered that I created the md array under a different Ubuntu Server version and thought that might be causing problems, so I stopped it and created a new one. When I tried to format it to XFS though, I got kernel panic again.

There are 63 entries of this cryptic message in the nvme error log for both 1.6TB Intel P4610s:

 Entry[63]
.................
error_count     : 0
sqid            : 0
cmdid           : 0
status_field    : 0(SUCCESS: The command completed successfully)
parm_err_loc    : 0
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................

And these are the smart logs for both:

Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 40 C
available_spare                         : 100%
available_spare_threshold               : 10%
percentage_used                         : 1%
endurance group critical warning summary: 0
data_units_read                         : 596,988,276
data_units_written                      : 564,366,836
host_read_commands                      : 3,792,553,717
host_write_commands                     : 2,961,035,683
controller_busy_time                    : 2,725
power_cycles                            : 15
power_on_hours                          : 304
unsafe_shutdowns                        : 5
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 39 C
available_spare                         : 100%
available_spare_threshold               : 10%
percentage_used                         : 1%
endurance group critical warning summary: 0
data_units_read                         : 575,183,605
data_units_written                      : 542,139,408
host_read_commands                      : 3,735,260,606
host_write_commands                     : 2,597,280,823
controller_busy_time                    : 2,747
power_cycles                            : 12
power_on_hours                          : 281
unsafe_shutdowns                        : 5
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0

Other than the unsafe shutdowns that looks fine to me, but what do I know. I don’t think they got too hot or anything, I’ve been obsessively monitoring the temperature. It stayed under 50 celsius all the time, and they’re rated up to 55 for operating temperature.

I don’t think it’s bad firmware either, I updated it just a week ago.

Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     PHLN026001YK1P6AGN   INTEL SSDPE2KE016T8                      1           1.60  TB /   1.60  TB      4 KiB +  0 B   VDV10170
/dev/nvme1n1     PHLN035300CV1P6AGN   INTEL SSDPE2KE016T8                      1           1.60  TB /   1.60  TB      4 KiB +  0 B   VDV10170

Specs:

OS  : Ubuntu Server 21.04 / kernel 5.11.0
CPU : AMD Ryzen 9 5900X
RAM : 2x 32GB Mushkin Redline Ridgeback G2 3600
MBD : ASUS Prime B550-Plus
SSD : 2x 1.6TB Intel P4610
HDD : 12TB WD My Book over USB3.2
PSU : Corsair HX750

zictes · May 18, 2021, 2:28pm

So after days and nights trying to find the cause of this nightmare, I have now painstakingly replaced every component in my system one by one except for the power supply. The memory passes memtest86 and mprime runs with no issues.

Still can’t plot even for a single day without going into kernel panic.

I have finally at least managed to find another poor soul experiencing the exact same problem, who has also received no help.

I’m running RAID 0 instead of RAID 6, but apart from that every single detail he describes matches up with my experience. Just like me he was unable to dump the core and even figured out why that’s not possible. He also had the same idea that it could be the chipset and went from X570 to B550, while I went the other direction with equally little success.

It still baffles me that nobody else here is experiencing this issue, since a lot of people here have very similar specs and are running the same exact software and OS in the same exact configuration.

chiahodler · May 18, 2021, 7:42pm

I am also experiencing this issue on x570 + 5950x - after a few hours in plotman computer is going to unresponsive.
Can’t do anything about it, tried a kernel 5.11, nothing.

anonymousjohndoe999 · May 24, 2021, 4:19am

I am also experiencing this issue.
2 Intel p4610 3.2TB raid 0 into /dev/md0.
Is it because of this raid 0?

anonymousjohndoe999 · May 24, 2021, 4:21am

Did u also raid 0 two nvme drives?

zictes · May 24, 2021, 4:38am

Most likely yes. Check the serial console log after the crash, and see if the call trace says something about md_end_io:

[406005.583319]  ? mempool_kfree+0xe/0x10
[406005.583319]  ? kfree+0xb8/0x220
[406005.583319]  ? mempool_kfree+0xe/0x10
[406005.583319]  ? mempool_free+0x2f/0x80
[406005.583319]  ? md_end_io+0x4b/0x70
[406005.583319]  ? bio_endio+0xe6/0x150

anonymousjohndoe999 · May 24, 2021, 4:42am

what’s your plan now? Don’t use mdadm raid0?

zictes · May 24, 2021, 4:42am

Have you tried disabling C-States in your BIOS/UEFI and adding processor.max_cstate=5 rcu_nocbs=0-31 to kernel args? The guy from the stackoverflow thread said he had success with that on a 5950X system.

anonymousjohndoe999 · May 24, 2021, 4:44am

I’m using DELL R730 server and if I remember correctly, I don’t see any C-states settings in bios. I’ll check for it later. Have you tried these?

zictes · May 24, 2021, 4:45am

Yes, I’ve been running stable for 5 days since abandoning raid 0. It adds about 20-30 minutes per plot on my system, but that’s much more bearable than crashing once a day.

I have not tried disabling c-states yet, will do that and report once I have to take the system down for maintenance again.

anonymousjohndoe999 · May 24, 2021, 4:47am

thank you sir. Keep in touch.

anonymousjohndoe999 · May 24, 2021, 4:48am

One more thought , might it be a problem of the u.2 to PCIE adapter? I’m using two diferrent brands of two adapters.

zictes · May 24, 2021, 4:51am

I thought about that too, might have something to do with the adapter. I’m using the same one for both SSDs though, so the different brands are not the issue.

I’m using this cable and adapter combo for both of them:

anonymousjohndoe999 · May 24, 2021, 4:54am

then I guess it’s not the PCIE slot problem either. Thought might be the slot problem. Considered switching a slot. Now I can skip that process.
Saw your comment on chia decentral Youtube video.

anonymousjohndoe999 · May 24, 2021, 4:57am

Could also be a problem of Intel P4610 I guess?

zictes · May 24, 2021, 5:00am

Yeah I don’t think it has anything to do with PCIe connection, I just remembered that the stackoverflow guy was having the issue on a bunch of standard 4TB WD Red SATA HDDs.

That’s also why I’m still skeptical about the P4610s being the problem, but that thought has also crossed my mind. Haven’t heard from anyone with different SSDs having periodic crashes.

brianw · June 4, 2021, 5:15am

I’m now running into this exact problem too. I’m using 5 PNY XLR8’s. I was successfully plotting on Debian Sid with an mdamd Raid0 with xfs. On Ubuntu 21.04 or 21.10, it kernel panics after a while. This user on Github also seems to be running into the same issue on a variety of Ubuntu versions.

github.com/Chia-Network/chia-blockchain

[BUG] Persistent Hard Crashing with XFS Temp Drive in MDADM RAID0

opened 10:22AM - 02 Jun 21 UTC

closed 10:32PM - 21 Jul 21 UTC

andrewseid

bug

**Describe the bug** XFS is known to be the fastest format for plotting, and te…sting proves this out. However, it also seems to result in reliable hard crashes on Ubuntu, usually within the first day of beginning plotting. I have experienced this issue about 15 times, on a mix of Ubuntu GUI 20.04, Ubuntu GUI 21.04, and Ubuntu Server 21.04. I've experienced it on three different systems, two AMD builds (3960X and 3990X), and one Intel build (i7-11700K). All systems have been using between two and four Samsung 980 Pro NVMe drives in MDADM RAID0. The issue seems to go away when I format the temp drive RAID0 array with ext4. **To Reproduce** 1. Create an XFS MDADM RAID0 array on Ubuntu 20.04 or 21.04 (GUI or server, doesn't matter), using 2-4 NVMe drives (in my case, Samsung 980 Pro 2TB, running on PCIe Gen 4). 2. Start 10+ plotting queues with -n 5 -r, depending on system specs. 3. Let system plot for 24-48 hours. **Expected behavior** Observe eventual hard crash. **Screenshots** On Ubuntu GUI, the desktop just completely freezes wherever it is. On Ubuntu Server, I got this: ![IMG_8562](https://user-images.githubusercontent.com/1294249/120462761-13af7d00-c350-11eb-8a04-657a1097f583.jpeg) **Desktop:** - OS: Ubuntu GUI 20.04, Ubuntu GUI 21.04, Ubuntu Server 21.04 - CPU: AMD 3990X, AMD 3960X, Intel i7-11700K - NVMe: 2-4 Samsung 980 Pro 2TB in MDADM RAID0 **Additional context** Random theory that you can feel free to ignore: since this is an extremely high performance setup, maybe it's hitting some kind of performance threshold or race condition during plotting? Or maybe it's *something else entirely* XD Thank you!