System crashing frequently midway through plotting

Ubuntu started popping up a “System program problem detected” window as of recently, which usually is the harbinger of a crash to come. The message does not include any information about what program is affected or what the problem is. The kernel and crash log are at the bottom of the post.

I checked the plotter logs too, but there are no errors reported leading up to the last line written before the crash in each of them. Could this be related to XMP/DOCP? DOCP is set to disabled in my UEFI though. I have noticed that the RAM is running at slightly above 2000MHz rather than the rated 3600MHz with automatic settings, could that be causing problems?

I was plotting 12 k32 plots in parallel, 2 threads and 3390MiB, 30min stagger. I think it only got about 6 plots in before the crash though. Average time for a single plot is slightly above 4 hours on my system.

OS : Ubuntu Budgie 20.04, packages up to date
CPU : AMD Ryzen 9 5950X
RAM : 2x 32GB Corsair Vengeance LPX DDR4-3600 CL18-22-22-42
MBD : Asus Prime B550-Plus
SSD : 2x 2TB Corsair MP600
HDD : 12TB WD My Book over USB
PSU : Corsair HX750
GPU : Nvidia GeForce GT240

Log in /var/crash:
 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
 page dumped because: non-NULL mapping
 Modules linked in: ses enclosure scsi_transport_sas xfs nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep edac_mce_amd nouveau snd_pcm snd_seq_midi kvm snd_seq_midi_event mxm_wmi snd_rawmidi ttm snd_seq drm_kms_helper snd_seq_device crct10dif_pclmul cec ghash_clmulni_intel snd_timer rc_core aesni_intel joydev i2c_algo_bit eeepc_wmi snd input_leds asus_wmi crypto_simd fb_sys_fops syscopyarea sparse_keymap cryptd sysfillrect sysimgblt video glue_helper wmi_bmof soundcore efi_pstore k10temp ccp mac_hid sch_fq_codel parport_pc ppdev lp parport drm ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 multipath linear raid0 hid_corsair hid_generic usbhid hid uas usb_storage crc32_pclmul r8169 nvme ahci i2c_piix4 realtek xhci_pci nvme_core libahci xhci_pci_renesas wmi gpio_amdpt gpio_generic
 CPU: 25 PID: 283 Comm: kswapd0 Not tainted 5.8.0-50-generic #56~20.04.1-Ubuntu
 Hardware name: ASUS System Product Name/PRIME B550-PLUS, BIOS 1401 12/03/2020
 Call Trace:
Package: linux-image-5.8.0-50-generic 5.8.0-50.56~20.04.1
SourcePackage: linux
Tags: kernel-oops
Uname: Linux 5.8.0-50-generic x86_64


Excerpt from kernel log:
    Apr 30 00:00:44 fred kernel: [ 6278.255580] Disabling lock debugging due to kernel taint
    Apr 30 00:52:19 fred kernel: [ 9373.240482] ------------[ cut here ]------------
    Apr 30 00:52:19 fred kernel: [ 9373.240484] nouveau 0000:04:00.0: timeout
    Apr 30 00:52:19 fred kernel: [ 9373.240530] WARNING: CPU: 12 PID: 10356 at drivers/gpu/drm/nouveau/nvkm/engine/gr/g84.c:168 g84_gr_tlb_flush+0x30b/0x320 [nouveau]
    Apr 30 00:52:19 fred kernel: [ 9373.240531] Modules linked in: ses enclosure scsi_transport_sas xfs nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep edac_mce_amd nouveau snd_pcm snd_seq_midi kvm snd_seq_midi_event mxm_wmi snd_rawmidi ttm snd_seq drm_kms_helper snd_seq_device crct10dif_pclmul cec ghash_clmulni_intel snd_timer rc_core aesni_intel joydev i2c_algo_bit eeepc_wmi snd input_leds asus_wmi crypto_simd fb_sys_fops syscopyarea sparse_keymap cryptd sysfillrect sysimgblt video glue_helper wmi_bmof soundcore efi_pstore k10temp ccp mac_hid sch_fq_codel parport_pc ppdev lp parport drm ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 multipath linear raid0 hid_corsair hid_generic usbhid hid uas usb_storage crc32_pclmul r8169 nvme ahci i2c_piix4 realtek xhci_pci nvme_core libahci xhci_pci_renesas wmi gpio_amdpt gpio_generic
    Apr 30 00:52:19 fred kernel: [ 9373.240558] CPU: 12 PID: 10356 Comm: kworker/12:0 Tainted: G    B             5.8.0-50-generic #56~20.04.1-Ubuntu
    Apr 30 00:52:19 fred kernel: [ 9373.240559] Hardware name: ASUS System Product Name/PRIME B550-PLUS, BIOS 1401 12/03/2020
    Apr 30 00:52:19 fred kernel: [ 9373.240582] Workqueue: events nouveau_cli_work [nouveau]
    Apr 30 00:52:19 fred kernel: [ 9373.240600] RIP: 0010:g84_gr_tlb_flush+0x30b/0x320 [nouveau]
    Apr 30 00:52:19 fred kernel: [ 9373.240601] Code: 8b 40 10 48 8b 78 10 4c 8b 6f 50 4d 85 ed 75 03 4c 8b 2f e8 87 e3 76 e7 4c 89 ea 48 c7 c7 7c 79 cc c0 48 89 c6 e8 4b bb b3 e7 <0f> 0b e9 49 ff ff ff e8 79 34 b9 e7 66 0f 1f 84 00 00 00 00 00 0f
    Apr 30 00:52:19 fred kernel: [ 9373.240602] RSP: 0018:ffffac2ac678f888 EFLAGS: 00010082
    Apr 30 00:52:19 fred kernel: [ 9373.240602] RAX: 0000000000000000 RBX: ffff9d515e7db400 RCX: 0000000000000027
    Apr 30 00:52:19 fred kernel: [ 9373.240603] RDX: 0000000000000027 RSI: 0000000000000082 RDI: ffff9d516eb18cd8
    Apr 30 00:52:19 fred kernel: [ 9373.240603] RBP: ffffac2ac678f978 R08: ffff9d516eb18cd0 R09: 0000000000000004
    Apr 30 00:52:19 fred kernel: [ 9373.240603] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
    Apr 30 00:52:19 fred kernel: [ 9373.240604] R13: ffff9d5166b63060 R14: ffff9d51624a39c0 R15: 0000000000000001
    Apr 30 00:52:19 fred kernel: [ 9373.240604] FS:  0000000000000000(0000) GS:ffff9d516eb00000(0000) knlGS:0000000000000000
    Apr 30 00:52:19 fred kernel: [ 9373.240604] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Apr 30 00:52:19 fred kernel: [ 9373.240605] CR2: 00007fc0998c1000 CR3: 0000000fdf898000 CR4: 0000000000740ee0
    Apr 30 00:52:19 fred kernel: [ 9373.240605] PKRU: 55555554
    Apr 30 00:52:19 fred kernel: [ 9373.240606] Call Trace:
    Apr 30 00:52:19 fred kernel: [ 9373.240628]  ? nv04_timer_read+0x47/0x60 [nouveau]
    Apr 30 00:52:19 fred kernel: [ 9373.240644]  ? nvkm_timer_wait_test+0x22/0x80 [nouveau]
    Apr 30 00:52:19 fred kernel: [ 9373.240657]  ? g84_bar_flush+0x8b/0xe0 [nouveau]
    ...

Update your BIOS PRIME B550-PLUS|Motherboards|ASUS USA

I don’t know if that will actually fix it. I just noticed its out of date.

4 Likes