Plots randomly disappearing while plotting

Hi,

I have a strange problem since a few days. When I start plotting using plotman, all works fine, just at some random point some plots start to disappear. Looking at the launch time of plots (stagger 24m) I noticed gaps between plots.

Things I did / checked:

  • All System temps seem to be fine (CPU, NVME, PCIe…)
  • Logs of those plots just stop without any error, it just stopped the process

System / Specs:

CPU: 5950X
RAM: 2x 32GB 3200MHz
MB: Asus Pro WS x570 ACE
PSU: 650W
PCIe: 1st Slot (x16 PCIe Gen4) → Asus Hyper M.2 Card (2x 2TB 980 Pro, 2x 1TB 980 Pro)
MD0 → RAID0 XFS 1x 980 Pro 1TB + 1x 980 Pro 2TB
MD1 → RAID0 XFS 1x 980 Pro 1TB + 1x 980 Pro 2TB
Destination: 2TB M.2 (Plots archived to Main_Node via Plotman rsync)

Last lines of plot logs:

    ||Bucket 58 uniform sort. Ram: 3.840GiB, u_sort min: 3.250GiB, qs min: 0.813GiB.|
    ||Bucket 59 uniform sort. Ram: 3.840GiB, u_sort min: 1.625GiB, qs min: 0.812GiB.|
    ||Bucket 60 uniform sort. Ram: 3.840GiB, u_sort min: 1.625GiB, qs min: 0.812GiB.|
    ||Bucket 61 uniform sort. Ram: 3.840GiB, u_sort min: 1.625GiB, qs min: 0.812GiB.|

Computing Table 4 (P1)

    ||Bucket 67 uniform sort. Ram: 3.840GiB, u_sort min: 3.250GiB, qs min: 0.813GiB.|
    ||Bucket 68 uniform sort. Ram: 3.840GiB, u_sort min: 1.625GiB, qs min: 0.812GiB.|
    ||Bucket 69 uniform sort. Ram: 3.840GiB, u_sort min: 1.625GiB, qs min: 0.812GiB.|
    ||Bucket 70 uniform sort. Ram: 3.840GiB, u_sort min: 3.250GiB, qs min: 0.813GiB.|

Computing Table 7 (P1)

    ||Bucket 125 uniform sort. Ram: 3.840GiB, u_sort min: 1.375GiB, qs min: 0.687GiB.|
    ||Bucket 126 uniform sort. Ram: 3.840GiB, u_sort min: 1.375GiB, qs min: 0.687GiB.|
    ||Bucket 127 uniform sort. Ram: 3.840GiB, u_sort min: 1.375GiB, qs min: 0.687GiB.|
    ||Total matches: 4288300635|

Compressing tables 3 and 4 (P3)

     	Bucket 67 uniform sort. Ram: 1.920GiB, u_sort min: 0.500GiB, qs min: 0.250GiB.
    	Bucket 68 uniform sort. Ram: 1.920GiB, u_sort min: 1.250GiB, qs min: 0.315GiB.
    	Bucket 68 uniform sort. Ram: 1.920GiB, u_sort min: 0.500GiB, qs min: 0.250GiB.

Anyone have experienced such issues?

2 Likes

Are you on linux? Are you running lots of parallel plots (around the limit of your RAM)? If so run this:

journalctl | grep "Out of memory"

Processes killed by the OOM killer don’t get much of a chance to log anything, it looks exactly like that ^

3 Likes

Yes I am on Clear Linux newest version + updates…
I guess RAM is fine, it is only about 8-10 plots in parallel.
Using 4000 for each plot rarely gets to 60% Ram usage

Running the command journalctl | grep “Out of memory” did not give me any output :frowning:

If you know the PIDs of the plots that ended, try grep-ing journalctl for those - if the kernel kills a process it usually says why.

1 Like

True, I tried with the pid of a plot which just now disappeared, nothing in the jounalctl sadly… something is really odd…

Yes that is odd, if you run out of tmp space for whatever reason or dst space the plot just hangs, if you OOM you get a message about it (and what you described 10x4G on 64GB has more than enough headroom).

How do you run plotman?

  • plotman interactive in a shell
  • plotman plot in a shell
  • something else? e.g. systemd service, or under anything else that might try to kill plotman’s child processes when the plotman process ends (plotman usually doesn’t do this)

Yeah, all seems to be fine regarding logs, no errors at all, just like the process gets killed.
I have to look at the created logs and see if there are some small sized logs that are not updating in size, checking those I can see that they are stuck/stopped. Then I need to identify the plot ID and delete those on the temp drives… really annoying and time consuming.

I run plotman just as you said, ´plotman interactive´ in a shell besides that, just glances and dstat is running for monitoring. This machine is only plotting not farming, did have this issue on ubuntu 21.04 some days ago and it worked. Switched to Clear Linux because of higher performance, was going well, but now it is not working good anymore…

If you want to get to the bottom of it I would suggest isolating the plotting from plotman to investigate - manually run some plots with a similar pattern using chia plots create, and see what return code the failing ones have (if indeed there are failing ones when running without plotman). And/or hack about in plotman to see if there is a way to log return codes.

Running plotman plot also seems to log a bit more than plotman interactive, you can run plotman status in another shell to keep track of what’s happening (and plotman archive in another shell if you’re using that).

I also use plotman, on Ubuntu 20.04, has been nothing but reliable for me so far…

1 Like

Thanks for your help! I am going to try the separate windows with separate commands, if that still fails i will try the way manually creating plots using chia and see how that goes.

Will reply back as soon as I have results.

Just now another one disappeared and I immediately used your command to see if there is any error, and there is ! This is what came out:

Jun 09 19:45:06 farmer systemd-coredump[50678]: Process 46661 (chia) of user 1000 dumped core.

Oooh ok, are you using the original plotting library or one of the newer ones?

Good question :smiley:
I got no idea, how can i check that?

You would know if you were using one of the newer ones.

Is not particularly easy to explain how to make use of a core dump file over a forum… you need to find the underlying executable that chia plots create invokes, then pass that plus the core dump to a program called gdb, and then making sense of what gdb tells you often requires some understanding of the code where the error originated.

If we assume that the chia plots program is relatively bug free and that the problem originates in your setup, t.b.h it could be anything but I’d suggest checking RAM first because it’s easy to test and can cause all sorts of errors - memtest86 is a good way to test - you might already have that as a bootable option in grub, if not is relatively easy to setup if you follow guides online.

Had left it running over night, temp dirs almost full because of the old files not being deleted.
I tried the single plotman commands, they did not do any difference. I will try maybe to reinstall the entire system? Maybe it is due to the fact that I resized the former ubuntu partition to half in order to install clear linux in parallel. Will do a memtest aswell.

What the errors currently look like, nothing really changed:

Jun 10 02:02:42 farmer systemd-coredump[81986]: Removed old coredump core.chia.1000.6f42368ad14d4e219c18f13335fb027a.46661.1623260697000000.zst.
Jun 10 02:02:49 farmer systemd-coredump[81986]: Process 77900 (chia) of user 1000 dumped core.
Jun 10 03:29:26 farmer kernel: chia[87878]: segfault at 7f0edaee2000 ip 00007f0edebdd1e3 sp 00007f0edaed6af0 error 6 cpu 29 in chiapos.cpython-39-x86_64-linux-gnu.so[7f0edeb91000+b8000]
Jun 10 03:29:26 farmer systemd-coredump[88390]: Removed old coredump core.chia.1000.6f42368ad14d4e219c18f13335fb027a.54521.1623266243000000.zst.
Jun 10 03:29:34 farmer systemd-coredump[88390]: Process 85172 (chia) of user 1000 dumped core.
Jun 10 05:00:48 farmer kernel: chia[95012]: segfault at 7f8c9139ef50 ip 00007f8df68ff2dc sp 00007f8df0f37a88 error 6 cpu 13 in libc-2.33.so[7f8df6867000+179000]
Jun 10 05:00:48 farmer systemd-coredump[95194]: Removed old coredump core.chia.1000.6f42368ad14d4e219c18f13335fb027a.61886.1623271601000000.zst.
Jun 10 05:00:58 farmer systemd-coredump[95194]: Process 92376 (chia) of user 1000 dumped core.
Jun 10 05:42:07 farmer kernel: chia[96700]: segfault at 7f16c7ce4a23 ip 00007f16611a7591 sp 00007f165b801a70 error 4 cpu 13 in libc-2.33.so[7f1661131000+179000]
Jun 10 05:42:07 farmer systemd-coredump[98245]: Removed old coredump core.chia.1000.6f42368ad14d4e219c18f13335fb027a.71055.1623276029000000.zst.
Jun 10 05:42:17 farmer systemd-coredump[98245]: Process 94225 (chia) of user 1000 dumped core.
Jun 10 07:55:22 farmer kernel: chia[106954]: segfault at 7f5631d1a000 ip 00007f563601b1e3 sp 00007f5631d0eaf0 error 6 cpu 13 in chiapos.cpython-39-x86_64-linux-gnu.so[7f5635fcf000+b8000]
Jun 10 07:55:22 farmer systemd-coredump[107911]: Removed old coredump core.chia.1000.6f42368ad14d4e219c18f13335fb027a.77900.1623283362000000.zst.
Jun 10 07:55:32 farmer systemd-coredump[107911]: Process 101258 (chia) of user 1000 dumped core.
Jun 10 08:30:47 farmer kernel: chia[109701]: segfault at 7fb099e10000 ip 00007fb0a11011e3 sp 00007fb099e04af0 error 6 cpu 26 in chiapos.cpython-39-x86_64-linux-gnu.so[7fb0a10b5000+b8000]
Jun 10 08:30:47 farmer systemd-coredump[110787]: Removed old coredump core.chia.1000.6f42368ad14d4e219c18f13335fb027a.85172.1623288566000000.zst.
Jun 10 08:30:47 farmer systemd-coredump[110787]: Removed old coredump core.chia.1000.6f42368ad14d4e219c18f13335fb027a.92376.1623294048000000.zst.
Jun 10 08:30:57 farmer systemd-coredump[110787]: Process 108455 (chia) of user 1000 dumped core.
Jun 10 08:59:50 farmer kernel: chia[112012]: segfault at 7f1bfc7a4a5b ip 00007f1b25a5e591 sp 00007f1b200b8a60 error 4 cpu 13 in libc-2.33.so[7f1b259e8000+179000]
Jun 10 08:59:59 farmer systemd-coredump[112949]: Process 106584 (chia) of user 1000 dumped core.

Yep definitely recommend a memtest - especially if you’re planning to reinstall anyway, many Linux live CDs have memtest86 as one of the boot options before install - it can take quite a long time to run, but will either confirm or rule out memory as an issue.

If it happens still, on a fresh install with confirmed clean memory, you would probably be best off raising this as a ticket against the chiapos github (Issues · Chia-Network/chiapos · GitHub) - people there will have a deep familiarity with the chiapos code (I don’t) and should be able to make sense of the core dumps - and if it’s a real bug they’ll want to know.

1 Like

Hi,
many thanks for your reply! I will definetly try the memtest, just strange, because all parts are brand new… maybe its a bios setting? I read someone in this forum had similar issues and set the RAM frequency to 2666MHz or smth like that, mine is strangely rated 2400MHz when looking into the bios, i had to set up DOCP or smth like that in bios so it runs at 3200MHz, maybe the RAM is just not correct? It is Hyper-X DIMM 2x32GB 3200MHz (at least that says the package)

Thanks again!

Memtest86 just passed all 4 Phases without any error, showed up as “PASSED” now i ll try reinstall Clear Linux with newest version fresh download and install… will keep updated

Ok, so memtest passed, I installed fresh Clear Linux, set up all with current versions of chia and plotman, same issue after about 3-4 hours plots start to randomly disappear. Went a step further replaced my 2x 32GB RAM modules with 4x16GB, still the same issue… so different RAM, new OS install, newest versions of Chia and Plotman still plots crashing / get killed by core dump… any other suggestions?

Thanks in advance!