Stuck plot tasks in plotman

Sandbo · June 3, 2021, 10:05am

From time to time, a certain task will get frozen, if not just progressing very slowly. This has happened on both of my arrays (3x1TB NVMe and 5x400GB SATA) almost randomly. For example:

Jobs (10): [1 . . : 2 : 3 . .: 4 ]
Prefixes: tmp=/mnt dst=/mnt/CSST1 (remote)
plot id k tmp dst wall phase tmp pid stat mem user sys io
0 91de6c9d 32 CSCP2 . 0:31 1:2 86G 29270 SLP 4.0G 0:38 0:02 1s
1 6987fe4b 32 CSCP1 . 1:31 1:4 166G 26713 SLP 4.0G 2:06 0:06 0s
2 97980fce 32 CSCP1 . 2:31 1:6 178G 24025 SLP 4.0G 3:27 0:10 1s
3 0a0ee338 32 CSCP2 . 3:31 1:6 180G 21463 SLP 4.0G 3:48 0:12 18s
4 0acba555 32 CSCP1 . 4:32 2:4 239G 18764 RUN 1.5G 5:43 0:16 0:01
5 e23f3499 32 CSCP2 . 5:32 2:4 230G 16198 RUN 1.5G 5:42 0:17 0:09
6 7daaee34 32 CSCP1 . 6:32 3:5 162G 13466 RUN 3.9G 7:32 0:24 0:02
7 63e3f00c 32 CSCP2 . 7:32 3:4 185G 10764 RUN 3.9G 7:08 0:24 0:40
8 8759d015 32 CSCP1 . 8:32 3:5 149G 8158 RUN 3.9G 9:31 0:23 0:02
9 9bb8526f 32 CSCP2 . 9:33 3:1 226G 5565 SLP 3.9G 5:48 0:17 0:13

where you can see task #9 spending 9:33 hours is stuck in phase 3:1, while task #7 spending 7:32 hours which started later has gone pass it. Some tasks can get to 4:0 and finished, but quite a few won’t and get stuck forever.

I am running in total 12 tasks in parallel, with 16 physical cores and 54 GB of RAM. The tasks are divided into half such that the 31TB NVMe and 5400GB SATA share the burden, and both should have enough space for 6 tasks. The assignment of the plotting is 2 threads each, 3400 MiB per plot. I think I have seen also that if I assigned fewer parallel, staggered tasks, say only 6 in total, this happens less.

Any help will be appreciated.

Voodoo · June 3, 2021, 10:16am

I have exactly the same thing, but running swar under windows and i’ve seen other people with the same issue.

Random but still ever so often, a plot will get stuck in phase 3. Taks manager will still show cpu and ram usage, but no disk write. No error, nothing in the logs, just stops.

Ram is more than enough, Disk space is more than enough. At first I thought it was because I was assigning more threads than the total number of the CPU, but this doesn’t seem to matter because I’ve also had it while running within the thread limit like you are.

I also ran windows memory test and no issues showed up.

You are running on Linux, me on Windows so this also doesn’t seem to be issue necessarily.
Are you also running AMD CPU / X570 mainboard?

If anyone has identified this problem, I would love to hear it.

Sandbo · June 3, 2021, 11:44am

Thanks for the info, I am running it on ubuntu 20.04.2, as a VM guest on Proxmox 6.
The system is AMD Threadripper 1950X with 64 GB non-ECC RAM, chipset is X399.

I have seen people suggesting it can be a memory issue, I could do a test later but at the moment the server is remote and I can’t reboot it for a memtest. But if you have found no issues with RAM either, then it can be another issue something.

Voodoo · June 3, 2021, 11:50am

yes, but next time I shutdown the system I will run another memtest, cause I don;t trust windows tools too much. I am seeing some BDOD errors as well sometimes and from what I gather from the debugger those are memory related. That’s why I ran memtest in the first place. Will let you know the results of a more thorough memtest over the weekend.

Sandbo · June 3, 2021, 1:00pm

thanks, that is appreciated.

I have been otherwise using my current workstation running some pretty heavy numerical calculations and haven’t spotted any issues, and it’s been 24/7 for years so I didn’t suspect RAM in the first place, but I will do a test once I have a chance.

Voodoo · June 6, 2021, 12:59pm

Well I didn’t get round to another memtest. What I did do is update the x570 chipset drivers last Thursday. So far - knock on wood - it seems to have solved both this problem and my persistent Bsod issues.
Haven’d had a plot stuck since.