From time to time, a certain task will get frozen, if not just progressing very slowly. This has happened on both of my arrays (3x1TB NVMe and 5x400GB SATA) almost randomly. For example:
Jobs (10): [1 . . : 2 : 3 . .: 4 ]
Prefixes: tmp=/mnt dst=/mnt/CSST1 (remote)
plot id k tmp dst wall phase tmp pid stat mem user sys io
0 91de6c9d 32 CSCP2 . 0:31 1:2 86G 29270 SLP 4.0G 0:38 0:02 1s
1 6987fe4b 32 CSCP1 . 1:31 1:4 166G 26713 SLP 4.0G 2:06 0:06 0s
2 97980fce 32 CSCP1 . 2:31 1:6 178G 24025 SLP 4.0G 3:27 0:10 1s
3 0a0ee338 32 CSCP2 . 3:31 1:6 180G 21463 SLP 4.0G 3:48 0:12 18s
4 0acba555 32 CSCP1 . 4:32 2:4 239G 18764 RUN 1.5G 5:43 0:16 0:01
5 e23f3499 32 CSCP2 . 5:32 2:4 230G 16198 RUN 1.5G 5:42 0:17 0:09
6 7daaee34 32 CSCP1 . 6:32 3:5 162G 13466 RUN 3.9G 7:32 0:24 0:02
7 63e3f00c 32 CSCP2 . 7:32 3:4 185G 10764 RUN 3.9G 7:08 0:24 0:40
8 8759d015 32 CSCP1 . 8:32 3:5 149G 8158 RUN 3.9G 9:31 0:23 0:02
9 9bb8526f 32 CSCP2 . 9:33 3:1 226G 5565 SLP 3.9G 5:48 0:17 0:13
where you can see task #9 spending 9:33 hours is stuck in phase 3:1, while task #7 spending 7:32 hours which started later has gone pass it. Some tasks can get to 4:0 and finished, but quite a few won’t and get stuck forever.
I am running in total 12 tasks in parallel, with 16 physical cores and 54 GB of RAM. The tasks are divided into half such that the 31TB NVMe and 5400GB SATA share the burden, and both should have enough space for 6 tasks. The assignment of the plotting is 2 threads each, 3400 MiB per plot. I think I have seen also that if I assigned fewer parallel, staggered tasks, say only 6 in total, this happens less.
Any help will be appreciated.