Gigahorse Compressed GPU Plot Client Released! Start Plotting Now!

A question…at lvls 0-4 on the chart it shows the same amount of capacity for the given GPU 15.06 for 3060. Hypothetical…I have 30.12 capacity. How many plots could I farm?
And why wouldn’t I be able to farm more at lvl 0 than lvl 4? Could I still have even more uncompressed plots farmed with that single gpu power or anything above 15.06 do I need addition gpu computational power? Either with multiple gpus or a more powerful gpu? Just curious…if this has been asked anywhere already, please forgive me and point me there. I have read so much in the last few weeks my head might explode but I don’t think I’ve heard anyone ask or answer this…Thanks for humoring me if any reply.

That’s crazy… there’s nothing in the code there that would make it wait on the copy…

I’m gonna have to blame windows on that.

There is no level 0. The lower levels are inefficient on GPU, that’s why it stops getting better below C3 / C4. Basically the GPU is just idle, and all you’re doing is giving it tiny work items. This is where CPUs are better. For the same reason you cannot get 100000 FPS with a 4090 on an old 90s game.

I was measuring the first plot. Just tried the second plot on 2080ti and it took 8m. Here’s the output:

Crafting plot 2 out of 2 (2023/02/13 08:38:18)
Process ID: 18361
Pool Public Key:
Farmer Public Key:
Working Directory: /home/tigana/p5510/
Working Directory 2: @RAM
Compression Level: C7 (xbits = 9, final table = 4)
Plot Name: plot-k32-c7-2023-02-13-08-38-9440388ca68c58187b3260e46943a4a33b6c17446155ebff3a62fcd9fb36c1e0
[P1] Setup took 0.581 sec
Flushing to disk took 6.659 sec
[P1] Table 1 took 11.107 sec, 4294967296 entries, 16787502 max, 66627 tmp, 0 GB/s up, 3.06115 GB/s down
[P1] Table 2 took 18.679 sec, 4294832394 entries, 16787426 max, 66732 tmp, 1.71315 GB/s up, 2.73035 GB/s down
[P1] Table 3 took 30.886 sec, 4294627931 entries, 16784987 max, 66952 tmp, 1.55405 GB/s up, 2.75206 GB/s down
[P1] Table 4 took 45.322 sec, 4294178387 entries, 16784170 max, 66719 tmp, 1.76501 GB/s up, 2.71944 GB/s down
[P1] Table 5 took 38.194 sec, 4293236723 entries, 16780917 max, 66584 tmp, 2.09419 GB/s up, 2.67058 GB/s down
[P1] Table 6 took 31.735 sec, 4291495532 entries, 16773692 max, 66596 tmp, 2.01589 GB/s up, 2.67844 GB/s down
[P1] Table 7 took 21.015 sec, 4287864140 entries, 16759319 max, 66501 tmp, 2.28224 GB/s up, 2.22461 GB/s down
Phase 1 took 197.709 sec
[P2] Setup took 0.132 sec
[P2] Table 7 took 10.209 sec, 3.12931 GB/s up, 0.0520374 GB/s down
[P2] Table 6 took 10.25 sec, 3.11943 GB/s up, 0.0518293 GB/s down
[P2] Table 5 took 10.255 sec, 3.11917 GB/s up, 0.051804 GB/s down
Phase 2 took 31.106 sec
[P3] Setup took 0.301 sec
[P3] Table 4 LPSK took 18.872 sec, 3464844649 entries, 14620146 max, 57843 tmp, 1.93539 GB/s up, 2.70243 GB/s down
[P3] Table 4 NSK took 21.875 sec, 3464844649 entries, 13547196 max, 57843 tmp, 1.77018 GB/s up, 2.72161 GB/s down
[P3] Table 5 PDSK took 17.245 sec, 3530709049 entries, 13811960 max, 54948 tmp, 1.88567 GB/s up, 2.71094 GB/s down
[P3] Table 5 LPSK took 24.529 sec, 3530709049 entries, 14247287 max, 57208 tmp, 2.52703 GB/s up, 2.07918 GB/s down
[P3] Table 5 NSK took 21.888 sec, 3530709049 entries, 13804181 max, 56770 tmp, 1.80276 GB/s up, 2.71999 GB/s down
[P3] Table 6 PDSK took 17.244 sec, 3709711144 entries, 14513182 max, 57692 tmp, 1.88503 GB/s up, 2.7111 GB/s down
[P3] Table 6 LPSK took 25.512 sec, 3709711144 entries, 15087758 max, 60390 tmp, 2.52078 GB/s up, 1.99907 GB/s down
[P3] Table 6 NSK took 21.992 sec, 3709711144 entries, 14505850 max, 59910 tmp, 1.8852 GB/s up, 2.70713 GB/s down
[P3] Table 7 PDSK took 17.829 sec, 4287864140 entries, 16778419 max, 66501 tmp, 2.46381 GB/s up, 2.62215 GB/s down
[P3] Table 7 LPSK took 28.325 sec, 4287864140 entries, 17200424 max, 68895 tmp, 2.52663 GB/s up, 1.80054 GB/s down
[P3] Table 7 NSK took 22.362 sec, 4287864140 entries, 16759319 max, 68272 tmp, 2.14295 GB/s up, 2.66234 GB/s down
Phase 3 took 238.125 sec
[P4] Setup took 0.079 sec
[P4] total_p7_parks = 2093684
[P4] total_c3_parks = 428786, 2385 / 2453 ANS bytes
Phase 4 took 14.533 sec, 2.19824 GB/s up, 1.25806 GB/s down
Total plot creation time was 481.517 sec (8.02529 min)

It was on Clear Linux, -t was an 8T p5510. I ran it on the first cpu by using numactl -N 0-7 -m 0-7. Also tried it on the second cpu but it was a bit slower.

edit: I Just remembered that I set the first pcie slot to x4x4x4x4 some time ago. I assume this is the reason. Will try on x16 later today. BTW what plot time can I expect on this machine with a 4090? Will its power be limited?

Thanks. Help me a lot. Now is running

./cuda_plot_k32 -C 7 -n 3 -t ~/Downloads/Temp/ -d ~/Downloads/Test/ -f -c -r 2 -M 512
Chia k32 next-gen CUDA plotter - e161e4b
Plot Format: mmx-v2.4
Network Port: 8444 [chia]
No. GPUs: 2
No. Streams: 4
Final Destination: /home/ed/Downloads/Test/
Shared Memory limit: 484.5 GiB
Number of Plots: 3
GPU[0] cudaDevAttrConcurrentManagedAccess = 0
GPU[1] cudaDevAttrConcurrentManagedAccess = 0
Initialization took 0.293 sec
Crafting plot 1 out of 3 (2023/02/13 16:06:51)
Process ID: 21787
Farmer Public Key:
Working Directory: /home/ed/Downloads/Temp/
Working Directory 2: @RAM
Compression Level: C7 (xbits = 9, final table = 4)
Plot Name: plot-k32-c7-2023-02-13-16-06-f79f3fac970547b32fcd21ac00689549f90dbb806505a9db501d97a3e6931949
[P1] Setup took 1.863 sec
[P1] Table 1 took 1.624 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 20.9361 GB/s down
[P1] Table 2 took 2.399 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.259 GB/s down
[P1] Table 3 took 3.999 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.2554 GB/s down
[P1] Table 4 took 5.786 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.3015 GB/s down
[P1] Table 5 took 4.782 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.33 GB/s down
[P1] Table 6 took 4.007 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.2129 GB/s down
[P1] Table 7 took 2.204 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.2115 GB/s down
Phase 1 took 28.05 sec
[P2] Setup took 1.046 sec
[P2] Table 7 took 0.108 sec, 0 GB/s up, 4.91898 GB/s down
[P2] Table 6 took 0.425 sec, 0 GB/s up, 1.25 GB/s down
[P2] Table 5 took 0.431 sec, 0 GB/s up, 1.2326 GB/s down
Phase 2 took 2.516 sec
[P3] Setup took 1.559 sec
[P3] Table 4 LPSK took 2.461 sec, 0 entries, 0 max, 0 tmp, 0.215868 GB/s up, 20.7234 GB/s down
[P3] Table 4 NSK took 2.875 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 20.7079 GB/s down
[P3] Table 5 PDSK took 2.276 sec, 0 entries, 0 max, 0 tmp, 0.233414 GB/s up, 20.5405 GB/s down
[P3] Table 5 LPSK took 2.391 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.3301 GB/s down
[P3] Table 5 NSK took 2.82 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.1118 GB/s down
[P3] Table 6 PDSK took 2.279 sec, 0 entries, 0 max, 0 tmp, 0.233107 GB/s up, 20.5135 GB/s down
[P3] Table 6 LPSK took 2.407 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.1883 GB/s down
[P3] Table 6 NSK took 2.811 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.1794 GB/s down
[P3] Table 7 PDSK took 2.197 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.2791 GB/s down
[P3] Table 7 LPSK took 2.392 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.3212 GB/s down
[P3] Table 7 NSK took 2.807 sec, 0 entries, 0 max, 0 tmp, 0 GB/s up, 21.2095 GB/s down
Phase 3 took 30.333 sec
[P4] Setup took 0.8 sec
[P4] total_p7_parks = 0
Floating point exception (core dumped)

Does anyone know what is this error?

PS: I remove to share the -f and -c.

i have added the cli command to my post but i don’t think it is really necessary as i don’t differ from the default values.

regarding the performance degradation which increases with time, i have also observed the same effect. after 8 hours i have seen a 50% degradation in plot performance (from 4 min to 6 min even with the new version of cuda_plot_k32)
about 75% of the degradation can be fixed by a simple SSD trim. (the plot process does not have to be stopped for this).
after the trim the plot times drop immediately to 4,5 min which means that there is still a performance loss of 25% which probably has other causes.

Yes exactly, whenever your P2 upload is that low, it’s PCIe lanes for sure. (in case of full RAM mode)

How fast the 4090 is depends how many channels of RAM and NUMA config. Also are you sure it’s PCIe 4.0 ?

Your GPU is not supported. What model is it?

Just saw the GPU[0] cudaDevAttrConcurrentManagedAccess = 0, I’ve never tested this case, does it work with single GPU?

Yeah I’ve always suspected this, on demand trimming during writes (which the SSD is doing automatically) is slower than manual batch trimming.

Yeah I’m sure it’s PCIe 4.0. It’s a dual Epyc Millan 128 core system, 16 channels of 512G 2933 RAM, 16 NUMA nodes. Is it enought for 4090?

Just tried on x16 slot with 2080ti and it took about 4 minutes. Is it good now? I was expecting 150 seconds as I thought 2080ti had the same hashrate as 3060ti.

Crafting plot 2 out of 2 (2023/02/13 21:32:00)
Process ID: 3267
Pool Public Key:
Farmer Public Key:
Working Directory: /home/tigana/p5510/
Working Directory 2: @RAM
Compression Level: C7 (xbits = 9, final table = 4)
Plot Name: plot-k32-c7-2023-02-13-21-32-fd7a57110b20960106826fa2aff1c7f8024fd6d42512ad2c5d3d54541911bd28
[P1] Setup took 0.537 sec
[P1] Table 1 took 5.554 sec, 4294967296 entries, 16787717 max, 66616 tmp, 0 GB/s up, 6.12176 GB/s down
Flushing to disk took 6.472 sec
[P1] Table 2 took 9.303 sec, 4294839752 entries, 16787831 max, 66653 tmp, 3.43975 GB/s up, 5.48213 GB/s down
[P1] Table 3 took 15.37 sec, 4294580475 entries, 16787081 max, 66660 tmp, 3.12287 GB/s up, 5.53027 GB/s down
[P1] Table 4 took 22.542 sec, 4294146503 entries, 16785916 max, 66479 tmp, 3.54861 GB/s up, 5.46758 GB/s down
[P1] Table 5 took 18.998 sec, 4293224718 entries, 16781020 max, 66642 tmp, 4.21016 GB/s up, 5.369 GB/s down
[P1] Table 6 took 15.772 sec, 4291315534 entries, 16774852 max, 66567 tmp, 4.05618 GB/s up, 5.38931 GB/s down
[P1] Table 7 took 9.6 sec, 4287573554 entries, 16762698 max, 66484 tmp, 4.99575 GB/s up, 4.86982 GB/s down
Phase 1 took 97.866 sec
[P2] Setup took 0.133 sec
[P2] Table 7 took 5.267 sec, 6.06511 GB/s up, 0.100864 GB/s down
[P2] Table 6 took 5.284 sec, 6.05087 GB/s up, 0.100539 GB/s down
[P2] Table 5 took 5.287 sec, 6.05013 GB/s up, 0.100482 GB/s down
Phase 2 took 16.14 sec
[P3] Setup took 0.317 sec
[P3] Table 4 LPSK took 9.376 sec, 3464804676 entries, 14606685 max, 57829 tmp, 3.89552 GB/s up, 5.43945 GB/s down
[P3] Table 4 NSK took 10.883 sec, 3464804676 entries, 13551115 max, 57829 tmp, 3.55805 GB/s up, 5.47047 GB/s down
[P3] Table 5 PDSK took 8.566 sec, 3530618033 entries, 13814226 max, 54872 tmp, 3.7962 GB/s up, 5.45765 GB/s down
[P3] Table 5 LPSK took 11.987 sec, 3530618033 entries, 14247961 max, 57286 tmp, 5.17097 GB/s up, 4.25463 GB/s down
[P3] Table 5 NSK took 10.87 sec, 3530618033 entries, 13805758 max, 56539 tmp, 3.62997 GB/s up, 5.47702 GB/s down
[P3] Table 6 PDSK took 8.566 sec, 3709491311 entries, 14510603 max, 57544 tmp, 3.79454 GB/s up, 5.45765 GB/s down
[P3] Table 6 LPSK took 12.462 sec, 3709491311 entries, 15086676 max, 60706 tmp, 5.16026 GB/s up, 4.09246 GB/s down
[P3] Table 6 NSK took 10.917 sec, 3709491311 entries, 14504931 max, 59936 tmp, 3.79745 GB/s up, 5.45344 GB/s down
[P3] Table 7 PDSK took 8.884 sec, 4287573554 entries, 16771604 max, 66484 tmp, 4.9442 GB/s up, 5.2623 GB/s down
[P3] Table 7 LPSK took 13.543 sec, 4287573554 entries, 17198332 max, 68897 tmp, 5.28407 GB/s up, 3.7658 GB/s down
[P3] Table 7 NSK took 11.086 sec, 4287573554 entries, 16762698 max, 68300 tmp, 4.32233 GB/s up, 5.3703 GB/s down
Phase 3 took 117.676 sec
[P4] Setup took 0.155 sec
[P4] total_p7_parks = 2093542
[P4] total_c3_parks = 428757, 2385 / 2468 ANS bytes
Phase 4 took 7.411 sec, 4.31047 GB/s up, 2.46707 GB/s down
Total plot creation time was 239.136 sec (3.98561 min)
Flushing to disk took 7.418 sec

I have seen the “0 entries” when running on unsupported/old card (e.g. K2200).
It even needed cold power cycle to come back to normal after that.

Isn’t the K2200 just supported for farming compressed plots but not plotting them…

Need to change NUMA config to NPS0.

You sure it’s electrically x16 ? P2 upload of 6 GB/s smells like 4 lanes or 8 lanes PCIe 3.0.
Check in nvtop.

yes exactly

Yeah, the new binaries improved for me the initial hour/two of plotting (kept HD writes over 200 MBps and using just 2 drives at a time), but at the end, after 6 hours, (same as before) the NVMe was full and reported HD writes were around 80 MBps. So, basically the problem is still there.

I have added fstrim to crontab to run every 30 mins, so will be watching the next run.

Although, I rather don’t expect much from it. I have to say that my understanding of fstrim is rather low. The way I understand it is that it only helps in cases where writes are in small chunks (e.g., normal OS operation), but not really where there are sequential writes of big chunks (what in my understanding is exactly the same as fstrim). (Reason being that in order to fully remove 1 byte from the SSD, the whole block or page needs to be erased, so single byte erase operations are expensive. Thus sequential writes using bigger chunks avoid this problem. Also from my reads, the penalty for not trimming is in the range of few single digit percent, not really 100-200% or so.)

Also, when my NVMe is full (after those ~6 hours), the plotter pauses, and waits for one plot to be removed before it can continue. This being said, during this time there are only read operations on the NVMe, as such trim status should be irrelevant, and read speeds should jump to ~160 MBps per drive (for all 4 drives in my case). And I don’t see that. Reads are as slow as when the plotter is writing in that last spot there.

On the other hand, if I kill the plotter and right after run 4 shells with mv commands, all is immediately good again.

I would also assume that if trim was the issue, we would have seen the same / similar problems with the old CPU based plotter, and I don’t think anyone has reported that.

So, i really don’t know how to tie this behavior to the trim status. Well, I will be watching this new batch of plots with trim kicking in every 30 minutes.

Update
Actually, it looks like trim is not working for me.

I killed the previous job, moved all finished plots off of the NVMe, run trim and after that started the plotter.

Before the first plot finished the xfr to HD, there were already 5 plots on the NVMe and 4 pending HD xfrs. This first plot was reported as having only ~90 MBps xfr speed (instead of ~250+ MBps what was reported if all is good).

And again, once I killed the plotter, and started to xfr those plots from NVMe, the speeds are back in the ~160 MBps for all 4 HDs. So, the only common thing is the plotter running.

Restarted the box, started the and HD write speeds are over 200 MBps, only 2 drives engaged at the time. Just in case, I left fstrim in crontab and let it run once per hour.

So, in my case fstrim didn’t really change anything. However, all is good again after I reboot the box.

Unluckly i can’t use ubuntu because in my system it random freezes for unclear reasons (not related to plotting)

An update on the issue: I tried plot sink locally with @localhost as -d, the problem looks most of the times solved and everything works as expected, till a certain point when it excludes on of the destination drives and keeps distributing plots only to the others, with strange non-repetive behaviours and timings. I’ll keep investigating with more parallel drives because i was testing with only 3 drives and when it loses one, everything becomes clogged and it’s not easy to follow what is bottlenecking what.

PS: sometimes during P3 i get messages like:

WritePark(): ans_length (859) > max_ans_length (858) (y = 1, i = 4789)

What do they indicate?

Thank you!

That’s normal to see sometimes.

Would indicate RAM fragmentation, but I’m not seeing this on my Ubuntu 20.04 (maybe because of remote plot sink ?).

That is what I said before. Your remote plot sink .is acting like an extra RAM. Also, I think that the Ethernet driver is the most stable driver on Linux. So, if the problem is on driver level, using remote plot sink reduce overall traffic on various drivers. It is not to say that the RAM driver is deficient. Rather than maybe a different driver is at play.

Although, I would assume that even if RAM was fragmented, once all the allocated RAM was returned to the system, all that would be consolidated, but I am not sure about it.

The fact that if I

  1. kill the plotter when it starts suffering,
  2. xfr plots from NVMe at full HD speed afterwards,
  3. trim the NVMe

and right after that when I run the plotter again and it has the exactly the same problems starting from the first plot xfr would suggest to me that it may not be RAM fragmentation. However, I don’t have much experience on drives level, and it may be that something went South there.

Another option is that maybe something like the number of available file handles got depleted / degraded. This would kind of explain why moving those plots after the plotter is stopped is not affected at all, as only a handful of those resources are needed during the move operation. On the other hand, when the next plotter run starts, it tries to get all what is needed to work with and start suffering right from the beginning.

I have clean logs, so really no hints about what to look for. I am completely lost here.

Actually, there is one difference between our setups. You run Ubuntu 22.04, I am 22.10. Maybe some experimental driverd slipped in that is causing it.

Maybe I missed it above… but what SSD(s) do u have that cause this difficulty?

1 Like

You can try this after doing all that:

echo 1 | sudo tee -a /proc/sys/vm/compact_memory
1 Like

I’m running Ubuntu 20.04

1 Like

Sorry, just catching up here.

I’m having a similar issue on Linux because I’m “flying too close to the sun” so to speak by trying to plot two k32 in parallel, full RAM mode, with 352GB RAM. My operating theory is that CUDA pins too many pages and the sending side of the plot sync needs a pretty substantial amount of free system memory to function. Those two things together cause the VMM to start thrashing.

I’ve worked around it by running each of my two cuda_plot processes with -M 140 and adding some fast swap space (Optane) to both NUMA nodes. This appears to give the VMM enough flexibility and breathing room to move things around and keep the plot sync moving. I have another 32GB RAM coming today or tomorrow and plan to retest without the swap space.

Hope that helps!

2 Likes

I have tried 3 different sticks 2 were WD Black 750, the third one is Samsung 970 EVO Plus. None of those NVMes were acting up till I started to use compressed plots. The box is Dell t7610 dual e5-2600 v2, but wit just 1 CPU, so I am running out of PCIe slots, thus at the moment have PCIe to NVMe with 1 slot only.

Super. Will do that. I think it could be also run while plotter is busy. It will not do much, because I assume it only works on available RAM, and that may be next to nothing.

How do you run 2x full RAM on 352 GB RAM? Don’t you need 256 GB per full RAM plotter?

I missed that. That means that our drivers are even further apart.

Do you know whether cuda_plotter can run on Fedora / Rocky Linux? I have tried to install it on Fedora, but couldn’t get nvidia drivers to load, so I gave up.

Actually, one more thing worth mentioning is that even as the HD write speeds degrade, I don’t see much, if any degradation in plots creation. They are coming 175 secs apart. The only slowdowns are when the NVMe gets filled up, and the plotter waits for drives to remove plots. At this point those “waiting” plots go well above 500 sec. To me, this suggests that at least NVMe writes are running with a full speed (what is also reflected by bpytop).

Also, to not focus too much on SATA controller side, I have tried to split drives across 2 internal SATA controllers (both LSI), added a PCIe SATA controller, used USB drives. None of that makes any difference.

Lastly, if the NVMe would be a problem, most likely the first couple of hours we would not see write speeds in 250 MBps range, as the suffering would most likely started early. The degradation grows with elapsed time, so either some driver level resources are gobbled / orphaned, or as Max suggested something is going on with RAM.

Sorry, one more thing. When I stop the plotter (ctrl+c), it doesn’t really capture that signal (what it does in Max’'s case). It rather has a hard crash at the end. That may also imply that some resources were not properly released by the plotter. Kind of strange, as I would think that all should be. This is maybe the place where some orphaned file handles are lurking (e.g. some handle leak).