Slow gpu plotting using bladebit on Dell R720 with 3060

blue_farmer · December 31, 2023, 3:35pm

New to GPU plotting. When reviewing posts from other farmers, my plot times seem a bit high. I know that Windows plotting will be slower, but this seems significantly slow. Aside from moving over to Linux, any tips or pointers to speed up plot times?

If you’re running a similar setup, how do your plot times compare?

–
Dell R720
Win11 Pro on internal SSD 256Gb (C:)
2 x Intel E5-2680 v2 (2 sockets, 16 cores, 32 logical processors)
256Gb DDR3 at 1333Mhz
1 x WD 1Tb NVME (D:)
RTX 3060 (12Gb)
Robocopy moving plots over to a 12Tb SAS 3.5"

–
powershell ".\bladebit_cuda.exe -n 1 -f xxx -p yyy -z 7 cudaplot d:\cudaplots\

–
Bladebit Chia Plotter
Version : 3.1.0
Git Commit : e9836f8bd963321457bc86eb5d61344bfb76dcf0
Compiled With: msvc 19.29.30152

[Global Plotting Config]
Will create 1 plots.
Thread count : 32
Warm start enabled : false
NUMA disabled : false
CPU affinity disabled : false
Farmer public key : xxx
Pool public key : yyy
Compression Level : 7
Benchmark mode : disabled

[Bladebit CUDA Plotter]
Host RAM : 255 GiB
Plot checks : disabled

Selected cuda device 0 : NVIDIA GeForce RTX 3060
CUDA Compute Capability : 8.6
SM count : 28
Max blocks per SM : 16
Max threads per SM : 1536
Async Engine Count : 1
L2 cache size : 2.25 MB
L2 persist cache max size : 1.69 MB
Stack Size : 1.00 KB
Memory:
Total : 12.00 GB
Free : 10.99 GB

Allocating buffers (this may take a few seconds)…
Kernel RAM required : 91955994624 bytes ( 87696.07 MiB or 85.64 GiB )
Intermediate RAM required : 4378927104 bytes ( 4176.07 MiB or 4.08 GiB )
Host RAM required : 142270791680 bytes ( 135680.00 MiB or 132.50 GiB )
Total Host RAM required : 234226786304 bytes ( 223376.07 MiB or 218.14 GiB )
GPU RAM required : 6163857408 bytes ( 5878.31 MiB or 5.74 GiB )
Allocating buffers…
Done.

Generating plot 1 / 1: yyy
Plot temporary file: d:\cudaplots\plot-k32-c07-2023-12-31-09-17-yyy.plot.tmp

Generating F1
Finished F1 in 18.53 seconds.
Table 2 completed in 47.85 seconds with 4294936135 entries.
Table 3 completed in 98.94 seconds with 4294824645 entries.
Table 4 completed in 127.47 seconds with 4294551792 entries.
Table 5 completed in 121.95 seconds with 4293980992 entries.
Table 6 completed in 97.21 seconds with 4292980201 entries.
Table 7 completed in 77.56 seconds with 4290922333 entries.
Finalizing Table 7
Finalized Table 7 in 32.33 seconds.
Completed Phase 1 in 623.55 seconds
Marked Table 6 in 9.55 seconds.
Marked Table 5 in 10.95 seconds.
Marked Table 4 in 11.76 seconds.
Marked Table 3 in 11.43 seconds.
Completed Phase 2 in 43.69 seconds
Compressing Table 2 and 3…
Step 1 completed step in 36.18 seconds.
Step 2 completed step in 31.27 seconds.
Completed table 2 in 67.45 seconds with 3439594236 / 4294824645 entries ( 80.09% ).
Compressing tables 3 and 4…
Step 1 completed step in 39.01 seconds.
Step 2 completed step in 46.10 seconds.
Step 3 completed step in 33.60 seconds.
Completed table 3 in 118.72 seconds with 3465425890 / 4294551792 entries ( 80.69% ).
Compressing tables 4 and 5…
Step 1 completed step in 35.60 seconds.
Step 2 completed step in 43.55 seconds.
Step 3 completed step in 35.02 seconds.
Completed table 4 in 114.17 seconds with 3531693584 / 4293980992 entries ( 82.25% ).
Compressing tables 5 and 6…
Step 1 completed step in 37.55 seconds.
Step 2 completed step in 44.72 seconds.
Step 3 completed step in 34.95 seconds.
Completed table 5 in 117.22 seconds with 3711397509 / 4292980201 entries ( 86.45% ).
Compressing tables 6 and 7…
Step 1 completed step in 37.52 seconds.
Step 2 completed step in 47.86 seconds.
Step 3 completed step in 39.81 seconds.
Completed table 6 in 125.19 seconds with 4290922333 / 4290922333 entries ( 100.00% ).
Serializing P7 entries
Completed serializing P7 entries in 22.47 seconds.
Completed Phase 3 in 565.22 seconds
Completed Plot 1 in 1232.46 seconds ( 20.54 minutes )

d:\cudaplots\plot-k32-c07-2023-12-31-09-17-yyy.plot.tmp → d:\cudaplots\plot-k32-c07-2023-12-31-09-17-yyy.plot
Completed writing plot in 0.15 seconds
Final plot table pointers:
Table 1: 0 ( 0x0000000000000000 )
Table 2: 1288999648 ( 0x000000004cd492e0 )
Table 3: 5067852148 ( 0x000000012e114974 )
Table 4: 19154609623 ( 0x0000000475b425d7 )
Table 5: 33510739123 ( 0x00000007cd654cb3 )
Table 6: 48597354073 ( 0x0000000b50a0c659 )
Table 7: 66039702598 ( 0x0000000f6045e446 )
C 1 : 4096 ( 0x0000000000001000 )
C 2 : 1720472 ( 0x00000000001a4098 )
C 3 : 1720648 ( 0x00000000001a4148 )

Final plot table sizes:
Table 1: 0.00 MiB
Table 2: 3603.79 MiB
Table 3: 13434.18 MiB
Table 4: 13691.07 MiB
Table 5: 14387.72 MiB
Table 6: 16634.32 MiB
Table 7: 16880.09 MiB
C 1 : 1.64 MiB
C 2 : 0.00 MiB
C 3 : 1227.64 MiB

Ronski · December 31, 2023, 3:39pm

The problem is primarily Windows, try Linux Mint, it will probably be half the time.

The memory speed is probably slowing things down a bit as well, but Windows is the biggest problem.

dctech · January 1, 2024, 2:04am

I use 5950X + 1080GTX + 128GB@3600 RAM + PCIe 4 2TB temp NVMe on Debian 12 and BB c3 plots get created in about 5-6min + transfer time to NAS. Win is definitely part of the problem as even with Madmax, back when NFT plots came out, my system was much slower (20-30%) on Win vs Linux.

Note that my GPU is running at PCIe 3 and only 8 lanes since this is only on standard desktop system in secondary PCIe slot and primary is already occupied with AMD GPU used for all other graphic use.

RATTL3R · January 1, 2024, 8:39am

New Linux convert here. Yes it’s Windows. Advice and trouble shooting
I’d also possibly avoid bladebit. It may or may not happen to you but I kept getting this “grr result out of memory” while farming bladebit plots with chia gpu harvester. Switched to gigahorse (with a fee mind you) and have not had 1 single solitary error since. It will mine you BB plots .

dctech · January 1, 2024, 6:53pm

No issues here with my system (Debian 12) which I have generating plots 24/7 for weeks now.

WOOD-PICK · January 1, 2024, 8:06pm

If I read correctly, noticed your thread count for plotting. Should be (CPU threads)-1?, 1 liberated for I/O.
I’m running with -2 which allows a thread or two for brief task manager peek-a-boos/sanity checks and covers I/O. That’s on a 12 thread 5600G, diskplot. CUDA may differ.

blue_farmer · January 5, 2024, 7:30pm

hat-tip to @Ronski for the Linux Mint suggestion. Same hardware specs mentioned in the initial post, OS changed from Win10 to Linux Mint 21.2

C7 plot creation time reduced from 21 minutes (Win10 Pro) to around 3 minutes (Mint 21.2)

–
Bladebit Chia Plotter
Version : 3.1.0
Git Commit : e9836f8bd963321457bc86eb5d61344bfb76dcf0
Compiled With: gcc 9.4.0

[Global Plotting Config]
Will create 1 plots.
Thread count : 32
Warm start enabled : false
NUMA disabled : false
CPU affinity disabled : false
Farmer public key : xxx
Pool public key : yyy
Compression Level : 7
Benchmark mode : disabled

[Bladebit CUDA Plotter]
Host RAM : 251 GiB
Plot checks : disabled

Selected cuda device 0 : NVIDIA GeForce RTX 3060
CUDA Compute Capability : 8.6
SM count : 28
Max blocks per SM : 16
Max threads per SM : 1536
Async Engine Count : 2
L2 cache size : 2.25 MB
L2 persist cache max size : 1.69 MB
Stack Size : 1.00 KB
Memory:
Total : 11.56 GB
Free : 11.43 GB

Allocating buffers (this may take a few seconds)…
Kernel RAM required : 88026990288 bytes ( 83949.08 MiB or 81.98 GiB )
Intermediate RAM required : 73728 bytes ( 0.07 MiB or 0.00 GiB )
Host RAM required : 142270791680 bytes ( 135680.00 MiB or 132.50 GiB )
Total Host RAM required : 230297781968 bytes ( 219629.08 MiB or 214.48 GiB )
GPU RAM required : 6163857408 bytes ( 5878.31 MiB or 5.74 GiB )
Allocating buffers…
Done.

Generating plot 1 / 1: abc
Plot temporary file: /media/nvme01/tempplots/tempcudaplots/plot-k32-c07-2024-01-05-13-33-abc.plot.tmp

Generating F1
Finished F1 in 2.94 seconds.
Table 2 completed in 6.42 seconds with 4294967296 entries.
Table 3 completed in 11.60 seconds with 4294967296 entries.
Table 4 completed in 14.71 seconds with 4294967296 entries.
Table 5 completed in 13.87 seconds with 4294926388 entries.
Table 6 completed in 11.40 seconds with 4294886752 entries.
Table 7 completed in 9.68 seconds with 4294765490 entries.
Finalizing Table 7
Finalized Table 7 in 4.97 seconds.
Completed Phase 1 in 75.60 seconds
Marked Table 6 in 3.23 seconds.
Marked Table 5 in 2.83 seconds.
Marked Table 4 in 2.69 seconds.
Marked Table 3 in 2.64 seconds.
Completed Phase 2 in 11.40 seconds
Compressing Table 2 and 3…
Step 1 completed step in 5.26 seconds.
Step 2 completed step in 4.84 seconds.
Completed table 2 in 10.10 seconds with 3439918962 / 4294967296 entries ( 80.09% ).
Compressing tables 3 and 4…
Step 1 completed step in 4.99 seconds.
Step 2 completed step in 6.74 seconds.
Step 3 completed step in 5.27 seconds.
Completed table 3 in 17.01 seconds with 3466110344 / 4294967296 entries ( 80.70% ).
Compressing tables 4 and 5…
Step 1 completed step in 4.91 seconds.
Step 2 completed step in 6.70 seconds.
Step 3 completed step in 5.41 seconds.
Completed table 4 in 17.02 seconds with 3532954429 / 4294926388 entries ( 82.26% ).
Compressing tables 5 and 6…
Step 1 completed step in 4.93 seconds.
Step 2 completed step in 6.94 seconds.
Step 3 completed step in 5.76 seconds.
Completed table 5 in 17.63 seconds with 3713632628 / 4294886752 entries ( 86.47% ).
Compressing tables 6 and 7…
Step 1 completed step in 4.85 seconds.
Step 2 completed step in 7.65 seconds.
Step 3 completed step in 7.10 seconds.
Completed table 6 in 19.60 seconds with 4294765490 / 4294765490 entries ( 100.00% ).
Serializing P7 entries
Completed serializing P7 entries in 3.23 seconds.
Completed Phase 3 in 84.58 seconds
Completed Plot 1 in 171.59 seconds ( 2.86 minutes )

/media/nvme01/tempplots/tempcudaplots/plot-k32-c07-2024-01-05-13-33-abc.plot.tmp → /media/nvme01/tempplots/tempcudaplots/plot-k32-c07-2024-01-05-13-33-abc.plot
Completed writing plot in 1.15 seconds
Final plot table pointers:
Table 1: 0 ( 0x0000000000000000 )
Table 2: 1290153184 ( 0x000000004ce62ce0 )
Table 3: 5069361184 ( 0x000000012e285020 )
Table 4: 19158899209 ( 0x0000000475f59a09 )
Table 5: 33520156909 ( 0x00000007cdf500ed )
Table 6: 48615862759 ( 0x0000000b51bb31e7 )
Table 7: 66073837309 ( 0x0000000f624ebefd )
C 1 : 4096 ( 0x0000000000001000 )
C 2 : 1722008 ( 0x00000000001a4698 )
C 3 : 1722184 ( 0x00000000001a4748 )

Final plot table sizes:
Table 1: 0.00 MiB
Table 2: 3604.13 MiB
Table 3: 13436.83 MiB
Table 4: 13695.96 MiB
Table 5: 14396.39 MiB
Table 6: 16649.22 MiB
Table 7: 16895.21 MiB
C 1 : 1.64 MiB
C 2 : 0.00 MiB
C 3 : 1228.74 MiB

Ronski · January 5, 2024, 8:40pm

That’s an incredible improvement, I know Bladebit was bad under Windows, but hadn’t realised it had got that bad.

Jacek · January 5, 2024, 9:03pm

Most likely, that was a NUMA problem on multi-socket boxes running Win and looks like it is challenging for all plotters (not sure if it can be fixed, though).

OTOH, if such box has just 1 GPU, most likely removing the second CPU fixes the issue. (For GPU plotting, CPU is not used that much (at least on GH).) I was fighting it with MM 1.0, where 1 CPU was few times faster than 2 CPUs (also gave up and switched to Linux).

Ronski · January 5, 2024, 9:10pm

IIRC my 3080 on Windows was about 6 minutes, and less than half on Linux. They made BladeBit slower doing stuff so it would run on machines with less than 256GB of ram.

Jacek · January 5, 2024, 9:34pm

I think that those smaller differences come from things like thread management, etc., not really computational part (H/W is the same, regardless of OS). Both MSoft and Linux compiler guys for sure read the assembly of the other guys on the computational part, so it should perform about the same.

On the other hand, the original code (GH/BB) depends on Linux-to-Win porting libs offering virtually everything else than computational part, and the quality of those libs may not be the best (most likely those libs are written to work, not really perform). Basically, to have a performant Win code, it has to be Win code from scratch using Win API.

Actually, at least on Win side, there is a need to play a lot with compiler switches to get the best results. This may also introduce small differences.