>30min GPU plotting Bladebit

lord_icon · June 30, 2023, 12:52pm

SuperMicro X9DRi-F
1 x Xeon E5-2697 v2 (12C,24T; 2,70Ghz)
2x Seagate Firecuda 520 512GB
256GB (8x32GB) Samsung PC3-L-12800L 4Rx4 ECC
1 x Nvidia Tesla M40 (24GB, GDDR5)

The following SystemCPU side I have installed only one for now. I also monitored the process… CPU sided there is never a big need for as. Mostly only one CPU is needed. At the very beginning I have all 24 at 100% for 2-3 sec.

I THINK that CPU is not the problem here.

NVME.
There could be a problem here. I have CPU 1 only in it.
So I have only 1x 16x PCIe and 2x 8x PCIe available.
The MB has no separate NVME slot. I use a PCIe to NVME adapter.
But this should be unproblamtic, because PCIe x8 3.0 can do up to 7,8 GByte/s.
But the NVME can only do 4.4 GByte/s.

Thus, only the Tesla GPU remains.
Should it really be that bad?

Anbei die Daten. Plot 2 war fast identisch
Completed Plot 1 in 1961.25 seconds ( 32.69 minutes )


Bladebit Chia Plotter
Version      : 0.0.0-dev
Git Commit   : unknown
Compiled With: gcc 11.3.0

[Global Plotting Config]
 Will create 10 plots.
 Thread count          : 24
 Warm start enabled    : false
 NUMA disabled         : false
 CPU affinity disabled : false
 Farmer public key     : 12345
 Pool public key       : 678910
 Compression Level     : 7
 Benchmark mode        : disabled

[Bladebit CUDA Plotter]
Selected cuda device 0 : Tesla M40 24GB
 CUDA Compute Capability   : 5.2
 SM count                  : 24
 Max blocks per SM         : 32
 Max threads per SM        : 2048
 Async Engine Count        : 2
 L2 cache size             : 3.00 MB
 L2 persist cache max size : 0.00 MB
 Stack Size                : 1.00 KB
 Memory:
  Total                    : 22.40 GB
  Free                     : 22.30 GB

Allocating buffers (this may take a few seconds)...
Kernel RAM required       : 90240524288  bytes ( 86060.07  MiB or 84.04  GiB )
Intermediate RAM required : 2999001088   bytes ( 2860.07   MiB or 2.79   GiB )
Host RAM required         : 141733920768 bytes ( 135168.00 MiB or 132.00 GiB )
Total Host RAM required   : 231974445056 bytes ( 221228.07 MiB or 216.04 GiB )
GPU RAM required          : 6135058432   bytes ( 5850.85   MiB or 5.71   GiB )
Allocating buffers

Generating plot 1 / 10: 2118da2930ac20c42bdc76a839ecdb0325ae021bd88c52a5ff9eae433fd69b04
Plot temporary file: /mnt/NVME/plot-k32-c07-2023-06-30-13-38-2118da2930ac20c42bdc76a839ecdb0325ae021bd88c52a5ff9eae433fd69b04.plot.tmp

Generating F1
Finished F1 in 11.88 seconds.
Table 2 completed in 78.34 seconds with 4294902713 entries.
Table 3 completed in 120.91 seconds with 4294899439 entries.
Table 4 completed in 122.55 seconds with 4294851516 entries.
Table 5 completed in 132.00 seconds with 4294707013 entries.
Table 6 completed in 123.78 seconds with 4294461867 entries.
Table 7 completed in 107.23 seconds with 4293843262 entries.
Finalizing Table 7
Finalized Table 7 in 52.83 seconds.
Completed Phase 1 in 749.52 seconds
Marked Table 6 in 137.69 seconds.
Marked Table 5 in 112.64 seconds.
Marked Table 4 in 106.85 seconds.
Marked Table 3 in 104.69 seconds.
Completed Phase 2 in 461.87 seconds
Compressing Table 2 and 3...
 Step 1 completed step in 17.28 seconds.
 Step 2 completed step in 70.39 seconds.
Completed table 2 in 87.67 seconds with 3439821562 / 4294899439 entries ( 80.09% ).
Compressing tables 3 and 4...
 Step 1 completed step in 15.83 seconds.
 Step 2 completed step in 43.54 seconds.
 Step 3 completed step in 84.99 seconds.
Completed table 3 in 144.36 seconds with 3465962938 / 4294851516 entries ( 80.70% ).
Compressing tables 4 and 5...
 Step 1 completed step in 17.22 seconds.
 Step 2 completed step in 44.14 seconds.
 Step 3 completed step in 86.84 seconds.
Completed table 4 in 148.20 seconds with 3532684493 / 4294707013 entries ( 82.26% ).
Compressing tables 5 and 6...
 Step 1 completed step in 17.48 seconds.
 Step 2 completed step in 45.79 seconds.
 Step 3 completed step in 91.54 seconds.
Completed table 5 in 154.80 seconds with 3713096457 / 4294461867 entries ( 86.46% ).
Compressing tables 6 and 7...
 Step 1 completed step in 17.38 seconds.
 Step 2 completed step in 50.90 seconds.
 Step 3 completed step in 107.51 seconds.
Completed table 6 in 175.78 seconds with 4293843262 / 4293843262 entries ( 100.00% ).
Serializing P7 entries
Completed serializing P7 entries in 37.38 seconds.
Completed Phase 3 in 748.21 seconds
Completed Plot 1 in 1959.61 seconds ( 32.66 minutes )

/mnt/NVME/plot-k32-c07-2023-06-30-13-38-2118da2930ac20c42bdc76a839ecdb0325ae021bd88c52a5ff9eae433fd69b04.plot.tmp -> /mnt/NVME/plot-k32-c07-2023-06-30-13-38-2118da2930ac20c42bdc76a839ecdb0325ae021bd88c52a5ff9eae433fd69b04.plot
Completed writing plot in 0.10 seconds

lord_icon · June 30, 2023, 12:56pm

Edit: The Specs.

Memory bandwidth 288.4 GB/s

Voodoo · June 30, 2023, 2:10pm

which version of bladebit are you using? it shows 0.0.0.dev? quite sure it should be faster than 30 minutes

NVME is not the problem, actually it’s stupendously fast

Completed writing plot in 0.10 seconds

That’s the only time the nvme is used.

GPU driver, bladebit version, PCIE link speeds in bios?

Edit: OS windows or linux. for me windows is 12 minutes or so last time i checked. (and win10 doesn’t work at all)

lord_icon · June 30, 2023, 2:58pm

In the following section I was able to detect an NVME write.

Compressing tables 4 and 5…
Step 1 completed in 17.21 seconds.
Step 2 completed step in 44.13 seconds.
Step 3 completed the step in 86.84 seconds.

install: nvidia Edv-Kit

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda
PATH=/usr/local/cuda/bin/:$PATH
sudo apt install cmake libnuma-dev
git clone https://github.com/Chia-Network/bladebit.git -b cuda-compression
cd bladebit
mkdir -p build-release
cd build-release
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --target clean --config Release
cmake --build . --target bladebit_cuda --config Release -j$(nproc --all)

Bladebit installed according to instructions
Ubuntu 22.02

sudo apt install -y build-essential cmake libgmp-dev libnuma-dev

no make or build (?)

./bladebit_cuda --help => no Version number

Voodoo · June 30, 2023, 3:06pm

hmm I think it has been a bit unclear which branch exactly to use to build by yourself. not sure if the instructions are up to date. But you’d have to ask in discord, not my cup of tea.

Unless you have principal objections, maybe just try the last binary version, see if that makes a difference
https://downloads.chia.net/bladebit/alpha4.3/bladebit-cuda-plotter/

lord_icon · June 30, 2023, 3:15pm

grafik

ahhh… my hero

Looks already different. I let times 3 plot.
Let’s see… I’ll call you

lord_icon · June 30, 2023, 3:26pm

runs only conditionally better:

ALT:
Completed Phase 1 in 749.52 seconds

NEW:
Completed Phase 1 in 567.01 seconds

-182seconds (about 3min)
But let’s wait for the other phases.

grafik

Voodoo · June 30, 2023, 3:41pm

That’s still weirdly slow though. It’s hard to like as always, the only thing I can think of is something pcie lanes/speed or maybe ram config, some mobo thing with the single cpu.
I have the same mobo, but with dual cpu and 16x16gb ram.

You could try plotting with the --no-numa flag added maybe it’s not handling memory allocation correctly or something like that

bag_of_chia · June 30, 2023, 5:02pm

Here is my results for reference.
plotter specs here - Show off your rigs! - #381 by bag_of_chia

Generating F1
Finished F1 in 4.04 seconds.
Table 2 completed in 8.68 seconds with 4294938284 entries.
Table 3 completed in 36.69 seconds with 4294910274 entries.
Table 4 completed in 18.60 seconds with 4294838698 entries.
Table 5 completed in 18.64 seconds with 4294592915 entries.
Table 6 completed in 15.90 seconds with 4294339126 entries.
Table 7 completed in 11.82 seconds with 4293669401 entries.
Finalizing Table 7
Finalized Table 7 in 6.53 seconds.
Completed Phase 1 in 120.93 seconds
Marked Table 6 in 4.71 seconds.
Marked Table 5 in 4.09 seconds.
Marked Table 4 in 3.91 seconds.
Marked Table 3 in 3.85 seconds.
Completed Phase 2 in 16.57 seconds
Compressing Table 2 and 3...
 Step 1 completed step in 6.21 seconds.
 Step 2 completed step in 4.92 seconds.
Completed table 2 in 11.13 seconds with 3439802247 / 4294910274 entries ( 80.09% ).
Compressing tables 3 and 4...
 Step 1 completed step in 6.43 seconds.
 Step 2 completed step in 7.94 seconds.
 Step 3 completed step in 5.83 seconds.
Completed table 3 in 20.20 seconds with 3465898677 / 4294838698 entries ( 80.70% ).
Compressing tables 4 and 5...
 Step 1 completed step in 6.77 seconds.
 Step 2 completed step in 8.12 seconds.
 Step 3 completed step in 6.45 seconds.
Completed table 4 in 21.34 seconds with 3532578519 / 4294592915 entries ( 82.26% ).
Compressing tables 5 and 6...
 Step 1 completed step in 6.76 seconds.
 Step 2 completed step in 8.43 seconds.
 Step 3 completed step in 6.38 seconds.
Completed table 5 in 21.57 seconds with 3712952071 / 4294339126 entries ( 86.46% ).
Compressing tables 6 and 7...
 Step 1 completed step in 6.40 seconds.
 Step 2 completed step in 9.27 seconds.
 Step 3 completed step in 6.76 seconds.
Completed table 6 in 22.43 seconds with 4293669401 / 4293669401 entries ( 100.00% ).
Serializing P7 entries
Completed serializing P7 entries in 3.96 seconds.
Completed Phase 3 in 100.63 seconds
Completed Plot 1 in 238.12 seconds ( 3.97 minutes )

/mnt/ramdisk/plot-k32-c07-2023-06-25-12-20-9d570807b00deb336f9df9deffbec12e262ff9608d2779ce714d6a0e08841a36.plot.tmp -> /mnt/ramdisk/plot-k32-c07-2023-06-25-12-20-9d570807b00deb336f9df9deffbec12e262ff9608d2779ce714d6a0e08841a36.plot
Completed writing plot in 79.58 seconds

Compared with your timings I feel like GPU is bottleneck. Could be a driver’s issue but…
According to this website NVIDIA Tesla M40 Specs | TechPowerUp GPU Database
RTX 2080 performing 180% of Tesla m40 performance. But even with that performance difference, you should see timings like 6-8mins per plot.
For some reasons GPU clearly underperforming. Hope its helps you to get on the right track of troubleshooting

Bladebit Chia Plotter
Version      : 0.0.0-dev
Git Commit   : unknown
Compiled With: gcc 11.3.0

[Global Plotting Config]
 Will create 1000 plots.
 Thread count          : 40
 Warm start enabled    : false
 NUMA disabled         : false
 CPU affinity disabled : false
 Farmer public key     : xxx
 Pool public key       : xxx
 Compression Level     : 7
 Benchmark mode        : disabled

[Bladebit CUDA Plotter]
Selected cuda device 0 : NVIDIA GeForce RTX 2080 SUPER
 CUDA Compute Capability   : 7.5
 SM count                  : 48
 Max blocks per SM         : 16
 Max threads per SM        : 1024
 Async Engine Count        : 3
 L2 cache size             : 4.00 MB
 L2 persist cache max size : 0.00 MB
 Stack Size                : 1.00 KB
 Memory:
  Total                    : 7.79 GB
  Free                     : 6.70 GB

drhicom · June 30, 2023, 6:30pm

32 Minutes, I can run to the store and get a hamburger in that time…

lord_icon · June 30, 2023, 6:38pm

--no-numa
Completed Plot 1 in 1957.13 seconds ( 32.62 minutes )

no better.
I’ll have a look at the BIOS settings tomorrow. Maybe there is something to find

lord_icon · July 2, 2023, 12:44pm

@Voodoo
Can you take a look at yours and see what you have?

lord_icon · July 10, 2023, 9:33am

nope… first with 1 CPU + in the BIOS everything back and forth and tested. Always between 28-30 min.

BIOS default setting + 2 CPU = same problem

NVME Benschmark test with fio

Result:
Read: (3050MB/s)
Write: (3388MB/s)

lord_icon@chia-plotter:~/cuda$ sudo fio --filename=/mnt/NVME/test.bin --direct=1 --rw=read --ioengine=libaio --bs=2m --iodepth=64 --size=10G --numjobs=1 --runtime=60 --time_base=1 --group_reporting --name=test-seq-read
test-seq-read: (g=0): rw=read, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=64
fio-3.28
Starting 1 process
test-seq-read: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [R(1)][100.0%][r=3311MiB/s][r=1655 IOPS][eta 00m:00s]
test-seq-read: (groupid=0, jobs=1): err= 0: pid=2661: Mon Jul 10 11:24:05 2023
  read: IOPS=1454, BW=2909MiB/s (3050MB/s)(171GiB/60038msec)
    slat (usec): min=48, max=1126, avg=139.26, stdev=33.52
    clat (usec): min=6065, max=93177, avg=43835.09, stdev=8948.10
     lat (usec): min=6223, max=93299, avg=43975.40, stdev=8944.62
    clat percentiles (usec):
     |  1.00th=[38011],  5.00th=[38011], 10.00th=[38536], 20.00th=[38536],
     | 30.00th=[38536], 40.00th=[38536], 50.00th=[38536], 60.00th=[39060],
     | 70.00th=[46924], 80.00th=[49546], 90.00th=[56886], 95.00th=[64750],
     | 99.00th=[73925], 99.50th=[78119], 99.90th=[83362], 99.95th=[85459],
     | 99.99th=[90702]
   bw (  MiB/s): min= 1812, max= 3316, per=99.98%, avg=2908.03, stdev=474.83, samples=119
   iops        : min=  906, max= 1658, avg=1454.02, stdev=237.42, samples=119
  lat (msec)   : 10=0.02%, 20=0.05%, 50=81.28%, 100=18.66%
  cpu          : usr=2.74%, sys=23.33%, ctx=86358, majf=0, minf=32787
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=87311,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=2909MiB/s (3050MB/s), 2909MiB/s-2909MiB/s (3050MB/s-3050MB/s), io=171GiB (183GB), run=60038-60038msec

Disk stats (read/write):
  nvme0n1: ios=183589/4, merge=1361/1, ticks=8005733/43, in_queue=8005788, util=99.93%
lord_icon@chia-plotter:~/cuda$
lord_icon@chia-plotter:~/cuda$
lord_icon@chia-plotter:~/cuda$
lord_icon@chia-plotter:~/cuda$
lord_icon@chia-plotter:~/cuda$ sudo fio --filename=/mnt/NVME/test.bin --direct=1 --rw=write --ioengine=libaio --bs=2m --iodepth=64 --size=10G --numjobs=1 --runtime=60 --time_base=1 --group_reporting --name=test-seq-write
test-seq-write: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=64
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=3230MiB/s][w=1615 IOPS][eta 00m:00s]
test-seq-write: (groupid=0, jobs=1): err= 0: pid=2708: Mon Jul 10 11:25:22 2023
  write: IOPS=1615, BW=3231MiB/s (3388MB/s)(189GiB/60042msec); 0 zone resets
    slat (usec): min=81, max=31327, avg=253.17, stdev=192.74
    clat (usec): min=10284, max=78493, avg=39351.56, stdev=2298.06
     lat (usec): min=10473, max=78745, avg=39605.56, stdev=2291.15
    clat percentiles (usec):
     |  1.00th=[38011],  5.00th=[38011], 10.00th=[38011], 20.00th=[38011],
     | 30.00th=[38011], 40.00th=[38011], 50.00th=[38011], 60.00th=[38011],
     | 70.00th=[41157], 80.00th=[41157], 90.00th=[41681], 95.00th=[41681],
     | 99.00th=[44303], 99.50th=[44303], 99.90th=[58983], 99.95th=[62653],
     | 99.99th=[72877]
   bw (  MiB/s): min= 3086, max= 3304, per=100.00%, avg=3233.97, stdev=22.48, samples=120
   iops        : min= 1543, max= 1652, avg=1616.88, stdev=11.24, samples=120
  lat (msec)   : 20=0.12%, 50=99.64%, 100=0.25%
  cpu          : usr=28.43%, sys=15.91%, ctx=96682, majf=0, minf=18
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,96998,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=3231MiB/s (3388MB/s), 3231MiB/s-3231MiB/s (3388MB/s-3388MB/s), io=189GiB (203GB), run=60042-60042msec

Disk stats (read/write):
  nvme0n1: ios=0/196995, merge=0/1526, ticks=0/7712009, in_queue=7712070, util=99.77%

NVME is out as guilty. I have now ordered a 1050 TI on Amazon.
ASUS Cerberus-GTX1050TI-O4G gaming graphics card (Nvidia, PCIe 3.0, 4GB GDDR5 memory, DVI, HDMI, Display Port)

Can only be the Tesla graphics card now. I think there was something in the news with drivers that should prevent mining of cryptocurrency.
Maybe the card recognizes this and switches to minimum.

I’ll see if I can find a GPU benchmark test that runs on BASH (without GUI).

whosrdaddy · July 10, 2023, 10:06am

M40 = basically a 980Ti with 24Gig ram.
Maxwell & Pascal are too slow for plotting, you really need Turing (20 series) or higher.
I have a 2680v2 with a 3070 that spits out plots in 165 seconds

LLPP · October 14, 2023, 1:07am

When I useing Bladebit Version 3.1 to ploting.
It’s take 26min!! (Completed Plot 1 in 1585.80 seconds ( 26.43 minutes ))
But I change back ploting with Bladebit Version 3.0,
ploting time come back to 9min ~12min.

windows 10
Intel(R) Xeon(R) CPU E5-2689 0 @ 2.60GHz 2.60 GHz (2 CPU)
Use Nvme SSD(Corsair MP600),
256GB (8x32GB) Samsung PC3-L-12800L 4Rx4 ECC
I alread try 1x 3060 Ti or 1 x 3070

Bladebit Chia Plotter
Version : 3.1.0
Git Commit : e9836f8bd963321457bc86eb5d61344bfb76dcf0
Compiled With: msvc 19.29.30152

[Global Plotting Config]
Will create 10 plots.
Thread count : 32
Warm start enabled : false
NUMA disabled : false
CPU affinity disabled : false

Compression Level : 5
Benchmark mode : disabled

[Bladebit CUDA Plotter]
Host RAM : 255 GiB
Plot checks : disabled

Selected cuda device 0 : NVIDIA GeForce RTX 3060 Ti
CUDA Compute Capability : 8.6
SM count : 38
Max blocks per SM : 16
Max threads per SM : 1536
Async Engine Count : 5
L2 cache size : 3.00 MB
L2 persist cache max size : 2.25 MB
Stack Size : 1.00 KB
Memory:
Total : 8.00 GB
Free : 6.96 GB
Generating F1
Finished F1 in 22.24 seconds.
Table 2 completed in 58.30 seconds with 4294967296 entries.
Table 3 completed in 103.38 seconds with 4294967296 entries.
Table 4 completed in 153.68 seconds with 4294967296 entries.
Table 5 completed in 143.92 seconds with 4294894843 entries.
Table 6 completed in 115.81 seconds with 4294835028 entries.
Table 7 completed in 90.73 seconds with 4294580012 entries.
Finalizing Table 7
Finalized Table 7 in 43.31 seconds.
Completed Phase 1 in 733.49 seconds
Marked Table 6 in 14.78 seconds.
Marked Table 5 in 16.07 seconds.
Marked Table 4 in 17.05 seconds.
Marked Table 3 in 16.95 seconds.
Completed Phase 2 in 64.86 seconds
Compressing Table 2 and 3…
Step 1 completed step in 56.44 seconds.
Step 2 completed step in 42.83 seconds.
Completed table 2 in 99.27 seconds with 3439914865 / 4294967296 entries ( 80.09% ).

hajes29a · October 14, 2023, 5:38am

I had similar issue on same setup except SSD

If I use a riser cable for GPU, only PCIe 3.0 8x is used 5min plotting time

GPU attached directly to PCIe slot…2.5min

E5-2697v2 has got 40 PCIe lanes…if you use all of them on SSDs…there are very few left for GPU. Try nvtop what it says about link bandwidth while plotting

drhicom · October 14, 2023, 11:47am

Don’t forget to do a chia plots check. Under Win 10 there was a problem with Bladebit_cuda, run the check

BadgerStork · October 14, 2023, 3:57pm

Excellent point! Always worth checking that