>30min GPU plotting Bladebit

Here is my results for reference.
plotter specs here - Show off your rigs! - #381 by bag_of_chia

Generating F1
Finished F1 in 4.04 seconds.
Table 2 completed in 8.68 seconds with 4294938284 entries.
Table 3 completed in 36.69 seconds with 4294910274 entries.
Table 4 completed in 18.60 seconds with 4294838698 entries.
Table 5 completed in 18.64 seconds with 4294592915 entries.
Table 6 completed in 15.90 seconds with 4294339126 entries.
Table 7 completed in 11.82 seconds with 4293669401 entries.
Finalizing Table 7
Finalized Table 7 in 6.53 seconds.
Completed Phase 1 in 120.93 seconds
Marked Table 6 in 4.71 seconds.
Marked Table 5 in 4.09 seconds.
Marked Table 4 in 3.91 seconds.
Marked Table 3 in 3.85 seconds.
Completed Phase 2 in 16.57 seconds
Compressing Table 2 and 3...
 Step 1 completed step in 6.21 seconds.
 Step 2 completed step in 4.92 seconds.
Completed table 2 in 11.13 seconds with 3439802247 / 4294910274 entries ( 80.09% ).
Compressing tables 3 and 4...
 Step 1 completed step in 6.43 seconds.
 Step 2 completed step in 7.94 seconds.
 Step 3 completed step in 5.83 seconds.
Completed table 3 in 20.20 seconds with 3465898677 / 4294838698 entries ( 80.70% ).
Compressing tables 4 and 5...
 Step 1 completed step in 6.77 seconds.
 Step 2 completed step in 8.12 seconds.
 Step 3 completed step in 6.45 seconds.
Completed table 4 in 21.34 seconds with 3532578519 / 4294592915 entries ( 82.26% ).
Compressing tables 5 and 6...
 Step 1 completed step in 6.76 seconds.
 Step 2 completed step in 8.43 seconds.
 Step 3 completed step in 6.38 seconds.
Completed table 5 in 21.57 seconds with 3712952071 / 4294339126 entries ( 86.46% ).
Compressing tables 6 and 7...
 Step 1 completed step in 6.40 seconds.
 Step 2 completed step in 9.27 seconds.
 Step 3 completed step in 6.76 seconds.
Completed table 6 in 22.43 seconds with 4293669401 / 4293669401 entries ( 100.00% ).
Serializing P7 entries
Completed serializing P7 entries in 3.96 seconds.
Completed Phase 3 in 100.63 seconds
Completed Plot 1 in 238.12 seconds ( 3.97 minutes )

/mnt/ramdisk/plot-k32-c07-2023-06-25-12-20-9d570807b00deb336f9df9deffbec12e262ff9608d2779ce714d6a0e08841a36.plot.tmp -> /mnt/ramdisk/plot-k32-c07-2023-06-25-12-20-9d570807b00deb336f9df9deffbec12e262ff9608d2779ce714d6a0e08841a36.plot
Completed writing plot in 79.58 seconds

Compared with your timings I feel like GPU is bottleneck. Could be a driver’s issue but…
According to this website NVIDIA Tesla M40 Specs | TechPowerUp GPU Database
RTX 2080 performing 180% of Tesla m40 performance. But even with that performance difference, you should see timings like 6-8mins per plot.
For some reasons GPU clearly underperforming. Hope its helps you to get on the right track of troubleshooting

Bladebit Chia Plotter
Version      : 0.0.0-dev
Git Commit   : unknown
Compiled With: gcc 11.3.0

[Global Plotting Config]
 Will create 1000 plots.
 Thread count          : 40
 Warm start enabled    : false
 NUMA disabled         : false
 CPU affinity disabled : false
 Farmer public key     : xxx
 Pool public key       : xxx
 Compression Level     : 7
 Benchmark mode        : disabled

[Bladebit CUDA Plotter]
Selected cuda device 0 : NVIDIA GeForce RTX 2080 SUPER
 CUDA Compute Capability   : 7.5
 SM count                  : 48
 Max blocks per SM         : 16
 Max threads per SM        : 1024
 Async Engine Count        : 3
 L2 cache size             : 4.00 MB
 L2 persist cache max size : 0.00 MB
 Stack Size                : 1.00 KB
 Memory:
  Total                    : 7.79 GB
  Free                     : 6.70 GB