List of Plotting Performance Tweaks

freeze · May 10, 2021, 11:28pm

There are a lot of knobs we can turn when trying to optimize plotting performance. So far, only a few have been tested in head-to-head benchmarks, and that information is scattered all over the place. I figured we should have a list of possible plotting performance tweaks we have benchmarked or should benchmark.

This list focuses on software/configuration matters since those are things you can adjust without buying anything.

Buffer size
- @codinghorror finds no clear difference between 4GB and 8GB
- Buckets are sorted with quicksort when not enough memory is available for uniform sort. Somewhere above the default - I used 4608 MB - uniform sort is always used. What the performance implications are, what the exact threshold is and whether adding memory past it has any effect is unclear to me.
- @markgibbons25 - Per the ChiaFarmer blog I use a buffer size of 3408 and 4 threads and it always uses uniform sort.
Thread count
- @codinghorror finds huge improvement with 4 threads over 2
- @Blueoxx finds huge improvement with 4 over 2, minor improvement with 6 over 4, slowdown with 8 over 6, attributing the latter to his CPU’s core structure (5900X).
- Whether going beyond 4 helps is unclear. Worth comparing on systems with a lot of excess hardware threads at their maximum sensible number of parallel plots. Note this is currently only relevant for phase 1, so all parallel plots in other phases only count for one thread.
Sort buckets
- Fewer buckets need more RAM (2x for half, I believe). Usual recommendation is to keep it at 128 but 64 seems worth trying for people who happen to have a lot of excess RAM anyway. Definitely do not sacrifice parallel plot count for fewer buckets.
Windows vs. Linux vs. macOS
- Equal environment (NTFS, no software RAID) to test the plotting code, practical environments for practical purposes
Linux filesystems
- Impact of disabling journaling in Ext4 (no performance guide recommends this, but that’s because it’s too risky for any use case except ours)
- Ext4 vs XFS vs Btrfs
RAID0 vs. separate filesystems
- @Blueoxx finds no difference with Windows software RAID
RAID stripe width / filesystem block size
- @storage_jm’s guide sets this to 64K. On my disks, the default is much higher.
- @Blueoxx finds 64K beats 4K on Windows/NTFS.
Continuous TRIM vs. frequent periodic TRIM
- The usual Linux distribution default - TRIM once a week - is obviously bad for plotting but these two strategies both sound sensible.
- May vary by filesystem, disk and firmware version.
poll_queues NVMe driver parameter on Linux
- Intel recommends tweaking this in an Optane performance guide. I’ve heard the suggestion to set it to the number of CPU cores. I believe this feature is off by default.
Impact of CPU side channel mitigations
- Defaults vs. mitigations=off kernel parameter on Linux. Results only valid for one CPU microarchitecture.
Core pinning with parallel plots
- No pinning vs. allow one hyperthread of each core vs. allow one hyperthread of a subset of cores (on many-core systems)
- Note you should definitely pin each process to one CPU on multi-CPU systems and make sure each process only uses memory from one NUMA node on NUMA systems (multi-CPU and Threadripper)

Most of these most likely have little, if any, effect in practice. But you never know.

Unfortunately this will have to be an Ideas Guy post as I did my POTLing on rented hardware I’ve already returned and own no useful hardware myself. But hey, someone had to write it…

I’d like this to be a community resource so I’m happy to update this post with any additions or corrections you guys have and I would not mind moderators making content edits either.

dchuk · May 10, 2021, 11:38pm

I made this a wiki page for everyone to contribute to over time. Thank you for making this!

codinghorror · May 11, 2021, 1:17am

It’s unclear if RAM speed helps plotting, either. Like 2133Mhz RAM versus 3200Mhz RAM, and all that. My guess is no.

Everything I’ve seen points to two true things

Most of all, you want insanely fast per-thread performance (yes, this means high Ghz clock rate too), and lots of cores.
Pair that with fast “enough” disks (NVME SSD), don’t overload them with too many simultaneous plots, and you’re golden.

(There is a hidden number three here which is, “and now multiply that by a hundred machines” )

storage_jm · May 12, 2021, 8:16pm

Generally you are cpu bound or io bound. Get cpu to max saturation and you will not see scaling in TB/day output, Hyperthreading does not do much so just target the number of processes to core count. If the CPU is always busy then number of threads and staggering will not make a difference. I stagger mostly to relieve stress on the I/O for the destination drive. For the SSD use a good data center NVMe and enable discard. Ideally more drives will provide higher IOPS per terabyte but fewer drives is easier to attach.

I see too many people obsessing over r value and staggering. Plot times are not important in isolation (good to compare run versus run) but total system output is

storage_jm · May 12, 2021, 8:17pm

100% here, this is the way

codinghorror · May 15, 2021, 8:44pm

Incorrect! RAM speed does help a fair bit:

ChuckNorris · May 16, 2021, 2:39am

How many plots a day are you making with your setup?

spyghost · May 16, 2021, 3:11am

Assuming you have a good (ie not outdated CPU) and an ample amount of RAM, the main bottleneck is usually (most likely) the NVMe. Not all NVMe’s are created equal. No matter how you tweak kernel, memory or filesystem settings, if the NVMe cannot cope with the RW requirement of multiple plots, the iowait will tend to go up and slow things down drastically.

My plotter is a 10th gen i7 6c/12t with 16GB DDR4 2666MHz RAM and a 1TB WD SN550 running Ubuntu Server 21.04 can do 12/day using the default plot settings. Still working on to squeeze a bit more to do more than 12.

My setup isn’t the most efficient nor the best performing plotter, but I’m good with it. I don’t wanna spend $ for a really good plotter.

chiasuperfarmer · May 17, 2021, 4:13am

I have a system with 96GB of RAM and can confirm that increasing RAM per plot to around 8-10GB yields noteworthy performance improvements (I only have a 1TB NVME so am limited to 3-4 concurrent plots). Also, I realized that ensuring only 1 plot is active per phase 1 with max number of threads (I have 12 so I dedicate 8 to phase 1 plots) makes things very smooth, as all other stages only use 1 thread anyway…

Thanks for the wiki, super handy! Maybe worth adding something on NVMEs with buffers that can reduce write speeds which I agree with the OP continues to be a bottleneck even on NVMEs

witwitkowski · May 23, 2021, 6:50pm

So, what you say, is basically you want to stagger for the entire time of phase 1?

leadfarmer · May 23, 2021, 6:53pm

Been playing with buckets. 128 as the default works great with 3300 per 4 threads. 5200 per 4 threads is marginally better, but better.

Tried 64 sized buckets. Jesus you need like 4x the ram, so running 12800 works, but anything lower and the 64 sized buckets just drag. I don’t see how unless someone has like 128gb memory for an 8 core the 64 sized buckets would help.

Yae · May 23, 2021, 7:08pm

I’ve run 64 buckets 4T with less ram without falling back to QS. You just need around 8GB

Kwyjibo · May 24, 2021, 10:32am

What is the comparative advantage of using 64 buckets as opposed to 128 buckets?

Yae · May 24, 2021, 10:47am

I think it is a tradeoff between using more memory but getting longer read/write streaks, potentially speeding up temp disk IO.

While this looks plausible for single plots, I do not know how this would hold for plotting in parallel to the same temp drive.

Kwyjibo · May 24, 2021, 10:59am

I have a 3rd gen i7 spitting out 2/day in queue (consistent on the output rate) on SSD - I will try 8 threads, 8gm ram on 64 buckets on that, willreport any noticeable improvements (near or above 30 mims).

Voodoo · May 24, 2021, 11:27am

From what I’ve seen so far RAM speed is important (only) on Ryzen systems.

Someone explained it somewhere on the forum, I don’t know where anymore.
Somethin about the limitations of the infinity fabric causing the bandwidth to be limited and this can only be helped by having faster RAM to increase the throughput.

found it:
Comparing plot speeds across CPUs - #41 by acrosbug?

Don’t really know if he’s right, but seems to know what’s he’s talking about.

Yae · May 24, 2021, 11:47am

If you want to know more, here’s more info for:

Zen 2 (Ryzen 3xxx): AMD Ryzen RAM scaling - performance effect in games - Introduction
Zen 3 (Ryzen 5xxx): AMD Zen 3 Ryzen Deep Dive Review: 5950X, 5900X, 5800X and 5600X Tested

acrosbug · May 24, 2021, 3:38pm

Hi Voodoo,

Refer to the chart on DDR4 write MB/sec, similar to Chia plotting. This also affect 5950x as AMD is using the same Infinity Fabric design. For optimization, use Linux and overclock the memory clock speed. Use multiple ssd instead of 1 ssd to avoid bottleneck on pcie or sata. Maybe Raid0 will help too.

Else sell ur AMD CPU and switch to Intel one for plotting.

Voodoo · May 24, 2021, 3:49pm

Well I think I’m running pretty ok dollars spend/plots a day.
I’m not sure at this point if there is much difference between Intel and AMD in that metric.
I might just sell the ram and get something faster instead. But I’m much more interested in storage space atm

Intel plots faster, AMD you get more cores/dollar.

If you plot faster you can save on a bit of temp space, but then again intel has many limitations on the m.2 slots and sata ports. So pros and cons either side.

But a lot of the building advice people are looking at on like chiadecentral is Intel focused. So It’s good for people to know that when going for AMD, get fast memory.
I based my build on the info I was reading which focused on number of cores and SSD endurance.
Only later I found out about sustained write speed, the massive importance of single core performance and the fact that you need fast memory for Ryzen.
In many articles, It says things like: just get the cheapest memory you can find…