There are a lot of knobs we can turn when trying to optimize plotting performance. So far, only a few have been tested in head-to-head benchmarks, and that information is scattered all over the place. I figured we should have a list of possible plotting performance tweaks we have benchmarked or should benchmark.
This list focuses on software/configuration matters since those are things you can adjust without buying anything.
- Buffer size
- @codinghorror finds no clear difference between 4GB and 8GB
- Buckets are sorted with quicksort when not enough memory is available for uniform sort. Somewhere above the default - I used 4608 MB - uniform sort is always used. What the performance implications are, what the exact threshold is and whether adding memory past it has any effect is unclear to me.
- @markgibbons25 - Per the ChiaFarmer blog I use a buffer size of 3408 and 4 threads and it always uses uniform sort.
- Thread count
- @codinghorror finds huge improvement with 4 threads over 2
- @Blueoxx finds huge improvement with 4 over 2, minor improvement with 6 over 4, slowdown with 8 over 6, attributing the latter to his CPU’s core structure (5900X).
- Whether going beyond 4 helps is unclear. Worth comparing on systems with a lot of excess hardware threads at their maximum sensible number of parallel plots. Note this is currently only relevant for phase 1, so all parallel plots in other phases only count for one thread.
- Sort buckets
- Fewer buckets need more RAM (2x for half, I believe). Usual recommendation is to keep it at 128 but 64 seems worth trying for people who happen to have a lot of excess RAM anyway. Definitely do not sacrifice parallel plot count for fewer buckets.
- Windows vs. Linux vs. macOS
- Equal environment (NTFS, no software RAID) to test the plotting code, practical environments for practical purposes
- Linux filesystems
- Impact of disabling journaling in Ext4 (no performance guide recommends this, but that’s because it’s too risky for any use case except ours)
- Ext4 vs XFS vs Btrfs
- RAID0 vs. separate filesystems
- RAID stripe width / filesystem block size
- Continuous TRIM vs. frequent periodic TRIM
- The usual Linux distribution default - TRIM once a week - is obviously bad for plotting but these two strategies both sound sensible.
- May vary by filesystem, disk and firmware version.
- poll_queues NVMe driver parameter on Linux
- Intel recommends tweaking this in an Optane performance guide. I’ve heard the suggestion to set it to the number of CPU cores. I believe this feature is off by default.
- Impact of CPU side channel mitigations
- Defaults vs.
mitigations=offkernel parameter on Linux. Results only valid for one CPU microarchitecture.
- Defaults vs.
- Core pinning with parallel plots
- No pinning vs. allow one hyperthread of each core vs. allow one hyperthread of a subset of cores (on many-core systems)
- Note you should definitely pin each process to one CPU on multi-CPU systems and make sure each process only uses memory from one NUMA node on NUMA systems (multi-CPU and Threadripper)
Most of these most likely have little, if any, effect in practice. But you never know.
Unfortunately this will have to be an Ideas Guy post as I did my POTLing on rented hardware I’ve already returned and own no useful hardware myself. But hey, someone had to write it…
I’d like this to be a community resource so I’m happy to update this post with any additions or corrections you guys have and I would not mind moderators making content edits either.