- NVMe performance.
Most of those benchmarks are trying to compare performance of those drives at the big chunk read / writes. This is where the 3k performance is, and this is where we see a big drop-off once the cache (for those that are employing that) is exhausted.
However, when the read/writes are done in smaller chunks, the performance drops off to 1/5 or 1/10th of that 3k value right out of the bat, and those differences between good and bad drives are not that pronounced anymore.
As also mentioned, the BOM cost reduction by also major manufacturers (WDC / Samsung) is basically crippling those “good” drives, once the dust settles down with the initial rush to test those drives. So, it is really hard to get reliable stats for what the current models are.
Of course, the TBW is usually better with better drives, so that is something to be considered as well.
I am not saying that I would suggest using those inferior products at all. All my NVMes are Samsung 970 Evo Plus, and a couple of WDC Black. Those were purchased early, so basically match the benchmarks that are out there.
As @Fuzeguy stated, I also gave up on using RAID for NVMes, as I didn’t see much gain if any. Sure, that spreads the TBW, but my take is that it is better to kill one and buy another one than have two half-dead ones.
Actually, I have tried to use NVMe 1 for T1, NVMe 2 for Stage / Dest and RAM for T2, but having that extra NVMe for Stage / Dst really didn’t change anything with plotting speeds. So, I gave up that route.
Also, I didn’t say that NVMe performance is not the issue, but rather not to mix all the options at once, as performance will be for sure interdependent, and potential gains may not translate well to the final setup. Therefore, I would start with a basic setup, get it nailed down, and add another component (e.g., play with NVMe at that point). Just work with one problem at a time.
- CPU
Affinity, etc. with dual processors is not that simple. Each processor has its own directly connected RAM and PCIe slots (i.e., NVMe). Therefore, the best performance is when such processor is using just that. When such processor has to reach across the other processor to grab extra RAM or PCIe slots, there is a penalty for that.
This also means that when we want to overprovision the number of threads, so those extra can reach across processors for a given MM instance (when that other instance is busy with different resources), that is also penalized, as at some point resources need to be shared for that first instance.
Shifting MM instances has big impact, if there is just one processor in the box. The whole concept of that shifting came because that crap Chia plotter that barely can use one thread, thus multiple instances were really needed, and shifting those was a game changer. Sure, MM is busy with different resources during different phases, but again that is fine tuning that should be done once the basics are worked out.
On a dual processor box, most likely shifting those MM instances is buying not that much, as the assumption is that each instance is running on its own (direct connected) RAM and NVMe.