Slow plot times with Dual Xeon 2699 (72 threads) + 256gb Ram + 8TB NVME (Windows + madMax)

makemake · August 4, 2022, 4:00pm

Problem(s):
Slow plot times for MadMax. The fastest I can get is 34min, but that is only using 36 cores. Even when I set it to all 72 threads- it will not listen.

My Setup:
Windows 10 Pro 21H2 (I know… “Ubuntu ftw!”)
Dell Z840
2x Xeon 2699: 18 core/36 threads, 2.3GHz base clock, but I’ve seen it boost around 3.1ish
256GB of 2133MHz ECC RAM - Hynix
4x 2TB NVMe drives (RAID0 using a PCIe card)
2TB for C drive NVMe
All NVMe drives are: “Inland Premium 2TB SSD M.2 2280 PCIe NVMe 3.0 x4 TLC 3D NAND Internal Solid State Drive, Read/Write Speed up to 3200 MBps and 2900 MBps, 3200 TBW”
madMAX v0.1.1

What I have tried:

I have tried with both NUMA enabled/disabled. To try and cover all cores, I have run 2 in parallel, which will drastically affect plot times.
RAM disk with ImDisk, Dataram RAMDisk, OSFMount (all slower than just plotting to NVME)
PrimoCache: Settings = 100% write cache, 8KB block size, infinite defer-write, native write mode)
Plotting on just the temp 8TB RAID0 drive
Messing around in the BIOS for power options etc
Doing a standard format instead of a quick format to make the RAID0
Praying to the Chia gods and wearing my underwear inside out

What gave me the best results to get to 34min total time?
36 threads
24000 ram
256/128 buckets
8TB RAID 0 drive as both the temp and final drive

Any help is greatly appreciated. Thank you in advance, I love you.

seymour.krelborn · August 4, 2022, 4:11pm

The advertised speed of NVMe drives is based on their faster cache section (which is probably 10% of the drive’s capacity). Once the cache runs out, most drives take huge performance hits (such as USB 2.0 speed, for some drives).

Even low-end NVMe drives boast fast speeds, because they are fast, until the cache runs out.
And plotting will fill the cache.

I am not familiar with Inland Premium drives. Perhaps they do not slow down?

taskmgr.exe will give you the read and write speeds in real-time.

madmax defaults to using 4 threads.
You must specify the “-r” option, with a value:
-r 36
to specify that madmax should use 36 threads (or however many threads you choose to use).

Cheers!

makemake · August 4, 2022, 4:57pm

Hola, thank you for the reply.

I should have mentioned that I have indeed tried to set the -r flag as 72 and many other values.

In windows, when NUMA is enabled, I can go into task manager and set the affinity to either “Group 1” or “Group 2” which are 36 threads each. I cannot set it to both.

As for the NVMe drives, here is a link with some data: Inland Premium 2TB vs Samsung 970 Evo NVMe M.2 2280 2TB: What is the difference?

seymour.krelborn · August 4, 2022, 5:15pm

Based on the price, I suspect that your NVMe drives are your bottleneck.
Better performing drives will usually cost more.

Price is not conclusive. But it is a somewhat reliable barometer.

To nail down whether or not your NVMe drives are your bottleneck, you would need to see benchmark results that hammer the drive.

A benchmark with 5 GB will not reveal the speed, other than the drive’s cache speed.

This review offers some insight:

On page 2 of the review, it shows the Samsung 980 besting the Inland Premium.
And the Samsung Pro (not in the review) is faster, still.

I use the 2 TB Samsung Pro with my 5950x (16 cores, 32 threads), and it keeps my CPU pinned for nearly all of phase 1 and nearly all of phase 3 (phases 2 and 4 are not as demanding).

I use two of them for plotting, and run two plotting jobs in parallel, which keeps my CPU pinned 99% of the time.

In summary, I suspect that your Inland Premium NVMe drives are your bottleneck.

taskmgr.exe will reveal what is happening with your NVMe drives and your CPUs.

If your NVMe drives are 90%+ active, and your CPUs are strolling along, then the problem is that your CPUs are waiting on your NVMe drives.

If, however, your CPUs are pinned (or close to it), then I have no suggestions, as it would seem that your CPUs would be your bottleneck?

Jacek · August 4, 2022, 5:24pm

I would ignore the NVMe for time being, as that is not the factor that is limiting your box.

Sure, some NVMe are using fast cache and slower main Flash, and once exhausted drop a lot in top performance, some have better / faster Flash, so don’t drop at all, but as manufacturers are doing cost reduction on those parts, benchmarks that are out there may not be really reflecting the latest versions (components are changed to inferior ones, but product model is kept the same). However, the way MM is using NVMe (drive), is not really optimal, so those 3k speeds are basically never reached (as reads/writes are done in small chunks what drives the performance about to 1/5 or 1/10 of what those cards are advertised as).

From my experience, those older Intel boxes don’t benefit much from using RAID 0 for NVMes. Maybe, RAID has to be enabled in BIOS (not through Device/Disk Manager) to get better performance. Most likely, AMD latest processors / motherboards are better in that.

So, start with just one NVMe, and once you get it working, then try to see whether you can improve the NVMe side. Otherwise, you are working with just too many moving parts. (By saying one NVMe, I mean one NVMe per processor when running two MM instances in parallel.)

As far as the CPU side, I have dual Xeon E5 v2, and just couldn’t make it to work with two processors using Win Pro 10. I was getting roughly similar speeds as you when using a single processor. The minute I added the second processor, the box was slowing down so badly that there was no point to continue.

So, I ended up switching to Ubuntu. Some people reported that CentOS / RedHat have better performance for MM. Also, I run two MM instances in parallel with -r value equivalent to number of physical cores, and -K value set to 2.

If you search this forum, there is a couple (or more) good threads that talk about those dual processor old Intel boxes.

seymour.krelborn · August 4, 2022, 5:30pm

I agree with everything that @Jacek wrote, except:

You might have more than one limiting factor.
But I am confident (short of positive) that your NVMe drives are not well suited for plotting.

The answer is available via taskmgr.exe

Fuzeguy · August 4, 2022, 5:32pm

I concur, your nvmes are fairly weak. For example, if you google your drive followed by “review” and glance at those you will find that it is a rather ordinary drive meant for general purpose use. Chia plotting is absolutely not that.

Also I would advise against RAIDing ssds. It really does little good, and in some cases, makes performance worse. You would be better off if you simply used one drive each for T1 and T2, and then duped that using the other two for a second MM instance staggered after Phase 1. At least that would spread the performance a bit and should get you some better plotting.

One more thing is that setting affinity, whilst sounding good, actually can reduce performance under windows because it can leave CPU cycles unused. Letting the Windows task scheduler do its thing generally improves performance, even if both instances of MM are set to use the whole of your CPU(s). Strange but true.

Last, MM doesn’t use much memory itself, but does use more the fewer buckets you specify. Rather than RAM disks, you might try reducing buckets and see how that works. RAM disks and such are mostly good for wear free plotting, that is, not destroying your nvmes. In the case of yours, that might be something to use to help save them, or get better drives. Good luck!

Jacek · August 4, 2022, 6:02pm

NVMe performance.

Most of those benchmarks are trying to compare performance of those drives at the big chunk read / writes. This is where the 3k performance is, and this is where we see a big drop-off once the cache (for those that are employing that) is exhausted.

However, when the read/writes are done in smaller chunks, the performance drops off to 1/5 or 1/10th of that 3k value right out of the bat, and those differences between good and bad drives are not that pronounced anymore.

As also mentioned, the BOM cost reduction by also major manufacturers (WDC / Samsung) is basically crippling those “good” drives, once the dust settles down with the initial rush to test those drives. So, it is really hard to get reliable stats for what the current models are.

Of course, the TBW is usually better with better drives, so that is something to be considered as well.

I am not saying that I would suggest using those inferior products at all. All my NVMes are Samsung 970 Evo Plus, and a couple of WDC Black. Those were purchased early, so basically match the benchmarks that are out there.

As @Fuzeguy stated, I also gave up on using RAID for NVMes, as I didn’t see much gain if any. Sure, that spreads the TBW, but my take is that it is better to kill one and buy another one than have two half-dead ones.

Actually, I have tried to use NVMe 1 for T1, NVMe 2 for Stage / Dest and RAM for T2, but having that extra NVMe for Stage / Dst really didn’t change anything with plotting speeds. So, I gave up that route.

Also, I didn’t say that NVMe performance is not the issue, but rather not to mix all the options at once, as performance will be for sure interdependent, and potential gains may not translate well to the final setup. Therefore, I would start with a basic setup, get it nailed down, and add another component (e.g., play with NVMe at that point). Just work with one problem at a time.

CPU

Affinity, etc. with dual processors is not that simple. Each processor has its own directly connected RAM and PCIe slots (i.e., NVMe). Therefore, the best performance is when such processor is using just that. When such processor has to reach across the other processor to grab extra RAM or PCIe slots, there is a penalty for that.

This also means that when we want to overprovision the number of threads, so those extra can reach across processors for a given MM instance (when that other instance is busy with different resources), that is also penalized, as at some point resources need to be shared for that first instance.

Shifting MM instances has big impact, if there is just one processor in the box. The whole concept of that shifting came because that crap Chia plotter that barely can use one thread, thus multiple instances were really needed, and shifting those was a game changer. Sure, MM is busy with different resources during different phases, but again that is fine tuning that should be done once the basics are worked out.

On a dual processor box, most likely shifting those MM instances is buying not that much, as the assumption is that each instance is running on its own (direct connected) RAM and NVMe.

makemake · August 4, 2022, 6:13pm

Wow, you guys are seriously awesome! I will try the suggestions and report back here. Thank you!

Do you think I would benefit from keeping NUMA enabled with this current system config? That way, each instance of MM is pinned to only one NUMA node and does not reach for the RAM from the other. I think I should also map the NVMe drives to the corresponding NUMA node’s PCIe slots for the same reason. Does that make sense?

Jacek · August 4, 2022, 6:26pm

Again, your main problem right now is Win coping with 2 processors. You should be targeting 2x 34 mins plots while running two MM instances. I would start with -u/-v = 8, and one NVMe per CPU (t1 / stage; maybe also dst, just to get your final drives out of the picture for now). Even if you can make it 2x 40-50 mins, that will be a good start. (In my case, I aborted my Win tests as plotting with two processors increased a single run to more than a couple of hours.) I would use ImDisk to create 2x 112 GB RAM drives (for t2). I would initially stop using PrimoCache, as that is going to potentially mask some problems, or making those problems harder to identify. Also, as you switch to ImDisk, you will not have much RAM left for PrimoCache. Of course, as the base, I would try both with and without NUMA, just to see where this goes.

There is basically no “one good setup” as you need to fine tune it based on your box. Also, don’t base your results on just one run, as when you have continuous runs, the resource usage patterns change a bit (you will see a couple of minutes differences here and there). Running a single run is good to just get a baseline (whether Win copes well with those 2 processors).

Basically, each of us that replied was reflecting problems that we faced, so it is not saying that one of us got it better set up, just our boxes are different, thus different problems needed to be addressed.

UPDATE
By the way, your box draws potentially 400-600W and that needs to be somehow quickly removed from your box. You should install HWMonitor, to check temps for you CPUs, RAM, NVMes. Each of those will temp throttle if not enough cooled. I had to switch to 120mm single fan AIOs for my processors (one per CPU), as with stock heatsinks CPU temps were going through the roof. Also, Samsung 970 Evo Plus (as other Samsung NVMes) are running hot just to boot, so I am using those cards to bring their temps down - Amazon.com. I still put those NVMes in heatsinks, and had to get longer screws, but looks like longer standoffs are a better choice. I am still fighting RAM temps, though (for time being put 140mm fan over one bank, but it also cools down the other bank a bit).

Fuzeguy · August 4, 2022, 7:12pm

Here’s another really good single nvme fan cooler for ssds. The fan is better than those little ones everyone is selling and the shroud helps flow and keeps everything neat > https://www.amazon.com/gp/product/B09DV1ZVMW/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&th=1

I also have one like Janek mentioned, but quad ssd, and those work great too. Motherboard needs bifurcation for the x4 version thou…

Jacek · August 4, 2022, 8:23pm

Actually, one thing that so far has not been mentioned is BB path. If you target a big number of plots, BB is potentially the best choice to go for your box. You would need to increase your RAM closer to 512. Although, no more nonsense with those NVMes (sure, you will need one or two to offload your plots, as you may be getting somewhere closer to 10 mins / plot (assuming that you can get MM somewhere between 15 and 20 mins / plot - combined). Also, BB scales perfectly across multiple processors (no need to touch NUMA). Although, I have zero experience with BB, plus most likely it will get soon retired for a couple of reasons (BB disk and plot compression). There is no time frame for BB disk provided. As far as compression, Chia was caught with their pants down recently, so no timeline for that as well.

Ronski · August 4, 2022, 8:28pm

I have T7910 with 512GB ram and a pair of e5-2699v3 running Linux Mint for plotting, BB with do a plot in 12 minutes.

MM 18.6 minutes in ram.

drhicom · August 4, 2022, 8:35pm

i was just looking around and saw that the Dell T7920 will hold 3TB of ram

Ronski · August 4, 2022, 8:49pm

Yes it will, and two 28 core CPU’s.

My T7910 will hold 1TB but I’m not buying sixteen 64GB sticks! I can also run two e5-2699v4 22 core CPU’s, but again just two expensive.

jack6070 · August 5, 2022, 2:21am

What dual E5 V2 CPU you have, sir? Do you use RamDisk? Tow MM run together can get how many plots a day? I have a similar system and I am trying to work it out. Thanks!

Jacek · August 5, 2022, 2:50am

It is Dell t7610. I have 2x e5-2695 v2, 256 MB RAM (4 strips per processor), 2x WD Black NVMe (one per MM instance), running Ubuntu. Single instance produces plot in ~40 mins, so combined is ~20 mins plots. t1, stage, dst are on NVMe, t2 is 112 GB RAM drive.

I could not get it to bite two processors when running Win 10 (cannot say that I tried too hard). Sure, it was plotting, but I just couldn’t take it when it was still running 2 hours in the process with no sign of getting close to the finish line. Still, with just one processor in the socket, it was taking ~50 mins to get it done.

As mentioned already, you need to deal with high temps, as board layout is really bad (CPU1 exhaust vs. CPU2 intake), and case ventilation is kind of a joke. I don’t have RAM air ducts, though (box came without those, and shelling $30 for one is kind of a nonsense - duct tape and cardboard do the trick for no cost, and not much labor).

Although, if you want to plot for chia (k32), upgrade to 512 GB, and go BB.

I still would like to go at some point CentOS / Rocky Linux route, as I am not really a fan of Debian / Ubuntu branch. Although, started with Ubuntu, as that was the recommended setup.

rallbright · August 5, 2022, 8:01am

windows is slowing you down

i use linux and get 17 min plots on a 5950x 128gb all day no problem

whisper_85 · August 5, 2022, 8:01am

Hello. I made few tests on DELL PowerEdge R640 servers with 256GB of RAM and what I found that with the same hardware configuration and the same plotting parameters using MadMax plotter, performance boost on Linux was about 110% better than on Windows10. You should really try Ubuntu. Windows is fun and games but is not really created and optimised for real computing operations. I work in datacenter where I maintain +2000 servers and 99% workers run on Linux. Plot time I have in Ubuntu is about 26 minutes with Dual Intel Xeon Silver 4215 8 Core CPUs, so with 16 cores/32 threads in total (using -r 16 parameter).

jonnys · August 5, 2022, 10:53am

I’d have to agree with anyone who said running Windows is your biggest issue. I personally wouldn’t change any Hardware until you tried to run it with something like Ubuntu.

I used to have a very similar setup:
Dell T7910
Dual E5-2699V3
256GB Ram
1TB SK hynix Gold P31

On Ubuntu with MM I used to get 22/23 minutes a plot

Also, the -r is just a multiplier and don’t set it to your exact thread count. I think I used to run it with -r 30 but play around with it around that range and see what you get. But setting it anything above -r 40 is way too high and will hurt performance.