Threadripper Lackluster Performance

Hi friends, new to the forum but not to Chia…

My mind is spinning with different avenues for investigation and I am hoping some smarter people can give me a course of action for tuning my system. I have two plotter systems (one Intel and one AMD). Currently the Intel is greatly outperforming the AMD system despite have less core/threads, equal RAM, and one less 2TB NVME. I will list out the specs:

Intel Plotter (25 plots per day w/o additional tuning)
CPU: Intel I9-10900 (10 core/20 threads) @ 2.8GHZ
Mobo: MSI Z490-A Pro
RAM: 64GB (32x2) DDR4 3600 Corsair Ripjaw
NVME(s): 2x 2TB Samsung 970 EVO Plus
PSU: 750w Thermaltake gold rated
OS: Ubuntu 21.04

Threadripper Plotter (25 per day…looking to improve)
CPU: AMD 2970WX (24 cores/48 threads) @3.0GHZ
Mobo: ASRock X399M Taichi
RAM: 64GB (32x2) DDR4 3600 Corsair Ripjaw
NVME(s): 3x 2TB Samsung 970 EVO Plus
PSU: Corsair RM850x
OS: Ubuntu 21.04

The AMD at 4 cores per phase 1 (max 2 phase 1 per NVME) and 15 max plots in parallel, I am running into completion times of 12-14 hours. In comparison, my Intel runs 8 jobs in parallel with the same phase 1 restriction , finishes around 8-9 hours. I don’t see how that is probable given the obvious disparity in power.

What I’ve gathered so far from reading other threads and stack overflow:

  • Check that the RAM frequency is set correctly in the BIOS
  • Check that my CPU clock speed is actually as advertised (I am in the process of determining this)
  • Check that the NVME speeds aren’t being throttled down to SATA speeds somehow

Things I’ve done so far to try to troubleshoot / improve performance:

  • Increased thread count for phase 1 from 4 to 6
  • Set RAM from 6000 to 8000
  • Lowered the max number of phase 1 plots per NVME from 3 to 2
  • Increased the time between plots from 90 minutes to 120 minutes and staggered job start times (1 per NVME) from 0 minutes to 30 minutes.
  • Upgraded CPU fans to a Noctua air cooling setup
  • Installed heat sinks for each NVME

I am struggling with next steps. I don’t want to halt plot production on my threadripper for a bunk test. Based on the provided information, is there a direction you’d begin investigating? Do I have a hardware bottleneck based on my mobo and/or CPU? I greatly appreciate any insight this great group has to offer!

TLDR; Intel faster than threadripper, what do?

That Intel chip is way faster than that threadripper clock for clock since it’s based on the older Zen+ architecture if I am not mistaken.

Watch this where he does a comparison between Intel 9900k vs. Ryzen 3900x etc.

1 Like

I don’t know, but those results seems fairly reasonable. I have a 3955wx (15/32) 80GB 3x 1tb, 1x 2tb quality nvme ssds (2x PCI-E 4.0, 2x PCI-E 3.0 dosen’t matter, the times are nearly identical) and have tried many combinations with results seemingly coming up similarly in scaling. It seems like something should cause results to improve, more/faster ssds, more memory, less parallel plots, etc. The reality is not that - I have done those things. I seem to be at a limit of about 28 plots/day. When I change one component, the others seemingly morph to compensate, limiting output. The only possible reasoning I can come up with is

  1. more cores with more plots, threads means more ‘friction’ in managing the overall scheduling of the system, slowing overall speed.
  2. for whatever reason, as additional load is introduced, it appears to slow other workloads, perhaps due to #1, or some other factor, such as overall memory bandwidth, for example. I’ve cooled everything, so not an issue.
  3. I don’t think it is ssd speed, as I have tried 2, 3 and 4 ssds at a time, staggering many ways to level out ssd loading, again no improvements.
  4. If I do only a few plots at a time, I can get really decent short time, the results simply do not scale well. But they should scale better, one would imagine. Again fighting reality.
1 Like

My experience with my chia plotting (r9 3900x and i7-5820k)

Chia only uses more than 1 thread in phase 1. So 60% of the time it is entirely single thread performance (including memory latency) limited or disk i/o. And in the multithreaded phase one, it doesn’t scale much past 3-4 threads. Both not great for a very “wide” beast TRipper.

  1. All core load frequency matters. My clocks dip the more I load up cpu on on AMD.
    1b. Per core IPC also. 10900 has way higher IPC and all core speeds than the any Tripper atm( new ones will crush it once released).
  2. intel 10xxx and older have low memory latency. (11th gen actually regressed on memory latency). Not sure if this is an issue. But a ryzen benefits from faster ram(and therefore faster infinity fabric)
  3. 10900 has ample cores for the multithreaded phase 1…and they are fast.

So all in all per plot speed TR sucks compared to a 10900. But if you throwing enough plotting nvme and ram on a TR… It will have more plots per day with more parallel plotting jobs. Despite the per plot speed difference.

2 Likes

Thank you all.

My naivete about chips is clear. I misunderstood some fundamental things and I have some reading to do! Chia is great for working out some sorely needed knowledge (in this case hardware and Linux as a whole)

Thank you very much for the illuminating insights!

1 Like

Well one thing is apparent, your 2970WX is a quad channel cpu and you are only running 2 sticks of ram. This will have a big performance hit on your threadripper. I haven’t looked deep into your ram spec but if it matches your Intel machine as shown above, you can quickly borrow that ram to test the results.

1 Like

Oh snap I can’t believe I didn’t notice this! Super lucky my wife supports my hobby, she is running out to get 4x16 gb sticks of ram.

I read up on the thread ripper performance in dual channel vs quad channel, and I agree with you.

Just shows that I really need to sit and research better when building a system.

1 Like

My TRipper is a 3960X. I can confirm this. More NVME and memory. Don’t think fast, think Mississippi river. Long, wide, deep.

2 Likes

This makes a lot of sense after all the tinkering and information others have provided. I’ll be able to do many more in parallel, provided I get a few more nvmes to handle the temp storage space.

Love the analogy!

+1 for Mississippi

Still tuning my purpose build system with secondary fall back use case as gaming render rig:

AMD Ryzen Threadripper 3960X
PCIE4.0 NVME and 128GB RAM.
Cinebench r23: 34k
Cost early May: somewhat below EUR 3,800 without migrated GPU.

So for 24 core Threadripper at about 4Ghz DDR4-3600:
Max was 70 plots/day with fully striped dynamic disk configuration.
Anything between 20 and 36 staggered concurrent plots of 2 threads seems to yield around 60-70.
Under 20 concurrent or with anything more than 2 threads or less than 128 buckets overall daily output only went down.

Keeping CPU total load around 80% seemed to yield best overall results.

Two things I still want to investigate:

  • If single thread might make a difference to even out CPU load for maximum throughput across all cores, which seems to be the primary bottleneck in this configuration. (If anyone has a spare Octachannel Threadripper Pro 3995WX just let me know.)

  • If plotting to 4 individually adressed NVMEs is overall faster then the Windows dynamic disk software stripe, since I noticed relative disk IO per plot seemed to drop significantly well below the cpu cores were saturated, even if load was only showing at 30% even at 30+ concurrency.

But I’m happy with anything 60+ and any reboot requiring configuration change is quite a 8h hit in the netspace race. Hard limit seems to be around 80 plots/day given the observed best single/low concurrency runs clock in somewhere above 7 hours. (24h/7plots * 24cores). So I wouldnt change anything, If I would get consistently above 70.

Hello,

Could you tell us the amount of SSDs NVMEs do you have? the RAM used for each plot? How you manage to have around 60-70 plots/day?

thks

I did some tests with my threadripper. While my NVMEs are currently working great, I still had extra cores and RAM to utilize. So I started just plotting to the HDDs directly. Interestingly, I am getting OK speeds and the added benefit of more in parallel.

oh, apparently edited this out accidentally:
it’s 4x 2TB
with 2 980 Pro an two equivalent performing Samsung OEMs

3 of them are on the motherboard, one in a PCIe card adapter all performing to spec.

PCIE 4.0 vs 3.0 Speeds seem to make quite the difference.
Striped vs 4 individual seem to make no huge difference in practice, as this still looks largly bottlenecked by CPU. Will probably switch it back to striped, since 10GB+/s single file read speads are something to behold.

8x16GB RAM

So make sure your not throttling your Gen4 NVMe with a PCIe 3.0 motherboard or starve your CPU with empty Memory Channels.

With the config below, they clock in about with 12 hours per plot, so reliably in high 60s/day for me when going for 36 concurrency.

16 stage 1 plots filling up the 8 cores not blocked by the 16 2+ phase plots seems to be the best balance to me. If you do that single or in 2 plots or two threads for stage 1 on seems to be not much of a difference, but I think to notice less overhead in single threaded runs now.

My current working theory for cpu limited high parallel plotting, given phase one is not maxing out cores:
Have 1.5x concurrent thread count of physical cores at any time in evenly staggered plotting
Make sure half of them are in phase 1, so later phase threads shouldn’t exceed 2/3 of physical core count.

Here one of the 4 current swar-chia-plot-manger job configs for reference. Set minimum_minutes_between_jobs to something like 6 minutes.

     - name: main1
        max_plots: 9999
        farmer_public_key:
        pool_public_key:
        temporary_directory: P:\plot
        temporary2_directory: 
        destination_directory: F:\
        size: 32
        bitfield: true
        threads: 1
        buckets: 128
        memory_buffer: 3800
        max_concurrent: 9
        max_concurrent_with_start_early: 12
        stagger_minutes: 72
        max_for_phase_1: 8
        concurrency_start_early_phase: 5
        concurrency_start_early_phase_delay: 1
        temporary2_destination_sync: false

To boost up launch phase:

  • set stagger_minutes to 0 for each job
  • run for about two hours, or until you hit 24 concurrency,
  • set back to 72 minutes (slightly below total time/concurrency per job)
  • then restart the plot manager.

Or just have a second config file with this boost config swap/rename them accordingly.

1 Like

Couldn’t he have 2 jobs one with 24 total and the zero stagger. Then job 2 would be the constant flow job with 72min stagger.

I haven’t tried this method myself as once its “flowing” I only tweak things and restart management.py.

Ok, one lesson, dont do the AMD Motherboard/Hardware RAID, no matter what cache or block size, it’s slower for massively parallel load like plotting. It saturates out around 10 parallel plots already.

much like the 4K-64Thread performance chart here:

Haven’t tried RAID 0 yet myself, but there’s a Reddit thread where someone is advocating it and getting great results. The linked article only used 3 Gen 3 SSDs.

The single threaded numbers look nice, I get 20GiB/s plus in consecutive reads with crystaldiskmark with 4x raid0 980pro, but the queue clogs up for me at about at around 16 single plots doing concurrent random access ops already, Which doesn’t happen with direct access or software striping which still manages to get way above 10GiB read speeds.

This is interesting and very timely info for me, because I was going to start trying hardware RAID 0 tonight (with an ASUS motherboard and AMD CPU).

I wonder if the guy in the Reddit thread was using different hardware. Might have even been software RAID (I don’t remember).

So with my hardware, would you suggest software RAID or individual disks? Do you have enough RAM to try a RAM disk?