Threadripper IO bottleneck?

chiachia · May 22, 2021, 5:26pm

Hey guys, I could really use some help identifying the bottleneck here. I’m pretty sure I’m pushing my NVMEs too hard.

Threadripper 3970x
Asus Prime TRX40-Pro S
128 GB 3200
4 Samsung 970 EVO Plus 2T (temp drives)
4 WD NAS 18T drives (dest drives)

I have a max of 3 P1 jobs across all 4 NVME at once (going to separate dest drives). Max jobs per NVME in all phases is 8 for a total of 31 jobs at a time (I’ve limited global jobs to 31).

Right now my times are around 12 hours for completion of a job. The first 4 jobs that start have a 2h 50 min P1 time and as more and more jobs get added to the drives that slows to 4h 33 min. The 4 jobs start simultaneously across the 4 NVMEs with a 30 min stagger time. I’ve tried this with a time as low as 12 with pretty much the same results.

I’m assuming it’s the NVMEs themselves that are being overtaxed and not the PCIe lanes since the threadripper and chipset have so many.

If I replaced these 4 970s with 4 980 Pros would this resolve the bottleneck and let me plot 31 jobs at full speed, due to those drives being PCIe 4.0 vs 3.0?

Azile · May 22, 2021, 5:43pm

Threadripper 3970x, 256 3600, 8x Samsung 980 T2 Pros and 8x HDDs.

We have done similar to you attempting 32 jobs across all SSDs, staggered and such. Same issues. Your times are like ours.

We are testing some stuff today to try and solve the bottleneck. I can tell you that replacing the drives will NOT help.

Do you use Swar or Plotman?

chiachia · May 22, 2021, 7:27pm

I’m using SWAR. The jobs are 4 threads and I’ve tried 4G/5G with negligible difference.

I’m pretty surprised you get the same results across 8x 980s. How many jobs per drive? 4? What motherboard?

Harris · May 22, 2021, 8:15pm

Individual plot completion times are not a useful way to measure what the system is capable of. How many plots are you achieving per day? (count the file modified timestamps for a 24 hour period).

8 parallel plots on a 2TB 970 EVO Plus seems too high for these consumer TLC drives. Try 5-6. Check the iowait to determine the optimum number before it starts to struggle (see link below).

Lack of PCIe bandwidth is not the problem in your case but the 980 PROs may perform better depending on their sustained write performance (check the reviews).

You didn’t mention the OS. Linux performs better if you aren’t already using it. Proper software configuration plays a big part too.

Read this thread for lots of tips:

Azile · May 22, 2021, 8:20pm

4 Jobs per SSD, yes. Asus Zenith II Extreme.

We are testing now to use 4x SSDs as temp -1 and 4x SSDs as temp -2, so paired 1 to 1 and then dumps to HDD as we feel the P1/P2 being handled by 1 SSD and P3/P4/P5 being handled by a second SSD will improve the bottleneck caused within the SSDs.

We would then run 8 jobs staggered by 60 minute delay per pair of SSDs and max total 32 jobs with latest Swar build and to have P1 per drive limited to 3 and total jobs per pair of SSDs set to 8. Will also use an offset = 15 minutes as well, so at the start, 1 job kicks off every 15 minutes sequentially until it ramps up to full, trying to balance everything evenly. We base this on hoping to achieve 8 hour plot times. Based on results, we would need adjust numbers.

In theory it should help a bit and we have been testing on another machine with just 2 SSDs to see how it works out and seeing some positive results.

You could try this method since you have 4 SSDs and just change the max jobs to 16 and global offset to 30 minutes and leave the settings per pair of SSDs as I described above.

Feel free to test, but at this point, I make no promises of results.

legcramp · May 22, 2021, 8:23pm

You only have four nvmes, I would look into adding the 980s into the mix not replacing the 970s with them. Now you just gotta source the adapter cards (like Asus Hyper m.2 Gen 4) to add more drives into the threadripper system since you have plenty of pci-e lanes.

1TB NVME is where it’s at.

Harris · May 22, 2021, 10:23pm

Yes, generally for consumer drives, more 1TB drives = more controllers = more plotting power compared to a lower quantity of 2TB drives.

It’s a common misconception - the extra space does not translate to more concurrent plots because the drive gets overloaded well before the space is fully utilized.

chiachia · May 23, 2021, 11:18pm

Something else definitely seems to be going on here. I added a 5th NVME drive and dropped jobs down to 6 on each and times went through the roof. Sitting at about 20 hour completion time now per job with only 25 running parallel.

Preliminary testing I did before adding the 5th drive (marking GB complete at 12min mark):

3 jobs across 1 drive staggered - 12 min 66gb
3 jobs across 1 drive same time - 12 min 66gb
4 jobs across 1 drive same time - 12 min 64gb
4 jobs across drives 1-4 same time - 12 min 58 to 60gb
4 jobs across drives 1-4 staggered - 12 min 59gb
3 jobs across drives 1-3 same time - 12 min 59-61gb
3 jobs across drives 1,2,4 same time - 12 min 62-66gb
2jobs each drives 1,2,4 same time - 12 min 58-59gb
2jobs each drives 1,2 same time - 12 min 60-63gb

It seems that adding a new drive into the mix drops speeds by a few GB and each job I add per drive continues to drop speeds.

codinghorror · May 23, 2021, 11:32pm

However, extra space does equate to higher TBW before the drive cells are used up. That’s another, different reason to choose larger drives.

keeper1 · May 24, 2021, 12:48am

I believe you are limited by the TLC’s I get 3 hour difference between 1tb tlc (kingston a2000) and 1tb SCL (970 pro) doing 4 parallel plots in each.

chiachia · May 24, 2021, 1:44am

While that may be, I can’t see why adding a 5th NVME while keeping the same total amount of jobs by reducing running jobs on the other 4 would slow the system down even more.

Fuzeguy · May 24, 2021, 2:07am

3955wx here on the GUI with (currently) 2x2tb temp and been testing different scenarios for best daily output. One or two plots at a time and it’s flying. But then I find striving for more plots at similar speeds like musical chairs. Add more chairs, it all slows down. Take some away and plotting speeds back up. I’ve tried x3 temp and x2 temp with different loads. You think you are going to do more plots to make better output, but everything slows to compensate. It seems that ~24 plots or slightly more is the max for my TR config. I’m thinking it’s the IO, but with x2 in RAID 0 Sam 970 Evo plus and another PCI-e 4.0 2TB that maybe should not happen. IO response is consistent for both ranging very low (sub ms) to higher 10-100ms occasionally but stays similar with loads I’ve tried. Memory is 4000GiB/plot 3T. CPU is not generally maxed at all, I’ve tons of memory, way more than is required.

Still 1/hr ave daily output is reasonable and over time accomplishes what is required, but IMWTK why the limit?

chiachia · May 26, 2021, 12:24am

Just wanted to give you guys an update. Added 4 case fans and another drive and now I’m plotting 30 jobs every ~11.5 hours.

Using SWAR with 5 jobs (1 for each NVME), 31 max concurrent, 12 max P1, 12 minutes between jobs for global settings.

Each job itself is set for 3 in P1, 6 concurrent, 4T/4GB, 12 minute stagger, and each NVME goes to its own HDD.

Littlegrand · May 26, 2021, 1:51am

I have a TR pro 16/32 and a TR 24/48, both with 7 nvmes of different makes and models. It does not seem to matter what I do across what drive, the average of the drives is roughly the same IE 2 tb Firecudas vs 1 tb 970 pro with the SAME jobs will have the same total time. Its so strange, it is like the TR balances it or something…

rkalla · May 26, 2021, 2:08am

Can you post your seat config?

ikbodud · May 26, 2021, 3:00am

I am on a wrx80 chipset with 3975WX (32C/64T) and with 8x 2TB sabrent rocket plus nvme drives and I get 85-90 plots/day with 1 hr stagger time.

I could probably optimize it but don’t have time to optimize. At this point chia has diminishing returns…

Littlegrand · May 26, 2021, 10:04am

Do you use swar/plotman and what is your number of parallel per drive/number of phase 1?

Thanks

chiachia · May 26, 2021, 12:38pm

Using SWAR and these are my settings:

max_concurrent: 31
max_for_phase_1: 12
minimum_minutes_between_jobs: 12

  - name: Samsung EVO 1
    max_plots: 60
    farmer_public_key:
    pool_public_key:
    temporary_directory: E:\ChiaTemp
    temporary2_directory:
    destination_directory: I:\ChiaPlot
    size: 32
    bitfield: true
    threads: 4
    buckets: 128
    memory_buffer: 4000
    max_concurrent: 6
    max_concurrent_with_start_early: 6
    initial_delay_minutes: 0
    stagger_minutes: 12
    max_for_phase_1: 3
    concurrency_start_early_phase: 4
    concurrency_start_early_phase_delay: 0
    temporary2_destination_sync: false
    exclude_final_directory: false
    skip_full_destinations: true
    unix_process_priority: 10
    windows_process_priority: 32
    enable_cpu_affinity: false
    cpu_affinity: [ 0, 1, 2, 3, 4, 5 ]

  - name: Samsung EVO 2
    max_plots: 60
    farmer_public_key:
    pool_public_key:
    temporary_directory: F:\ChiaTemp
    temporary2_directory:
    destination_directory: J:\ChiaPlot
    size: 32
    bitfield: true
    threads: 4
    buckets: 128
    memory_buffer: 4000
    max_concurrent: 6
    max_concurrent_with_start_early: 6
    initial_delay_minutes: 0
    stagger_minutes: 12
    max_for_phase_1: 3
    concurrency_start_early_phase: 4
    concurrency_start_early_phase_delay: 0
    temporary2_destination_sync: false
    exclude_final_directory: false
    skip_full_destinations: true
    unix_process_priority: 10
    windows_process_priority: 32
    enable_cpu_affinity: false
    cpu_affinity: [ 0, 1, 2, 3, 4, 5 ]

  - name: Samsung EVO 3
    max_plots: 60
    farmer_public_key:
    pool_public_key:
    temporary_directory: G:\ChiaTemp
    temporary2_directory:
    destination_directory: K:\ChiaPlot
    size: 32
    bitfield: true
    threads: 4
    buckets: 128
    memory_buffer: 4000
    max_concurrent: 6
    max_concurrent_with_start_early: 6
    initial_delay_minutes: 0
    stagger_minutes: 12
    max_for_phase_1: 3
    concurrency_start_early_phase: 4
    concurrency_start_early_phase_delay: 0
    temporary2_destination_sync: false
    exclude_final_directory: false
    skip_full_destinations: true
    unix_process_priority: 10
    windows_process_priority: 32
    enable_cpu_affinity: false
    cpu_affinity: [ 0, 1, 2, 3, 4, 5 ]

  - name: Samsung EVO 4
    max_plots: 60
    farmer_public_key:
    pool_public_key:
    temporary_directory: H:\ChiaTemp
    temporary2_directory:
    destination_directory: L:\ChiaPlot
    size: 32
    bitfield: true
    threads: 4
    buckets: 128
    memory_buffer: 4000
    max_concurrent: 6
    max_concurrent_with_start_early: 6
    initial_delay_minutes: 0
    stagger_minutes: 12
    max_for_phase_1: 3
    concurrency_start_early_phase: 4
    concurrency_start_early_phase_delay: 0
    temporary2_destination_sync: false
    exclude_final_directory: false
    skip_full_destinations: true
    unix_process_priority: 10
    windows_process_priority: 32
    enable_cpu_affinity: false
    cpu_affinity: [ 0, 1, 2, 3, 4, 5 ]

  - name: Sabrent
    max_plots: 60
    farmer_public_key:
    pool_public_key:
    temporary_directory: N:\ChiaTemp
    temporary2_directory:
    destination_directory: I:\ChiaPlot
    size: 32
    bitfield: true
    threads: 4
    buckets: 128
    memory_buffer: 4000
    max_concurrent: 6
    max_concurrent_with_start_early: 6
    initial_delay_minutes: 0
    stagger_minutes: 12
    max_for_phase_1: 3
    concurrency_start_early_phase: 4
    concurrency_start_early_phase_delay: 0
    temporary2_destination_sync: false
    exclude_final_directory: false
    skip_full_destinations: true
    unix_process_priority: 10
    windows_process_priority: 32
    enable_cpu_affinity: false
    cpu_affinity: [ 0, 1, 2, 3, 4, 5 ]

Littlegrand · May 26, 2021, 2:27pm

Sorry meant @ikbodud, I am in the same boat as you but would like to see his to maybe help us

chiachia · May 26, 2021, 4:19pm

I think it’s the amount of NVMEs. It looks like (at least for me) 6 jobs per NVME is the max so if he’s doing 6x8 that’s 48 jobs at a time which would be about 96 per day.

I also have someone in a reddit thread with 3970x and 5x 980 pros telling me he’s doing 100 plots per day with 5 jobs running at once with a 90 minute stagger using 6T/6500mib but I haven’t had the time to test that and won’t until next week.

If you try it out let me know how it works.

Thread for reference: https://www.reddit.com/r/chia/comments/nkwlva/why_is_my_32_core_amd_threadripper_biblically/gzglcc9/?context=3