2TB Samsung 970 much faster than 1TB 980 and Intel nvmes. Seems like odd OS scheduling

FuriousGeorge · June 21, 2021, 6:12am

I have an Asus Zenith Extreme Board with Threadripper 1950x (16C/32T), 64 GB of DDR3000.

Based on my CPU and RAM, given what I’ve read I should be able to run 20 plots in parallel, given enough nvme drives.

I installed 4 nvme disks. Two are Intel, on a DIMM by the RAM, and the Samsung nvmes are installed in PCIe ports 2 and 4. Per the manual, neither port shares pci lanes, unless u.2 storage is in use, which it is not.

I’m running 17 parallel threads. I use -r 2 and default ram settings. Let’s see far I got after about 11 hours.

Samsung 2 TB nvme
4 Parallel Processes, 1hr stagger:

Completed, and up to Stage 1 Phase 3:
Computing table 2
Forward propagation table time: 2286.548 seconds. CPU (114.340%) Mon Jun 21 01:29:29 2021
Starting phase 4/4
Stage 3: Compressing tables 5 and 6
First computation pass time: 1340.103 seconds. CPU (78.650%)
(forgot to start this one till later) Stage 1: Forward propagation table time: 2071.915 seconds. CPU (123.760%) Mon Jun 21 01:00:37 2021
Computing table 3

4 x Mushkin 480 GB SSD, RAID 0, 4x PCIe 2.0 HBA
4 Parallel processes, 1 hour stagger:

Stage 3: Total compress table time: 2875.590 seconds. CPU (29.310%) Mon Jun 21 00:43:17 2021
Compressing tables 2 and 3
Stage 2: Total backpropagation time:: 3196.063 seconds. CPU (29.950%) Mon Jun 21 01:35:57 2021
Backpropagating on table 2
Stage 2: Total backpropagation time:: 3254.953 seconds. CPU (33.220%) Mon Jun 21 01:18:43 2021
Backpropagating on table 5
Stage 2: Total backpropagation time:: 1991.244 seconds. CPU (21.580%) Mon Jun 21 01:34:53 2021
Backpropagating on table 6

Intel nvme #1
3 parallel processes, 1hr stagger:

Stage 2: Forward propagation table time: 10542.896 seconds. CPU (32.580%) Mon Jun 21 01:31:50 2021
Computing table 6
Stage 2: Forward propagation table time: 9100.620 seconds. CPU (38.460%) Sun Jun 20 23:29:54 2021
Computing table 5
(Forgot to start this one till later): Stage 1: F1 complete, time: 1288.21 seconds. CPU (15.69%) Mon Jun 21 00:46:49 2021
Computing table 2

Intel nvme #2
3 parallel processes, 1hr stagger:

(This one was paused for about 30 min) Stage 2: Total backpropagation time:: 2978.436 seconds. CPU (15.070%) Sun Jun 20 23:51:19 2021
Backpropagating on table 6
(This one was paused for about 30 min) Stage 1:
Forward propagation table time: 5413.073 seconds. CPU (60.530%) Sun Jun 20 23:30:43 2021
Computing table 7
(Started this one after pausing others) Stage 1: Computing table 2
Forward propagation table time: 2914.677 seconds. CPU (91.380%) Mon Jun 21 01:46:54 2021 (not bad so far)

Samsung 1TB nvme
3 parallel processes, 1hr stagger:

Stage 1: Forward propagation table time: 8107.955 seconds. CPU (41.920%) Sun Jun 20 23:51:51 2021
Computing table 6
Stage 1: Forward propagation table time: 9605.114 seconds. CPU (36.580%) Sun Jun 20 23:55:08 2021
Computing table 5
Stage 1: Computing table 4
Forward propagation table time: 9866.944 seconds. CPU (35.310%) Mon Jun 21 01:10:37 2021

So, as you can see, my 2tb nvme, running more threads, has completed an entire plot before the 1tb nvme from the same make and similar model (1 year newer) struggles to get out of stage 1.

I think this has something to do with how linux is scheduling processor time. Notice on one of the intel nvmes, even with another 14 process running, when I paused the two running processes and started a third on that disk, that third process actually started running alone at a pretty respectable rate, despite being the 15th out 15 threads at the time I started it:

Stage 1 Table 2:

Forward propagation table time: 2914.677 seconds. CPU (91.380%) Mon Jun 21 01:46:54 2021

Compare that with the samsung 1tb for the same phase, with one other parallel process on that storage, and only 9 parallel processes over all:

Computing table 2
Forward propagation table time: 7332.526 seconds. CPU (40.050%) Sun Jun 20 19:20:28 2021

That is some crap time.

Finally, here is the 2tb samsung, with 17 parallel processes overall, and another three on that disk, running, on the second pass:

Computing table 2
Forward propagation table time: 2286.548 seconds. CPU (114.340%) Mon Jun 21 01:29:29 2021

Wtf? Why does linux give some processes consistently more cpu time than others, for no apparent reason?

FuriousGeorge · June 21, 2021, 8:56am

I’ve investigated a bit more, and found a drawing for a similar X399 chipset motherbaord. This was a pre-release leak, but it looks pretty accurate:

You can see the motherboard I’m using here:

Correcting what I said earlier, there may be more shared pci lanes than I thought, nonetheless it would not be enough to bottleneck me. I have an HBA on the 4x port (which is pcie 2.0), and the nvmes are on the 2nd and 4th x16 (1 tb and 2 tb respectively).

The two intel nvmes are on the nvDIMM slot, which can be seen to the right of the RAM.

I’ve tried pinning each process to two cores (with the 17th floating), but it doesn’t seem to have made a difference. I was looking into numactl for controlling numa (which I assume the chip does a pretty good job of on its own), but I’ll have to reboot and change a bios setting first.

After these slower ones complete one pass, I’m going to unmount and reformat them. Seems like a dumb thing to do, but twice before I’ve noticed big differences after doing that, which makes me wonder if instead of OS scheduling, it might be TRIM that’s to blame. I did it because I thought my my zfs RAID array was somehow the culprit, and then I thought my md raid array was the culprit. Now there are just disrecrete disk. There’s no point in going raid 0 to fit in another vm.

Next time I’m going to start the processes in the opposite order. I half-think that the order is what determines the speed of the processes more than anything, which would point to the scheduler being the problem.

That said, I have a lot of IO wait, around 30% while overall CPU usage is around 25%, so I’m not so sure.

FuriousGeorge · June 21, 2021, 10:17am

Where IO delay is peaking is where I was trimming. This seems to have helped at least to an extent.

The 2 tb nvme has its processes on the first 8 cores. The 4tb ssd array has the next 8. The remaining 16 are shared across 8 parallel processes on 3 nvmes.

Unfortunately, it looks like that often, with the 1st nvme and the ssd array being real busy, while the 3 x 1tb nvme, which I’d expect to be the fastest with 3 processes a piece, are actually the slowest.

WolfGT · June 21, 2021, 1:32pm

Yes, what you have found is accurate. Your issue is the restrictions of your PCIe capability on that board. The 980 Pro is faster. Even on PCIe 3.0. The 980 is a PCIe 4.0 drive and the 970 Evo is not. But even with the PCIe mismatch, the 980 is faster. I have both and have tested. If you take all other drives out and remove the restrictions of the lane sharing you will see the difference. Getting them to all work together the way you want on that board … Good luck.

FuriousGeorge · June 21, 2021, 4:52pm

Yeah, but if you read my post closely, you’ll see that the opposite is happening. The 970 is going four times faster overall, and neither is sharing lanes. In any case, I don’t think your theory is completely accurate.

These are in 16x 4.0 slots, but I’m pretty sure the nvme work at 4x or 8x speeds, because even at 4x 3.0 speeds you are talking 4 GT/s, which works is more than what your average nvme is putting out. Dedicated m.2 slots on motherboards generally get only 4 lanes, not because no one has figured out that you can add four more and make it faster, but because you don’t need more yet.

Unless there is some known flaw with this particular board’s dedicated pcie lanes (of which it has many), that should not be a bottleneck.

It looks like plots on the 2tb samsung nvme could complete 3 passes with 4 in parallel at around the same time the 3 parallel plots on the 1tb samsung nvme complete 1 pass. That works out to a factor of four of throughput difference. There’s just no way that difference can be accounted for by the specs on the chips or the specs on the motherboard, which again points to a software issue.

Thanks for the reply.

WolfGT · June 21, 2021, 4:56pm

It’s not a theory. I have tested many times and the 980 is definitely faster. Absolutely no doubt about it. Not by much when running on PCIe 3, but still faster. If you are getting some other result, something is messed up. You may have a bad 980. But I can guarantee that the 980 is faster.

FuriousGeorge · June 22, 2021, 1:56am

I was referring to the idea that the pci lanes on the motherboard are a bottleneck.

With respect to the 970 vs 980, I looked and the 980s can operate at pcie4 speed, as you say. This makes it even odder that it is underperforming the 970.

I’m going to swap out the two intel nvmes and put 3 more 980s in the rig. I’ll have a total of 4 980s and 1 970 at that point. Maybe I’ll start with zfs raid 0 for the volume management and fs, and work my way to discrete disks if there’s still a problem.

Meanwhile, I’ll wait for this plotting to end and run some benchmarks, before and after, and see if I can’t make sense of it.

WolfGT · June 22, 2021, 2:27am

BTW, you should just switch to Madmax. Running all those parallel plots are a major pain. Just letting Madmax fire though a plot one at a time is a huge convenience. No more “I’ll wait for this plotting to end”. Try it, you will never go back.

FuriousGeorge · June 22, 2021, 2:30am

Thanks for the tip. I was actually considering that or plotman pretty soon. I wanted to get a feel for it hands-on first.

It takes me a good half hour to set up, between running screen, opening enough windows, naming them all, running the threads, having windows for diagnostic info to set up, pinning the processes, and so on.

It’s interesting the first two times…

WolfGT · June 22, 2021, 3:07am

Plotman and Swar’s are plot managers that help you coordinate the parallel plots. If you are trying to do them without a manager, you are not going to get what you want. The plots will drift into each other and begin to sync. After a couple days, the timing is all messed up. A plot manger solves that. But that is “the old way” if it can be called that. Madmax removes the need to run parallel plots. Removes the tedious task from it. So much easier. And I get an additional 11 plots a day per system using Madmax. It’s just a better way to plot. Good luck.

FuriousGeorge · June 22, 2021, 3:08am

I’ll definitely look into it before I start plotting over. Thanks.