Plotting process measurements

Keermalec · May 28, 2021, 11:56pm

Brilliant, thank you Gladanimal.

So if compression of tables 5 and 6 starts at around 85% of total single plot time, then it seems using a secondary temporary drive shaves off 15% from the duration of each plot, without increasing disk usage.

gladanimal · May 29, 2021, 5:39am

That’s right! That’s why I started this testing.

gladanimal · May 29, 2021, 5:13pm

Next measurement session

New hardware:

i7-11700 8c/16t
Asus Z590-F
ADATA SX8200 PRO 1TB for the first temporary (tweaked XFS)
ADATA SX8200 PRO 512GB for the second temporary (tweaked XFS)
RAID0 of 4 x 4Tb
RAM 32GB 3200MHs
running on Fedora 34
using optimized chiapos

Plotting parameters:

Plot size is: 32
Buffer size is: 28000MiB
Using 128 buckets
Using 2 threads of stripe size 65536
Using optimized chiapos

Overall:

Time for phase 1 = 5074.508 seconds. CPU (162.030%)
Time for phase 2 = 2212.331 seconds. CPU (90.880%)
Time for phase 3 = 3109.913 seconds. CPU (139.740%)
Time for phase 4 = 284.687 seconds. CPU (102.400%)
Total time = 10681.442 seconds. CPU (139.220%)

Memory

Only about 4892MiB was used of 28000MiB given for phase 1, 2276MiB at peak for phase 2, 4168MiB at peak for phase 3 and 1198MiB for phase 4.
So one plotting process (128 buckets, 2 threads) can utilize only 4892MiB of RAM (will try 4 threads and 64 buckets mixings later).

Disk space

Looks like previous experiments, but peak usage is a little bit lower (244091MiB). I think it affected by higher memory usage.

CPU & IO

Note: next charts shows average values per minute! In moment them could be higher.
CPU usage:

Total MB read and write on first temporary per minute (second and dest is not so interesting):

IOPS average (average per minute!):

gladanimal · May 29, 2021, 8:35pm

Looking at the charts above (Plotting process measurements - #3 by gladanimal, Plotting process measurements - #23 by gladanimal) I found using default buffer size (3389MiB) affects to (especially) composing tables 4 and 5 at first phase and thus increases temporary storage I/O. So using 5000MiB would be optimal for performance on k=32 and u=128. Higher buffer size would not be utilized!

Yae · May 30, 2021, 11:07am

Would you mind trying 4 and 6 threads also?

gladanimal · May 30, 2021, 11:08am

Right now I have 4 threads raw results. Charts are coming

gladanimal · May 30, 2021, 11:25am

IOPS

I captured real-time IOPS for temporary drive. Looks very nice. The peak value is about 9000 IOPS. Not so high as I think before. I assume slower CPUs would produce lower values (but not the fact).
So I assume IOPS is not bottleneck on modern NVME SSDs. For example my ADATA SX8200 Pro has 390K/380K, it’s more than enough to handle I/O for about over 50 parallel plot seedings (but actually its SLC and DRAM cache size and R/W speed is bottleneck)

It’s very easy to measure it on linux. Run this command in separate terminal before you start plot seeding:
iostat 1 | awk '/nvme1n1/{print $2}; fflush(stdout)' >> your_iops_log.txt 2>&1
where nvme1n1 is a name of your temporary drive and your_iops_log.txt is a complete filename to place logs.
After plot seeding finish press Ctrl-C and import results from log file to spreadsheet and make a chart.
Your charts are welcome!

gladanimal · May 30, 2021, 12:11pm

Four threads

Plot size is: 32
Buffer size is: 28000MiB
Using 128 buckets
Using 4 threads of stripe size 65536
Using optimized chiapos

Overall:

Time for phase 1 = 3630.957 seconds. CPU (230.460%)
Time for phase 2 = 2212.198 seconds. CPU (90.180%)
Time for phase 3 = 3170.369 seconds. CPU (139.560%)
Time for phase 4 = 299.807 seconds. CPU (102.430%)
Total time = 9313.333 seconds. CPU (162.080%)

It’s really faster than 2 threads for about 1380 seconds (~23 minutes). First phase is shorter, other phases the same as 2 threads. Now lets look inside…

Memory

Memory usage decreased to 4392 MiB at peak at first phase and a little bit for second one.

Disk spase

Temporary drive space usage is a same as 2 threads, but shorter in time.

CPU & IO

At first phase CPU usage is greater, but not at 400% as expected. (This is average values per minute, I would check it later).

I/O operations seems more agressive for first phase and same for others.
IOPS:

Total read and write:

Conclusions

It seems to be optimal parameters for me is -r 3 -b 4500 for 128 buckets

cr33d0 · May 30, 2021, 1:15pm

Great info @gladanimal, sorry for my noobness, do you mean -r 3 is for 3 threads and -b 4500 is for 4500MB RAM buffer?

Yae · May 30, 2021, 3:23pm

No, extra thread efficiency is diminishing with every thread added !1 thread 97%, 2 threads 150% :::). Even 6 threads will just see you around 280%.

gladanimal · May 30, 2021, 3:56pm

Exactly. I mean exactly this

gladanimal · May 30, 2021, 3:56pm

Thanks for info, would test it

VLATKO175 · May 30, 2021, 4:21pm

Thanks for this testing (number’s) .
Can you test with more threads, like 12 -16 maybe.

gladanimal · May 30, 2021, 6:11pm

Looks like so close to optimal

I’d run -k 32 -r 3 -b 4500 -u 128

Plot size is: 32
Buffer size is: 4500MiB
Using 128 buckets
Using 3 threads of stripe size 65536
Using optimized chiapos

Overall:

Time for phase 1 = 4108.402 seconds. CPU (201.390%)
Time for phase 2 = 2186.909 seconds. CPU (90.780%)
Total compress table time: 643.945 seconds. CPU (95.100%)
Time for phase 4 = 289.450 seconds. CPU (102.300%)
Total time = 9693.975 seconds. CPU (153.760%)

Total time is greater for 300 seconds. So average CPU usage is so close but it is average.

Memory

Comparsion with previous (4 threads, 28000 MiB RAM). Amount of used memory looks really the same.
It’s OK! 4500 MiB is enough for 128 buckets with 3 and 4 threads.

Disk usage

Matches the previous test.

CPU & IO

CPU usage average is lower, so using 4 threads is better for phase 1 than 3 threads.

IOPS is close to matching previous test.

Conclusion

Using 4 threads and 4500 MiB of RAM is good enough for k=32 and 128 buckets.

gladanimal · May 30, 2021, 8:25pm

128 buckets vs 64 buckets

Plot size is: 32
Buffer size is: 28000MiB
Using 2 threads of stripe size 65536
Using optimized chiapos

128 buckets (blue) overall

Time for phase 1 = 5074.508 seconds. CPU (162.030%)
Time for phase 2 = 2212.331 seconds. CPU (90.880%)
Time for phase 3 = 3109.913 seconds. CPU (139.740%)
Time for phase 4 = 284.687 seconds. CPU (102.400%)
Total time = 10681.442 seconds. CPU (139.220%)

64 buckets (red) overall

Time for phase 1 = 5216.343 seconds. CPU (160.510%)
Time for phase 2 = 2314.673 seconds. CPU (90.890%)
Time for phase 3 = 3341.300 seconds. CPU (139.510%)
Time for phase 4 = 309.550 seconds. CPU (102.170%)
Total time = 11181.868 seconds. CPU (138.210%)

Memory average

Temporary space usage

CPU average usage

IOPS average

Conclusion

I didn’t find reasons to use 64 buckets (for 2 threads; will try 4 threads later). Just higher CPU and memory usage and 500 seconds slower. No profits found yet.

gladanimal · May 30, 2021, 8:34pm

Coming tests

Using 6 threads
Using 16 threads
64 buckets with 4 threads
4 threads and 3408 MiB (lower memory)
Accurate CPU and RAM usage for 4 threads, 4500 MiB buffer size, 128 buckets and k=32
Accurate comparing “optimized chiapos” vs “vanilla client”

Yae · May 31, 2021, 9:38am

Do you plan to run these with optimized chiapos or the vanilla client? (the former is probably better for you while the latter is more usefull for the community).

gladanimal · May 31, 2021, 9:44am

I’d updated test schedule ))

Yae · May 31, 2021, 9:46am

Btw, are the resulting plots from the optimized client smaller in size? (do they drop table 1?)

gladanimal · May 31, 2021, 9:53am

Plots are completely the same