After the beginning of compressing tables 5 and 6.
More detailed test is coming. It will include I/O measuring.
I upgraded my script. Results will be soon.
Script is launching by cron every minute and captures single plot seeding statistics. Output is formatted to import to spreadsheet with tab-separated format. Paths and device names is actual for my environment!
Here is script, not perfect, but working:
#!/bin/bash
# Capture date and time meaurements begins
DATE=`/bin/date +%Y-%m-%d`
TIME=`/bin/date +%H:%M:%S`
# Settings section
TEMP1=/temp1/
TEMP1_DRIVE=nvme1n1
TEMP2=/temp2/
TEMP2_DRIVE=nvme0n1
DEST=/plots16t/
DEST_DRIVE=md127
OUTPUT=/home/lv/testing/log2.txt
# Capture IO averages for 60 seconds
IO=`iostat -m -y 60 1`
# Capture disk space usage
TMP1_S=`/bin/du -s $TEMP1 | awk -F$'\t' '{print $1/1024}' OFMT="%3.0f"`
TMP1_IO=`echo -e "$IO" | awk -v var="$TEMP1_DRIVE" '$0~var{print $2,"\011",$3,"\011",$4,"\011",$6,"\011",$7}'`
TMP2_S=`/bin/du -s $TEMP2 | awk -F$'\t' '{print $1/1024}' OFMT="%3.0f"`
TMP2_IO=`echo -e "$IO" | awk -v var="$TEMP2_DRIVE" '$0~var{print $2,"\011",$3,"\011",$4,"\011",$6,"\011",$7}'`
DST_S=`/bin/du -s $DEST | awk -F$'\t' '{print $1/1024}' OFMT="%3.0f"`
DST_IO=`echo -e "$IO" | awk -v var="$DEST_DRIVE" '$0~var{print $2,"\011",$3,"\011",$4,"\011",$6,"\011",$7}'`
# Capture TWB for nvmes
TMP1_TWB=`/usr/sbin/smartctl -a /dev/$TEMP1_DRIVE | awk '/Data Units Written/{gsub(",","",$4); print $4*512/1024}' OFMT="%3.0f"`
TMP2_TWB=`/usr/sbin/smartctl -a /dev/$TEMP2_DRIVE | awk '/Data Units Written/{gsub(",","",$4); print $4*512/1024}' OFMT="%3.0f"`
# Capture CPU stats and memory usage
MEM=`/usr/bin/smem -c "name uss" --processfilter="^/home/lv/chia-blockchain/venv/" | grep chia | awk '{print $2/1024}' OFMT="%3.0f"`
CPU=`echo -e "$IO" | awk '$1 ~ /^[[:digit:]]/ {print $1}'`
WA=`echo -e "$IO" | awk '$1 ~ /^[[:digit:]]/ {print $4}'`
# Make heading row for new file
if [ ! -f $OUTPUT ]; then
COMMONLBL="Phase\tTime\tCPU,%\tWA\tMem,MB\tTemp1,MB\tTemp2,MB\tDst,MB\tTWB1,MB\tTWB2,MB"
TMP1IOLBL="Tmp1 TPS\tTmp1 rs,MB/s\tTmp1 ws, MB/s\tTmp1 r, MB\tTmp1 w,MB"
TMP2IOLBL="Tmp2 TPS\tTmp2 rs,MB/s\tTmp2 ws, MB/s\tTmp2 r, MB\tTmp2 w,MB"
DSTIOLBL="Dest TPS\tDest rs,MB/s\tDest ws, MB/s\tDest r, MB\tDest w,MB"
echo -e "$COMMONLBL\t$TMP1IOLBL\t$TMP2IOLBL\t$DSTIOLBL" >> $OUTPUT
chown lv:lv $OUTPUT
fi
# Make output
COMMON="\t$DATE $TIME\t$CPU\t$WA\t$MEM\t$TMP1_S\t$TMP2_S\t$DST_S\t$TMP1_TWB\t$TMP2_TWB"
echo -e "$COMMON\t$TMP1_IO\t$TMP2_IO\t$DST_IO" >> $OUTPUT
Brilliant, thank you Gladanimal.
So if compression of tables 5 and 6 starts at around 85% of total single plot time, then it seems using a secondary temporary drive shaves off 15% from the duration of each plot, without increasing disk usage.
That’s right! That’s why I started this testing.
Next measurement session
New hardware:
- i7-11700 8c/16t
- Asus Z590-F
- ADATA SX8200 PRO 1TB for the first temporary (tweaked XFS)
- ADATA SX8200 PRO 512GB for the second temporary (tweaked XFS)
- RAID0 of 4 x 4Tb
- RAM 32GB 3200MHs
- running on Fedora 34
- using optimized chiapos
Plotting parameters:
- Plot size is: 32
- Buffer size is: 28000MiB
- Using 128 buckets
- Using 2 threads of stripe size 65536
- Using optimized chiapos
Overall:
Time for phase 1 = 5074.508 seconds. CPU (162.030%)
Time for phase 2 = 2212.331 seconds. CPU (90.880%)
Time for phase 3 = 3109.913 seconds. CPU (139.740%)
Time for phase 4 = 284.687 seconds. CPU (102.400%)
Total time = 10681.442 seconds. CPU (139.220%)
Memory
Only about 4892MiB was used of 28000MiB given for phase 1, 2276MiB at peak for phase 2, 4168MiB at peak for phase 3 and 1198MiB for phase 4.
So one plotting process (128 buckets, 2 threads) can utilize only 4892MiB of RAM (will try 4 threads and 64 buckets mixings later).
Disk space
Looks like previous experiments, but peak usage is a little bit lower (244091MiB). I think it affected by higher memory usage.
CPU & IO
Note: next charts shows average values per minute! In moment them could be higher.
CPU usage:
Total MB read and write on first temporary per minute (second and dest is not so interesting):
IOPS average (average per minute!):
Looking at the charts above (Plotting process measurements - #3 by gladanimal, Plotting process measurements - #23 by gladanimal) I found using default buffer size (3389MiB) affects to (especially) composing tables 4 and 5 at first phase and thus increases temporary storage I/O. So using 5000MiB would be optimal for performance on k=32 and u=128. Higher buffer size would not be utilized!
Would you mind trying 4 and 6 threads also?
Right now I have 4 threads raw results. Charts are coming
IOPS
I captured real-time IOPS for temporary drive. Looks very nice. The peak value is about 9000 IOPS. Not so high as I think before. I assume slower CPUs would produce lower values (but not the fact).
So I assume IOPS is not bottleneck on modern NVME SSDs. For example my ADATA SX8200 Pro has 390K/380K, it’s more than enough to handle I/O for about over 50 parallel plot seedings (but actually its SLC and DRAM cache size and R/W speed is bottleneck)
It’s very easy to measure it on linux. Run this command in separate terminal before you start plot seeding:
iostat 1 | awk '/nvme1n1/{print $2}; fflush(stdout)' >> your_iops_log.txt 2>&1
where nvme1n1
is a name of your temporary drive and your_iops_log.txt
is a complete filename to place logs.
After plot seeding finish press Ctrl-C
and import results from log file to spreadsheet and make a chart.
Your charts are welcome!
Four threads
Plot size is: 32
Buffer size is: 28000MiB
Using 128 buckets
Using 4 threads of stripe size 65536
Using optimized chiapos
Overall:
Time for phase 1 = 3630.957 seconds. CPU (230.460%)
Time for phase 2 = 2212.198 seconds. CPU (90.180%)
Time for phase 3 = 3170.369 seconds. CPU (139.560%)
Time for phase 4 = 299.807 seconds. CPU (102.430%)
Total time = 9313.333 seconds. CPU (162.080%)
It’s really faster than 2 threads for about 1380 seconds (~23 minutes). First phase is shorter, other phases the same as 2 threads. Now lets look inside…
Memory
Memory usage decreased to 4392 MiB at peak at first phase and a little bit for second one.
Disk spase
Temporary drive space usage is a same as 2 threads, but shorter in time.
CPU & IO
At first phase CPU usage is greater, but not at 400% as expected. (This is average values per minute, I would check it later).
I/O operations seems more agressive for first phase and same for others.
IOPS:
Total read and write:
Conclusions
It seems to be optimal parameters for me is -r 3 -b 4500
for 128 buckets
Great info @gladanimal, sorry for my noobness, do you mean -r 3 is for 3 threads and -b 4500 is for 4500MB RAM buffer?
No, extra thread efficiency is diminishing with every thread added !1 thread 97%, 2 threads 150% :::). Even 6 threads will just see you around 280%.
Exactly. I mean exactly this
Thanks for info, would test it
Thanks for this testing (number’s) .
Can you test with more threads, like 12 -16 maybe.
Looks like so close to optimal
I’d run -k 32 -r 3 -b 4500 -u 128
Plot size is: 32
Buffer size is: 4500MiB
Using 128 buckets
Using 3 threads of stripe size 65536
Using optimized chiapos
Overall:
Time for phase 1 = 4108.402 seconds. CPU (201.390%)
Time for phase 2 = 2186.909 seconds. CPU (90.780%)
Total compress table time: 643.945 seconds. CPU (95.100%)
Time for phase 4 = 289.450 seconds. CPU (102.300%)
Total time = 9693.975 seconds. CPU (153.760%)
Total time is greater for 300 seconds. So average CPU usage is so close but it is average.
Memory
Comparsion with previous (4 threads, 28000 MiB RAM). Amount of used memory looks really the same.
It’s OK! 4500 MiB is enough for 128 buckets with 3 and 4 threads.
Disk usage
Matches the previous test.
CPU & IO
CPU usage average is lower, so using 4 threads is better for phase 1 than 3 threads.
IOPS is close to matching previous test.
Conclusion
Using 4 threads and 4500 MiB of RAM is good enough for k=32 and 128 buckets.
128 buckets vs 64 buckets
Plot size is: 32
Buffer size is: 28000MiB
Using 2 threads of stripe size 65536
Using optimized chiapos
128 buckets (blue) overall
Time for phase 1 = 5074.508 seconds. CPU (162.030%)
Time for phase 2 = 2212.331 seconds. CPU (90.880%)
Time for phase 3 = 3109.913 seconds. CPU (139.740%)
Time for phase 4 = 284.687 seconds. CPU (102.400%)
Total time = 10681.442 seconds. CPU (139.220%)
64 buckets (red) overall
Time for phase 1 = 5216.343 seconds. CPU (160.510%)
Time for phase 2 = 2314.673 seconds. CPU (90.890%)
Time for phase 3 = 3341.300 seconds. CPU (139.510%)
Time for phase 4 = 309.550 seconds. CPU (102.170%)
Total time = 11181.868 seconds. CPU (138.210%)
Memory average
Temporary space usage
CPU average usage
IOPS average
Conclusion
I didn’t find reasons to use 64 buckets (for 2 threads; will try 4 threads later). Just higher CPU and memory usage and 500 seconds slower. No profits found yet.
Coming tests
- Using 6 threads
- Using 16 threads
- 64 buckets with 4 threads
- 4 threads and 3408 MiB (lower memory)
- Accurate CPU and RAM usage for 4 threads, 4500 MiB buffer size, 128 buckets and k=32
- Accurate comparing “optimized chiapos” vs “vanilla client”
Do you plan to run these with optimized chiapos or the vanilla client? (the former is probably better for you while the latter is more usefull for the community).
I’d updated test schedule ))