Plotting process measurements

Hi Gladanimal, thanks for sharing this awesome test.

Could you clarify what you mean by “Starting next plot seeding after compressing tables 5 and 6 (phase 3) of previous plot begins”?

Does this mean you start the next plot after compressing tables 5 and six or right after the beginning of compressing tables 5 and six?

Many thanks

After the beginning of compressing tables 5 and 6. :+1:
More detailed test is coming. It will include I/O measuring.

1 Like

I upgraded my script. Results will be soon.
Script is launching by cron every minute and captures single plot seeding statistics. Output is formatted to import to spreadsheet with tab-separated format. Paths and device names is actual for my environment!
Here is script, not perfect, but working:

#!/bin/bash

# Capture date and time meaurements begins
DATE=`/bin/date +%Y-%m-%d`
TIME=`/bin/date +%H:%M:%S`

# Settings section
TEMP1=/temp1/
TEMP1_DRIVE=nvme1n1

TEMP2=/temp2/
TEMP2_DRIVE=nvme0n1

DEST=/plots16t/
DEST_DRIVE=md127

OUTPUT=/home/lv/testing/log2.txt

# Capture IO averages for 60 seconds
IO=`iostat -m -y 60 1`

# Capture disk space usage
TMP1_S=`/bin/du -s $TEMP1 | awk -F$'\t' '{print $1/1024}' OFMT="%3.0f"`
TMP1_IO=`echo -e "$IO" | awk -v var="$TEMP1_DRIVE" '$0~var{print $2,"\011",$3,"\011",$4,"\011",$6,"\011",$7}'`

TMP2_S=`/bin/du -s $TEMP2 | awk -F$'\t' '{print $1/1024}' OFMT="%3.0f"`
TMP2_IO=`echo -e "$IO" | awk -v var="$TEMP2_DRIVE" '$0~var{print $2,"\011",$3,"\011",$4,"\011",$6,"\011",$7}'`

DST_S=`/bin/du -s $DEST | awk -F$'\t' '{print $1/1024}' OFMT="%3.0f"`
DST_IO=`echo -e "$IO" | awk -v var="$DEST_DRIVE" '$0~var{print $2,"\011",$3,"\011",$4,"\011",$6,"\011",$7}'`

# Capture TWB for nvmes
TMP1_TWB=`/usr/sbin/smartctl -a /dev/$TEMP1_DRIVE | awk '/Data Units Written/{gsub(",","",$4); print $4*512/1024}' OFMT="%3.0f"`
TMP2_TWB=`/usr/sbin/smartctl -a /dev/$TEMP2_DRIVE | awk '/Data Units Written/{gsub(",","",$4); print $4*512/1024}' OFMT="%3.0f"`

# Capture CPU stats and memory usage
MEM=`/usr/bin/smem -c "name uss" --processfilter="^/home/lv/chia-blockchain/venv/" | grep chia | awk '{print $2/1024}' OFMT="%3.0f"`
CPU=`echo -e "$IO" | awk '$1 ~ /^[[:digit:]]/ {print $1}'`
WA=`echo -e "$IO" | awk '$1 ~ /^[[:digit:]]/ {print $4}'`

# Make heading row for new file
if [ ! -f $OUTPUT ]; then
COMMONLBL="Phase\tTime\tCPU,%\tWA\tMem,MB\tTemp1,MB\tTemp2,MB\tDst,MB\tTWB1,MB\tTWB2,MB"
TMP1IOLBL="Tmp1 TPS\tTmp1 rs,MB/s\tTmp1 ws, MB/s\tTmp1 r, MB\tTmp1 w,MB"
TMP2IOLBL="Tmp2 TPS\tTmp2 rs,MB/s\tTmp2 ws, MB/s\tTmp2 r, MB\tTmp2 w,MB"
DSTIOLBL="Dest TPS\tDest rs,MB/s\tDest ws, MB/s\tDest r, MB\tDest w,MB"
echo -e "$COMMONLBL\t$TMP1IOLBL\t$TMP2IOLBL\t$DSTIOLBL" >> $OUTPUT
chown lv:lv $OUTPUT
fi

# Make output
COMMON="\t$DATE $TIME\t$CPU\t$WA\t$MEM\t$TMP1_S\t$TMP2_S\t$DST_S\t$TMP1_TWB\t$TMP2_TWB"
echo -e "$COMMON\t$TMP1_IO\t$TMP2_IO\t$DST_IO" >> $OUTPUT
4 Likes

Brilliant, thank you Gladanimal.

So if compression of tables 5 and 6 starts at around 85% of total single plot time, then it seems using a secondary temporary drive shaves off 15% from the duration of each plot, without increasing disk usage.

2 Likes

That’s right! That’s why I started this testing.

1 Like

Next measurement session

New hardware:

  • i7-11700 8c/16t
  • Asus Z590-F
  • ADATA SX8200 PRO 1TB for the first temporary (tweaked XFS)
  • ADATA SX8200 PRO 512GB for the second temporary (tweaked XFS)
  • RAID0 of 4 x 4Tb
  • RAM 32GB 3200MHs
  • running on Fedora 34
  • using optimized chiapos

Plotting parameters:

  • Plot size is: 32
  • Buffer size is: 28000MiB
  • Using 128 buckets
  • Using 2 threads of stripe size 65536
  • Using optimized chiapos

Overall:

Time for phase 1 = 5074.508 seconds. CPU (162.030%)
Time for phase 2 = 2212.331 seconds. CPU (90.880%)
Time for phase 3 = 3109.913 seconds. CPU (139.740%)
Time for phase 4 = 284.687 seconds. CPU (102.400%)
Total time = 10681.442 seconds. CPU (139.220%)

Memory


Only about 4892MiB was used of 28000MiB given for phase 1, 2276MiB at peak for phase 2, 4168MiB at peak for phase 3 and 1198MiB for phase 4.
So one plotting process (128 buckets, 2 threads) can utilize only 4892MiB of RAM (will try 4 threads and 64 buckets mixings later).

Disk space


Looks like previous experiments, but peak usage is a little bit lower (244091MiB). I think it affected by higher memory usage.

CPU & IO

Note: next charts shows average values per minute! In moment them could be higher.
CPU usage:


Total MB read and write on first temporary per minute (second and dest is not so interesting):

IOPS average (average per minute!):

4 Likes

Looking at the charts above (Plotting process measurements - #3 by gladanimal, Plotting process measurements - #23 by gladanimal) I found using default buffer size (3389MiB) affects to (especially) composing tables 4 and 5 at first phase and thus increases temporary storage I/O. So using 5000MiB would be optimal for performance on k=32 and u=128. Higher buffer size would not be utilized!

2 Likes

Would you mind trying 4 and 6 threads also?

Right now I have 4 threads raw results. Charts are coming :sunglasses:

1 Like

IOPS

I captured real-time IOPS for temporary drive. Looks very nice. The peak value is about 9000 IOPS. Not so high as I think before. I assume slower CPUs would produce lower values (but not the fact).
So I assume IOPS is not bottleneck on modern NVME SSDs. For example my ADATA SX8200 Pro has 390K/380K, it’s more than enough to handle I/O for about over 50 parallel plot seedings (but actually its SLC and DRAM cache size and R/W speed is bottleneck)

It’s very easy to measure it on linux. Run this command in separate terminal before you start plot seeding:
iostat 1 | awk '/nvme1n1/{print $2}; fflush(stdout)' >> your_iops_log.txt 2>&1
where nvme1n1 is a name of your temporary drive and your_iops_log.txt is a complete filename to place logs.
After plot seeding finish press Ctrl-C and import results from log file to spreadsheet and make a chart.
Your charts are welcome!

Four threads

Plot size is: 32
Buffer size is: 28000MiB
Using 128 buckets
Using 4 threads of stripe size 65536
Using optimized chiapos 

Overall:

Time for phase 1 = 3630.957 seconds. CPU (230.460%)
Time for phase 2 = 2212.198 seconds. CPU (90.180%)
Time for phase 3 = 3170.369 seconds. CPU (139.560%)
Time for phase 4 = 299.807 seconds. CPU (102.430%)
Total time = 9313.333 seconds. CPU (162.080%)

It’s really faster than 2 threads for about 1380 seconds (~23 minutes). First phase is shorter, other phases the same as 2 threads. Now lets look inside…

Memory

Memory usage decreased to 4392 MiB at peak at first phase and a little bit for second one.

Disk spase

Temporary drive space usage is a same as 2 threads, but shorter in time.

CPU & IO

At first phase CPU usage is greater, but not at 400% as expected. (This is average values per minute, I would check it later).


I/O operations seems more agressive for first phase and same for others.
IOPS:

Total read and write:

Conclusions

It seems to be optimal parameters for me is -r 3 -b 4500 for 128 buckets

Great info @gladanimal, sorry for my noobness, do you mean -r 3 is for 3 threads and -b 4500 is for 4500MB RAM buffer?

1 Like

No, extra thread efficiency is diminishing with every thread added !1 thread 97%, 2 threads 150% :::). Even 6 threads will just see you around 280%.

2 Likes

Exactly. I mean exactly this

Thanks for info, would test it

Thanks for this testing (number’s) .
Can you test with more threads, like 12 -16 maybe.

1 Like

Looks like so close to optimal

I’d run -k 32 -r 3 -b 4500 -u 128

Plot size is: 32
Buffer size is: 4500MiB
Using 128 buckets
Using 3 threads of stripe size 65536
Using optimized chiapos

Overall:

Time for phase 1 = 4108.402 seconds. CPU (201.390%)
Time for phase 2 = 2186.909 seconds. CPU (90.780%)
Total compress table time: 643.945 seconds. CPU (95.100%)
Time for phase 4 = 289.450 seconds. CPU (102.300%)
Total time = 9693.975 seconds. CPU (153.760%)

Total time is greater for 300 seconds. So average CPU usage is so close but it is average.

Memory

Comparsion with previous (4 threads, 28000 MiB RAM). Amount of used memory looks really the same.
It’s OK! 4500 MiB is enough for 128 buckets with 3 and 4 threads.

Disk usage

Matches the previous test.

CPU & IO

CPU usage average is lower, so using 4 threads is better for phase 1 than 3 threads.


IOPS is close to matching previous test.

Conclusion

Using 4 threads and 4500 MiB of RAM is good enough for k=32 and 128 buckets.

1 Like

128 buckets vs 64 buckets

Plot size is: 32
Buffer size is: 28000MiB
Using 2 threads of stripe size 65536
Using optimized chiapos

128 buckets (blue) overall

Time for phase 1 = 5074.508 seconds. CPU (162.030%)
Time for phase 2 = 2212.331 seconds. CPU (90.880%)
Time for phase 3 = 3109.913 seconds. CPU (139.740%)
Time for phase 4 = 284.687 seconds. CPU (102.400%)
Total time = 10681.442 seconds. CPU (139.220%)

64 buckets (red) overall

Time for phase 1 = 5216.343 seconds. CPU (160.510%)
Time for phase 2 = 2314.673 seconds. CPU (90.890%)
Time for phase 3 = 3341.300 seconds. CPU (139.510%)
Time for phase 4 = 309.550 seconds. CPU (102.170%)
Total time = 11181.868 seconds. CPU (138.210%)

Memory average

Temporary space usage

CPU average usage

IOPS average

Conclusion

I didn’t find reasons to use 64 buckets (for 2 threads; will try 4 threads later). Just higher CPU and memory usage and 500 seconds slower. No profits found yet.

Coming tests

  • Using 6 threads
  • Using 16 threads
  • 64 buckets with 4 threads
  • 4 threads and 3408 MiB (lower memory)
  • Accurate CPU and RAM usage for 4 threads, 4500 MiB buffer size, 128 buckets and k=32
  • Accurate comparing “optimized chiapos” vs “vanilla client”

Do you plan to run these with optimized chiapos or the vanilla client? (the former is probably better for you while the latter is more usefull for the community).

1 Like