In search of: Extremely Technical details of Quality and Proof checking

luckidog · May 16, 2021, 7:14am

First off, I believe harvesting from s3 compatible storage is possible and will be profitable. I’m currently farming 935 plots stored on Wasabi from a droplet on Digital Ocean. I’m using goofys to present the s3 compatible storage as a FUSE file system. That seems to be mostly good enough most of the time.

I don’t want to miss a proof though. So I’m looking to improve things.

I’ve started to look in the chiapos project at the actual I/O going on. But I’m looking for someone to talk me through what’s going on here. I’m relatively certain that if we replace the file open/seek/read operations with s3 range requests (cached of course) we can get that last 5% of performance I’d like to see.

It’s been done before in C#: Random-Access (Seekable) Streams for Amazon S3 in C# | by Lee Harding | circuitpeople | Medium

I’ve heard @storage_jm say things like the quality check is 7 reads, and getting a proof is 64 more reads. I’m after details like: are each of these dependent on data from prior reads, or could we fire them all async in parallel? On a disk you probably wouldn’t simultaneously seek and read from 7 positions, but if you’re making an HTTP request, why not fire them all at once and await the response. Is there any way to predict the “good” portions of the file? What are the implications of the 64 signage points within a block/challenge (I’d like all the right vocabulary here)? Are there efficiencies that fall out of that?

I’m not going to post my resume, but I do have the chops to get this done with a little push in the right direction.

chia-ninja · May 16, 2021, 8:58am

I am trying a similar approach. I tried Backblaze but the latency on challenges (even single ones) was way too high so it couldn’t make the 30 seconds cut off.

I am putting all new plots in AWS S3 and using s3fs to expose it via Ubuntu. I need to wait for the new node to finally finish syncing before seeing latency but from doing the chia plots check it seems way faster (I have a small VM running in the same AWS region to keep latency a slow as possible).

Any reason for choosing goofys over s3fs? Also what kind of times are you seeing on challenges in. your logs? With almost 1000 plots you must have quite a few multiple plot challenges in a single round.

mlin · May 16, 2021, 9:16am

I think this is the best reference for all the gory details: Chia Proof of Space Construction

Based on what I’ve grokked so far (partial), I don’t think there’s a lot of low-hanging fruit in terms of parallelizing the I/O for an individual PoS from a given plot; it’s following effectively-random pointers through the seven stored tables, inherently serial.

On the other hand, if you have multiple plots passing the filter for a challenge, I think it may be the case that the software currently goes through them serially when they could searched in parallel. It’d be great to change that, but you can sort of MacGyver it just by splitting your farm into multiple nodes each with a subset of your plots.

Besides using range requests appropriately, I can think of two other subtleties:

Make sure to use a connection pool so that you’re not renegotiating TLS for every request
It could be interesting to try firing off duplicate requests on multiple connections and take the first response; this might reduce the tail latency (but plausibly might make it worse too, depending on the backend architecture, so empirical testing would be needed)

luckidog · May 16, 2021, 2:51pm

I started on s3fs, and moved around between a few node types with different ratios of CPU and RAM. With use_cache on and off. With it on it kept filling the disk and crashing, even when I had a cron trying to stay ahead of it. The final straw was it disconnecting even without use_cache turned on. It was consuming about 1Gbps of bandwidth, and I think average check time was 10 seconds. I was also accumulating many checks over 30 seconds. Too lazy to find the historical data, but it was hitting 30s worse than 2% of the time.

When I switched to goofys bandwidth dropped to under 1Mbps (now up to 2Mbps, this might be scaling with the volume of storage and suggest 1Mbps per 10 TB of plots harvested). And average search has slowly crept up from just under 2s to almost 3s. Fortunately my plots are in 50 TB folders, and today I’ll likely be distributing them to more harvesters to get some of that speed back. Currently I have 27 over 30s in 48,092 chances in my logs, so that’s 0.056% timeout rate, much improved. But again these have been accumulating more recently driving me to fan out.

Here’s the stats with at least 2 plots eligible based on chia-stats.py:

33,137 chances to win in your logs.
0 proofs found (these are wins, right?).
3.4998 Average time to check.
163.73842 Max time to check.
15 Count over 30s (NB: These are bad, fix it).

I’d love to see what you get from this script running with s3fs in AWS with s3. And I’m telling you - try out goofys

luckidog · May 16, 2021, 3:00pm

Thanks for the link!

Awesome point on the connection pool, I’ll definitely keep that in mind now. I’d gut TLS altogether if possible. My plot data is meaningless without the farmer private key I have at home, so I wouldn’t even care if it gets snooped.

By multiple connections you’re suggesting keeping a few different open sockets, and throwing a request at each of them? In my current layout I wouldn’t have multiple physical connections, so I don’t know if there’s anything to be gained here.

From the metrics I added to chia plots check, it seems like after 2 rounds of checking qualities all the require blocks for that are getting cached (somewhere? goofys probably). Wow, I just re-ran an -n 5 scan of 9 plots that I did yesterday, it returned everything in 1.1s. so that isn’t flushing very often even with all the harvesting going on.

Bumped up to -n6 to see some new challenges. I’m seeing between 0 and 2 seconds on the qualities checks, and 0 to 2 ms (for the repeat challenges) and 8 to 15 seconds on the fresh ones (median around 10).

chia-ninja · May 16, 2021, 3:19pm

This is Backblaze with about 400 plots:

1022 chances to win in your logs.
0 proofs found (these are wins, right?).
25.2151 Average time to check.
204.53941 Max time to check.
227 Count over 30s (NB: These are bad, fix it).

i.e. not good

@luckidog did you turn on caching specifically? I just have the default install running using goofys.

dchuk · May 16, 2021, 3:52pm

This is a great thread, I’m going to be following it closely. How are you plotting for those wasabi-stored plots? Yourself or from a cloud plotting service?

luckidog · May 16, 2021, 4:15pm

Plotting service http://chiaappliance.com/ tell them Aaron sent you

luckidog · May 16, 2021, 4:17pm

That’s looks close to inline with what I saw on s3fs, not good enough IMHO.

I did nothing specific to configure cache (not using catfs at this point). But it’s gotta be happening somewhere based on the delta between fresh and repeat reads of the same bytes. It’s a file system, so maybe just the OS level disk caching.

Also take a look at GitHub - stolk/chiaharvestgraph: Graphs the activity of a chia harvester in a linux terminal. for visualizing your response times.

Moveon · May 16, 2021, 4:36pm

Do you think it’s an valid option to use S3 one zone IA storage class to store plots? I’m still syncing the only full node and not farming on my 600+ plots that presented to one c5n.xlarge ec2 instance by using goofys. I’m just a bit concerned with the possible large number of listobject api calls required by goofys when farming.

Also how many farmer/harvester or full nodes do you use to farm 900+ plots?

I was also thinking about Wasabi cause their low storage cost however the high AWS transfer out fee kills that idea.

Thank you! And this is a great topic.

luckidog · May 16, 2021, 4:52pm

I just finished splitting my single 100 TB harvester into 2 50TB harvesters. I’m trying to keep 50TB per folder all in the same bucket. One of those folders is full, and the others will continue to grow. I’ll be able to watch the performance of the fixed 50TB harvester vs one starting at 50 and growing over the next week. That should give some metrics on plots:harvesters ratio and performance. The new harvester is slightly beefier.

You’d have to run the numbers, but I’d expect AWS s3 from Ec2 to perform acceptably (how could it be worse than DO to wasabi?), I don’t know the number of API calls or the billing for that, since it’s all included on Wasabi.

When I have free time I might plot more on spot instances to s3 and try out an AWS harvester for comparison.

Moveon · May 16, 2021, 5:05pm

I was also attempted by Wasabi because their APIs are free and low monthly cost but they don’t seem to have the facility to plot unfortunately, and it appears that getting plots out of the wherever you’re plotting isn’t cheap.

I’m planning to use one full node to farm all my plots and not sure if it’s doable or not, and I think that it really depends on how farming works. Hopefully it doesn’t list all the plots at all and only read the plot(s) that passed the filter.

I’m very surprised that I think that plotting/farming chia in public cloud can be profitable (at least for some time) after running the numbers over and over.

luckidog · May 16, 2021, 7:14pm

Wasabi is storage only, no compute. You can pay to setup a direct connect if you have a data center, or Digital Ocean I believe has a fast peered connection. I’ve spiked over 2 gbps between DO and Wasabi. Haven’t done the math to figure out if plotting on DO’s storage optimized nodes is cost efficient, or if they would let you. They won’t unlock node types until you send them what you’re going to be doing with them.

codinghorror · May 16, 2021, 7:50pm

It is a good question; what does plots check actually do? Here is what the docs say:

Each plot will take each challenge (default 30) and:

Get the quality for the challenge (Is there a proof of space? You should expect 1 proof per challenge, but there may be 0 or more than 1.)

Get the full proof(s) for the challenge if a proof was present

Validate that the # of full proofs matches the # of expected quality proofs.

On the topic of full proofs vs expected proofs:

If the ratio is >1, your plot was relatively lucky for this run of challenges.

If the ratio is <1, your plot was relatively unlucky.

This shouldn’t really concern you unless your ratio is <0.70 # If so, do a more thorough chia plots check by increasing your -n

Which is elaborated upon thusly

The plots check challenge is a static challenge. For example if you run a plots check 20 times, with 30 tries against the same file, it will produce the same result every time. So while you may see a plot ratio << 1 for a plot check with x number of tries, it does not mean that the plot itself is worthless. It just means that given these static challenges, the plot is producing however many proofs. As the number of tries (-n) increases, we would expect the ratio to not be << 1. Since Mainnet is live, and given that the blockchain has new challenges with every signage point - just because a plot is having a bad time with one specific challenge, does not mean it has the same results versus another challenge. “Number of plots” and “k-size” are much more influential factors at winning blocks than “proofs produced per challenge”.

In theory, a plot with a ratio >> 1 would be more likely to win challenges on the blockchain. Likewise, a plot with a ratio << 1 would be less likely to win. However, in practice, this isn’t actually going to be noticeable. Therefore, don’t worry if your plot check ratios are less than 1, unless they’re significantly less than 1 for many -n.

So I guess the idea is you can run a quick n check and see if any plots are particularly poor quality, but this is a bad use of your time compared to simply pumping out as many plots as possible!

I just checked 9 plots on my current machine via the minimum allowed -n 5 like so

./chia plots check -n 5 -g E:\chia-final\

and got:

Proofs 7 / 5, 1.4
Proofs 5 / 5, 1.0
Proofs 8 / 5, 1.6
Proofs 11 / 5, 2.2
Proofs 5 / 5, 1.0
Proofs 4 / 5, 0.8
Proofs 7 / 5, 1.4
Proofs 3 / 5, 0.6
Proofs 5 / 5, 1.0

So by this logic I should check out the plot that scored the worst, the 0.6 one. Let me try that, 30 challenges:

./chia plots check -n 30 -g E:\chia-final\plot-k32-2021-05-16-05-59-4c63a8cb933ee80cbc19ab554264a33679f84df2fe2ecc264a0fe5c27f7feb05.plot

and got

Proofs 23 / 30, 0.7667

Let’s try 100 challenges! Really work this plot out!

Proofs 93 / 100, 0.93

Seems OK?

Let me try the batch run again, with -n 100 and see (note that a plot dropped since I ran the command, so it’s 10 plots this time:

Proofs 94 / 100, 0.94
Proofs 93 / 100, 0.93
Proofs 124 / 100, 1.24
Proofs 97 / 100, 0.97
Proofs 94 / 100, 0.94
Proofs 100 / 100, 1.0
Proofs 79 / 100, 0.79
Proofs 93 / 100, 0.93
Proofs 79 / 100, 0.79
Proofs 103 / 100, 1.03

I guess I’d like to emphasize the docs here:

Number of plots and k-size are much more influential factors at winning blocks than proofs produced per challenge

That being said I highly recommend running plots check with the minimum (5) to make sure you don’t have invalid plots. My plotters do tend to produce the rare invalid plots every so often, seemingly at random…

mlin · May 16, 2021, 9:55pm

Yes, the concept here is that your >30s proofs are probably dominated by a few of the requests taking an unusually long time for whatever reason. For example, perhaps the S3 API server you’re connected to is overloaded at that moment, and takes several seconds to respond. If you have connections to multiple API servers and send redundant requests through them, you’re perhaps a little more robust to those outliers, since you can use the first response you get.

You’re increasing the total traffic though, so this strategy could easily make things worse depending on how the backend works. There are refinements, such as send the redundant request only after you see some delay getting the first response. Further reading: The Tail at Scale by Google

luckidog · May 16, 2021, 10:01pm

Thanks for all those details!

Here’s my PR for the plot checker that adds timing details: Add timing metrics to plot check by aarcro · Pull Request #5109 · Chia-Network/chia-blockchain · GitHub

Which is how I’m trying to validate that I will be fast enough for real challenges / proofs. Not particularly concerned with the proof ratio. Just looking to produce real I/O and time it.

codinghorror · May 17, 2021, 12:15am

Oh nice! Yeah I got bitten early on by the 30 second harvester limit. Probably cost me 20 chia, but oh well. Live and learn

luckidog · May 17, 2021, 12:35am

Yup, I saw your thread/defect, that was my inspiration for digging into all the timing stuff.

f1gm3nt · May 17, 2021, 12:59am

I must be way off in my calculations, I looked into S3 storage, but the cost is very high. Could you provide some of the calculations you’re using?

Each plot would be around $2.50 a month? Not including data transfers. Using https://chiacalculator.com/ one plot comes out to $7.20/mo. So it would be around $4.70/mo profit. None of this includes the cost for plotter servers or harvesters.

To me it seems better to just buy hard drives and just pay the cost up front.

luckidog · May 17, 2021, 1:20am

Cost on wasabi is $6/TB/mo, no transfer or API charges. It’s “s3 compatible” so all the s3 tools work, but it isn’t AWS.