Troubleshooting failure to farm with lots of plots, due to harvester 30s timeout

Yes.

plotter - a computer that does not have a full node. All it does is makes plots.
full node - a computer with a farmer, harvester, wallet and full node.

I had my plotter share drives over samba so harvester on full node could access and farm them. But this was to slow. So I configured and started harvesters on my plotting computers, and connected them to the farmer, following the tutorial.

To understand what I did to the config file you have to look at a original one and compare to the section I posted. You can find the config file here: ~/.chia/mainnet/config/config.yaml

Prior changing the config file, I did not get any logs from harvester on my plotting computer. Your experience my very.

There is a script running on plotters that keeps hdd from going to sleep:
while true; do date > /mnt/usb-hdd/keep.awake ; sleep 20; done
If the drive goes to sleep you are looking at a access time of 10+ sec.

3 Likes

That’s correct. There is a harvester filter pass with in-memory metadata so it doesn’t have to constantly read plots off the disks.

2 Likes

@codinghorror (or anyone else who has gotten plots frequently… or anyone who’s sufficiently familiar with the algorithm / code), are the times longer when you find a proof vs. when a plot passes the filter but no proof is found?

That would make it a lot more important to get no-proof times very low.

I only have one data point, so I can’t tell :sweat_smile:

1 Like

@storage_jm just posted a great walk through for setting up remote harvesters on chia decentral’s YouTube channel. The wiki is a lot to read through, but he did a great job just showing how to get it working and make sure it’s running since the GUI and CLI are not useful for remote harvesting yet.

3 Likes

its much worse with multiple file copies and much worse when the drive is full. we actually had an event in our farm where we found a proof but didn’t get the reward because we are harvesting as we plot. What you are describing in storage we call “Quality of Service” which is the distribution of latency over time. Everything in the storage stack, including disk state, storage interface, filesystem, network, etc. all add latency.

3 Likes

There are at least two of those events I posted log entries for, in this very topic… where I found a proof in 35 seconds or 36 seconds so I didn’t get an award… you can use the search function and tick “search within this topic” to find them :wink:

Overall I think as plot farms get much much larger there’s gonna be tremendous pressure on the project to relax that 30 second number a bit more – but it is also true “run lots of harvesters” is a good solution for now. Problem is, setting up harvesters is still too complex, so hopefully that can be made much simpler over time.

And, as I said earlier, three things need to happen:

We got a form of this, which is a WARN if harvester times are over 5 seconds, so #1 is partially addressed… but #2 and #3 are equally important and haven’t happened yet.

In particular, since the harvester isn’t telling you which files it checked for proofs, you have literally no way of figuring out which device is slow, and that’s crippling. That’d be my suggested next change for the project team. Just make INFO log exactly which files it checked for proofs, now you can correlate that with slow harvester times and say …

:bulb: Oh, interesting, every time it checks for a proof on {DRIVE X}, the times spike… maybe there’s a problem with {DRIVE X}?

3 Likes

Traveling today and not a async python expert, but it would be worth a PR to add an info log with the plot files that are still in the awaitable after a certain amount of time has passed here. Definitely couldn’t hurt.

Edit: read further into the function, like right below the line I was interested in, and it looks like it’s doing that now… :woozy_face:

2 Likes

Oh sweet, if they’ve added “tell us exactly which plot files the harvester chose to check for detailed proofs” to the logs then that’s awesome! It will be so helpful for the reasons I explained above :point_up_2:

2 Likes

Maybe they should write a little utility that does whatever logic is done when a proof is found and have it report how long it took . Run this utility on every storage and report results. This should help.

1 Like

Yes, that’s one of the other issues I didn’t bring up – there’s no check that exactly mimics the harvester proof check. The Chia team member recommended -n 30 as a reasonable approximation, though.

12 posts were split to a new topic: Why not use NAS and RAID for your Chia farms?

But we can still win the reward right even we have 5 second warning? because i have some too

Yes. 30 seconds is the hard limit.

@codinghorror Did you solve your issue? I have same problem using different external drive on different desktops…

Yes, I solved it by removing NAS-es from the equation!

1 Like

and I solved it by running remote harvesters on every node with local drives (including NASs).

1 Like

Thanks! That sounds great! So do you still use remote drive to store plots? I think I have similar configurations. Different external hard drives connected to different computers and used windows share folder to link to master farmer something like //192.168.0.2/e, //192.168.0.2/d. Is that still a working solution? Thanks again!

Thanks! It means we cannot use network drive for farming even with 1Gbps bandwidth for local LAN?

It’s not recommended, and my observational experience with it was poor. Using network drives, any sort of activity going on on the network caused response times to rise. I’m only using remote harvesters now, and that is working fantastic. Each has their own local drives and communicate only with the main farming node.

I use some network drives yes but as the network grows, the variability increases. The NAS proof times were always ~3x-4x the remote drive times though; you can scroll up and read the stats yourself if you like.

1 Like