Troubleshooting failure to farm with lots of plots, due to harvester 30s timeout

bikerdude · May 6, 2021, 5:39pm

Yes.

plotter - a computer that does not have a full node. All it does is makes plots.
full node - a computer with a farmer, harvester, wallet and full node.

I had my plotter share drives over samba so harvester on full node could access and farm them. But this was to slow. So I configured and started harvesters on my plotting computers, and connected them to the farmer, following the tutorial.

To understand what I did to the config file you have to look at a original one and compare to the section I posted. You can find the config file here: ~/.chia/mainnet/config/config.yaml

Prior changing the config file, I did not get any logs from harvester on my plotting computer. Your experience my very.

There is a script running on plotters that keeps hdd from going to sleep:
while true; do date > /mnt/usb-hdd/keep.awake ; sleep 20; done
If the drive goes to sleep you are looking at a access time of 10+ sec.

codinghorror · May 6, 2021, 5:40pm

That’s correct. There is a harvester filter pass with in-memory metadata so it doesn’t have to constantly read plots off the disks.

freeze · May 6, 2021, 8:43pm

@codinghorror (or anyone else who has gotten plots frequently… or anyone who’s sufficiently familiar with the algorithm / code), are the times longer when you find a proof vs. when a plot passes the filter but no proof is found?

That would make it a lot more important to get no-proof times very low.

I only have one data point, so I can’t tell

mike · May 8, 2021, 2:58am

@storage_jm just posted a great walk through for setting up remote harvesters on chia decentral’s YouTube channel. The wiki is a lot to read through, but he did a great job just showing how to get it working and make sure it’s running since the GUI and CLI are not useful for remote harvesting yet.

storage_jm · May 8, 2021, 2:40pm

its much worse with multiple file copies and much worse when the drive is full. we actually had an event in our farm where we found a proof but didn’t get the reward because we are harvesting as we plot. What you are describing in storage we call “Quality of Service” which is the distribution of latency over time. Everything in the storage stack, including disk state, storage interface, filesystem, network, etc. all add latency.

codinghorror · May 8, 2021, 7:45pm

There are at least two of those events I posted log entries for, in this very topic… where I found a proof in 35 seconds or 36 seconds so I didn’t get an award… you can use the search function and tick “search within this topic” to find them

Overall I think as plot farms get much much larger there’s gonna be tremendous pressure on the project to relax that 30 second number a bit more – but it is also true “run lots of harvesters” is a good solution for now. Problem is, setting up harvesters is still too complex, so hopefully that can be made much simpler over time.

And, as I said earlier, three things need to happen:

We got a form of this, which is a WARN if harvester times are over 5 seconds, so #1 is partially addressed… but #2 and #3 are equally important and haven’t happened yet.

In particular, since the harvester isn’t telling you which files it checked for proofs, you have literally no way of figuring out which device is slow, and that’s crippling. That’d be my suggested next change for the project team. Just make INFO log exactly which files it checked for proofs, now you can correlate that with slow harvester times and say …

Oh, interesting, every time it checks for a proof on {DRIVE X}, the times spike… maybe there’s a problem with {DRIVE X}?

mike · May 8, 2021, 8:11pm

Traveling today and not a async python expert, but it would be worth a PR to add an info log with the plot files that are still in the awaitable after a certain amount of time has passed here. Definitely couldn’t hurt.

github.com

Chia-Network/chia-blockchain/blob/ec0a4cec4bc154e4918d34a5c389b47074fa645b/chia/harvester/harvester_api.py#L199


                new_challenge.sp_hash,
            ):
                passed += 1
                awaitables.append(lookup_challenge(try_plot_filename, try_plot_info))
    except Exception as e:
        self.harvester.log.error(f"Error plot file {try_plot_filename} may no longer exist {e}")


# Concurrently executes all lookups on disk, to take advantage of multiple disk parallelism
total_proofs_found = 0
for filename_sublist_awaitable in asyncio.as_completed(awaitables):
    filename, sublist = await filename_sublist_awaitable
    time_taken = time.time() - start
    if time_taken > 5:
        self.harvester.log.warning(
            f"Looking up qualities on {filename} took: {time.time() - start}. This should be below 5 seconds "
            f"to minimize risk of losing rewards."
        )
    else:
        pass
        # If you want additional logs, uncomment the following line
        # self.harvester.log.debug(f"Looking up qualities on {filename} took: {time.time() - start}")

Edit: read further into the function, like right below the line I was interested in, and it looks like it’s doing that now…

codinghorror · May 8, 2021, 8:41pm

Oh sweet, if they’ve added “tell us exactly which plot files the harvester chose to check for detailed proofs” to the logs then that’s awesome! It will be so helpful for the reasons I explained above

geekwithoutacause · May 10, 2021, 6:59pm

Maybe they should write a little utility that does whatever logic is done when a proof is found and have it report how long it took . Run this utility on every storage and report results. This should help.

codinghorror · May 10, 2021, 11:39pm

Yes, that’s one of the other issues I didn’t bring up – there’s no check that exactly mimics the harvester proof check. The Chia team member recommended -n 30 as a reasonable approximation, though.

codinghorror · May 11, 2021, 10:06pm

12 posts were split to a new topic: Why not use NAS and RAID for your Chia farms?

huseyincoskn · May 13, 2021, 3:11pm

But we can still win the reward right even we have 5 second warning? because i have some too

codinghorror · May 13, 2021, 6:42pm

Yes. 30 seconds is the hard limit.

yrx816 · May 19, 2021, 3:29am

@codinghorror Did you solve your issue? I have same problem using different external drive on different desktops…

codinghorror · May 19, 2021, 3:42am

Yes, I solved it by removing NAS-es from the equation!

farmerfm · May 19, 2021, 4:02am

and I solved it by running remote harvesters on every node with local drives (including NASs).

yrx816 · May 19, 2021, 3:33pm

Thanks! That sounds great! So do you still use remote drive to store plots? I think I have similar configurations. Different external hard drives connected to different computers and used windows share folder to link to master farmer something like //192.168.0.2/e, //192.168.0.2/d. Is that still a working solution? Thanks again!

yrx816 · May 19, 2021, 3:35pm

Thanks! It means we cannot use network drive for farming even with 1Gbps bandwidth for local LAN?

farmerfm · May 19, 2021, 3:51pm

It’s not recommended, and my observational experience with it was poor. Using network drives, any sort of activity going on on the network caused response times to rise. I’m only using remote harvesters now, and that is working fantastic. Each has their own local drives and communicate only with the main farming node.

codinghorror · May 19, 2021, 4:03pm

I use some network drives yes but as the network grows, the variability increases. The NAS proof times were always ~3x-4x the remote drive times though; you can scroll up and read the stats yourself if you like.