Troubleshooting failure to farm with lots of plots, due to harvester 30s timeout

Hmm. Unless you’re seeing in increase in harvester times, I don’t know that I’d go by that. Set your log level to info and grep for

^.*?eligible.*?$

That’ll give you just the eligibility proofs from the harvester.

I did eventually bite the bullet and split to 3 GUI farmers:

  1. on my main phantom canyon NUC, to all USB drives connected to it and three network share 18tb dump drives

  2. on my AMD Ryzen plotter, to all USB drives connected to it, plus its own 7 drives connected via mobo SATA (actually 8 but one is used for the OS)

  3. on an intel 10th gen NUC, just to the two terra-master NAS’es left; I’ve managed to slowly, painfully extract 90tb from the other two full ones on the network as of today.

I have port forwarded 8444 to the primary machine (#1 above) and turned unpnp off in the config.yaml on #2 and #3. I really wanted to avoid doing this but I felt like I had no choice, my harvester times, even with removed NASes, were all over the place…

This lets me compare harvester proof times on all 3 machines, as log level is set to INFO on all 3.

2 Likes

We’re going to have to start adding Chia network architect to our resumes!

3 Likes

Yes - I’m doing that. My understanding is that you find a proof, there is some additional work to be done after that before you are awarded the block. You have to have determine who, if there are multiple winners, has the best proof of time and also create the new block. My thought is that having the CPU under load somehow effects this 2nd phase. It’s probably nothing, but winning every day for 3 days straight after a dry spell of multiple weeks gets my bugdar up. (Spent a couple decades managing software dev teams, so my bugdar is finely honed, LOL.)

2 Likes

OK, so here’s my harvester proof analysis now that I’ve broken this up into 3 farming machines rather than one master farmer. This is overnight, so about 8-10 hours of runtime at info log level. These are values gleaned from this line in the logs:

2021-04-26T00:16:49.777 - 2 plots were eligible for farming 294794196b... Found 0 proofs. Time: 0.49411 s. 22 plots

Specifically the Time: 0.49411 line which I greped out via the regular expression \d+\.\d{5,5} meaning, one or more digits, followed by a period, followed by exactly 5 digits.

  1. HTPC Phantom Canyon, all USB connected drives plus 3 network shares that map to 18tb WD elements dump drives

    • avg 1.94s
    • max 239 :warning:
    • min 0.05
    • median 0.32
    • 19 values over 30s
  2. AMD Ryzen 5950x megaplotter, attached to 7 SATA drives and 8 USB drives

    • avg 0.3s
    • max 35s
    • min 0.11s
    • median 0.43
    • 1 value over 30s
  3. Intel NUC 6c/12t plotter, attached via 1gbps to two terramaster NAS’es on same network switch

    • avg 4.11s
    • max 10s
    • min 0.04s
    • median 4.0s
    • 0 values over 30s

So you can draw a few conclusions here:

  • I need to make sure my network shares are being “woken up” regularly using a tool that pings the drives. I already do this locally on each machine with a little GUI tool that artificially reads drives every so often (I have it set to 30 seconds), but I’m going to double-shot it and run it both a) on the local machine and b) on the remote network machine so it’s getting read-pinged both ways. We’ll see if that helps the Phantom Canyon numbers. It could also be that the dump drives are slow when they’re being dumped to, e.g. plots are being written to them…

  • Local USB connections are definitely preferred and faster than network connections. The AMD machine has zero network connections and it’s got the best average times, though the median isn’t spectacularly better than the machine with the network connections.

  • NAS’es, even simple RAID 0 ones, are definitely slower than both network connections to plain drives, and direct USB / SATA connections to plain drives.

You can see my theory here is essentally correct: NAS’es by themselves are not so bad. The problem is they add variability. The more NAS’es, the more variability, and that variability compounds, like a multiplier.

3 Likes

@codinghorror Can you share the script/command you used for this analysis (maybe as it’s own thread)? Seems really valuable for everyone to do this too

1 Like

Yes, I already did so in the powershell topic.

1 Like

Just as a point of information: My experience with QNAP and Synology NAS (not sure about other brands) is that they are generally configured to spin down the drives after a period of time, and will definitely spin drives down if they are starting to get warm even if you configure them not to, if the fans cannot keep up. Consumer SATA drives will happily go take a nap even if you tell it not too. A decent HBA (host bus adapter) AIC (add-in-card) firmware is configured to keep drives awake, and the firmware on SAS drives is also, out-of-the-box, configured to stay awake.

1 Like

No – these all default to “don’t spin down the drives”. It’s generally regarded as unhealthy to keep spinning drives down and up anyway so I’d be surprised if that was the default on decent NAS’es…

1 Like

The Synology NAS and the QNAP NAS, irrelevant of what you tell it to do, will thermal throttle if the case fans cannot keep up with the heat dissipation needs and will definitely spin down your drives. The SATA controller firmware gets a say in whether drives spin down. The SATA drive firmware gets a say in whether drives spin down. Agreed on the “generally unhealthy statement”, and is why SAS HDDs and certain special use case HDDs, e.g. security/AI/NVR drives, predominantly don’t spin down, but… consumers care about power consumption and noise, not HDD lifespan measured in hundreds of thousands or millions of hours MTBF. If you want to test whether your SATA drive gets a say in spin-down, stick it in a closed, unventilated box while its running. It’s going to nap (not a dirt nap) no matter what you configure it to do. Modern HDDs will thermal throttle. USB to SATA controllers will put your drive to sleep. Motherboard SATA controllers will put your drive to sleep. The OS (which you generally can control) will put your drive to sleep. You can configure hdparm to retrieve bus state of the drive at regular intervals and watch a drive go to sleep even after you have configured it not too. Not in every case with every drive on every configuration, but I’ve seen it more than I can recount. That said, I don’t think your drives are thermal throttling. And it is possible that the high proof times have little to do with the drives taking a nap.

I don’t know how you’ve got your system configured, from another post I glanced at, it sounds like you had multiple NAS system in the same network and then moved to a single NAS which partially solved your problems.

2 Likes

I agree, there is a problem!
I have similar problems
I’m about to go crazy: @ eleventh day 3,568 TiB and 0 XCH

2 Likes

I’m sorry, and it is affecting a lot of people! Spread the word! Please follow the advice in this topic. Set log level to INFO and scan your log for the harvester proof lines!

2021-04-26T00:16:49.777 - 2 plots were eligible for farming 294794196b... Found 0 proofs. Time: 0.49411 s. 22 plots

If that harvester proof check time is 5s or over, there are … issues.

100%. Two things helped:

  • do NOT put all plot files in :file_folder: one folder (this was a killer)
  • do NOT use multiple NAS :floppy_disk::floppy_disk::floppy_disk::floppy_disk: devices on your network, as they induce too many variables when combined

I think you can get by with one or two NASes just fine, though I would continue to advise against it for all the reasons stated up-topic ad nauseam :point_up_2:

I turned this sleep setting off. At least that’s how I understood the setting to work in my Synology NAS

1 Like

Thank you for the tip, I did what I said mate, the folders are 6 tracks and more than 10 terabytes with a single external disk plugged in, but there are problems if this harvester endurance check time is 5 seconds or more. I could not understand here. I am using windows 10. Can you tell me how to do it?

Found the HDD hibernation setting for Synology NAS’s:

  1. Open Control Panel
  2. If you’re not in Advanced Mode, click on it on the top right.
  3. Go to the Hardware & Power settings
  4. Go to the HDD Hibernation tab
  5. Set both of the hibernate time settings (for internal and USB connected drives) to none

2 Likes

What could cause my pc to be only proofing 5-10 minutes apart? My proof time is under 5 seconds typically.

Anyone, please!
If u check your log and found all “proof chek” under 5 sec, please, share your system config and plots folder settings. How you connect HDD’s? Network?

This is my problem too. 3 usb 3.0 external hdds and about 50% proof check much more than 5 sec, typicly 20-30 sec.
this make me sad :cry:

I want to clarify this. Multiple NASes are FINE if you are running remote harvesters.

Running ONE harvester that accesses network-attached storage is an anti-pattern and will likely result in you unable to earn china.

I understand it’s conceptually simpler to run one harvester and access network mounts, but the problem is this is not safe. You have to just get over it, and just run remote harvesters.

I have 8 different NASes on a 1gbit switch, copying files in and out all the time, and I have had zero issues. How? I run remote harvesters on each NAS.

3 Likes

Can you share how you managed to set that up? I’m highly interested :+1:

3 Likes

I go into a SSH shell, install the CLI version of chia (no GUI) per the official instructions, and follow the instructions for running remote harvesters. Farming on many machines · Chia-Network/chia-blockchain Wiki · GitHub

Takes about 5 minutes to set up per NAS.

4 Likes

Update overnight, these are all harvesting times:

  1. HTPC Phantom Canyon, all USB connected drives plus 3 network shares that map to 18tb WD elements dump drives

  • avg 1.4s
  • median 0.49
  • max 19.8s
  • min 0.06
  • 0 values over 30s
  1. AMD Ryzen 5950x megaplotter, attached to 7 SATA drives and 8 USB drives

  • avg 4.56s
  • median 4.09s
  • max 112s :warning:
  • min 0.26s
  • 68 / 9727 values over 30s
  1. Intel NUC 6c/12t plotter, attached via 1gbps to two terramaster NAS’es on same network switch

  • avg 1.97s
  • median 0.43s
  • max 106.4s :warning:
  • min 0.11s
  • 102 / 9746 values over 30s :warning:

I realized that the USB drives were actually sleeping even though I had a program set up to read from them every 30 seconds… I guess that was getting cached? I had one drive definitely wake up from time to time because I could hear it wake up! I switched that program to write a file to them every 60 seconds instead.

Also I am doing a fair bit of plot copying on the AMD machine, so that could explain some of the outlier times there.

5 Likes