Watching the farmer like a hawk - SILENT failures

I have a handful of plotters (mostly Linux, 2 part time Windows), 1 full node/farmer/harvester (Linux) and a main PC i use to monitor disk space, eligible checks and is my daily usage PC

Today my main PC PSU dies and it took me an hour to get it back up and running - when it came back up i checked my farmer/harvester GUI, on another PC - and it all looks green - great

My main PC has a “eligible” monitor that i use to passively monitor - fire it up and … nada

Despite the main farming GUI being all GREEN - scroll down and there have been no “Last Attempted Proof” proofs checked for 2-3 hours - ie the harvester had died and there was no warning on the GUI

I restarted it all and am now waiting for it to sync/and start farming again

THIS IS WHAT I MEAN BY A SILENT FAILURE and an example of why we need pools - i could have been looking at this green GUI for days wondering what the problem was waiting to get a reward. Only fastidiously monitoring picked this up

Farming status: Farming
Block rewards: 0.0
> Plot count: 1158
Total size of plots: 114.624 TiB
Expected time to win: 1 week and 1 day

I agree that there can be a lot of improvement to the monitoring side. I think they are just making sure the core features work and the monitoring will be handled by other vendors.

My point is that without fastidious monitoring and feedback by REMUNERATION (ie pools), there are several SILENT failure modes and it’s difficult to know if something is broken or am just unlucky

I know pools are coming but i also don’t want the devs to take their eye off the ball given that it has been deferred from next week

If you gave me a choice between POOLS and fixing fragile software, i’d take pools first

Hmmm! My harvester died again - last harvester log output was checking eligibility - than nada - the full node carried on logging - the GUI is green (The last Attempted Proof was coincident with the last log)
There are no warnings or errors in the log after that point :frowning:
ps shows the farmer and harvester are still running …

0 S ianj       41544   41267  0  80   0 - 68133 ep_pol 12:41 pts/1    00:00:14 chia_farmer
0 S ianj       41545   41267  0  80   0 - 883136 ep_pol 12:41 pts/1   00:04:16 chia_harvester

No wonder i am one of the “unluckiest”

Raised as a github issue and raised a more specific topic in this forum - there are many that are websocket related but i tried to add as much info as possible as many are pretty sparse

81 minutes later =- happened again - added details to github issue

Enabled chiadog watchdog so i get pushover messages when it happens - its gonna be a hard night … lol

Well, I just added an issue to github for Chiadog. So heads up. I keep getting “harvester appears to be offline” when it isn’t. So hopefully you don’t get false notifications to add to your stress.

I noticed that when i kill the chia processes and restart the gui/farmer/harvester - chiadog chimes in a while later to say it is still down - i now restart that at the same time