Farmer stops syncing every day after several hours

mark · January 1, 2024, 10:46pm

My farm and full node have stopped syncing every day after some number of hours (usually between 2 and 24 hours). I restart the farm and it starts working again, but then I have the same issue again after some hours.

In my logs, things look normal:

2024-01-01T12:05:26.718 harvester chia.harvester.harvester: INFO     3 plots were eligible for farming 80a00efc91... Found 0 proofs. Time: 0.26310 s. Total 1029 plots
2024-01-01T12:05:26.862 harvester chia.harvester.harvester: INFO     3 plots were eligible for farming 80a00efc91... Found 0 proofs. Time: 0.31689 s. Total 1029 plots
2024-01-01T22:28:47.177 full_node full_node_server        : INFO     Connection closed: 91.117.34.222, node id: a2040e6200fe9bc92ddd37c1b927b782206fbea7120e49f1a33b7da4ff54e9f8
2024-01-01T22:28:47.179 full_node chia.full_node.full_node: INFO     peer disconnected PeerInfo(_ip=IPv4Address('91.117.34.222'), _port=8449)
2024-01-01T22:28:53.436 full_node full_node_server        : INFO     Connection closed: 67.174.212.86, node id: 9c6deff5842fe774073fa34f307db9f77d7029556f1a637fab6a36eaa4b7bca4
2024-01-01T22:28:53.437 full_node chia.full_node.full_node: INFO     peer disconnected PeerInfo(_ip=IPv4Address('67.174.212.86'), _port=0)

Note that everything seems normal at 12:00 and then the next log entry is at 22:00 (which is when I logged in to check on things). So that means between 12:00 and 22:00 it was not farming nor checking for proofs.

Farming status: Not synced or not connected to peers

chia show -s
Network: mainnet    Port: 8444   RPC Port: 8555
Node ID: xxxxxxxx
Genesis Challenge: ccd5bb71183532bff220ba46c268991a3ff07eb358e8255a65c30a2dce0e5fbb
Current Blockchain Status: Not Synced. Peak height: 4739099

chia version 2.1.3

Ubuntu 20.04.3 LTS

Running on a Raspberry Pi4 with 8 GB of ram.

Chia is running off of a connected SSD with 440 GB of capacity and plenty of free space.

I have been farming on this setup since Chia launched in 2021. So I’m not sure what is causing it to suddenly not work.

I tried deleting my peers database and restarting Chia, but after syncing the problems returns after several hours.

What can I do to try to resolve this problem?

Jacek · January 2, 2024, 12:04am

when this problem started?
are you still connected, but just cannot sync, or rather completely disconnected?
how many peers you are connected to?
what is your log level?
where is your swap file?

Since mid Dec there is a dust storm going on, so you RPi may be getting overloaded.

mark · January 2, 2024, 1:13am

Hi, thanks for your response.

it could have been any time in the last couple months, to be honest, since I was not checking my farm regularly until the last week
it looks like the node is not syncing. It’s connected to the internet. What do you mean by connected, and how do I check if it’s connected?
When I try chia peer full_node -c it lists 50 FULL_NODE entries, which is equal to the target_peer_count in my config file
log_level is INFO
My OS swap is disabled. I checked with swapon --status which returned empty

Let me know if any of that looks suspicious. Otherwise yes perhaps it’s the dust storm. Maybe there’s a way I can monitor whether or not my RPi is getting overloaded

Jacek · January 2, 2024, 1:41am

as you don’t know whether it started before dust storm or not, let’s blame dust storm for now
as you see those peers in #3, your node is connected, but most likely overwhelmed (low hanging fruit to chase)
there is no difference (from your node perspective) whether you are running with 10 or 50 peers; however, each peer comes at processing cost, so I would drop it to 8-10 first (can go lower when monitoring the box, but let’s start with that)
drop the log level to WARNING; chia doesn’t have rate control over those log lines, so every single line produced is hitting the SSD slowing everything down, and there are possibly tons of those lines
that’s fine

I like either bpytop or btop (latter is preferred, as that is C vs Python - same dev). This one gives you CPU / HDs / network info (this is my preference, but if you use something else that should work as well - same raw data). I would also install lm-sensors, maybe you could get some useful info from that.

mark · January 2, 2024, 3:02am

Thanks, that’s very helpful info and suggestions.

I lowered the peer count to 10, and raised the log_level to WARNING.
Unfortunately lm-sensors reports that no sensors are available for the raspberry pi 4.

I have been using htop on an ad-hoc basis, and things seem normal at the moment (CPU load, memory pressure are well within normal range). Just installed btop from snap and it seems really cool.

Do you recommend I just leave btop running in a session the whole time? or is there some way to make it log anomalous conditions?

Just restarted chia, so let’s see how things go now.

Interesting thing about raising log_level to WARNING is that I won’t see the stream of logs about “Found 0 proofs”, so it’ll be harder to know at what time the system stopped working.

Jacek · January 2, 2024, 3:55am

Check this thread about lm-sensors - lm-sensors does not detect integrated temperature sensor on Raspberry Pi · Issue #30 · lm-sensors/lm-sensors · GitHub. Kind of old, but at least will give you an alternative to check temp(s).

I am not sure if btop logs anything, but I was not looking for it. However, I am usually running it for hours without any noticeable side effects. It is C code, so should be very efficient (the previous version (bpytop) was written in Python and was still pulling not that many resources.

The main problem was not syncing, right? You can run ‘chia show -s’ and the fourth line should be ‘Current Blockchain Status: Full Node Synced’. I would run that command through watch with a 5 mins refresh (easy to monitor).

The problem with those logs is that those writes compete with db activity, so we want to minimize everything that creates additional traffic on the SSD that hosts bc db. If you have an old SSD, maybe you could use that and push logs there.

Saying that, the wallet db doesn’t have that much traffic and is faster (super small), so maybe taking the wallet db from the main SSD would help as well. Wallet db can sit on a crap drive, as it is fast to regenerate, and it also doesn’t interfere with the farming (if it is being regenerated).

Still, the peer count is what is usually the main problem (the bc db on SSD actually is). So, you may as well do just one thing at a time (bring up your logs to INFO level, if that helps), and start from there. Just tail your logs to get a sense how much is being pumped.

seymour.krelborn · January 2, 2024, 4:27am

What version were you running prior to 2.1.3?
Did you have this problem with that prior version?

mark · January 2, 2024, 5:34am

@Jacek :
I was able to use sudo vcgencmd measure_temp to check the CPU temp, but then I realized that btop also reports the CPU temperature, so that seemed redundant. Although I suppose I could log the results of CPU temp checks to file.

The result of chia show -s is Current Blockchain Status: Not Synced. when I check periodically on my machine (every 2 to 24 hours or so). So I have to restart chia to make it sync again.

What I meant about the “Found 0 proofs” INFO log is that this is kind of like a ‘heartbeat’ that tells me that my farmer was working during the time corresponding to the timestamps of the logs. So for example, when I log in to check my system, I can see the last INFO log about “Found 0 proofs” was from 10 hours ago, and no activity since then, which tells me the heartbeat died 10 hours ago. Maybe I can try adding a watch for chia show -s every few minutes like you suggested.

I could try moving the log folder to a separate SSD if I increase the log level to INFO again.

I have actually shut down the wallet, so I’m only running the farmer and associated services.

After reducing the target peer count, the farm has been farming successfully for the last 3 hours. I’ll check in tomorrow and see if it’s still alive. Thanks again

mark · January 2, 2024, 5:35am

@seymour.krelborn I forget exactly but it was either 1.8.x or 1.6.x . I didn’t have a sustained problem like this before

Jacek · January 2, 2024, 6:02am

You could conditionally redirect it to a file and have the same info as from debug.log with bsicaly no overhead (doing it every 5 mins or so). Of course, INFO level is the way to go, but you need to start somewhere first.

You can check config.yaml / data_layer / logging: / log_filename and change log location there to point it to one of your HDs for now (increasing it to INFO).

Looking forward to seeing how it will hold tomorrow.

seymour.krelborn · January 2, 2024, 6:05am

Two items to consider:

1:
Roll back to 1.8.x, and see if the problem persists, or if the problem vanishes.

2:
Are you wired to your router, or are you you wireless?

Approximately 15 months ago, I kept seesawing between being synced and falling out of sync. Yet, everything on my end checked out good, including my strong wireless signal.

Having no other ideas, I disabled my wireless connection, and plugged in an Ethernet cable. Not only did I never fall out of sync again, but my response time to challenges improved.

Ronski · January 2, 2024, 6:34am

This is most likely the issue due to the dust storm, my brother runs his farm on an old PC (about 15 years old!) with a dual core CPU which peaks at 3Ghz, and he’s been experiencing the same problem.

Apparently quite a bit of the Chia code is single threaded, and very poorly optimised, which causes problems.

You could try the latest Beta which addresses some of the problems caused by the dust storm.

@seymour.krelborn interesting point on the wi-fi, I’m pretty sure my brother is hard wired, but I’ll check.

drhicom · January 2, 2024, 4:22pm

“Oh the famous install the lan cable fix”
That was when I was younger!!!

mark · January 3, 2024, 12:40am

Well unfortunately I checked back today and my node is not synced again.

This was with log_level at WARNING at peer count at 12.

$ chia show -s 
Network: mainnet    Port: 8444   RPC Port: 8555
Node ID: xxxxx
Genesis Challenge: ccd5bb71183532bff220ba46c268991a3ff07eb358e8255a65c30a2dce0e5fbb
Current Blockchain Status: Not Synced. Peak height: 4743184
      Time: Tue Jan 02 2024 09:19:06 UTC                  Height:    4743184

$ # number of connected nodes
$ chia peer full_node -c | grep FULL_NODE | wc -l
12

I had had btop running in a tmux session, but unfortunately it says at the top “03:19:59 Terminated” and has returned to a command prompt.

and it’s now 00:22 UTC the next day.

The current time:

$ date -u
Wed Jan  3 00:26:29 UTC 2024

I tried grepping my debug logs for what happened around 03:19 UTC yesterday, and it shows:

2024-01-02T03:18:02.595 full_node chia.full_node.full_node: WARNING  Block pre-validation time: 17.55 seconds (32 blocks, start height: 4739643)
2024-01-02T03:19:02.147 full_node chia.full_node.full_node: WARNING  Block pre-validation time: 21.77 seconds (32 blocks, start height: 4739675)
2024-01-02T03:19:59.521 full_node chia.full_node.full_node: WARNING  Block pre-validation time: 17.12 seconds (32 blocks, start height: 4739707)
2024-01-02T03:20:52.808 full_node chia.full_node.full_node: WARNING  Block pre-validation time: 19.29 seconds (32 blocks, start height: 4739739)

I don’t know if this is a real problem or a red herring. But it looks like some block pre-validation took around 20 seconds. However those logs continue on and on until 2024-01-02T04:34:48.108, and are not super frequent (about once a minute). After that they decrease to once every ten minutes or so.

Then later on I start seeing tons of log lines like

2024-01-02T04:36:21.134 full_node chia.full_node.full_node: WARNING  Transaction queue full for peer 273ef3f1e7de3ff966939d8442088ac42c22c4b43479b03eebcedb27f4d59ce7

Again, not sure if these are red herrings.

Then I see many logs up until about 2024-01-02T09:30, when I see this error:

2024-01-02T09:30:00.773 full_node full_node_server        : ERROR    Exception:  <class 'asyncio.exceptions.TimeoutError'>, closing connection PeerInfo(_ip=IPv4Address('93.209.124.114'), _port=8444). Traceback (most recent call last):                                                                                                                               
  File "/home/ubuntu/chia-blockchain/chia/server/ws_connection.py", line 425, in _api_call                                                                           
    response: Optional[Message] = await asyncio.wait_for(wrapped_coroutine(), timeout=timeout)                                                                       
  File "/usr/lib/python3.8/asyncio/tasks.py", line 501, in wait_for                                                                                                                  
    raise exceptions.TimeoutError()                                                                                                                                                  
asyncio.exceptions.TimeoutError

and

2024-01-02T09:30:00.777 full_node full_node_server        : WARNING  Banning 93.209.124.114 for 10 seconds

After that the number of log lines slows to about 4-5 lines per hour, the most recent being:

2024-01-02T21:13:44.577 full_node full_node_server        : WARNING  Banning 182.138.160.134 for 10 seconds
2024-01-02T21:13:44.579 full_node full_node_server        : WARNING  Incompatible network ID. Maybe the peer is on another network
2024-01-02T22:01:31.532 full_node full_node_server        : WARNING  Banning 182.138.160.134 for 10 seconds
2024-01-02T22:01:31.535 full_node full_node_server        : WARNING  Incompatible network ID. Maybe the peer is on another network
2024-01-02T22:01:59.853 full_node full_node_server        : WARNING  Cannot write to closing transport 73.119.171.145
2024-01-02T23:41:26.998 full_node full_node_server        : WARNING  Banning 162.212.153.137 for 10 seconds
2024-01-02T23:41:27.001 full_node full_node_server        : WARNING  Incompatible network ID. Maybe the peer is on another network

Hopefully there’s some insight coming from these logs.

Next I will l try logging chia show -s in a loop with a timestamp, hopefully that will pinpoint exactly when the node stops syncing. And I need to find a way to keep btop running or log CPU, memory, network and IO load with timestamps too. And I will try upgrading to the newest beta, or downgrading to 1.8.x (assuming that is still valid for farming).

Jacek · January 3, 2024, 1:54am

Maybe drop the number of peers down to 5-7 for now.

I have to say that I was using more bpytop, and just recently (a couple of months ago) switched to btop. However, I have never had any problems with it. It was always rock solid. Also, you are running ARM version, so maybe that build is not that well tested. Assuming that it shouldn’t be crashing, maybe this indicated a low RAM situation? Have you tried to go through system logs, maybe you could find something there? Also, maybe try to create a swap file on your SSD.

That second log has an Exception there. That is not a normal behavior, but not diving deeper into the code it is hard to tell whether just that one connection task was taken down, or rather from that point on chia got unstable. My take is that in the production code there should be no Exceptions, but chia has it all over the place, making it harder to understand how critical those are.

Banning peers is normal, so I would not worry about it (yet, see below). Although, the next block couples peer-banning with those peers being on a different network (not sure whether a testnet (no clue how it could be) or a forked branch). I have not seen those before but would think that those should be rather rare things, and not happening to a bunch of peers roughly at the same time.

My understanding is that a node cannot tell whether it is on the main chain or on the forked one. The node assumes that the main chain is the one that majority of peers agree with. What it means is that potentially your node got on a forked chain. I don’t know whether the node can recover by itself from being on a forked branch, or what to do to bring it back to the main chain.

From the node perspective, it is enough to have just 1 peer, as long as the peer is reliable. This setup is mostly used for a secondary (backup) node at home (kind of crude master/slave). Maybe just for testing, you could sync to just 1 or 2 peers for a couple of days. In config.yaml there should be an option to specify trusted peers and those take preference over other peers. Unfortunately, I am not running my chia node at the moment (I run a test one, but it is spotty). Maybe a couple of folks could PM you and provide their IPs to link to (they need to have port 8444 open, though).

As far as that chia show -s, just grep for “Synced” and only logs those times.

As far as monitoring / logging your box, I use node-exporter / prometheus / grafana to do that. You would run node-exporter on your RPi and prometheus / grafana on another box. Here is a link about running it on RPi - Prometheus node exporter on Raspberry Pi - How to install - Linuxhit (I didn’t read it, though). Again, you don’t want to run prometheus / grafana on your RPi to not introduce extra stress thus further obscure the problem.

Nothing really stands out from what you have there. That btop crashing may suggest low RAM and that last block being forked out (although, I don’t know how to explain that you are forked out so often - maybe remove that peers db, as that is the starting point for a new start).

Maybe the first thing to do would be to add that swap file and drop peer count to 5. If that leads nowhere, to try to get a couple of reliable / trusted nodes and drop count to 1-2. Also, maybe run fsck on the SSD, if you didn’t already. By the way, is your OS on the same SSD as chia, or on an SD card (I would move it to an SD card)?

Lastly, I would create a ticket on btop github (Issues · aristocratos/btop · GitHub) about that crash. Maybe someone will comment what the most likely reason could be for it (basically pointing you in a right direction with your chia / RPi setup).

UPDATE:
I have started my (Win) node and was looking at logs. Actually, there are not that many lines being pumped right now on INFO level (maybe a couple per second), where the old dust storms were pushing multiple lines per millisecond. Therefore, ignore all my previous whining about those logs putting pressure (keep them at INFO and on your SSD). Not sure if logs improved (got trimmed), or this is a different type of dust storm that mainly attacks mempool (keeps pressure on fees).

Also, that box (i5-8250U) runs at about 5% of CPU. This would suggest that your RPi should be able to handle that load. RAM load is at a tad over 7GB total, so maybe close to a borderline for RPi without a swap.

As @JeffJN suggested, maybe v2.1.3 is buggy, or doesn’t work that well on ARM / RPi, so maybe upgrade to that v2.1.4-rc version.

JeffJN · January 3, 2024, 2:15am

I am suspecting it might be part of the mempool issue that has been going on the last few days. Possible fix is in the latest Release Candidate. https://github.com/Chia-Network/chia-blockchain/pull/17139

Yves · January 4, 2024, 12:32am

I have the exact same problem as mark. My system is on ubuntu with a small supermicro sever 16g RAM. It was working well for over a year and the problem started ~1 month ago. I was on chia 2.1.1 and thought the system might need update. Upgraded to 2.1.3 but the problem persists. When sync stops, restart chia makes it working again, then sync will stop a few hours after.

mark · January 5, 2024, 5:32pm

Good news, I dropped my target_peer_count to 6, and my farm has been running smoothly for the last 2 days.
I added a short while loop to log the sync status every 5 minutes, and it has always been synced.
Here’s the script in case anyone finds it helpful:

#!/bin/bash

while true
do
    STATUS_LOOP_DATE=`date -u +%Y-%m-%dT%H:%M:%S`
    STATUS_LOOP_STATUS=`chia show -s | grep "Current Blockchain Status"`
    echo "$STATUS_LOOP_DATE $STATUS_LOOP_STATUS" >> status_loop.log
    sleep 300
done

I will probably append a log of the top CPU, memory, disk, and network processes to the end of this log to help diagnose in the future.

Thanks for all the help and detailed explanations so far.