Syncing stuck at same height

Chiasaltmine · December 1, 2021, 7:57pm

Setting up a new system to re-plot for pools. Downloaded GUI version 1.2.11. I keep getting stuck syncing at height 552,064.

Everything works great up until that height. I’ve tried deleting the db and restarting, deleting the .chia folder, uninstalling and re-installing. Three times now it sticks at the same height (after almost a day each time). Confirmed port 8444 is open and time/date properly set.

Error log has lots of:

      ERROR failed fetching 552064 to 552096 from peers.

Then followed by:

      ERROR Error with syncing: <class 'RuntimeError'>Traceback (most recent call last): File    "chia\full_node\full\[node.py](https://node.py)", line 807, in _sync

    RuntimeError: Weight proof did not arrive n time from peer: [XXX.XX.XXX.XXX](https://XXX.XX.XXX.XXX) (xxx=various peer addresses)

Any thoughts on what to try next?

Jacek · December 1, 2021, 8:12pm

Are you trying to setup a second farmer at your location just for plotting?

Could you do a screenshot of your Connections section under Full Node panel?

Chiasaltmine · December 1, 2021, 8:37pm

Hi JaceK,
Trying to set up a full independent farmer/plotter (new wallet) specifically for pools. Plan to slowly add hard drives that I wipe/re-plot from a different system (that’s been running hpool).

Is this the screenshot you wanted to see? I seem to be connected to plenty of full nodes.

Jacek · December 1, 2021, 9:12pm

Yes, that is the screenshot I wanted - thank you. Looking at the column Height, it looks like your node is at least able to properly connect to other peers (if it connects but chokes early, you will have zeros there). Although, the MB Up/Down doesn’t look good. Unless you have other peers that are in the range of 10+ or so, those fractional downloads (I think) indicate the second choking point (your node gets some data, but cannot process it, or something like that).

If that is indeed the second choking point, then it is due to your node not being able to handle blockchain db. The first question is what is the drive that holds that db, is it an NVMe, or SSD (what brand), or HD.

Still assuming that this is the db choking problem, another way to relieve some pressure from it, is to limit the number of peers (at least while you are syncing). Modify peer count in your config.yaml under ‘full_node’ section down to 10:

target_peer_count: 10

And restart chia. After 15-30 mins, do another screenshot of those connections. Hopefully, we will see more bytes received.

Kind of reflecting on errors you provided. The second one (RuntimeError) indicates that the response was too slow or missing. All your peers look good, so there is no reason to suspect that your peers are not doing what they supposed to do. That implies that the delay is on your side. The only delay that you can have on your system is due to chia not able to handle blockchain db. That also would indicate to look at your blockchain db handling (e.g., move it to a faster drive, lower the number of peers). It has nothing to do with your H/W, as that is just bad chia code.

Chiasaltmine · December 1, 2021, 9:20pm

Thanks for the feedback. I’ll try and follow up with results.

This is on a NUC system only running chia off a SSD (with 400+ GB free space). I plotted previously on the system with no issues (but I shut it off a few months ago when I was done plotting).

Jacek · December 1, 2021, 9:38pm

Yeah, my node is also i5 NUC (from Gigabyte), and I did add to it an NVMe, as db couldn’t be processed properly when traffic was getting heavy. It came with a Gigabyte SSD that is rather crap, I mean good for everything else, but was not good to support blockchain updates properly. (It was passing all the benchmarks though.)

I also did try to plot on it, but it was getting really hot, and CPU started to throttle. I did open the case, and put a fan next to it, but it couldn’t reach the CPU (was on the other side of the PCB).

xkredr59 · December 1, 2021, 9:44pm

I’m no expert at syncing issues (@Jacek is) but it’s very remarkable that even after restarting clients, wiping db files and even re-installing the sync process stalls at exactly the same blocks.
Error log’s last line states that several peers are tried for these blocks and all fail to respond.
It’s something very specific in your system I guess.
Long shot but what OS are you running and is any firewall or antimalware software active that falsely triggers and blocks on some coincidentally present fingerprint in its database?

Chiasaltmine · December 1, 2021, 9:51pm

Just as a follow up, it looks like there’s a bug report on GitHub for version 1.2.11 that sounds almost identical to this. They are stuck at 552032 (vs. my 552064). They’ve added the detailed error log that looks identical to mine. Think I’ll try to roll back to an earlier version to see if it resolves.

[Bug] no full node sync chialockchain 1.2.11 · Issue #9388 · Chia-Network/chia-blockchain · GitHub

Jacek · December 1, 2021, 10:04pm

That to me means that this is not a real error (e.g., shit happened), but rather the system cannot handle something beyond that point. Basically, the only task that his box is doing is syncing (i.e., no plots, just clean start). In that case, the offending factors would be either the ISP side or local. Assuming that it would be the ISP, we should still see some data trickling, and we don’t see it. So, we are left with the system. On the system level, you have two main processes involved. The one is handling getting data from all those peers, the other is channeling that data through the db. Again, assuming that the peer handling process is at fault, we basically don’t have much traffic there, as chia is handling only one peer at a time, but doing it in a round robin cycle. So, we can assume that this process has data to feed to the db handling process. That puts us at the db not being handled right, and basically screwing synchronization with that peer handling process. At least, this is how I see it working. Those two suggestions (NVMe, lower peer count) have slightly different implications. The first one (NVMe), lets chia handle db faster. The second one is lowering issues related to synchronization between those two processes (but it also indirectly lowers db reads/writes as for every peer, chia needs to go through blockchain db).

Actually, considering those two suggestions (NVMe and peers), assuming that it would be just the first problem (slow drive holding blockchain db), that would also manifest in that drive being choked most of the time. I don’t recall anyone stating that (and that was also not the case on my box). That further points to the synchronization code between that peer handling process and db updating process. Reducing peers is removing some stress exactly there.

If that would be the case, chia would not be able to connect to those peers. We see it connecting, and we see it working up to that ~500 level rather fine. So, that would not be my choice of what could go wrong.

Again, his box runs fine to that ~500 level, so up to that point everything is sound. It implies that there are no other interferences, as the syncing process is rather linear as far as received data. The only thing that is non-linear is blockchain db access.

That you for that, but I am also not an expert here. We are all like those blind people touching an elephant.

One hope for me is that Flex will have a full node potentially soon, and to me that would be the way to go, as I see no hope that chia will address those issues.

Jacek · December 1, 2021, 10:10pm

If you scroll to the end of that bug report, you will see there an answer from chia. To me, that guy first didn’t bother to read the log, the second his guess is patently wrong (problem with the ISP side). It is not the first time for me seeing that guy basically brushing off issues to the user error level.

Yes, it is true that some people reported being stuck at around ~500-800 or so. Before downgrading, I would really lower that peer count, and give it a try.

Chiasaltmine · December 1, 2021, 10:33pm

Just to follow up again. Tried reducing the peer count to 10 with no change. However, I downloaded version 1.2.10 and the syncing is moving again (past the previous sticking point). Not sure what the underlying issue is, but that seems to work for now. Thanks for the suggestions.

edit - just a note that I also downloaded the beta 1.2.12 and still had the issue with being stuck

Jacek · December 1, 2021, 10:57pm

Did you restart chia / reboot your box?

Are you still on 10 peers, or you reverted back to 80?

Chiasaltmine · December 1, 2021, 11:42pm

Reboot with the original peer change, but only chia restarts between version changes.

Left it at 10, seems to have a decent syncing speed still.