.11 new issue, just stops syncing

So, over the last few days my box has had this happen twice.

.11 gui has just stopped being in sync, and I have needed to delete unconfirmed transactions , end and restart chia.

Both times chia has not shut fully, the first time I wrecked my dB, this time I killed the zombie process in task manager and a restart of gui had me moving again.

Not checked logs for clues as yet.
Anyone else seeing this behavior?

I’ve never lost sync, ever, even during dust storms and servicing eighty nodes. However, the shutdown issue has been persistent thru the various releases. I find that the ‘chia’ processes simply do not shut down most times when the GUI is shut down. Yet occasionally, they do. Go figure. Perhaps the db transfer to some of those nodes keep the processes open?

Because I had to resync once back summertime I think before I realized what was happening, I now watch task mgr every time. And given time, they will eventually stop. Sometimes it’s fairly quickly, sometimes many minutes. I now know patience, in the case of chia, is a virtue to maintain order.

Thanks.
I also service 80 nodes, and my box is far from stressed, using just over a quarter of my ram normally and a tiny bit of my processor, so I don’t understand it, or why closing and restarting the gui would fix it.

Seems like a bug, but I’ll prob just update the client if it continues.

I forget, what is your CPU, memory, ssd used in your farming node?

Fx 8350, 32gb ddr3, its a kingston ssd, I forget the exact one, it wasn’t the latest fastest model , but the one before iirc.
Looking up the code I think it’s an A400

Some things you might consider to help troubleshoot the issue > Looking at the Passmark ratings for both of our CPUs (mine = Ryzen 5 3600), they are 5959 & 17837.

My observations are that almost exactly 1/3 of my CPU is used, under peak use loads, for all Chia processes combined to maintain proper response in a realtime fashion. So, although JMHO backed by my analysis of the involved processes, your CPU is pretty close to the minimum that is necessary to keep an average node from potentially falling behind at high load times (when all processes are peaking max).

The Kingston A400 ssd seems pretty weak as well, but shouldn’t cause problems unless near full and so can slow down. Still, in might be worth it to get something a bit better, performance wise for a few $$

You sync issue happens rarely, as you say, but when it does, something has saturated under load. These dust storms, and potentially just heavy blockchain transaction use in the future, have the potential for problems for a lot of borderline and below nodes, IMO. Good luck!!

The dust storms don’t generally affect me, they push a few responses from 5 to 10 s and that’s all.

This last 2 drops seem different.

Anyhow thanks for your input, appreciated, and food for thought.

I found something that works. When it gets stuck go to full node button and drop down to peers. Delete all peers that are behind or not updated. Allow new peers to get accepted and delete any new ones not meeting criteria. This, specially during dust storms could get the node stuck.
Hope it helps you.

You do realize that the only reason that you see those “behind” peers is that your node cannot handle that number of peers. The only reason that you have those not behind / good peers is that they have stale counters. Therefore, when you drop those peers en masse, you explicitly dropping the peer count, as such giving your box some breathing room. However, if that is a dust storm, the situation will get repeated. So you are just wasting your time fighting your default config.yaml configuration.

One thing to check in Task Manager is to switch CPU view to logical processors and watch individual cores. The reason is that most of the chia code is not multithreaded. In this view, you may see that just one core gets choked, where all the other ones run idle. Nothing can be done to help it, until chia realizes that they need to rewrite the code.

If you have extra memory (chia runs fine in about 8-10 GB RAM), maybe you can see whether something like PrimoCache would work? I would set it up to only cache the blockchain db folder.

If the box does not look like stressed (basically no CPU, drives, network activities), that usually indicate that the code has synchronization problems, and is timing out. That has been confirmed by chia during the first dust storm (they screwed up synchronization between the peer and db handling code). It is kind of a nasty problem to have. My take is that v1.2.11 addressed part of that, but unfortunately not really removed this problem.

The thing that makes this synchronization worst is that chia setup is applying the same settings to all installations (e.g, that 80 peer magical garbage). On the other hand, the code has either none, or really bad bandwidth / peer management, so this is the choking point.

There is no point to brag that a box runs with 80 peers, as that number was just picked from a hat. If the box shows problems with syncing, just drop that number by 20, and let it run for few days, and go from there.

Again, from p2p topology, 3 peers are a bare minimum. Everything above that improves stability of the network. The upper boundry is not really what is good for the network (or what was selected from a hat), but rather what is good for a given node, and therefore, still good for the network (better to have node that is alive, then dead).

Box does not seem stressed, there was no storm for event no 2, cpu max looked like 13%.
It’s only a 4 core cpu, so I assume I’ve power left there.
I have spare ram, I’ll think about primo but that’s new to me.

The odd thing is, when it happens, it’s not even syncing, just sat dead despite up and down traffic, so maybe it is as you said a time out.

Edit, oops, 8 core, so if it’s pushing it all through 1, that could well be the issue.

That is the problem. We all have plenty of power even using those older processors, but if the code is single-threaded, that just doesn’t count. Your CPU is seen as a single core one with a sub-standard processing power (what is the official position). Or rather badly written code that follows some garbage numbers, instead of adjusting to what the system can do (my take).

I saw some people mentioning memory leaks. I didn’t follow those posts, though. But that could be part of the problem as well.

Maybe another way of looking at that is to move your wallet db to one of your HDs. Wallet db is not that heavily used as blockchain, so should run fine to run from a HD. However, when it is sitting on the same drive as the blockchain, it just makes the SSD work harder. Again, the problem is not really with SSD being slow on paper, but rather the fact that those db reads / writes are small chunks random access, and that thing is basically bringing any media to crawl.

The zombie process was still using ram despite every other chia process closed.

I edited my post, was 8 core.

1 Like

That is just a lack of synchronization during the shutdown. The only bad part there is that if one of those orphaned processes is still working on db (start_full_node, I think), it can lead to a db corruption.

As @Fuzeguy wrote, some patience is needed, but you can always argue that someone acted just 1 sec too early. Again, this extra wait is just trying to compensate for that broken synchronization that chia doesn’t bother to address, but still worth a try, especially if that is just one restart that you need to do.

Actually, I wanted to verify that start_full_node process on my box right now. Usually (always so far), when looking at all those processes under Processes tab, there was one combined Chia entry with around 10 sub-processes. However, right now all my start_full_node processes are not there. It looks like they got orphaned, as they moved to “Background processes” group, and are there by themselves. This is the first time I see this crap (maybe this is v1.2.10 thing, though).

I forgot to mention, or rather stress it better. That 80 node thing has rather nothing to do with “serving nodes.” That higher peer count is just adding resilience to the network, in case of some catastrophic collapses in some regions (i.e., to prevent network from being a bunch of disjointed islands). A well-designed protocol will classify peers by their “connection power” and try to maintain those higher classified peers connected to strengthen network resiliency. However, that is network resiliency, not really “peer serving” part.

“Serving nodes” per se is rather a function of a number of nodes and bandwidth dedicated to an individual node. As the bandwidth has a clear limit (upload bitrate), when the code behaves right, it will just push more data per peer, if the number of peers is reduced.

Also, looking from the other side, if a node has less peers, it will just draw more data from peers (kind of per session), still ending up getting the same amount of data per time slot.

On the other hand, it looks like that during the dust storm, all nodes are blindly pushing the same data to all peers, as such basically saturating those peers (node needs to sort through that noise - thus excessive db activity - synchronization problems). This is where the peer reduction potentially has the biggest impact, it just cuts down on the junk that is pushed around during those times (as the code does nothing to address it).

Also, happy New Year guys!!!

1 Like

And to you. ( 2022 chrs)

Not sure you have correct concept of this. This view below is from another task mgr sw. For “start_full_node” processes, i.e., there is one parent and several forks. Each fork takes CPU time randomly as needed. I don’t know why there are that number of forks.

Looking at my default Windows task mgr, sure I see some are separated as u mention, but they still belong to the parent and are accessed and run regularly as can be seen watching them, or the "CPU Ave’ column below. Also, in fact, all those forked subprocesses are ‘background processes’. There’s no shame in being one.

-corrected to show columns

I am not really sure which part you think I don’t understand, or you explained differently. To me, you have just restated what I said based on an output from a slightly different tool (looks like is using a different way to pair processes).

You are banging on me using “background processes” with a pejorative connotation, but I didn’t say that that is something good or bad (of course, both types are needed, and neither is either superior or inferior). I rather stated that for whatever reason those start_full_node (SFN) processes slipped outside of the main Chia container, and therefore they look like orphaned processes. Again, I stated that I see this behavior for the first time, and don’t really have a good explanation for it.

However, when I shut down chia right now, the main container closed in a few seconds, but those three SFN processes didn’t shut down. I waited a couple of minutes but at the end gave up and killed them. Looking at the blockchain folder, db was not closed properly.

This to me further supports the notion that those processes were orphaned, or rather that the daemon process (that owns them) didn’t bother to wait for them to shut down or somehow lost track of them. Actually, knowing that the main SFN task owns those additional ones, I first killed it. However, that didn’t kill those other two SFN processes. I had to manually kill all of them individually. Further implying that those processes really don’t communicate well among each other (thus are becoming “orphaned” - doing whatever they want, not really what they should be scheduled to do). Although, those are started as independent processes, and it was a forceful termination, so maybe not really that much supporting that notion.

I restarted chia, and all processes are included in that Chia container as normal. After some wait (all processes still under the Chia container) I shut chia down again. This time all processes exited nicely, and db was closed properly. Apparently, the inter-process communication was working as intended this time. Further supporting the notion of orphaned processes.

I have to say that I am using the term “orphaned” liberally. I mean that the daemon process looks like lost track of those SFN processes, nothing more. The fact that they are still working, is as expected, as SFN on itself deals only with peers and blockchain syncing, so really doesn’t care whether other processes are there or not. However, the main SFN process keeps ports open for other processes (farmer, wallet) to communicate if they are around. If those other processes will be gone, the SFNs will just merrily proceed with syncing, not paying attention to system events like request for a shutdown (i.e., potentially crashing dbs).

Kind of a long shot, but maybe this is the problem that is causing those syncing problems. Maybe the fact that those SFNs are “orphaned” implies that the synchronization for handling incoming transactions by those sub-SFN processes is busted, as such there are those timeouts, and everything goes out of whack. Although, it would be really a lucky shot if this would be the case.

Also, as you have noticed, all those nodes get some little tasks to do. However, what I saw during the last dust storm was that only the main SFN process was choking one core. All the other SFN processes were basically doing nothing. Same as you, I would expect that the only reason those additional SFN processes are there is to partition the work, but for whatever reason this is not the case. Still, I am on v1.2.10, so maybe it is a bit different in v1.2.11.

Actually, reflecting on what @bones said, he saw about 13% CPU usage. He has 8 logical cores, so that 13% kind of makes 100% of just one core. Even if we see multiple cores active, it could be that whatever process (SFN ?) is spawning multiple threads; however, is serializing the work (waiting for each thread to finish before starting a new one), thus ending up with 100% of combined one core load. Maybe the next time that happen, he can monitor separate cores, not the combined usage (and eventually switch to Process view to nail down which process(es) is (are) hosing CPU).

At least that is what I see, and how I would read it, but of course I am also happy to read how you are interpreting it. Also, all of that is just reflecting on what you wrote, so I didn’t sleep on it yet (not that I will).

Sorry, mis-interpreted your use & meaning of the phrase, “This is the first time I see this crap”. Carry on. Hopefully the new year will bring improvements to Chia’s implementation, making its operation plain and not requiring such ruminations on how it potentially is (not) correctly working.