Chia Lost Sync Overnight and is No Longer Syncing

Jacek · February 21, 2022, 3:15am

I think our both statements a bit too general.

When chia starts, it forks several processes (e.g., daemon, start_farmer, start_full_node, …). My understanding is that those processes are most likely single threaded. However, start_full_node (SFN) spawns several additional SFN processes (processes, not threads) to parallelize transaction processing. Looking at my two boxes, the number of those sub-SFN processes looks like is related to the number of physical cores (e.g., on 4/8 CPU it spawned 4 sub-SFN, and on 10/20 CPU eight). So far, so good. All those sub-SFN processes are sitting on one core, and maxing it out. That leaves a couple of more physical cores not occupied, plus additional 10 logical cores basically idle (the overall CPU utilization is around 50%). So, we can say that during heavy syncing, most of the CPU is being used. Looking how fast it syncs right now, it will take around 2 days to get it fully synced. Potentially, we could calculate how many transactions per hour it can handle in this state, and compare it with a normal daily load to see what kind of dust storm load it could potentially handle.

However, all peer connections and blockchain db writes (wallet does its own thing) are handled just by that one main-SFN process. That is potentially the single threading problem with dust storm while trying to act as everything is normal. This main SFN process gets overwhelmed handling all those peers. At least, this is what I saw before during those dust storms - one core choked, no strain on other resources (NVMe holding blockchain db, plenty of free RAM) - node struggling. Dropping the number of peers during those events helps that main-SFN process, and at some point, all is good again (same storm traffic, just less peers).

Kind of funny. I installed v2.11 chia that bigger box (i9-10900), and started syncing from scratch. I tried to manually add my main node as another peer, but it preferred to draw data from Iran. When I killed that peer, it switched to Croatia, … Also, for whatever reason that connection between those two nodes was constantly aborted (I tried to re-add it few times, but it got dropped within minutes, so I gave up).

UPDATE 1
After about 6 hours, the sync status is at around 35% (good!). That would indicate that it may take less than a day to sync. However, something changed, as the overall CPU usage is mostly at 10% total, with some spikes up to %50 (say on average below 20%).

Looking at the individual SFN processes, the main one chokes its core completely, where the sub-SFN processes are dropping down to 10% or so of their respective core loads. The SSD that holds blockchain is at about 10% - 20% load.

So, maybe this indicates that as db grows, it is slowing down, and is maybe dragging everything down. Although, I would expect that as long as that SSD is not pushing 80% - 100%, there should still be some headroom for those sub-SFN to run at the full speed.

Also, as the node is just leaching data for now, it means it is not pushing any data to other nodes, as every other node is mostly fully synced (so this part is not as much stressed as during those dust storms).

UPDATE 2
I switched db from SSD (Samsung 870 EVO) to NVMe (WD Black SN 750). All those sub-SFN processes are seesawing between 20%-100% respective core loads. On that SSD, once loads dropped down to 20%, they all stayed there for rather long time. This time, they bounce between those two levels without much if any delays. Both peeks and valleys are sharp. So, looks like that switch improved the overall sync speed.

Looking at that NVMe, it bounces between 5%-10% load with r/w speeds in the range of 1MB read / 10-15 MB writes. The reason that those slow reads / writes have such high load percentages is that db is read / written in really small chunks, as such the max loads are way below advertised sequential big chunk levels.

By the way, the main-SFN is still hosed (running at 100% its core load).

UPDATE 3
After few hours, db grew (40GB), and the main SFN is choking more and more. Right now, the main CPU load peaks are really sharp, but valleys are at least twice as wide as bottoms of those peaks (safe to say that all those sub-SFNs are operating at around 30% capacity). Those valleys show that all those sub-SFN nodes are not processing any blocks.

Also, from time to time the CPU usage of the main SFN drops down to zero (of course, all sub-SFNs are also at zero). As the total CPU load at this point is at zero, this implies that main-SFN’s process scheduling code went out of whack, and SFN waits for some timeout-outs to be able to get back to work.

UPDATE 4
After about 30-35 hours, the sync is complete. I am not sure whether it was a real thing, or just me waiting for it to end, but it felt like the last about 5k blocks were really, really slowly processed.

So, to summarize this exercise, I have used:

Intel i9-10900 10 physical / 20 logical cores, pushing close to 5GHz if not loaded (non K version, but tweaked to be close to it)
Samsung 870 EVO SSD (used for about 50% of syncing)  
WD Black SN 750 NVMe (used for the last 50% of syncing)
128GB 3,600 MHz DDR4 (used just a tiny bit of it)
Windows 11
Chia v1.2.11

There was nothing else running on that box, just a single node with those two drives (SSD and NVMe), no additional HDs, no plots, no nothing.

So, that setup is not that shabby. It has decent amount of raw CPU power, it has decent single core performance, and has fast DDR4 RAM. Both SSD and NVMe are the best or close to be in their respective classes.

The main work performed during the syncing process was done by start_full_node processes. There is the main SFN process that deals with all peers and db, and schedules additional sub-SFN processes to crunch data for downloaded blocks. Those additional sub-SFN process only work when they are given blocks to crunch. Otherwise, their loads drop to zero.

There were 8 of those sub-SFN processes started. Looking at individual cores, those sub-SFN are either single-threaded or are serializing threads if they are multithreaded (i.e., still acting as single-threaded, but with additional burden of thread switching).

With that setup, the box total CPU load never went over 50-60%. As the box has 10 physical cores, we could assume that one more sub-SFN process could be started. To be more aggressive, another 2-4 sub-SFN cores could be started, and those would use hyperthreading on the CPU to utilize potential idle times in other processes.

IF the concern was that having those extra sub-SFN processes could potentially interfere with other processes (wallet, farmer, main-SFN), those processes should be started with lower priority, thus yielding to those that really need CPU time.

During the whole process, the download speed from all those peers was around 1Mbps (125KBps). That speed implies that just one rather wimpy peer would satisfy that bitrate. So, the obvious question is why not reducing the number of peeers, if the node is way behind the current blockchain height. Sure, the assumption is not to put too much strain on one particular peer, but that could be done by recycling peers after some time (an hour/two, …). The main-SFN needs to handle those peer connections, and during the syncing the node is basically dead to the network, as virtually the whole network is ahead of it. Dropping the number of peers, if the node is behind would have another benefit, as the node would behave much better during the dust storms.

After db grew to about 30GB or so, it started slowing down. The total CPU load dropped from being mainly around 50% dropping to about 10% (this implies that just the main-SFN is working, and all sub-SFN processes are starving). There were more and more such valleys, up to the point that the chart looked more like having peaks from time to time. The overall CPU load dropped to about 20-30%.

The SSD load at that point was around 10%, with reads being around 1MBps and writes around 20-30MBps. So, on the paper, it looked that SSD still has plenty of headroom.

I switched from SSD to NVMe. There was small improvement. The chart switched from mostly valleys to mostly peaks. Although, after few hours it looked again the same as with SSD.

One thing that we kind of miss is that SSDs are advertised as having ~500MBps r/w speeds, and NVMes around 3.3GBps r/w. However, those results are for sequential reads using big blocks. I run CrystalMark against my drives, and for small blocks NVMe had only around 50% advantage. I would imagine that running smaller chunks would further diminish NVMe advantages. Although, with prices being about wash, I would still go for an NVMe (on my primary node such switch helped a log, potentially as it also removed those dbs from OS drive).

At that point, looking at single core performance, the main SFN process was running 100% of time at 100% of load (so it was clearly choking). At the same time, during those extended valleys, those sub-SFN processes were dropping down to zero load. So, basically, the CPU was idle, with main SFN struggling with synchronizing peers and dealing with blockchain db.

With all that, it is obvious that the main choking point during the syncing is the main SFN. That code is either single-threaded or with serialized threads, what boils down to it choking a single core that it sits on, and stopping the syncing process.

The second thing to look at is blockchain db. Obviously, as db will grow, this problem will just get bigger and bigger. The improvements can come from three places: 1. cleaning up the db (better db layout, better indexing), 2. code handling db, 3. dumping sqlite. It looks like chia v1.3 will have some improvements around #1. We don’t know whether anything will be done for #2, and we know that sqlite is still being used.

Part of the problem with code handling the db is related to peer count. The more peers the node needs to handle, the more activity is put on the db. Of course, when the water is calm, and there is a small breeze, all looks cool; however, when dust storms/belches/puffs come trying to keep all those peers around is just foolish. A dead node is just a dead node to the network, where node that clings to just 10 peers, but still works is an asset. How hard is to understand that.

I will try to repeat this exercise using v2 db. Although currently it looks that chia v1.3 will default to v1 db.