Syncing slow and eventually slows to a crawl

distortionist · November 12, 2021, 12:43pm

I have looked through here, and I found one person with the same problem but no solution.

I am syncing on a 3950x, to a HDD.

Its been 3 or 4 days now and I am still at height 860,000… I have 1GB internet connection so it is not that. I am unsure of the HDD is slowing syncing down, but my problems are much more than that.

Syncing starts fine and I connect to healthy nodes, but after a couple of hours syncing something happens, and ALL new nodes show their height at 0 and have barely any activity. Old nodes show up fine but they eventually drop and get replaced with nodes with height 0. It seems to be a bug, or something my ISP is doing maybe?

Here is what it looks like

FULL_NODE 182.136.144.250                        51972/8444  6f7af282... Nov 12 07:27:53      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 116.24.95.245                          47260/8444  4999d9b3... Nov 12 07:30:30      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 222.67.245.156                         55095/8444  097f3810... Nov 12 07:55:37      0.0|0.1
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 119.78.254.3                           31726/8444  fe28b36f... Nov 12 07:55:37      0.0|0.1
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 101.206.243.175                        48716/8444  e770c638... Nov 12 07:45:14      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 5.146.192.66                           31210/8444  1ab9780a... Nov 12 07:55:37      0.0|0.1
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 31.14.157.25                           59712/8444  cb44498f... Nov 12 07:55:23      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 139.210.135.64                         50843/8444  7f1a30af... Nov 12 07:55:38      0.0|0.1
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 116.208.173.94                         64639/8444  6511e597... Nov 12 07:55:03      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 116.22.208.154                         65194/8444  b254bb9a... Nov 12 07:55:37      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 183.228.216.170                        14029/8444  6c6b66ef... Nov 12 07:40:06      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 111.42.111.210                         36262/8444  5a4009b4... Nov 12 07:55:38      0.0|0.1
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 115.60.18.11                           53948/8444  6162b87c... Nov 12 07:54:56      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 115.60.138.118                         34631/8444  5f1f0e12... Nov 12 07:46:14      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 27.192.73.157                          57508/8444  08a5039e... Nov 12 07:55:37      0.0|0.1
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 36.148.109.128                         36425/8444  21a2c2f1... Nov 12 07:55:37      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 78.9.102.205                           38426/8444  6deb9695... Nov 12 07:55:37      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 49.86.82.160                            2341/8444  4d85defd... Nov 12 07:50:56      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...
FULL_NODE 125.127.116.248                         6405/8444  d2e2c8dd... Nov 12 07:46:03      0.0|0.0
                                                 -SB Height:        0    -Hash:  Info...

This is what the logs at INFO level look like when this happens

2021-11-12T07:56:37.354 full_node chia.full_node.full_node: INFO     peer disconnected {'host': '42.53.74.48', 'port': 8444}
2021-11-12T07:56:37.345 harvester chia.plotting.manager   : INFO     Found plot /mnt/Dummy3/plot-k32-2021-07-17-03-12-0bfd853704f541badf9a5f1bd50aa5e41c3b7ace14fa0d0cd03336137660d15b.plot of size 32
2021-11-12T07:56:37.355 full_node full_node_server        : INFO     Connection closed: 182.136.144.250, node id: 6f7af282bf0a1f448670553420c78314b232a265b05b600a98feee77d8dad9c6
2021-11-12T07:56:37.360 full_node chia.full_node.full_node: INFO     peer disconnected {'host': '182.136.144.250', 'port': 8444}
2021-11-12T07:56:37.360 full_node full_node_server        : INFO     Connection closed: 116.24.95.245, node id: 4999d9b3b687d2d2b3f72175ab14671653145473fdaab45c1c67cef28c09319e
2021-11-12T07:56:37.361 full_node chia.full_node.full_node: INFO     peer disconnected {'host': '116.24.95.245', 'port': 8444}
2021-11-12T07:56:37.361 full_node full_node_server        : INFO     Connection closed: 222.67.245.156, node id: 097f3810f0226c451048f388e87f9cf637225c5f600e4256e54cd2d7564e465c
2021-11-12T07:56:37.361 full_node chia.full_node.full_node: INFO     peer disconnected {'host': '222.67.245.156', 'port': 8444}
2021-11-12T07:56:37.362 full_node full_node_server        : INFO     Connection closed: 119.78.254.3, node id: fe28b36f1168db02e8856db8bbb6a566d93a99630477984f47e496f872529099
2021-11-12T07:56:37.362 full_node chia.full_node.full_node: INFO     peer disconnected {'host': '119.78.254.3', 'port': 8444}
2021-11-12T07:56:37.362 full_node full_node_server        : INFO     Connection closed: 101.206.243.175, node id: e770c638625a569210baafc08509961aae477562f2f35d79002ce992df873555
2021-11-12T07:56:37.362 full_node chia.full_node.full_node: INFO     peer disconnected {'host': '101.206.243.175', 'port': 8444}
2021-11-12T07:56:37.345 harvester chia.plotting.manager   : INFO     Found plot /mnt/Dummy3/plot-k32-2021-07-17-13-01-0eb764ef1dde773ee1fac0bfe1777b886bcb2f192b159b2433dbf17a7a1eaefb.plot of size 32
2021-11-12T07:56:37.367 full_node full_node_server        : INFO     Connection closed: 5.146.192.66, node id: 1ab9780a340dce0179ecc1cb5200623488205cd91e600827458697e2ed760b73
2021-11-12T07:56:37.371 full_node chia.full_node.full_node: INFO     peer disconnected {'host': '5.146.192.66', 'port': 8444}
2021-11-12T07:56:37.371 full_node full_node_server        : INFO     Connection closed: 31.14.157.25, node id: cb44498f5532d53086171d230623759905bf631d4192a3f9a1694e6c0e2f77e7
2021-11-12T07:56:37.372 full_node chia.full_node.full_node: INFO     peer disconnected {'host': '31.14.157.25', 'port': 8444}
2021-11-12T07:56:37.372 full_node full_node_server        : INFO     Connection closed: 139.210.135.64, node id: 7f1a30affaedc764566233280bb5722be46a4e02c3aec3545eab108650bf378b
2021-11-12T07:56:37.372 full_node chia.full_node.full_node: INFO     peer disconnected {'host': '139.210.135.64', 'port': 8444}
2021-11-12T07:56:37.372 full_node full_node_server        : INFO     Connection closed: 116.208.173.94, node id: 6511e59709c9e23d888b2105651907ba16e1055c23280d20b8bc9c37751664a5

k0d3g3ar · November 13, 2021, 2:37pm

You are not alone. I am setting up a multi-system farming network here, with about 5 harvesters and a farmer/wallet system, and I decided to put the farmer/wallet system as a VM on Proxmox (LXC Ubuntu headless server) so I could easily back it up. Much the same experience as you are reporting - started off fine, but about 800,000 height and it slows to a crawl. It has been 10 days now, and I’m still not done with it. Speed is decreasing exponentially.

What appears to be a potential bottleneck here is SQLite database. My day job is a DBA on a very large SQL database, and this looks eerily like an non-indexed search in the DB that becomes so poor in performance once a table gets populated with data, and at these numbers that might make some sense. That said, I haven’t spent any time looking at the internal code or DB design, but I’m thinking that maybe that is not such a bad idea - at least to check the DB setup. If that is the case, I’m wondering if there might be a better performing database for this number of rows than SQLite? I’ve also had corruption occur (particuarly on upgrades), which resulted in me having to do this again, and why I’ve gone for a VM solution that I can backup or put in a more HA environment.

Anyway if you discover something as to where the issue is, please share it. You are not alone.

K

Fuzeguy · November 13, 2021, 2:51pm

If you have a spare SSD or have space on your boot SSD, I would try using that for .cha directory, at least as a test to see if it fixes the problem. Syncing is both CPU (regular bump in use every couple seconds), and drive intensive as it builds the db. HDs just aren’t the best, or even good at all, for that db activity.

Jacek · November 13, 2021, 4:10pm

Which versions are you running?

Take a look at this thread, especially at that IMPORTANT bullet. It talks about v1.2.8/9 screwed up db access. Potentially, v1.2.11 fixed something, but not all. On the other hand it may be that v1.2.7 and before didn’t have that problem.

My understanding is that mysqlite serializes all requests, and if it is either not or poorly indexed, it may be that when one request is struggling with getting data, all the other requests in the queue are timing out. However, we don’t see any errors in the logs to substantiate that.

Maybe you could try to increase debug level to DEBUG, and some related errors will show up?

distortionist · November 13, 2021, 4:31pm

Ok I have tracked both issues down.

Issue #1 where new nodes show up with height 0. That seems to be caused by too many connections to chia. I am going to assume it is my ISP doing that. I tried dropping the peer limit to 60 ( default is 80 ) and it did not help, same thing happened when the list filled up. Then I dropped it to 40, and the issue is gone. Like I said, I think it is my ISP doing that, once too may connections are on the port it blocks data or something. I am not sure. Anyway, drop your peer limit if it is happening to you.

Issue #2 slow sync, I am not sure what is actually causing that. After fixing issue #1 it seemed to speed up but the sync is still really slow. I do believe it is partly due to HDD speed. I downloaded a DB from my friend and on HDD it took forever to load it starting chia. I moved it to my NVME and it was much faster loading. I do believe it would speed up syncing having it on NVME.

Hope this helps.

Jacek · November 13, 2021, 5:01pm

Putting blockchain db on SSD/NVMe is basically in the same category as limiting the number of peers. On one hand, db is used for internal work, on the other, all those peers trigger big number of db accesses, thus potentially read/write request timing out (whether locally, or do to not responding to peers).

So, that is a good catch.

As far as the peer count, 3 peers let the network grow (but really slowly), so maybe during the times needed to catch up, you can limit that number to 10 or so. From what I saw in my Connection tab, local node may be getting big chunks of db just from one peer at a time, and eventually switching to the next one. I don’t think that process is parallelized.

k0d3g3ar · November 13, 2021, 5:42pm

The more I look at this, the more it looks like the issue is with SQLite as the DB. There are many known issues with putting this number of rows in a table in SQLite, as it was never designed for that level of content. A better choice (wisdom in hindsight) would have been to use a threaded database server, like FirebirdSQL or PostgreSQL or even MySQL, but that’s probably not going to happen anytime soon.

Unfortunately I’m struggling to find ways to optimize SQLite other than from within the database calls that are in the code (connection config parameters). This article pretty much explains issues relating to using SQLite at levels like we are seeing:

https://phiresky.github.io/blog/2020/sqlite-performance-tuning/

I quote this paragraph: “SQLite is often seen as a toy database only suitable for databases with a few hundred entries and without any performance requirements, but you can scale a SQLite database to multiple GByte in size and many concurrent readers while maintaining high performance by applying the below optimizations.”

Now I’m not a SQLite expert, but considering its intended use case that makes some sense. That said, nothing is impossible. The frustration is not being able to tweak the configuration settings on an install by install basis. The comments relating to moving to a SSD for the database might work but won’t if you are running your farmer on a hypervisor or AWS or whatever. In those cases you don’t have the ability to augment or change the hardware.

Any SQLite experts here that know how to configure/optimize it when you can’t change the code that connects to the DB?

SlugPlot · November 13, 2021, 7:40pm

SQLite is not a toy, so your post is just negative speculation.

Sqlite is used by many large successful projects like chromium.

Jacek · November 14, 2021, 1:03am

I was just working with Larry, trying to get his box up to speed (looks like the same problem).

We moved his db to NVMe, and dropped the number of peers to 20 (I would not mind going down to 10, at least for duration of syncing). When we tried to start off from his “old” db (blockchain), nothing worked (no connection, everything spinning). When we removed blockchain db, and started from scratch, the syncing picked up with about 1% of blockchain per 30 minutes or so. About 50-70% of peers that he has not are supplying him with data (blockchain db).

Although, I am not sure whether that ~700-800 height is some magical value where shit just starts falling apart.

He is running v1.2.9 right now. The goal is to fully sync, make db backup, and eventually upgrade, as v1.2.8/9 are stated by Chia as basically garbage.

k0d3g3ar · November 14, 2021, 1:25am

Yep, that 700-800 height is weird. I did a full blockchain sync on another box (all in one, on Ubuntu with GUI and it did complete, but again at about the 850 height level, it slowed down to a crawl).

k0d3g3ar · November 14, 2021, 1:29am

I’m sure SQLite is great for small DB sizes and tasks, but considering the large size of the Chia blockchain, it is exhibiting the performance you would get from non-indexed searches. If you know databases, a non-indexed search may have nothing to do with the make or model of the database, but simply a column with a non-indexed state on a query that expected it to be indexed. This is an oversight many developers make - not a problem, just add the index. But it could also be that inserts into the database when it reaches certain levels like this bring it to a crawl. I’m sure Chromium (the web browser) doesn’t store millions of rows in tables like a blockchain.

Jacek · November 14, 2021, 1:34am

Which Chia version are you running? Yesterday, Sargonas (Chia) stated that v1.2.8/9 may have serious problems with syncing when the new version will come online this coming Tuesday. Maybe the problem is somehow related to that issue? I am still on v1.2.6, and it works great for me.

k0d3g3ar · November 14, 2021, 1:47am

I’m using 1.2.11.dev0 It could be that being a later version, the issue was not detected and resolved. Maybe this Tuesday the answer will become apparent.

Jacek · November 14, 2021, 1:55am

Roughly, how long did it take you to fully sync on that box?

k0d3g3ar · November 14, 2021, 1:56am

About 3-4 days I think. This was about a month ago, so I’m sure it was an earlier version - probably similar to yours. My current sync has been running now for 8 days, and is at about 1,030,000 height (says it has to go to about 1,160,000 or so).

Jacek · November 14, 2021, 2:07am

I am at full height (1,137,694), and don’t have any problems with syncing. It may be that in v1.2.8 they introduced bad code that started to show problems during the dust storm. As v1.2.11 was rushed, I doubt that they address all the issues. Although, I moved my db from SSD to NVMe last week.

If you are still struggling with getting synced, really try to drop your peer number count (config.yaml, and Chia restart). That box that I was helping to fix had about 80 connections, but basically zero Up/Down exchanges on all of them, plus plenty of peers at 0 height. Once we switched to 20 peers, about 15 peers started to send blockchain chunks.

As that article stated, the WAL option let you use write-ahead-log, but sqlite is still synchronous, and when it gets overwhelmed by peers, my take is just giving up - thus zero reads/writes, as everything just times out.

Jacek · November 14, 2021, 2:11am

This is an easy assumption, but I think it is a wrong one. It looks like with the current state of blockchain db and the code that is using it, we are close to max what it can do. When you have those 80 peers (default value), all those nodes are competing for db access, as such db starts to crawl. Just drop the number of peers (config.yaml) to something like 20, and you should start seeing more nodes that will start serving you with blockchain data.

When chia is syncing, it is not parallelizing those downloads, it looks that it is going after one node at a time. In theory, you should be able to sync with just one peer. If you peer with 10 nodes, you just give those nodes some break, when chia is doing round-robin downloads from them. (At least, that is how I understand how it works.)

sargonas · November 14, 2021, 7:37pm

To be more specific, if you are running 1.2.8 or 1.2.9, after CATs are released you will keep functioning normally and your ability to stay synced if already synced will not change in any way. The specific issue will occur if you become unsynced by more than 20 blocks, you will then be unable to sync back up fully until you update.

Definitely an issue we want people to avoid before it becomes a problem for them, but this does not equate to having the rollout of CATs on tuesday effectively bricking every 1.2.8/1.2.9 instantly on the spot… that’s a non-trivial misinterpretation.

kpm · December 31, 2021, 5:36am

Has a solution been found? My system was down for a few weeks and now syncing but it’s extremely slow. I have about 300 Mbps connection so not the slowest. This is the first time it’s been taking that long. I have the latest version 1.2.11. Any insights? Thanks!

Nitsuga · January 14, 2022, 1:27am

It will take around 4 days to fully sync. What you can do is only add connections.