What's with network space?

wasya · November 1, 2021, 3:33pm

It goes down drastically. What happened, is journey over?

cjd9153 · November 1, 2021, 3:55pm

I posted about it earlier too. I don’t know. Someone said it is an estimate, so it will fluctuate. I agree, to a certain point, but it has never fluctuated like this.

wasya · November 1, 2021, 4:12pm

Fluctate here, flactuate there, looks like time to dig up fatboyslims tracklist.

DigitalSpaceport · November 1, 2021, 5:38pm

This is IMHO lower end spec farmers getting pushed out of sync (which stops them farming) as part of the ripple from the recent attack. Dust storm > creates > Connection Zombies. Even if you are on decent hardware, you can get hit if you restart with deadweight zombies.

Jacek · November 1, 2021, 6:27pm

I would rather avoid using the term “lower end spec.” Chia is using that term to try to get off the hook with this problem. Also, when you use one of the slowest languages out there to run time critical tasks, what else you can expect - it is by default slow. A good example of that is MadMax/BitBlade/Flex client. All of those written in straight C/C++, and clearly outperforming Chia code. Of course, engineers behind those projects are not entry level, like what we see from Chia (end of my rant).

As far as the problem, it is not just those boxes that are behind with syncing, but also those boxes that provide data to those that are behind, and thus are overwhelming their upload bandwidth, what leads to massive increase in stalled partials - i.e., dropping net space, as stalled partials are not counted as individual net spaces. Hopefully, for now that is over.

DigitalSpaceport · November 1, 2021, 6:46pm

I use the term because they actually are low powered/low spec devices. I think Chia should update and validate against machine types much more rigorously then they have in the past however for sure given the nature of what has happened here. I dont think they are trying to get off the hook, they seem like they are learning. It is hard to anticipate things, but this one deserves a post-mortem for sure when/if it subsides. Also isnt the chiapos written in C and wrapped in python?

Is the dust storm over?

SlugPlot · November 1, 2021, 6:55pm

Not all of the critical code is written in Python. For example the chiapos library.

Also you can see from this job opening page that they are already interested in re-factoring into Rust and/or C++.

“

Extend functionality and refactor codebases in Python, Rust, or C++.
“

Chia is doing far better than the average start-up.

SlugPlot · November 1, 2021, 7:10pm

If stale pooling partials were to trigger automatic down-throttling of peer connections maybe that would help mitigate the problem.

Another issue, with perhaps high impact: many use their plotting machine as their farmer and harvester. Instead of just telling people not to do this, it would help to implement suspension of plotting during the stale partials event.

Edit: by “suspension”, I mean intermittent suspension/pause, in order to prioritize partials and proofs.

nontechguy · November 1, 2021, 7:57pm

HPool OG dropped by around 2 Eib over the last couple of days and has not recovered back to the usual 11 Eib that it was at for the last month. Maybe some of the whales in China are dropping out.

Nitsuga · November 1, 2021, 10:13pm

Lot of farmers when into chia looking for the short term winning, so do i, but the turn into a project to be in the long term. Biggers farmers would be migrating to best ROI projects. Additionaly, the spam attack could decrease the network space.

Jacek · November 1, 2021, 10:47pm

Fair point. I cannot imagine plotter running in Python. However, with the current v1.2.10, the main problem is around handling blockchain db. It looks to me that code is really not optimized Python, and potentially poorly indexed db. We can only assume that blockchain db will grow, thus those problems will be more pronounced (currently that db is around 30GB, it is possible that people that are facing issues have .chia folder still located on HDs, although that is not really a big db, and if properly indexed should still be manageable).

When you have those stales, it is already post mortem scenario. So, your suggestion is how to balance stuff when on error condition. Yes, it should be there to recover from potentially the worst scenarios.

There are two things wrong with how it is implemented right now. First is that Chia is not properly managing your upstream bandwidth. There is no backing-off mechanism. So, when you have those “starving” boxes asking for data, Chia complies, and saturates that bandwidth (bear in mind that the ratios are maybe 10:1 for down/up-stream bandwidths, i.e., easy to request, difficult to satisfy). This should be managed first. (ignored by Sargonas, implying nothing is being done to address that). Of course, one could think about one more param in config.yaml to limit the bandwidth, but that would kill the network, as people that don’t understand reasoning behind would basically throttle it down to zero.

The second thing is that when you handle any network connection, you need to have some rules how to prioritize traffic. Chia doesn’t do anything to prioritize your partials over those starving boxes requesting for data (cofirmed by Sargonas).

Lastly, as stated already, when you have a deficient node that just cannot keep up (potentially db on v. slow HD), that node should request data only with the speed it is updating that db. That would force a network back-off situation for that node (less requests for updates). However, what we see is that those starving nodes are asking for way more data than they can handle. The only way to explain that is that the process that request data from up-to-speed peers is overwhelming the process that updates db, as such drops that not needed at the moment data that on the floor, and request it again, and again (no synchronization between those two processes, just a round robin queue). I really have no other explanation, how a node that cannot keep up with updating its db is not in down-throttling mode.

Also, a lot of people that have issues around slow starts / wallet access may have corrupted db. I was just trying to use sqlite backup utility to back up my blockchain db, but it got stuck at some point, and didn’t move for more than an hour. Potentially, that means that it encountered some mangled records that had circular references. Not really sure, as there was no feedback from that util, just bytes stopped coming while db was in a full swing. This problem may not be obvious to Chia processes, as potentially those work mostly with the latest records, and it don’t really have a need to parse old records. But that may not be the case during those upgrades. It would be nice to have some utility that one could run to check the integrity of that db. A potential source of those corruptions comes from people trying to close Chia, when it gets stuck and potentially db processes are not being closed cleanly.

SlugPlot · November 1, 2021, 11:08pm

@Jacek That message was long enough to overwhelm a node or a QA Engineer.

The hope is for incremental improvements after stop-gap measures.

dcrypt · November 2, 2021, 1:28pm

Sounds like your QA engineers are slow like lisp.

SlugPlot · November 4, 2021, 5:07am

Interpreted languages can be sped up. It’s not useful to spread negativity about a particular language choice. If you feel that strongly you can just develop an alternative.

Jacek · November 4, 2021, 6:17am

What is the point of telling other people to do something that you are not capable of doing? Although, if you are looking at examples of such things, there is MadMax, BitBlade, Flex client. I think all those programs were made by a single person in their spare time, and all outperform Chia’s code.

Also, having a kumbaya attitude is not really helping, where one person in the basement basically was close to bringing the whole network down… With his code, and few AWS instances, anyone can keep Chia network hostage right now, for as long as they wish, as people will not be able to upgrade / sync when the fixes will be out.

SlugPlot · November 4, 2021, 6:30am

Poor pool implementations were the primary issue. Even if the client can mitigate, the pool should be able to handle the client behavior.

Jacek · November 4, 2021, 6:49am

And the source or reasoning behind that is …?

I guess, you wanted to say that pools not having fees was the issue, but that is actually the lesser, and easily addressable problem (most likely already addressed). The main issue was a loss of maybe 60-80% of nodes, and that is with just a light dust storm.

SlugPlot · November 4, 2021, 6:54am

You want to change the subject, which was your hate bombs about lisp.

I change the subject now to: @Aspy68 Appreciation Day! Feb 29th no, every 2nd Tuesday.

Jacek · November 4, 2021, 7:38am

I guess, I am repeating myself, but actually, you still don’t get it. The subject of this thread is about the network space. Delayed payments / transaction fees is an orthogonal problem, and that is what you are rambling about. Any fixes for that are not addressing network space loss.

On the other hand, networks space is closely correlated with number of nodes that make that network. That 10-20% lost of the network space is just one side of the coin, where the other are nodes behind that. Most nodes that were affected are representing small farms (low end boxes), as such that 10-20% of lost network space is is the same as the loss of 60-80% of nodes on the network.

When someone starts looking at ad hominem attacks, just implies that he cannot continue on ad rem level.

SlugPlot · November 4, 2021, 7:52am

I muted you so I can’t view your replies.

I will now count my gains from buying the dip. LOL