Harvester not participating in challenges

An interesting data point… after the chaos this morning *cough* I decided to transfer a bunch of plots while the network was down. I patched my node with the hotfix and got things running pretty quickly, but I still had a bunch of plots transferring (at full speed, no bandwidth limit) and I noticed that I did not experience the missed challenges that I did previously during the plot transfer.

I noticed the amount of full nodes I’m connected to is much lower than before, 10 now vs >50 before. This might be due to the low percentage of nodes running the hotfix?

I don’t know if it’s just related to number of node connections, or potentially there is a lower rate of transactions after the stuck chain this morning.

FWIW I’m also seeing this in my chiadog

Also copying plots over the network. On 10GBps but seems I need to upgrade to 40gbps.

Out of interest how do you know if your farm is responding to challenges or not?

I’m also seeing this and working on resolutions. I agree that I see the most unhealthy signage point logs when transferring plots over the same link that the farmer/harvester use to the NAS.

I have a Synology NAS and am planning on using a RPi 4 as a farmer/harvester.

The NAS has multiple Ethernet ports so I plan on having the farmer/harvester with it’s own eth connection separate from the plotter LAN where plots are transferred.

Also following this as a possible bug here: [BUG] My Raspberry Pi 4 4GB currently misses / doesn't finish plenty of signage points in a row · Issue #1796 · Chia-Network/chia-blockchain · GitHub

2 Likes

I am also experiencing this issue (we had a small discussion here and I wasn’t the only one experiencing this). Not sure how long this has been going on but probably not always since I’ve managed to win some coins in the past. There also seems to be quite a few different bug reports in github, reddit etc from even a few months ago.

The Chiadog bug on Windows is a great tip. However, altough I am running everything on Windows, the stopping is real. I’ve gone through all the options I could imagine, and I’ve also started to think if this has something to do with new plots joining the farming after completion.

Are you guys using only one machine or multiple harvesters? I’ve just realized that the problem might not have occurred before I tested harvesting with multiple machines using the official guidelines.

Does anyone know when user sets up a harvester to another machine, and copies the ca-folder, and configures the IP, is this all a one-way-street, so all the changes are only in the harvester machine, or does this also make some changes automatically to the main system? I’ve noticed that this second harvester shows up in the GUI but now I just want to “erase” it completely to be sure that it doesn’t cause any issues. I’ve already disconnected the harvesting PC from internet, and stopped the harvester from command line. But does the main machine e.g. try to look for this second harvester?

I’m only running on a single machine. Full node, harvester, wallet, etc are all running on my NAS.

1 Like

Yup, same here. But I tested before multiple harvester out of curiosity + for the preparation of scaling my system.

I’ve tried literally everything and haven’t figured out this issue yet. Right now, I have a hunch that perhaps me, running 3 plots simultaneously (3x4GB) on 16GB RAM might cause some memory issues. Although there should still be 4 GB left for other usage, this might not be the case in reality. Right now, I am plotting only 2 simultaneously, and it seems that the memory usage is often around 70 % due to all the other stuff (Chrome etc) running on the side. I’m waiting with curiosity to see and learn if some phases create a spike on the system that could cause the shutdown of the harvester. At least now it feels as if the harvester is stopping not-so-frequently…

Otherwise, I am starting to be out of ideas with this.

So far so good! I decreased plotting in my full node / plotting / farming-PC, and seems like either the memory or the CPU is enjoying the changed situation since I haven’t had any issues for the past 12 hours.

Too bad this might mean that if I wanted to keep plotting on my PC, I’d need to build a separate farming setup, yet that isn’t very practical due to having a dozen external HDDs that are working nicely with PC.

I’ve also got a machine which is experiencing this problem; I’ll try without any plotting activity and see if it keeps working for longer.

1 Like

Can this be solved with dual NIC´s 1 for chia and 1for filetransfers ( ex new plots coming in)

1 Like

Good suggestion! Unfortunately I tried it today and it made no difference. Transferring the plots on a secondary network interface still resulted in my node de-syncing during the transfer.

1 Like

Isn’t it well known by now that Chia and NAS’es do not play particularly well together? And it’s totally I/O based.

My poor terra-master NAS’es had gigabit ethernet support (so a reasonably beefy CPU), were set up as RAID 0, and their reaper proof times (by that I mean the actual disk I/O to check if a plot meets the proof) were consistently 4x of the simple JBOD drive connections. :frowning:

It does seem like there’s fundamental overhead in NAS architectures, and Chia proof checks are sensitive to this… :confused:

1 Like

Yeah, seems like there are strange issues with NAS’es for sure.

I saw your tread about the issues that you were having, and I think I’m running into something different. My NAS is completely stable while I’m not transferring plot files around. My average proof seek time is 0.7 seconds over the last 24 hours.

I think my Chia Docker container is getting resource starved somehow while transferring files. It seems like file transfers are heavily prioritized by Synology to the detriment of other processes running on the NAS. I was shocked that throttling the file transfer speed down to almost nothing yielded the same results.

I think the next step would be to contact Synology support to see if they have any input. Perhaps there is a way to configure the NAS where it doesn’t de-prioritize Docker.

I’ll be honest, I’m not very motivated to investigate further since I have filled the storage attached to this NAS and things are working smoothly now that I don’t have to move these plot files around.

If anyone else gets to the bottom of this I’d be interested to hear the solution, in case I ever decide to re-plot for pooling.

1 Like

Ah, so just to confirm, you only have issues when files are being transferred to/from the NAS? For me, the 4x proof lookup speed penalty of “plot file on NAS” was always present, compared to “plot file on JBOD attached via USB”.

It was a workable delay, not near the 5 second warning limit, but it was noticeable because it was easily 4x more than the JBOD time. It wasn’t a small difference! Here’s the last harvester times from 9 days ago:

intel / nas htpc amd datacenter JBOD
avg 3.59 1.59 0.48 0.61
median 3.51 0.12 0.13 0.47

See what I mean? That’s the last time I looked… I stopped looking once I removed the NAS’es from my environment.

1 Like

Correct, I only have issues when I’m transferring files. When the NAS is “idle” my lookup times seem like they’re in line with your JBOD numbers.

I only recently stopped plotting to this NAS, so I’ll collect some more data on average lookup times and report back.

Overall, my expected time to win has been roughly what I’m actually winning (with a dry spell here or there), so things are actually working.

1 Like

Thanks so much for this.

Are you using SSD cache on your Synology? Is it recommended or required?

What kind of file structure do you use for the volume created? Is it raided? Or should I just read the documentation and stop asking easily answered questions. My new hard drives just arrived and id love to get cracking away (when I can find time!)

I’m not using SSD caching and I don’t think it would be particularly useful for the Chia use case.

For a file structure I have each drive as a separate volume with no RAID. The user interface will encourage you to use RAID and warn about potential data loss, but do not use RAID it is a waste of farming space for Chia.

Just FYI, I found that transferring plots between my Windows plotter and Windows Farmer was causing a similar issue, challenges coming in delayed or sporadically. ChiaDog alerted me to this. I even set a throttle on the robocopy of the plots and the issue still remained.

I’m wondering if you need a separate network for moving plots around and one for farming like @WolfGT is doing. So I added a wireless adapter to the farmer to test it out later.

RAID is useful for when you cannot afford to lose data. Plot files are not like that as you can recreate them. Why waste a disk on RAID to protect something when you can use that extra disk storage for holding plots. Also raid can manage degradation to a certain point - then you lose everything - that might be 1, 2 3 disk but and unlikely to happen but it is still a risk (and more protection = more committed disks) - if you lose half your plot disks you still have half remaining - its is inherently tolerant of losses. The only good reason for RAID is to increase the bandwidth so you can copy from the plotter to the destination faster or if you cannot replace the plots