Periodic bad response times

Chris22 · September 30, 2021, 1:39am

It could be. But our farmer is a lot lighter on hardware and power You require less than 1% of the cpu and 100mb of ram. A lot less data transfer as well. And usually even if things go to sleep our farmer keeps them awake.

2.0 version out soon even reduces wear while speeding up lookups so your drives last longer and use even less power+produce less heat.

Jacek · September 30, 2021, 2:09am

Neither one is the issue, is it? That mainly applies to RPis, as on a standard box, those things are just rounding errors.

That would imply that the problem is Chia software, and it really doesn’t look like, as otherwise more people would be banging on this issue.

======================

Kind of off topic, but since you started it.

You mean, you cache plot file results? Those HDs need to spin all the time, though.

Would it be possible to pre-parse those plots, and save results somewhere (not on those drives), and just run from those pre-parsed files? I mean, during those 5s lookups. Wouldn’t those remove most of the needs to hit those drives? Although, my take is that head wear is negligible (one small hit every so often), and spindle wear is normal (just runs 24/7, no hard starts/stops)

Also, I see that your pool is sending maybe 10x more requests for partials, comparing to those based on Chia code. So, maybe that reduced wear just addresses that part?

====

The only thing that I could think of is that Flex farmer has better error handling, and is not hiding all kind of errors in debugging logs. I guess, any farmer that does that, would help both pools and farmers, as it looks like people are not realizing that their setups have issues (chia’s app is really good at hiding those, and Chia team in ignoring such reports).

I am a happy member of your pool, and I wish your famer’s page would show in big red letters what are the errors encountered during the last 24 hours, and how to remedy those (not just in more verbose form of restating those problems) - maybe an error FAQ in the link.

One thing is making sure that farmer is healthy (what i have no doubt your is - I didn’t try it yet), and the second is to let users know that they have stuff to fix, eventually somehow forcing them to do that (even if only by a constant nagging).

Recently, one YouTuber has shown his farmer page for your pool, and my take is that even though he knows (much) more than an average farmer, he got about 20% bad partials without realizing that. He is doing those videos for several months as such potentially Flex member for that long, so unfortunately, by you not helping him, both Flex and members suffer, as those bad partials are hurting pool’s winning chances.

Chris22 · September 30, 2021, 2:08pm

Hard to explain as I don’t fully understand it myself. I’ve been told IOPS is halved.

It often is.

Nope. Difficulty affects head movement but movement already happens a lot due to the farming process so its not going to increase wear.

Indeed much better. We are working on the farmers error logs but most problems have easy solutions and we do have support in our discord channel.

We show stales and invalids on the website and have active support on our website, discord, and telegram. But if people don’t ask us we can’t help them and I assume he’s not running our software to farm. We don’t pay for bad partials.

Example farmer page running chia gui

Example running flexfarmer

Note the low stales on the flexfarmer rig, the ability to name your workers, and that net space is reported.

Bones · September 30, 2021, 2:31pm

Actually, many ppl are having the issue, so I’ve followed all ideas that fixed it for most seemingly to no avail.

Been great since I removed the other hub, once again hoping it’s sorted.

13 hrs with no issue so far.

Fuzeguy · September 30, 2021, 2:32pm

^This!! I have thought that this should be possible, just have no clue how to go about it. If’t was everyone could have superfast response time to challenges, pretty much regardless of setup. Would think this would greatly improve the overall blockchain response and health to improve clearing transactions.

Bones · September 30, 2021, 2:34pm

I have no clue how pre-parsing works, but wouldn’t that allow ppl to do it, save that data then delete the plot undermining the space requirement?

Fuzeguy · September 30, 2021, 2:40pm

Don’t think so. Why? 'Cause if they ‘won’ the plot, they would then have to, and not be able to to, lookup the plot particulars. And so they would not actually complete the win.

Bones · September 30, 2021, 2:42pm

I assumed you would parse the particulars as well, otherwise what’s the point of doing it,
As I said, not really sure what was being discussed, or if the further checks always look for particular things, or random things from many possibilities.

Jacek · September 30, 2021, 3:49pm

Sorry, I don’t know that terminology that well. To me, there are two searches, one that should finish in 5s, and the other in 30s. My take is that only those 5s ones could be preparsed, as those are not the winning searches, but rather ballpark checks whether your farm has something worth to further dig in.

Assuming that someone does solo farming, speeding up those 5 sec searches is not changing anything, as far a protocol goes, but makes farms run faster. That’s it. So, if one deletes those plots, there will be no wins, as no 30s results can be produced.

However, for those that belong to various pools, that may be an issue. I am not sure whether pools pay for those 5s or 30s searches, but I think that for those 5s. If that is the case, and those preparsed results would somehow not be bound to a farm, or somehow unique, people could run multiple farms on those preparsed results, and/or share those between farmers. Basically killing the pools.

Again. I don’t understand the mechanics of those searches, so that was just potentially a dumb question.

Fuzeguy · September 30, 2021, 3:59pm

If you can wrap your neurons around it, this explain all the details, and may well (well, most assuredly) hold the answers to all questions here.

Personally , I’ll need some time.

Jacek · September 30, 2021, 4:16pm

Yes, that is what I really don’t like, and think that it needs improvement. First, for an average person, there is really nothing wrong on those screenshots, as people may really not understand what a stale partial is, or how many stales is just too many. Therefore, I asked for BIG RED text that would draw attention to that problem.

The second is that the solution to those issues should be right there, next to those big red letters, to make it easier for people, so they don’t need to check it on discord/telegram/… at least to understand that they have issues, and eventually and immediate help to get some basics done.

Lastly, I am not running your farmer, but my stale results are in the same range as what your farmer has (about 0.05%). With that amount of stales, those don’t show in yellow on the that chart there. I still don’t know whether I can do anything to improve those numbers, or on that level this is just normal On the other hand, that 20% stales were clearly visible in that YT guy charts (if one understands what to look for). Your screenshot for the standard farmer looks exactly the same to me, yes, there are 2-3x more stales, but there is no basis, no warning that it is wrong. This is why I say that whatever you do right now is not good enough, as apparently that YT guy should know more, yet it didn’t make him think that he is being screwed with his setup, and that he is also screwing his pool/fellow farmers.

As far as Flex not paying for those stales, that is the wrong side of the coin. Pools pay for partials only when blocks are being won, and those stales are potential loses (e.g., 20% less chances to win a block for that YT guy).

Bones · September 30, 2021, 4:26pm

If they did they would go broke.

The first search is the filter, only 1 in 500 ish plots pass that.
If the plot passes that, it needs to pass the next bunch of requests to see if the plot also has those attributes.
If not, no reward, so any pool paying out before a reward is earnt is bound to fail, unless they have huge pockets.
No one with any sense would set up a pool like that.

Back on topic.
Hub wasn’t the issue, kinda glad I didn’t win that auction earlier, I was gambling my system was OK, and it was super cheap IMO.
But my isp failed to fix the issue so who knows, ffs.

Full of beer now, not tinkering today.

Chris22 · October 1, 2021, 12:25am

We do explain stales in our faq but yes colored texts may help. There is a bit of an assumption that people know what stales are from eth mining but its understandable that not everyone has mined before. We are pretty used to everyone coming to our discord/telegram to ask things/learn

Bones · October 1, 2021, 7:37pm

Well I’m making ground, pulled all drives today, I did have a mix of ntfs mounts and drive letters, now all are ntfs mounts and numbered so i know which disk is which for easier troubleshooting.

I’ve noticed everything can be fine, but adding any new disk throws driver errors on all my 5 bay docks , except the one I just added to.
I have to power cycle the whole lot to fix it.

When assigned a drive letter, they show up as a portable device, where as when assigned to a ntfs mount they show as j micron generic usb. ( dependant on dock ).
Both on the same driver and version by msoft though.

Oddly the gui still sees the plots even when the driver fails and needs a power cycle.

So further anysis needed, but heading in the right direction I feel sure.

Bones · October 1, 2021, 11:17pm

So earlier analysis was not quite correct but I’m gaining ground.

System is using different drivers for ntfs mounts and drive letter mounts.

Drive letter mounts show differently than ntfs mounts.

Adding a new disk is causing the error, as all the other hdd in docks continue showing the ntfs mount symbol, but also tries to connect via the letter mount symbol, so the 5 bay docks look like they have 10 devices each.

A power cycle returns them to 5 showing as ntfs mounts only.
If not done, every cpl mins the drives are trying to re initialise using the letter mount drivers.

I will assume as the drives are constantly trying to load un needed drivers its just slowing my look up times

So back to testing, but confident I’m on the home run now.

Jacek · October 1, 2021, 11:41pm

Do you see any errors in Event Viewer?

Bones · October 1, 2021, 11:50pm

I’ll be honest, I’ve not even looked yet.
Had to do these steps, as I’d 37 drives and no idea what was where so couldn’t analyse anything sensibly.

After my sort out today, I added 1 more drive and left it showing both symbols, I started getting look up times at just over 5 seconds, so not as bad as before but not great.

Leaving it now till the morning, as I pulled 1 drive to plot the last 2tb that wasn’t full

So if any issues when I look then ill check.

But looking at the device and the events, I could see that un needed driver loading over and over relentlessly.

Bones · October 2, 2021, 7:59am

Hmm, so 3 different errors / warnings.

One msoft are aware of, but it seems they had yet to fix it in 2020 since 2014, and can be ignored.

''The storage optimizer couldn’t complete retrim on Media because: The operation requested is not supported by the hardware backing the volume

Same disk identifiers kb2983588

I’ve seen a recommended fix for this, but the issue has not occurred since I power cycled after last adding a disk.

No default permission for com server application.

This only shows once and does not seem to affect anything.

I’ve copied the full paragraph for this error so I can look later.

Since I left it running with no double devices showing its run perfectly with no warnings on all but one drive, that drive returned times of just over 5 and 6 seconds.

I must look at that and why, but its better than all drives throwing issues for sure, and not to bad times only twice 1 second apart in 8 hrs.

Jacek · October 2, 2021, 5:52pm

The first error is potentially due to just laziness, and it is not really an error. I think what it represents is that when disk optimization is requested, the process doesn’t check whether the disk is mechanical or SSD/NVMe, and barfs on mechanical drives. The upper level that reports errors, cannot tell what is the reason, and is just reporting that error. Although, it is possible that once that main process barfs, there is no further optimization (if that trim is included in all/full optimizations).

The third error is not relevant, and maybe not worth to pursue too much. I also have some errors that I don’t know how to fix, and after spending some (short) amount of time to figure those out just gave up.

Although, the second is kind of telling. I had a similar problem when I was moving drives between computers. My understanding is that once OS sees a drive, it remembers it, and how it was added to the system (e.g., drive letter, or mounting point). When you remove that drive, and for instance reboot your computer, that drive letter / mounting point could be reused. Once you bring it back, looks like the system is confused from time to time. Although, maybe that “remembering” those drives only applies to drives that we forced by us to a new mount points / drive letters.

I had to do some diskpart work to get it fixed, but don’t really remember what. I just checked some suggestions, and most of those are asking to generate a new GUID, but that is not what I did, as I don’t want to generate "personal’ GUIDs - that is the OSes job. I think that I forced that offline disk to be online - here is the article. Again, I don’t really recall that, but would say it is worth a try if you see one offline.

Saying that. In my case, the problem came from me manually shuffling those drives. If in your case that is not what is happening (you shuffling those drives), then maybe this points to your hubs somehow going on / off line, thus those drives being disconnected / reconnected.

The fact that when you reboot your system everything looks fine may be due to how those drives are being enumerated and mounted (e.g., in the original way).
.

lowestofttim · October 2, 2021, 5:59pm

What was the fix for the same identifiers? I find stuff about windows server etc etc