How many long response times do you tolerate?

Question in the title!
Farm overall is very stable & partials/payouts exactly as expected (via Spacepool). Setup is all hardwired via ethernet cables

I’ve about 90 drives across two harvesters & separate main node. About 800TiB.
2 remote harvesters are fine (all response times consistently <5 seconds avg 0.3 sec)

On my main node I get about one long response time from the same few drives (10-15 seconds) every 5-7 days. I can’t quite figure out why - these are mounted SATA drives in Windows 10.
USB devices/sleep/power saving/anti-virus/updates etc all disabled

It’s a bit irritating but given that all response times are under 30 seconds, would you just tolerate it?

1 Like

Its a tiny percentage, id just leave it be.
However under 30 sec is to supply the proof, not pass the filter.

I would be checking which drives are slow, and if its certain drives in particular check those drives / drive for issues.

2 Likes

USB 2.0 or USB 3.0 just asking.

Have you ever run a chi plots ckeck on that node?

2 Likes

Thanks.

Mostly 3. but interestingly, it isn’t the USB drives that have the slight excess of slow responses but the SATA connected. I do have a 3rd party PCIe SATA expansion but have used this in the other harvesters without issue

Have run plots check in the past. Individually all seem ok & no signal from partials etc of any issue

Windows or Linux, any different drives installed that were installed on other machines?

1 Like

15 seconds is kind of a good timeout. I mean, if you get something in the range of 30+ secs, that may point to problems with the drive or adapter card. On the other hand 15 secs most likely points to drive power state changes.

It looks like your average drive size is below 10TB. Maybe the drive in question is a smaller one and goes to sleep (dry lookup spell for it in that run), as it is borderline close to its IDLE_B or STANDBY timer. Maybe you could either write a batch to write to that drive once per minute or use some app that will do that for you.

2 Likes

What, if anything, do those same few drives have in common?
Do they share a specific drive model, that differs from all other drives in your farm?
Do they share the same controller on your PC?

Also, some drives are hard coded to go to sleep, short of getting written to periodically.
There are no Windows settings that will keep them awake.

I have a few such drives, and I have a keepalive.bat script that cycles every 100 seconds that copies a few bytes of data to each of those drives. Now they never sleep.

To see if the drives are sleeping, have a command ready for you to simply have to press enter to kick off a file copy job. If, when you copy the file, there is an initial delay, then that drive was probably sleeping.

I recommend “robocopy.exe” for such a test, because by default it will show you the progress of the copy job. It makes it easy to see if there is a delay getting started.

robocopy.exe is included in Windows 10. I do not have Windows 11, but it is probably there, too.

Run:
robocopy/?
from the command prompt for the syntax.
A lot of info will get displayed, due to all of the options. You will not need them. The top of the help output is all you will need for your test.

Note that if you “cd” to a directory on that drive, that action might wake up the drive, which will invalidate your file copy test. So set everything up, and wait 30 minutes or so. Then press enter and see if the file copy happens instantly, or if it hesitates to get started.

I suggest that the source of your file copy test be an SSD drive, to eliminate the source drive from being a possible factor in any delays.

2 Likes

Windows - I know I should have gone with Linux but didn’t fully expect this would become quite the project it has!

I’ve tried a similar thing with the respective drives - including a batch script & also “Keep alive” program but if anything, it made the problem worse! (response times >30 seconds albeit some time ago so may have been other variables at play)

Only thing the drives have in common is that they are “mounted” - i.e. Windows 10 - over 26 drives occupying A:->Z: and so the remainder are mounted in a folder. I can’t understand how that would make a difference though?

I had the same problems with the adapter card, so I reduce the connected drive for each adapter card. Currently i have 8 hdd on each card. The lookup time is very good, max is maybe one plot per day over 2 sec. 0,02 avg.

1 Like

10-15sec on my setup (2 harvesters + node all Linux) indicates HDD waking up from sleep which I fix with a script executing every minute that keeps the drives spinning. Another possibility is that something else is accessing the same drive at that time causing high IO and slowing down harvester access but that would cause slowdown only if there is heavy IO traffic (ex: you are copying plots to/from that HDD). That said in the past I’ve also seen unusual behavior with large number of USB drives connected to Intel USB controllers (that was on an old Surface Book Pro) which sort of resembled a brief hang before IO resumed but I have since moved off that system and use ARM64 based SoC for harvesters (1st all USBs and 2nd all SATA) these days with no issues at all. Well the SoCs I use today are not RPi4s as these have issues with larger number of USB drives.

Forgot to mention that I monitor my max scan times and anything >10sec I investigate. I try to keep my harvesters humming with <10sec plot scan times which are typically avg of <1sec on USBs.

1 Like