Performance loss plotting on multiple HDDs in parallel

I dunno, it’s hard to take all this magical ram software cache stuff seriously after the failure of Intel Optane

and the failure of Vista’s ReadyBoost

https://www.anandtech.com/show/2163/6

not to mention the failure of weird hybrid HDDs that had partial SSDs for “cache” on them… remember those?

https://www.anandtech.com/show/5160/seagate-2nd-generation-momentus-xt-750gb-hybrid-hdd-review

What do all these techniques have in common? Trying to make a slow interface faster by globbing a little bit of something faster in between. And they all failed. In general this path – the path of using “a special cache layer” has, at best, a checkered history in the computer industry.

It’s the job of the operating system for the last forty years to cache frequently needed info in unused system memory as a matter of normal operation. This is an extremely mature area of computer science, to put it mildly.

So why the magical software? I mean heck we already know ramdisks are completely not worth it for chia… and that’s going all the way.

1 Like

No worries mate, you’ve obviously made your mind up on how it is all supposed to work.

I haven’t actually! I’m just saying the history here is… not good! Read through readyboost! Read through optane! Read through hybrid hdd/sdd drives! Decide for yourself!

It’s a special case. Windows is aggressively caching reads, but write-through for safety. Not even auto adapting settings can encompass all workloads.

Caching like this has a longer history in the data center. FancyCache, now here renamed, appeared as a reaction to linux bcache from googler which in turn was a clone of things before it to bring linux on par with stuff like Dillon’s FreeBSD swapcache and of course ZFS tiered and metadata caching which is slightly different, but the primary use-case, and what they want to bring btrfs up to. Of course best would be no discrete memory hierarchy at all! In principle. Like System38 (AS400) where there isn’t even concept of closing a program, they all live happily in unified and secure capability-based memory forever. I thought NVMe would bring us closer to that forgotten utopia, instead here we are.

So in our case Chia writes a lot but then discards alot! With sufficiently large cache we can save on writes and only write out the final result.

This is a quick comparison for k=25 which finishes quicker but otherwise has the same characteristic, pushing the same GB/hour as k=32. Average seconds per 600MB micro-plot for run of 36 jobs, 12 in parallel on 12 cores:

ntfs primocache
win12p 330 332

So no difference in time at all. But,

Just a tenth of bytes written, just the final files.

image

Compared to 239,194,228,224 in test without it.

PrimoCache is very dumb in caching strategies, brittle and prone to data loss too. But for this case, or compiling many small files… I think Windows can be tortured to supply something like that by itself, but it isn’t available for mortals.

Of course for k=32 everything is pushed out of memory outright, especially with parallel throughput, so there’s little savings, and considering no gain in time no sense at all for real-world plots. Outside of terabyte-memory equipped datacenter blades.

Of course a world where everything is designed to run tidily like StackOverflow without monsters behind would be the loveliest.

1 Like

I’ve never lost data in a RAM cache except due to my own ineptitude. But then… UPS FTW. And of course, losing a plot in progress due to a failure of a caching layer is hardly a big deal. Caching strategies could be better, but I wouldn’t describe them as “dumb.” You can tune the behaviour with various controls, though I wish I could flag certain data as “don’t ever flush this.” But that use case doesn’t apply to plotting.

For a better test, turn off the read cache, it is a waste. A sizable write cache, when plotting in parallel, can help with SSD latency, read/write contention due to low QD, makes write amplification almost completely moot, and also, oddly enough, thermals of the drive if cooling isn’t adequate. In essence, it smooths out those little stalls due to multiple processes trying to all simultaneously dump 12GB of data to the drive. It isn’t possible to predict what multiple parallel plotting operations will want to read next, but it very predictable that some parallel plot will want to flush 4GB of data to the drive. I configured a write cache in front of each temp SSD to be slightly larger than the number of plots being done in parallel on the SSD. And then further caches to sit in front of spinning rust HDDs so that plots can be stacked up on slower drives if the network or destination isn’t keeping up with the supply of plots. I’ve noticed that parallel plotting can get very bursty at times in terms of the IO.

Will this work for everybody? I don’t know. It works for me.

1 Like

Ah, ok, that makes sense. You can actually disable that on a per-device basis as I recall, for removable drives specifically? Yeah in Device Manager, go to the drive and then right-click:

Again – I want to be clear, I’m not saying you are wrong! I’m just noting that there’s a loooong history of weird performance claims from things like Intel Optane, Vista ReadyBoost, and hybrid HDD/SSD drives that never really panned out long term.

(can you even buy hybrid HDD/SSDs any more?)

Yes, Seagate FireCuda Gaming SSHD, WD Enterprise SSHD. Several other models, including a few enterprise drives. HDDs are “hybrid” with their 128MB or 256MB of cache, but they are not pitched as such because the cache is so small. Almost all consumer SSDs are hybrid, except the hybrid is SLC/MLC in front of TLC/QLC, and we’re both aware of the “why are my parallel plots going slow?” problems that go with that.

Some of those SKUs are getting long in the tooth, but they are just the ones I recall from memory and recent price drops of SSDs has compressed the demand in that market segment.

All major operating systems have a built-in feature to build a tiered SSD/HDD hybrid. Whilst Apple hasn’t shipped an iMac with a Fusion Drive in a few years, macOS still offers the ability to create a Fusion Drive which is really useful when you have a macOS machine that needs to do a bunch of builds but a RAM disk that fits in memory isn’t large enough, but a tiered RAM cache in front of an SSD would be just perfect. GitHub - JustinLloyd/fusion-drive

Windows Storage Spaces has the ability to create tiered storage built right in to it, and is available in some form on even the Home edition. Our second AIO kitchen computer uses a slow 5400pm 2TB HDD with a faster 256GB mSATA sat in front of it acting as a transparent cache. ReadyBoost from the Vista days has evolved and is now based on the SuperFetch service built in to Windows 10 and it is possible to create a ReadyBoost drive for lower end h/w.

Synology NAS offers the ability to plug in to two NVMe SSDs or two SATA SSDs running in RAID1 that will act as a fast cache in front of the regular SATA drives. But that gives mixed results depending on access patterns to the NAS and I don’t recommend most users invest in it.

An Optane 905p AIC for the longest time would only work as a cache drive in front of a HDD until Intel finally “got it” and updated their software to let you use the drive as a regular drive. I sat in some really long meetings where the general consensus by the Intel higher ups was “wHy wOUld pEOple BuY An ExPEnSIve SsD tO StOrE gaMEs?!?” Then they finally enabled users to boot from the 905p AIC as well.

1 Like

Well right, this is kinda my point, the whole hybrid smooshing fast and slow things together technique has too many downsides. It’s better to have a single fast 1TB NVME hard drive, than it is to have an 18TB HDD with a 512TB NVME “supercache” fronting it. Users seem to understand that they don’t need 18TB of sometimes-usually-kinda-fast space, what they need is 1TB of consistently always ultra-speedy space… for the same price.

This is truly hilarious and does not surprise me one bit… sigh…

I agree, most users don’t need that. And they understand the concept of “this drive is fast and stores my stuff” and “this drive is slow, and stores my other stuff.” In fact, most users don’t need 18TB at all except that one weird guy who always clicks away fast from the browser window whenever you walk by his desk.

Developers, gamers, video content creators not withstanding of course. But those use cases are a tiny fraction of the overall consumer market.

Dumb caching is… dumb. Applying a cache (Registers > L1 > L2 > RAM > Optane RAM > SSD > HDD > Network) in a particular part of a process can alleviate a problem and should be done in the same way that code is profiled. “You don’t know what part of the code is slow until you instrument it.” I noticed that on my particular system that plotting was frequently waiting for drive buffers to flush before it would read the next X GB of data to perform a a quicksync on, and if other processes were pushing data in to the write queue, or reading from the drive, that plots would temporarily stall. I cannot predict what the plotting process wants to read, but I can predict that the plotting process will want to dump several gigabytes of data at some point so I can provide a write-cache for it.

3 Likes

Oh, I was absent recently while you guys discussed into such deep level that I can hardly catch up lol

Update: I’ve been plotting with my HDDs with the help of PrimoCache pretty well so far. My config is 6c12t CPU, 32G RAM, 6 HDDs as -t/1 plot each with 2h stagger, and setting 10GB cache in PrimoCache. Average duration for each job is around 12hrs.

The trick here is, the biggest con of RAM cache tools such as PrimoCache, being that you may lose important data which is supposed to have been written into HDD during a power loss or crash etc, is hardly a problem when plotting, since your ongoing plot will be lost anyways.

1 Like

Miguel,
What settings are you using in the PrimoCache ?

Delayed write 300s, no prefetch

Can you explain to me how you set up your plotters? You have 3x hard drives for temporary and 3 that are written on? You indicate that you use 2 hours or staggers? Will you let Chia do that? Or do you do that with queue? Do you start a process every 2 hours? I myself have 20x 4TB disks that I want to plot as quickly as possible. Advice is welcome.

In the tests I have done and something I have read, I concluded that parallel plotting is not really that parallel. Chia uses a single write buffer.
Only a single slow HDD can delay all of your parallel plotting.

I have witnessed this to some extent. I wanted to see how slow a hdd plotter via Sata (7200rpm 4tb cmr with about 180MB/s sequencial writes) was and a new 10tb Usb Seagate external Hub model( has a weird slow slow sequencial @Q1 of only 85MB/s but over 200MB/s with Q32 in crystal Diskmark.(that drive has slow rights even as a destination drive compared to the non hub 10TB model)

Running both HDDs along with my 6-7 nvme plots everthing was a crawl. (despite excessive ram and cpu left) I killed the USB job next morning as it was slowest. And let the HDD drive continue… Took 22hrs(but some of it slowed by slow USB plot job. I did another test with just the 4tb sata hdd with bucket changes. It caused big 0% cpu dips that seemed to sync with 100% 4TB HDD spikes. Killed the HDD plot job and cpu activity was stead sailing after.

TLDR: HDDs should probably not be mixed in with slow disks. And possibly don’t mix bucket sizes… More experiments needed on that. YMMV

Yes those “0% CPU dips” are probably caused by the same mechanism which slowed down my multiple HDDs–the CPU was obviously waiting for the HDD/USB to finish some certain R/W job and thus caused the slowness. A mem cache can improve that but not a full remedy. Not sure whether there is a way to fully avoid such waiting behavior of the CPU.

I noticed that these 0% CPU drops only happen on SATA HDD. Do not happen on USB HDD.
And it only happens on creaky software while transfers and disk fees during Windows copies remain high.
It’s as if there was a software lock creaking saying not to make 2 simultaneous accesses in SATA.
Strange, isn’t it?

I’ve used FancyCache, now PrimoCache off and on for years. I’m currently giving it a shot at improving plots, but it really doesn’t help cutting time on fast NVMEs. Currently (after testing many scenarios) I’m using 32GB write cache on one drive (of 4). No improvement in times. I also let the cache share across 4 nvme drives. Same. BTW, I’m doing 14 simultaneous plots across those 4 ssds.

What it DOES improve, and dramatically it seems, is write wear on the ssd. It can take writes over an extended period, and only write occasionally to ssd after having filtered and organizing to minimize ssd wear and use. Reading isn’t a wear issue for ssds, and its fast, so I don’t bother with that.

The main takeaway I see, is that Chia plotting doesn’t cache well, as it’s always new, unique access so there little to cache. Alternatively, rust disks are probably greatly helped, because of their nature…I just don’t use any of them anymore for active programs.

Is your CPU utilization at 100% all the time?

Not always, but regularly, yes 100%. I’m also trying different block write sizes in the cache to ssd, a parameter that could help as it seems to choose 4K by default, and perhaps larger will be faster.