Boosting madmax plotting speed by focussing on IOPS using ZFS

Hi all, I thought I’d just share my experience which I think is somewhat different to the norm, in the case that it helps someone else.

So I’ll start with what I’ve got, what speed I get and then hopefully we can talk about what further optimisations we could make.

AMD Threadripper 1950x, Asus X399 Prime M/B, 128GB 2666MHz RAM (not overclocked), 4 x Intel DC S3520 240GB SSD’s, a second machine that acts as the node (this one is the plotter) which is connected over a Gigabit network.

With this I get 29Min Plots. 27 with a few tweaks that I don’t apply.

I personally think that’s pretty good considering the slow SSD’s that I’m using, my ram is not that fast, the CPU is old and of a slower MHz than modern standards (though plenty of threads).

To achieve this I’ve done a few things. Primarily, configured the SSD’s into a ZFS Stripe, enabled LZ4 compression and set a 1MB recordsize. On top of that atime off, xattr = sa, redundant metadata = most. I haven’t yet looked at primary /secondary cache changes which may increase things further. Anyway, I can explain that if you’re interested. I’ve done no bucket optimisation (running at standard 256), I use a 110G RAMDisk of course, have applied all 32 threads. There are some remaining ZFS tweaks that increase the speed (sync and checksum) but I haven’t applied them as I don’t really care about 1 more minute at the expense of any potential data quality issue.

After each plot completes it is copied over the network to my NAS / node (unraid) which happens in the background while the next plot is running. This is currently running on Ubuntu.

Nothing is overclocked. I’ve been experimenting a bit with the best settings because understanding how databases work with tables and stripe sizes, it’s all about iops and matching the disk structure up to the kinds of reads and writes that the application needs.

I nearly get this speed on a single INTEL D3-S4510 I have, which just goes to show how bad these 4 Intel M.2’s are unless striped. If I had more of the faster INTEL SSD’s, I’d expect something even more favourable. In fact, the 4x3520’s have a much slower write iops and a much slower sustained throughput for read and write, but once striped, the reads are SO fast that overall it outperforms the 4510.

I’d encourage anyone with a few drives lying around to try a decent raid tech for the plotting portion to increase iops and see how well it goes.

Also note that I tried this with some supposedly awesome NVME drives called Seagate Firecuda 520 (500MB) versions. They’re PCie 4.0 and I only have 3.0 but nevertheless the plot time was an incredulous 50minutes. On top of that the two runs I did on them reduced their health down to 85% so that are not as durable as the say. Compare that to the INTEL that is still at 99% for literally hundreds of runs.

So my conclusion is that IOPS is extremely important. And that ZFS is awesome because you can essentially set all these settings on the fly and for an individual folder, rather than having to reformat the whole array. I look forward to hearing if anyone else has done anything similar! And the way ZFS does record sizes across a stripe is pretty fancy too.

What do you think?

Thanks,

Marshalleq.

5 Likes

Have you tried with something even simpler, like an mdadm stripe? ZFS has a lot of overhead, although I’ve heard other folks say that the transparent compression helps with plotting.

Well, if you don’t have an L2ARC device attached to the pool, secondarycache won’t do anything; and sequential reads bypass ARC as soon as they’re detected, so primarycache won’t do much good, either. If anything, you could try setting primarycache=metadata and decreasing zfs_arc_max to a fairly small value, just to mitigate any memory contention between ARC, the RAMdisk, and the madMAx plotter.

Are you seeing sync writes on the pool during plotting??

1 Like

Hey, nope I haven’t tried MDADM yet. Just waiting for some more time outside work to have another play. I was quite excited to be getting the speed I am as it is! But yes it’s possible it’s faster - though likely not easier. My main thought is that ZFS’s variable stripe sizes and compression combined with recordsize are likely going to compensate for the unknown read/write particulars of creating a plot. If we could get exact understanding of that from chia that would certainly help.

Thanks - I didn’t know ZFS automatically bypassed primary cache for sequential reads - I must look into that - I’ve been setting up any datasets like that manually - so perhaps I won’t need to!

Regarding sync writes - I applied sync and checksum separately and both resulted in a performance increase of around 1 minute. So it didn’t seem worth it for the potential downsides. So I would say yes I saw sync writes, however it’s not like I went overboard testing it.

Maybe next time I’ll pop it all in a spreadsheet. It will certainly be interesting to check out MDADM, I remember using that years ago, setting up stride and stripe sizes etc using online raid calculators to try get better performance. I’ll see if I can give it a quick test tonight when my plotting run finishes. I generally use XFS as a filesystem for large files, though I can’t say chia is totally large file bound while plotting. Anyway, if you think a different FS would be better, let me know and I’ll try that as well.

Out of interest have you tried any raid 0 setups for chia plotting?

Cheers!

I think this only comes into play if you’re using RAIDZ. I’m guessing you just have a bunch of single-device vdevs in your pool, i.e. no RAIDZ vdevs.

The default behavior in ZFS is to only do a sync write when the O_SYNC flag is passed to fopen(), otherwise it does an async write. I’m curious if you were seeing sync writes in zpool iostat before you started fiddling with it. When you say you “applied” sync do you mean sync=always or sync=never? Do you have a SLOG device attached to your pool?

I don’t think you have to worry about stride size or stripe width if you’re doing a simple RAID0 volume in mdadm. You can format your md volume with whatever filesystem you want… XFS is fine.

I have not! I was going to do a stripe of fast 2.5" HDDs but then madMAx came out and I built my plotter around that and a small Optane drive instead.

Hey, so I did eventually try mdadm with default settings and XFS. It got me about 1 more minutes faster. So that isn’t exactly enticing, though not bad for default settings - obviously there may be some further optimisations I could do. That said there would be on ZFS too, anyway - will have a further look at some point to see what else I could achieve.

Answering some of your other questions:

  1. I’m not sure if you need Raidz for variable stripes given I’m still running a stripe, just without parity. Though you could be right - I’m not sure how I’d check this - googling hasn’t helped so far.
  2. I was referring to my changing sync to disabled from the default of standard.
  3. OK. Haven’t looked into it - I certainly didn’t worry about it this time anyway.
  4. Nice. What size is the optane? I’ve been wondering if those new hybrid types are any good or if you need an actual big octane. Though what you and I call small might be different - I’m seeing small as 32G and below because there are lots of those being sold around NZ.

If you google “zfs variable stripe” you’ll see that it’s a feature of RAIDZ, which you’re not using.

Well, I wanted a drive that could handle the full 220GiB tmpdir. The smallest one I could find that met that requirement while also being relatively affordable was the Optane 900P 280GB.

Those little ones are usually a hybrid of NAND flash and 3D XPoint, and I think some are bifurcated x2x2, so probably not a great fit for plotting.

Nice - yeah those are great cards. I’ve looked at them a few times TBH. We do have the hybrid octanes here too - I think they’re different than the smaller ones though - something I will look out for next time I see one!

Did you find something that says variable stripes are limited to Raid Z? I didn’t and I don’t understand what parity (in the striping sense, not the metadata sense) would have to do with it. Surely it applies to any striped array as even a plain stripe with no parity would have a stripe width right?

I’m a little bit guessing though, because I can’t seem to confirm that the two are associated exclusively.