Why does bladebit exist if madmax can plot in ram using less ram?

Serious question, not trying to be snarky:

Bladebit requires 400+gb to plot in ram, whereas with madmax you can get away with 223 gb (in my experience) if you wait for the move to finish before starting the next plot. You may be able to skip the wait if you use a script and an nvme buffer disk, but have not tested that. That said, you can definitely skip the wait by adding another 110 gb of ram, and you’d still be under 384 gb.

That’s last bit is significant for platforms that max out at 768 or 1536 gb of ram, because it means you can run 2 to 4 jobs in parallel with madmax, rather than 1 or 3 (but 2 might be faster cuz NUMA) with Bladebit. My threadripper will not be able to use bladebit at all until 64 gb ddr4 udimms are a thing, which may take as long as never.

I’m probably missing something, but assuming that plot times are similar between Madmax and Bladebit (as I’ve read is the case), why would I use Bladebit if uses more ram and thus might ultimately be slower on certain systems, for the reason I outlined above?

I’ve seen reports of over 100 plots of day with bladebit, I’ve not seen that with madmax.

So on the surface without research I assumed it was just for speed.

2 Likes

I’ve seen reports of over 100 plots of day with bladebit, I’ve not seen that with madmax.

So on the surface without research I assumed it was just for speed.

On 8 year old dual socket (total 24c/48t) xeons I can currently get around 80 per day using tmpfs with nvme. I estimate that when I add ram and eliminate the nvme entirely, I’ll get close to 100 per day using madmax on that old hardware. (I’ll also be able to directly compare it to bladebit).

When I was plotting with a 1950x (16c/32t) on tmpfs and nvme, I was able to break 1300 seconds, and 1500 was easy with less aggressive clocks. I’m guessing, based on that, that a 2990wx would get comfortably over 120 per day plotting entirely in ram (if you could get the completed plot off the ramdisk before you run out of ram on the next plot).

In the link in my OP, someone reported 15 minutes with a threadripper/128gb ram with madmax, and 15 minutes on an epyc using madmax (presumably with tmpfs and nvme). On one hand the nvme would slow down the TR, but the TR is probably faster per core than the EPYC (both are 32 cores, but clock speeds were not reported for the TR), so I’m guessing it’s a wash for that reason.

1 Like

Bladebit outperforms madmax in all RAM plotting.

On my dual Xeon e5-2680 v4 with 512GB of RAM and NVMe buffer it does so by 3 minutes, producing ~130 plots per day vs ~110 with madmax. When running parallel instances of madmax you’ll get slightly slower results than running madmax with an all tmpfs with NVMe buffer.

In terms of cost effectiveness, sure the lower RAM with madmax comes out cheaper but if you’re wanting highest performance and your system is equipped with the ability to plot entirely in RAM then bladebit is the winner.

4 Likes

First of all, it exists just because someone created it. The creator of BB doesn’t need to ask anyone permission, nor would stop realizing their idea just because something similar exists.

As for objective differences. MM operates with “files”. The fact that you can create a RAM disk to operate the files is merely a workaround for performance or durability reasons. Operating a RAM disk like this incurs certain overhead on the CPU/memory and OS. The BB addresses buckets directly in the memory and writes files only for the final output.

In extreme cases, when you have the hardware, with BB you can create plots faster than they can be written to a HDD. With appropriate SSD caching strategy and filling up multiple HDDs at once, I saw someone posting results of 6 min per plot and sustained speed of 220 per day.

6 Likes

This is the absolute correct answer and from my own experience I can create plots with BB alot faster than I can move around, since I still have no 3rd layer intermediary storage solution that can offload to the final resting place (farming drives). For the moment I managed to top >540 plots per day.

4 Likes

Thanks for the replies, and for the technical explanation, rfc2324.

TL; DR; With respect to ram plotting and bb vs mm, I wonder if in certain scenarios mm might still be faster. Dual socket configurations, and configurations that take multiples of 384 gb of ram per numa node come to mind.

The older model dual xeons I mentioned above have a numa node per socket and 3 dimm slots per channel. I can configure them to be 256 gb per node at 1600 mhz ram, or up to 768 gb per node at 1066 mhz ram.

I presume running bb twice (once per node) in the 768/node (1.5tb total) configuration would be faster than running it once, simply because of the relatively high latencies between sockets.

At that point you’ve spent a lot of money on ram, and you could run mm four times (twice per node).

However, in both cases you are running at 1066 mhz ram, so whichever is faster, the paralleling may not be beneficial vs. running bb once at faster ram speed – or mm twice.

For instance, 1600 mhz you can still have 512 gb, which is enough to run mm twice, given the -w option (wait for move operation). If you used a 3rd level of storage you might be able to do without -w, and then I would be surprised if mm is still faster (see above re: numa nodes and inter-socket latencies). BB would only run once in this configuration.

This is mostly food for thought, but I have enough ram currently to test 3rd level storage with mm, in order to avoid -w(aiting for move).
I plan to not use ssd for plotting much longer, so I will get around to it.

You should stop worrying about NUMA nodes. From my testing there seems to be no performance loss from crossing the NUMA boundary while plotting, the difference in performance between single vs parallel plotting on dual socket systems appears to be due to some other inefficiency, most likely at the software/kernel level. I was actually able to achieve slightly better times parallel plotting without numactl compared to with numactl, even with around 7% numa miss/foreign rate.

As for why bladebit is faster, it skips the step of having to transfer the plots out of working memory to a filesystem (even when using tmpfs, there is a slight performance hit when transferring from working memory to a filesystem as it works through the kernel layer). The faster the ram, the better gain you get from bladebit. With E5 v2 xeons, you will struggle to get a single plot much under 30 min with mad max, but with bladebit I’ve seen some v2 users getting around 18 minute plots.

When considering cost, you could likely find a dell r520/720 systems with 256gb ram, add in some cheap NVME and possibly upgrade the CPUs for $600-700 and make 70 plots/day by parallel plotting with mad max. A similar system for bladebit would cost around $1100-1200 and only increase output by a small amount. The newer the system (like an r730) the higher the output will be, but also at significantly more cost. It would cost me around $1500 to upgrade my v3 system to 512gb RAM. So not really cost effective to buy a bladebit capable system, but if you’ve already got the hardware it will be more efficient.

3 Likes

I’m not sure exactly how you were using numactl, but keep in mind that, for instance, -t and -2 should be in the same numa node to maximize benefits. I haven’t tested without it recently, and I don’t have any more disks to plot on at the moment, but I recall a pretty big gain from 1x madmax to 2x. I don’t recall specifically what my results were from running 2x mm with and without numactl, however. I’'ll check that again when I get some more disks.

The faster the ram, the better gain you get from bladebit. With E5 v2 xeons, you will struggle to get a single plot much under 30 min with mad max, but with bladebit I’ve seen some v2 users getting around 18 minute plots.

Plotting entirely in ram, I make plot every 1400 seconds. Plotting twice with -2 tmpfs and -t nvme I make a plot every 1000 seconds. I’m pretty sure I could get over 100 plots per day, if I had 512 mb of ram, and if I could move the plot to 3rd level storage before plotting runs out of ram. The unfortunate thing about this platform is the ram speed dropping from 1866 to 1600 to 1066 as you got 256/512/768 gb of ram, so i’ll always need that 3rd tier for 2x ram plotting, I’m guessing…

Incidentally, I’m not sure what V2s were used for the 18 minute plot times you saw. If they were also 2697s, then extrapolating my current time with parallel mm jobs, mm would indeed be faster than 1 bb, as I speculated. If they were the 10- or 8-core variety, then bb is probably significantly faster.

When considering cost, you could likely find a dell r520/720 systems with 256gb ram, add in some cheap NVME and possibly upgrade the CPUs for $600-700 and make 70 plots/day by parallel plotting with mad max. A similar system for bladebit would cost around $1100-1200 and only increase output by a small amount. The newer the system (like an r730) the higher the output will be, but also at significantly more cost. It would cost me around $1500 to upgrade my v3 system to 512gb RAM. So not really cost effective to buy a bladebit capable system, but if you’ve already got the hardware it will be more efficient.

My 12 core V2s are making at least 80 plots a day right now (I said 1000 seconds before, but it’s more like 1050), but it depends very much on the nvme I’m using for -t. Cheaper older ones can add 50%, samsung EVO 970s and 980s work well. As I said before, I’m pretty sure I can get over 100 plots a day with pure ram plotting.

They outperformed my OCed TR1950X pretty handily, though the TR has only 16 cores, and the dual xeons have 24 combined.

You make a good point about not getting much benefit from the extra ram needed to run bb. That said, if I’m still plotting a few years from now, the cost of those NVMe will add up. Eventually, it will become more expensive than plotting in RAM. Also, you’re amortizing that initial investment the entire time, and it will have some resale value in the end (junked nvme obviously wont).

When I get more disks, I’ll upgrade one of my servers to 512 so I can test 2xmm vs 1xbb.

BB is very well suited to a “consistent production” workflow for the R820 I have just gotten with 512GB and no buffer SSD. I can get a solid 72 plots per day without the need for an intermediate drive of any sort. That includes copy time. That’s writing direct to a USB3 hub attached enterprise drive. It took me like 3 tests to figure this out, but when I saw that I was sold on the value prop of BB vs MM. MM just needs a lot more hands on to get a consistent flow as you will have a large buffer usually of ssd or nvme to store temp plots.

There is a “easy” way around the write-to-final stage single drive bottleneck which is to use unraid in fill-up mode over a network share on a system with no parity drive (but still using array) and disks that are at the same capacity of empty space and 10gb connections. It will write one plot to 1 disk, then another the the next disk and so on and overcome the 150MB write wall as fast as the plots are created and the script to move them sees them.

I think there is a solid TCO arguement for older systems (r810/r910(E7 class on 910)) with 512GB ram which can usually be scored in the 800-1000 range with decent processors and no need for SSD/NVME as being the cheapest way to really fast plots.

I plan on doing a few days of benchmarking on BB vs MM after I get these new drives filled up, if there are any tests explicitly you want to see LMK.

1 Like

I think there is a solid TCO arguement for older systems (r810/r910(E7 class on 910)) with 512GB ram which can usually be scored in the 800-1000 range with decent processors and no need for SSD/NVME as being the cheapest way to really fast plots.

I have not priced out the quad socket stuff, but I can definitely vouch for the TCO of the dual socket stuff from the same generation.

1 Like