Can Chia plots "go bad" over time? Mine seem to!

codinghorror · May 17, 2021, 11:57pm

OK so this is weird.

I know I have done the work on this one, so bear with me. I recently scanned EVERY drive in a datacenter hosted JBOD with the following command:

.\chia plots check -n 5 -g C:\mounts\hd01

… just to make sure all the plots were good. That 4U JBOD has a number of drives, you get the idea, like

C:\mounts\hd01
C:\mounts\hd02

etcetera. Real standard computing stuff – Windows, Mac, Linux, basic drive mount points on any OS. That 4U JBOD is driven by a 1U brain server.

So, I scanned each drive with plots check -n 5, and I removed any invalid plots. Each drive had from 0-2 plots that were invalid. This was 3 days ago. I scanned every drive and did a plot check on them.

Now fast forward to today, and I’m seeing bad plots get reported in the Chia GUI. Huh. That’s … odd? I scanned every drive a few days ago and made sure all the plots were good!

But here’s the Chia client telling me plots are bad and sure enough… when I double check… they are bad!

ERROR    Failed to open file C:\mounts\hd23\plot-k32-2021-05-12-13-45-f0e26426307bedc4d43f53502d453fb52800d274e1c55f744ace2f67927a0a22.plot. Invalid plot header magic Traceback (most recent call last):
  File "chia\plotting\plot_tools.py", line 189, in process_file
ValueError: Invalid plot header magic

2021-05-17T16:15:06.682  chia.plotting.plot_tools         : ERROR    Failed to open file C:\mounts\hd23\plot-k32-2021-05-12-13-50-117c991c91af976a63f7bd466a0cec19ac043deff4fb0f70fde6811d62df8158.plot. Invalid plot header magic Traceback (most recent call last):
  File "chia\plotting\plot_tools.py", line 189, in process_file
ValueError: Invalid plot header magic

2021-05-17T16:15:06.682  chia.plotting.plot_tools         : ERROR    Failed to open file C:\mounts\hd23\plot-k32-2021-05-12-13-57-cba497cd80a9c14b835a49107d1323c9159397ff209014a3ef70aa540113d6ae.plot. Invalid plot header magic Traceback (most recent call last):
  File "chia\plotting\plot_tools.py", line 189, in process_file

So this makes me wonder… can plot files go bad over time? This is kind of boggling my mind. A few possibilities:

Maybe -n 5 isn’t enough of a check?
Maybe the drives are actually bad?
Maybe there’s a communication error between the brain 1U and the 4U JBOD? But if so I’d expect that to show up randomly, here it is repeatable, it’s just these specific files that show up bad.

The error is always failed to open {plot}. Invalid plot header magic.

The plot farming process does not write to the disks at all, so I’m kind of wondering how plot files that were previously tested good via plots check could turn into bad plots, with no disk write activity? Nothing is writing to these disks, the only thing running on the machine is the Chia farming GUI.

I know the brain system itself, the 1U driving the JBOD, is definitely stable, it’s a Xeon, it’s got ECC memory, it passed memtest, it passed prime95/mprime overnight…

JeffJN · May 18, 2021, 12:07am

I noticed the gui reports bad plots if they have been moved to another drive, but if you restart the gui the plots are fine.

codinghorror · May 18, 2021, 12:13am

Right – that’s a possibility, thanks for noting that, but there was no movement of plots in this case. All plots are static, nothing is being moved around.

WolfGT · May 18, 2021, 12:26am

Sounds like something up with the hardware corrupting files, probably a drive that is going bad. But that is just a guess. My belief is that once a plot is created, and is good, it will be good as long as it doesn’t get modified in some way. Does that NAS have a way to look for hard/soft errors? That may help you find the culprit.

codinghorror · May 18, 2021, 12:28am

It’s not a NAS; just a JBOD. What surprises me is that nothing is really being written to these drives. It could be that the JBOD is bad? If so I’d expect random errors, but these appear to be specific to specific files.

vandy · May 18, 2021, 12:51am

Have you confirmed it’s not the same drive each time?? Hard drives have a failure rate just like any other hardware, they sometimes go bad.

codinghorror · May 18, 2021, 12:54am

I did record in my notebook how many drives had bad plots and I’ll try to correlate. I’ll even show a picture of my analog notebooking!

So all drives were checked on 5/14. The first number is the # of bad plots from plot check. The second number, circled, is the new bad plots that emerged on 5/17!

LuaKT · May 18, 2021, 12:59am

Oh I suppose just this would work

Format-Hex -Count 19 .\plot-file.plot

codinghorror · May 18, 2021, 1:10am

Ah! That’s a good one thank you! Check it out, very cool:

PS C:\Users\chia-farmer\AppData\Local\chia-blockchain\app-1.1.5\resources\app.asar.unpacked\daemon> Format-Hex -Count 19 C:\mounts\hd37\plot-k32-2021-04-01-17-16-87052438dac9c81d1cac92cc70c3c1445e0df8b4b14f3d4d57d844438d26fc24.plot

   Label: C:\mounts\hd37\plot-k32-2021-04-01-17-16-87052438dac9c81d1cac92cc70c3c1445e0df8b4b14f3d4d57d844438d26fc24.plot

          Offset Bytes                                           Ascii
                 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          ------ ----------------------------------------------- -----
0000000000000000 50 72 6F 6F 66 20 6F 66 20 53 70 61 63 65 20 50 Proof of Space P
0000000000000010 6C 6F 74                                        lot

That’s on a known good plot. Now here on a bad plot.

PS C:\Users\chia-farmer\AppData\Local\chia-blockchain\app-1.1.5\resources\app.asar.unpacked\daemon> Format-Hex -Count 19 C:\mounts\hd37\plot-k32-2021-05-12-17-54-f1701e732fbcaae008c0424f99a65b4165be9afc5939ebc7efdf56d80004d910.plot

   Label: C:\mounts\hd37\plot-k32-2021-05-12-17-54-f1701e732fbcaae008c0424f99a65b4165be9afc5939ebc7efdf56d80004d910.plot

          Offset Bytes                                           Ascii
                 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          ------ ----------------------------------------------- -----
0000000000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000010 00 00 00

wow – super helpful. So that plot appears to be garbage, a bunch of zero bytes (!!)

codinghorror · May 18, 2021, 1:16am

Hmm. I’ve noticed these are all plots that have recent copy dates. The plot thickens… maybe I copied over bad plots, when I was filling in for other bad plots!

gerhard · May 18, 2021, 1:20am

BTW: Windows supports mounting as read-only as well. Might prevent accidents…

https://megamorf.gitlab.io/2019/07/14/mount-ntfs-partition-read-only/

codinghorror · May 18, 2021, 1:22am

Well yeah I can CTRL+Aand property the files read only for sure! I’ll go ahead and do that.

luckidog · May 18, 2021, 5:46am

Consider pulling the drive that seems to have issues and put it on a different controller, just to eliminate the JBOD as source of errors? Could it just be misreading, rather than the data on disk actually being wrong?

codinghorror · May 18, 2021, 5:51am

As you can see from the journal pic, it is all over the place. There’s no “smoking gun” specific to any one drive.

Anyways, most holes are filled and the files are read only now. We’ll see if this helps. It is very… odd… though. I am 100% sure I did a full plot check (albeit only at -n 5) on 5/15 for every drive.

f1gm3nt · May 18, 2021, 6:09am

I’ve read a few places that plots go bad over time, but I found

Which puts that straight. There’s a few other helpful tidbits as well.

They keep the wiki pretty up to date as well so it’s a good resource to check every now and again

Tydeno · May 18, 2021, 6:45am

I also notice this behaviour ony my farmer. Not at your scale, but from time to time I see plots turn “bad” with the same error you get. My farmer at the time uses local storage only (E-ATX Board with lots of drives connected trough SATA). I thought thay maybe the copy job (I trasfer the plots using a 8TB external HDD) invalidates the plot header until now.

Tydeno · May 18, 2021, 12:02pm

I just asked in Keybase and got some replies about possible root-causes for this (not specific to your situation but all that have invalid plot headers). I just post as I got the replies:

If header magic invalid; this may be due to an incomplete copy or just plain corruption (bit rot)
Bad RAM does the same thing (marginally unstable OC’d)
Using a filesystem that leverages checksumming might help (ZFS)
Long interface cables and unreliable power wont help

casualChia · May 18, 2021, 3:31pm

How long have these plots been in place? Do you know for sure the harvester had previously seen them as good or was this possibly the first time they were examined by the harvester?

ianj · May 18, 2021, 4:09pm

When i checked all mine with small -n a few failed - reliably - but when i increased to 30+ they passed

So nowadays i only use -n 10 as a screener then check to -n 100 for a more confident view

codinghorror · May 18, 2021, 4:33pm

Very very interesting. Something is definitely happening because I know for sure I did a full plot check pass on all drives on 5/15!

We will see if marking the files read only helps.

The header is definitely all zeroes, thank you for that command @LuaKT