Obscure hardware issue, Invalid plot header magic

rthorntn · May 18, 2021, 11:21am

Hi,

tl:dr getting “Invalid plot header magic” on 8 of my 14 plots, server hardware, ECC RAM, datacentre NVMe…Ubuntu 20.04…no ECC errors and all disks pass SMART…using three NVMe and five datacentre HDD’s, the issue is found on plots from all three SSDs and on all five HDD’s.

If it’s a hardware error it’s an obscure one, or multiple failures.

More detail:

Dual Xeon, DDR4 ECC Type: 8x32GB Multi-bit ECC (edac-util -rfull output: mc0:noinfo:all:UE:0 / mc0:noinfo:all:CE:0 / mc1:noinfo:all:UE:0 / mc1:noinfo:all:CE:0)

3 x nvme drives (plotting)

drive 1 - 3 plots
drive 2 - 3 plots
drive 3 - 1 plots

1 x nvme drives (-2) - drive that all temp files get written to before going to HDD (brand new drive)

5 x 14T sata hdd (farming) new drives

drive 1 - 4 plots (1 bad, from sequence 2) [nvme drive 1 & 2 plots come here]
drive 2 - 2 plots (2 bad, one from each sequence) [nvme drive 3 plots come here]
drive 3 - 4 plots (2 bad, from sequence 1) [nvme drive 1 plots come here]
drive 4 - 2 plots (1 bad, from sequence 1) [nvme drive 2 plots come here]
drive 5 - 2 plots (2 bad, one from each sequence) [nvme drive 2 plots come here]

14 plots complete (7 in sequence 1, 7 in sequence 2), 8 bad (4 in sequence 1, 3 in sequence 2) wtf!!!

Error:

2021-05-18T08:09:34.615 chia.plotting.plot_tools : ERROR Failed to open file /mnt/field02/plot-k32-2021-05-17-06-57-xxx.plot. Invalid plot header magic Traceback (most recent call last):
File "/home/rthorntn/chia-blockchain/chia/plotting/plot_tools.py", line 189, in process_file
prover = DiskProver(str(filename))
ValueError: Invalid plot header magic

So I googled and I read disk issues and possibly RAM.

I have ECC right so it shouldn’t be that?
I didn’t get 8 plots from any one nvme, I guess all nvme drives could be bad?
The brand new Intel 750 could be bad but because it’s a single point of failure wouldn’t it corrupt all plots?
I have failed plots on all HDD drives, surely all 5 drives can’t be bad?
Cosmic rays, SATA bus corruption, SATA controller issue, who knows?
A bug, any other way to verify the plots?

I’m pretty pissed that I only have 6 good plots out of 14, less than 50% success rate. Please help, lol, preferably in a way that will get all of my 14 plots to pass the check… Should I just stop plotting until I figure it out, who knows…

I just lowered the chia plots RAM from 8000 to 4000

I just changed -2 to be the same drive as -t

Command:

screen -d -m -S chia01 bash -c 'cd /home/xxx/chia-blockchain && . ./activate && sleep 0h && chia plots create -k 32 -b 4000 -e -r 4 -u 128 -n 32 -t /mnt/1600gb_1/temp1 -2 /mnt/1600gb_1 -d /mnt/field01 |tee /home/rthorntn/chialogs/chia01_1_.log'

Here goes I will check in 10 hours to see if it made any difference.

Tydeno · May 18, 2021, 11:24am

You farm the plots on the same machine as you plot them? So no copy job in between?
Therese another Thread about that issue. See here maybe: Can Chia plots "go bad" over time? Mine seem to! - #18 by Tydeno

rthorntn · May 18, 2021, 11:25am

Thanks tydeno, same machine for plotting and farming.

rthorntn · May 18, 2021, 11:50am

Wtf is going on…

With this command:

hexdump -c plot-xxx.plot | less

Working plots show:

0000000 P r o o f o f S p a c e P
0000010 l o t

Bad plots show nothing and are also empty with ‘cat’ but the bad files use the same 102GB like the others.

I could handle corruption but why would the files be empty.

Tydeno · May 18, 2021, 12:00pm

I just asked in Keybase and got some replies about possible root-causes for this (not specific to your situation but all that have invalid plot headers). I just post as I got the replies:

If header magic invalid; this may be due to an incomplete copy or just plain corruption (bit rot)
Bad RAM does the same thing (marginally unstable OC’d)
Using a filesystem that leverages checksumming might help (ZFS)
Long interface cables and unreliable power wont help

rthorntn · May 18, 2021, 12:18pm

Thanks @Tydeno

IMHO hardware issues create corruption, in the file some bits of data wrong, the files look totally empty.

With the majority of my 102GB files empty I smell a bug with Chia and Linux.

I need to go on to Keybase

rthorntn · May 18, 2021, 12:27pm

$ hexdump plot-k32-2021-05-17-06-57-good.plot

0000000 7250 6f6f 2066 666f 5320 6170 6563 5020
0000010 6f6c 4074 a998 9ed1 5637 7a8b 0dd4 ee82
0000020 a75d cce9 7566 6246 08ee 0c4e 3163 7da6
0000030 0853 2024 0400 3176 302e 8000 f696 c07b
0000040 fed0 ef51 1cf4 6635 9a30 c54f 72db 0fe3
0000050 721a 4572 b887 0a20 ee5e 8d86 260e dd1b

$ hexdump plot-k32-2021-05-17-17-12-bad.plot

0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
00000e0 0000 0000 0000 1000 26da 4487 0000 1400
00000f0 40eb 047a 0000 1900 bc0b 043a 0000 1900
0000100 d60b 0074 0000 1900 d60b b074 0000 0000
0000110 0000 0000 0000 0000 0000 0000 0000 0000
*
<I have to ^C on the bad hexdump as the cursor just freezes on the next line after the * at the bottom>

The bad plot looks like 102GB of absolute garbage.

rthorntn · May 18, 2021, 12:59pm

@Tydeno could you please do me a favour and ask about my issue on Keybase, maybe point them here?

I have a new Mac and had given up on Keybase so I just had to do a SMS reset that takes 5 days, I can’t see how to create a new Keybase without an invite code and so I’m a bit stuck.

rthorntn · May 18, 2021, 1:08pm

I posted an issue here:

github.com/Chia-Network/chia-blockchain

Invalid plot header magic, plots full of garbage, bug?

opened 01:06PM - 18 May 21 UTC

closed 10:47AM - 20 May 21 UTC

rthorntn

Hi, tl:dr single server, getting “Invalid plot header magic” on 8 of my 14 pl…ots, server hardware, ECC RAM, datacentre NVMe…Ubuntu 20.04…no ECC errors and all disks pass SMART…using three NVMe and five datacentre HDD’s, the issue is found on plots from all three SSDs and on all five HDD’s. If it’s a hardware error it’s an obscure one, or multiple failures. More detail: Dual Xeon, DDR4 ECC Type: 8x32GB Multi-bit ECC (edac-util -rfull output: mc0:noinfo:all:UE:0 / mc0:noinfo:all:CE:0 / mc1:noinfo:all:UE:0 / mc1:noinfo:all:CE:0) 3 x nvme drives (plotting) drive 1 - 3 plots drive 2 - 3 plots drive 3 - 1 plots 1 x nvme drives (-2) - drive that all temp files get written to before going to HDD (brand new drive) 5 x 14T sata hdd (farming) new drives drive 1 - 4 plots (1 bad, from sequence 2) [nvme drive 1 & 2 plots come here] drive 2 - 2 plots (2 bad, one from each sequence) [nvme drive 3 plots come here] drive 3 - 4 plots (2 bad, from sequence 1) [nvme drive 1 plots come here] drive 4 - 2 plots (1 bad, from sequence 1) [nvme drive 2 plots come here] drive 5 - 2 plots (2 bad, one from each sequence) [nvme drive 2 plots come here] 14 plots complete (7 in sequence 1, 7 in sequence 2), 8 bad (4 in sequence 1, 3 in sequence 2) wtf!!! Error: 2021-05-18T08:09:34.615 chia.plotting.plot_tools : ERROR Failed to open file /mnt/field02/plot-k32-2021-05-17-06-57-xxx.plot. Invalid plot header magic Traceback (most recent call last): File “/home/rthorntn/chia-blockchain/chia/plotting/plot_tools.py”, line 189, in process_file prover = DiskProver(str(filename)) ValueError: Invalid plot header magic So I googled and I read disk issues and possibly RAM. I have ECC right so it shouldn’t be that? I didn’t get 8 plots from any one nvme, I guess all nvme drives could be bad? The brand new Intel 750 could be bad but because it’s a single point of failure wouldn’t it corrupt all plots? I have failed plots on all HDD drives, surely all 5 drives can’t be bad? Cosmic rays, SATA bus corruption, SATA controller issue, who knows? A bug, any other way to verify the plots? I’m pretty pissed that I only have 6 good plots out of 14, less than 50% success rate. Please help, lol, preferably in a way that will get all of my 14 plots to pass the check… Should I just stop plotting until I figure it out, who knows… I just lowered the chia plots RAM from 8000 to 4000 I just changed -2 to be the same drive as -t Command: screen -d -m -S chia01 bash -c ‘cd /home/xxx/chia-blockchain && . ./activate && sleep 0h && chia plots create -k 32 -b 4000 -e -r 4 -u 128 -n 32 -t /mnt/1600gb_1/temp1 -2 /mnt/1600gb_1 -d /mnt/field01 |tee /home/rthorntn/chialogs/chia01_1_.log’ Here goes I will check in 10 hours to see if it made any difference. With this command: hexdump -c plot-xxx.plot | less Working plots show: 0000000 P r o o f o f S p a c e P 0000010 l o t Bad plots don't have that and use the same 102GB like the others. $ hexdump plot-k32-2021-05-17-06-57-good.plot 0000000 7250 6f6f 2066 666f 5320 6170 6563 5020 0000010 6f6c 4074 a998 9ed1 5637 7a8b 0dd4 ee82 0000020 a75d cce9 7566 6246 08ee 0c4e 3163 7da6 0000030 0853 2024 0400 3176 302e 8000 f696 c07b 0000040 fed0 ef51 1cf4 6635 9a30 c54f 72db 0fe3 0000050 721a 4572 b887 0a20 ee5e 8d86 260e dd1b $ hexdump plot-k32-2021-05-17-17-12-bad.plot 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 00000e0 0000 0000 0000 1000 26da 4487 0000 1400 00000f0 40eb 047a 0000 1900 bc0b 043a 0000 1900 0000100 d60b 0074 0000 1900 d60b b074 0000 0000 0000110 0000 0000 0000 0000 0000 0000 0000 0000 * <I have to ^C on the bad hexdump as the cursor just freezes on the next line after the * at the bottom> The bad plot looks like 102GB of absolute garbage. I could handle corruption but why would the files be empty. Thanks. Richard

codinghorror · May 18, 2021, 8:49pm

Yes, I had a similar experience here:

rthorntn · May 18, 2021, 10:07pm

OK so the latest sequence of 7 plots just completed and all check out, so it looks like either removing the seperate -2 nvme drive from the equation or lowering the RAM from 8000 to 4000 might of fixed it, I say that because I guess I could have the “go bad” over time issue and some of my older plots will go bad, who knows, I only started checking plots after the 2nd sequence had completed so I don’t know if the 8 plots started bad or went bad.

Will be keeping a close eye on it.

philt · May 25, 2021, 1:34pm

I seem to have the problem, but it seems to be restricted to my big Ryzen machines with 6 core/12 thread processors. Threading could be an issue. With four cores/threads, you’re less likely to have multiple running threads running simultaneously because of os and other non-chia threads running. With a 12+ thread machine, it’s more likely to occur and bring out threading issues. I’ll keep monitoring and trying different parameters and see what happens.