Plots keep dying in phase 3 on an Intel NUC8i5BEH running Ubuntu Desktop

I recently got and setup this: https://www.amazon.com/gp/product/B07GX59NY8/ref=ppx_yo_dt_b_asin_title_o00_s01?ie=UTF8&psc=1

With this NVME drive: https://www.amazon.com/gp/product/B07TLYWMYW/ref=ppx_yo_dt_b_asin_title_o04_s02?ie=UTF8&psc=1

I’ve tried running 3 separate plots now (including one where I just used default settings), and each time, the plot stalled in phase 3.

My question: Where do I begin to investigate what’s going on? Are there logs I can be looking for? I still see CPU activity by Chia when this happens (in fact, oddly, the CPU seems to be stuck at 12% when this happens), I’ve been tailing debug logs but they’re just showing Farming activity (which seems to be running fine).

Thoughts?

I’m running a plot right now, 44% done. Been going for 3 hours nearly on the dot. We’ll see how things go, only just now started Phase 2.

So it seems to have happened yet again. My debug logs are still flowing fine from a farming perspective, but my Plotting is stuck at 71%, and strangely the Chia process is stuck at exactly 12% CPU again. Very confused what’s happening.

Alright, per some good feedback on the Keybase chat, I ran some smartctl/fsck checks on the drive and it all seemed fine, so now I’m giving a command line only try of a plot to see how it goes

Ran one plot overnight, only from the CLI, and stalled out again in Phase 3, this time at the very beginning.

You can see here that one CPU thread is completely pegged at 100%, and this has been stalled for almost 9 hours now…

What the heck do I do here? The only thing I can think of is to downgrade chia to 1.0.1 or something just to see if it’s something weird in 1.0.3…

Just ran a memtest, seems I have some bad RAM…

Which seems consistent with this Reddit thread: https://reddit.com/r/chia/comments/mjbr6t/plotting_randomly_stopping_without_error_most/

I’m running memtests now on each ram stick to see if it’s one or the other, and already having lots of errors on one stick, so I’ve ordered some different ram to try again tomorrow. Yay.

3 Likes

Alright, ran memtest86 against each individual ram stick in my NUC, and discovered one is riddled with errors and the other seems fine. So I’m going to try and run a plot now with just the one good stick in the machine and see what happens.

I’ve now been able to run 3 successful plots after removing the bad RAM stick, so I think it’s relatively safe to say this has been figured out.

In summary, what I saw, and what I did:

  1. My plotting was randomly freezing, typically around phase 3, with no errors, and no way to recover
  2. After suggestions from the Keybase group, I downloaded memtest86 (https://www.memtest86.com/) and flashed it to a USB stick using Balena Etcher (balenaEtcher - Flash OS images to SD cards & USB drives)
  3. Rebooted the machine with the USB in and booted into it, then ran the default memtest. Started seeing a lot of errors really quickly, so exited the test
  4. Removed one of the RAM sticks, and reran memtest. Still had errors.
  5. Switched to the other RAM stick, no errors. Left that in, and started the machine normally again.

Now since I ordered these as a pair from Amazon, I need to return them both. I have a pair of 8GB sticks arriving today that I’ll use in the machine going forward.

Hopefully this helps anyone stuck with a similar issue.

4 Likes

I have a policy of running memtest on every new build immediately after building it:

3 Likes

Pre-Chia, the only “computers” I had built in the last 10+ years were Raspberry Pi’s. I’ve been on MacBooks both professionally and personally for a long time now, so I’ve gotten pretty rusty with building machines and the newer components (I wasn’t even really aware of what an NVME drive was before getting into Chia haha).

It’s been fun to dust off some old knowledge and learn some new things with this, albeit frustrating in this case chasing demons in the machine.

1 Like

I was having chia plots stall randomly in stage 3 just like you were. I went so far as to fork plotman to detect a stalled job and kill it thinking it was just a fluke. I ran memtest and found a bad stick in my new plotting rig. Thanks for the help!

2 Likes

I was curious if the plots made on the faulty machine that didn’t crash would be valid, so I ran chia plots check and about 5% of the plots ended up being invalid. Wouldn’t hurt to double check if you were able to get some plots complete while you had the memory issue.