Harvesters' OS hanging/crashing/restarting

porkopops · July 13, 2021, 11:35am

I have 2 brand new PC’s for harvesting (they were brand new about 1.5 months ago and bought for this purpose). They both have Windows 10 Pro installed. No other s/w is installed expect from default s/w provided by the vendor, but not much wasteware (as is per usual for some vendors).

The only thing they do is run MadMax.

The PC’s crash anywhere from 20 mins after power up/start harvesting, to over 2 hours or even over a day. I had to setup a monitor script on another PC to run test-netconnection to them so I get an alert when they become unresponsive.

They’ll sit there happily doing nothing, with no crashes but when using Madmax v0.1.1 (latest version) they start to crash.

Using HWInfo, the CPU’s are aren’t hot (60-70C).

All drivers are up to date and there are no obvious errors in Event Viewer.

As a test, I’ve disabled Intel SpeedStep and Turbo boost and this sorted out one of them, but not the other one. Obviously this hurts plotting time so isn’t ideal.

Any ideas?

cr1za · July 13, 2021, 11:45am

Maybe do a RAM check through your BIOS?

porkopops · July 13, 2021, 1:01pm

Thanks. I’m doing that right now

It’s just weird that 1 PC behaves when reducing the CPU load by disabling the options mentioned above.

cr1za · July 13, 2021, 1:31pm

I think SpeedStep doesn’t matter, turbo boost more so…

matthewjbauer · July 13, 2021, 1:36pm

Just food for thought… If you have an NVMe drive that doesn’t play nice with Chia plotting and it is the OS drive, that will cause the system to crash. My recommendation from what I have read is to not plot to your OS’s NVMe for this very reason. Just giving another possible solution here.

Also on the train of thought surrounding NVMe drives and regular SSDs, they can get hot when pegged to full load. What you will want to do is put a heat sink on them and preferably also have a fan blowing air over them to get exhausted/pulled out of the computer. You do not want the SSDs overheating as they can “glitch out” and cause weird behavior.

Bones · July 13, 2021, 1:40pm

What release are you running.
This was a known issue that was fixed in the update before last iirc.

Try the latest release.

porkopops · July 13, 2021, 3:51pm

Thanks for the feedback. For both PC’s, I’m using an SSD for the OS which isn’t shared with plotting and Enterprise Samsung PM1735 - HHHL Cards. I currently have them in software RAID0 in one PC, but the same issue occurred when I used them without RAID too.

Funnily enough, both PC’s have these HHHL cards in them. The PC that crashes more often has 3 of them and the other one has only 1 of them.

I’m currently doing a RAM test, which is taking hours. Once that’s done, I’ll kick off some plots and check the SSD temp’s…

porkopops · July 13, 2021, 3:52pm

Thanks @Bones. I’m using Chia 1.2.1 and Madmax 0.1.1 - I think they’re both the latest versions.

porkopops · July 13, 2021, 3:58pm

In case it helps, here’s the SMART HWInfo from the harvester which doesn’t crash so much. It’s been plotting for most of today without a glitch

It seems OK.

matthewjbauer · July 13, 2021, 4:21pm

Unless both PCs are identical, I’d try removing the SSDs and plotting to something else and see if the issue is resolved. There have been reported cases of Samsung (or Samsung based) SSDs having issues and being extremely slow or crashing a system under intense load.

Also, just because a RAM test passes today doesn’t mean that it won’t fail tomorrow. A single bit could flip at random and may cause stability issues with the RAM if it the RAM or system can’t compensate.

porkopops · July 13, 2021, 4:37pm

Good idea. I have a couple of Micron SSD’s in there, so I’ll plot to them and see if it it’s more stable (unfortunately they’re SATA though, but at least its a good test)
FYI - The PC’s are different. One’s a Z490 and the other a Z590.

porkopops · July 15, 2021, 12:14pm

Update: I stopped using the Samsung disks and plotted using a 119GB RAMDisk (temp2) and the 2 Micron disks in RAID0 (temp1 and destination)…

But the same problem occurred…

If I wasn’t already bald, I’d be pulling my hair out

Edit: I’ve physically removed the Samsung SSD’s and trying again.

porkopops · July 15, 2021, 7:15pm

So, It’s still happening… I’m gonna shave my head.

khominhvi · July 16, 2021, 12:48am

There’s just so little information to go on. MADMAX is pretty hard on the system.

A few questions, if you don’t mind:

CPU Model / Cores?
RAM Size?
PSU Wattage?
Madmax settings?
Overclocked?

If it were my pc, I would try to rule out Madmax and find another way to stress the system. Maybe try some long stress test or benchmarking software and see if the system is stable… no idea what could emulate madmax though…

porkopops · July 16, 2021, 10:58am

Sorry for the delay…

Intel® Core™ i9 Ten-Core Processor i9-10900K (3.7GHz) 20MB Cache (20 Cores)
ASUS® ROG STRIX Z590-F GAMING WIFI (LGA1200, USB 3.2, PCIe 4.0) - ARGB Ready (2.5Gb Eth)
128GB Corsair VENGEANCE DDR4 3000MHz (4 x 32GB)
512GB PCS PCIe M.2 SSD (2000 MB/R, 1100 MB/W)
CORSAIR 550W CV SERIES™ CV-550 POWER SUPPLY
CoolerMaster Hyper 212 (120mm) Fan CPU Cooler Black Edition
Micron 5300 MAX SATA SSD Qty 2

Madmax CLI:
.\chia_plot.exe -t E:\ChiaTemp\ -2 R:\ChiaTemp\ -d E:\Plots\PoolPlots\ -n -1 -r 16 -u 256 -c key -f key

E: is 2x RAID0 Micron SATA SSD’s.
R: is RAMDrive

It’s not OC’d. I’ve been doing the opposite and ramping it down due to this issue, but no luck.

I’ve since removed some of the ROG s/w in case it was faulting while madmax is maxxing out the CPU.

EDIT: The RAM test came back clean after running for about 12 hours

khominhvi · July 16, 2021, 11:46am

Hmm I don’t see anything wrong with this setup. Apart from the budget friendly CPU cooler on a beastly 10900K, but I don’t think that’s the problem.

Have you tried this without RAID? I’m wondering if RAID 0 is the problem.
Motherboard BIOS up to date?

porkopops · July 16, 2021, 12:51pm

The BIOS is at the latest version, but there could still be a bug I guess.
I had the same issue before I started to use the RAID0 too.

TBH I’ve resigned myself to thinking that I’m just going to have to live with it until I’m fully plotted, which should just be about a week.

Thanks for your support.

porkopops · July 16, 2021, 2:24pm

Just in case someone feels kind enough to have a look at this, I logged around 1.5 hours with HWiNFO until the PC crashed.

I honestly can’t see anything obvious in the log at all, which could cause these hangs/reboots. However, the CPU is higher temp that I thought but not critically hot. I checked the last ~20 rows, so just before the crash occurred.

Here’s a link to download the csv file from my OneDrive: https://1drv.ms/u/s!AglB3Ty2EV3Eg5cHL8Q2xfHKbGxwRA?e=O4MjmU

Thanks in advance

Fuzeguy · July 16, 2021, 2:40pm

Well I did look at your HWMonitor sheet. You may not think your CPU is running hot, but if mine were those temps, I’d run down to BestBuy and buy this https://www.bestbuy.com/site/cooler-master-masterliquid-ml120l-rgb-120mm-radiator-cpu-liquid-cooling-system-with-rgb-lighting-black/6303597.p?skuId=6303597 immediately. 90+ deg is way too hot to expect no problems.

I did this for my two systems and temps cratered to 70-+ degree range under almost any load. For so cheap, it would be a slam dunk upgrade and well worth it.

porkopops · July 16, 2021, 3:12pm

Cheers @Fuzeguy

I didn’t notice those temps further up the list. Certainly when it crashed it was nowhere near 90, but as you say that is flippin’ hot to have ant any time.

So better be safe… New purchase coming up. Thanks

Edit: 2 Water coolers are on order… I hope this is it… Mind you, I already shaved my head. nvm.