How do SSDs fail?

KryptoMine · October 30, 2022, 4:20pm

after 300TB of plots, I’m nearing my first ssd failure.
durability shows ~10%. have you had an ssd failure already?
What will happen when it fails? Will MadMax produce corrupted blocks or will it just be like: nope, Im no longer working?

LG

drhicom · October 30, 2022, 5:16pm

Why do you feel your going to have a failure, I have been using some SSD for a long time that some tools say they are going bad but they still keep making plots.

KryptoMine · October 30, 2022, 5:42pm

Well I have had bad HDDs which say they are degraded. They still run, but may have corrupted blocks so i dont use them for my personal data anymore.

I wonder if its the same for ssds. I dont want to put a lot of effort having drives full of corrupted plots in the end.

So I wonder how they will fail. If they will produe normal plots until, one day the operating system says: Thid disk cant be used, its defective (which would be fine with me)

Additionally its no issue for me to replace the nvme. I have ~20 500gb ssds lying around from scraped old defective laptops. No need to replace ssds unnessecairly though.

Captain_Plots-a-lot · October 30, 2022, 5:47pm

Sure, I killed an old Kingston 256GB from Chia database writes actually (wasn’t even the plotting). Failures manifest in different ways, mine would just “lose all the data” but the drive letter was still accessible if you popped in with another drive/OS to check it out. You could even reformat it, install everything, and it’d work for a few more days then lose everything again. I did this 3 times until I realized drives are like $50 and my time was worth way more than stressing with old junk (tossed the old drive). This was one of those first waves of janky SSDs though - I’m impressed it lasted as long as it did (probably from 2015 or something).

KryptoMine · October 30, 2022, 5:51pm

Thank you for the insight!

whosrdaddy · October 30, 2022, 6:16pm

I have a samsung 256Gb M.2 SSD, I used it in the past for plotting, now it is running in a harvester, ubuntu shows 140% wear but no problems so far
here is the smartctl output of my drive:

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 36 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 143%
Data Units Read: 540,535,198 [276 TB]
Data Units Written: 785,338,408 [402 TB]
Host Read Commands: 3,279,873,929
Host Write Commands: 3,853,275,423
Controller Busy Time: 21,028
Power Cycles: 88
Power On Hours: 1,867
Unsafe Shutdowns: 28
Media and Data Integrity Errors: 0
Error Information Log Entries: 122
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 36 Celsius
Temperature Sensor 2: 43 Celsius

seymour.krelborn · October 30, 2022, 6:17pm

I believe that tools that give you “critical” warnings on an SSD health are doing so for two reasons:

When you reach the manufacturer’s terabytes written (TBW) value, you end your warranty.
That is a convenient way for manufacturers to end warranties, ahead of the time period of the warranty.
A percentage of people will see the warning, and replace their perfectly working SSD. That results in more sales.

You can run:
$ chia plots check -g name-of-plot.plot
…to check each new plot.

I have been using my Samsung 980 Pro NVMe drives long past their TBW value, with no issues. I am still plotting with them.

I do not know the number (and it probably varies by manufacturer and drive model), but you can probably get 2x, 3x, maybe 10x (who knows?) the writes, beyond the manufacture’s advertised TBW value.

Sometimes I wonder if the manufacturers know how many writes their SSDs can handle, before becoming an issue.

The reason I wonder is because I have been hammering away on my SSDs for 18 months, and they work like the day they were new. Unless I am missing some other type of testing, then it means that the manufacturers would have to hammer away at the SSDs for that long (or longer) to find out how much punishment their SSDs can handle.

I do not think that any manufacturers have postponed the release of their drives, for years, in search of the drive’s longevity. They know enough to know that their drives can handle “X” TBW, and advertise it as such, with #1 and #2 (above) factored in.

If your time is not worth the effort to run the checks on your new plots, and especially if you already have good SSDs lying around, then replace your (reported) “critical” SSD. Peace of mind has a value, too.

I will keep hammering away on mine, until they show me some real-life sign that they have had enough.

whosrdaddy · October 30, 2022, 6:23pm

The safety margins on TBW are very high on most SSD’s, especially on Samsung drives, the quality of their NAND flash is simply excellent.

Fastskiguy · October 30, 2022, 9:30pm

I had a plotter drive that according to crystal disk was dead but it continued to plot. I ran out of disk space and it sat unused for a year then when I fired it back up it was unreadable. I kinda wonder if it would have lasted longer if I had kept using it.

Joe

drhicom · October 30, 2022, 11:31pm

Run some disk tools like

And see what you can do with it,

KryptoMine · October 31, 2022, 7:22am

hmm, so far so good
WDC 1tb drive:
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 41 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 139%
Data Units Read: 4,825,518,702 [2.47 PB]
Data Units Written: 4,473,494,086 [2.29 PB]
Host Read Commands: 8,615,315,131
Host Write Commands: 7,368,521,790
Controller Busy Time: 50,176
Power Cycles: 306
Power On Hours: 11,903
Unsafe Shutdowns: 176
Media and Data Integrity Errors: 0
Error Information Log Entries: 1
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

Intel 500gb drive:
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 46 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 140%
Data Units Read: 2,123,155,525 [1.08 PB]
Data Units Written: 1,939,824,940 [993 TB]
Host Read Commands: 9,463,079,266
Host Write Commands: 8,378,022,019
Controller Busy Time: 221,741
Power Cycles: 203
Power On Hours: 11,329
Unsafe Shutdowns: 94
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

I think temperature ir also relevant. Im using some beefy copper coolers from ali express which can be reused.

Noj · October 31, 2022, 9:48am

I have a 1TB SSD, M2, it is way way past it’s EOL, and it started to run really slowly when plotting. I simply ran mkfs (rebuild file system), and it worked fine again I now run mkfs before I plot a large drive (probably unnecessary), and never have a problem.

mumtazali · October 31, 2022, 1:20pm

NVME drives may be good way past their theoretical EOL for temporarily holding data (e.g. plotting), but not good for long-term storage.

WolfGT · October 31, 2022, 1:47pm

Usually, in almost all cases, the SSD catches fire, then it jumps up out of the computer and runs to the nearest child’s bed and crawls under the covers. But I’m sure it will be fine.

drhicom · October 31, 2022, 1:49pm

Can you send me all your bad disks, where are you located for shipping costs.

mattrapid · October 31, 2022, 3:53pm

I’ve gone through 3 SSDs (1TB, 1TB, 2TB) and my experience on Ubuntu was the plotting progress on the Chia UI would just freeze. The plot log wasn’t helpful either. Then after restarting the system, I would get stuck at a bios screen with a hardware error. I could not start the system without booting from a USB key. Finally, after looking at the SMART disk data, realizing that it was past its lifespan, and removing it from the system was everything working normally again. I still was able to attach the drives by USB and load some plots on them for farming.

Fuzeguy · October 31, 2022, 5:21pm

generally, yes…but here’s a Samsung example that isn’t excellent with hardly any use (relatively) >

And this is how it fails in practice, you can read, but write has gone to chit big time, about USB 2.0 speed FCOL >

KryptoMine · October 31, 2022, 10:18pm

Temperature is at 70°c
was it there consistently or even hotter? Proper cooling?

Maybe its also just a monday model but I’d definitely consider 70 on the high side if running 24/7
failure still waaay too quickly with 20 tb written. Especially for a 4tb model. id do rma

Bones · October 31, 2022, 10:46pm

Were you using your os drive to plot on?
If so, dont, and those bad drives may very well be fine just for plotting on.

mattrapid · November 1, 2022, 12:21am

No, they were all separate nvme temporary drives. The OS drive is just for the OS