Testing out your hard drives via SMART checks

codinghorror · April 24, 2021, 7:12pm

Great point by Ryan! What SMART data should we look at, if we think a hard drive might be bad? First thing to do is check SMART stats:

Drive info


Model	WDC WD180EDFZ-11AFWA0
Serial Number	3WJY5LMK
Capacity	18,000,207,937,536 bytes [18.0 TB]
Firmware	81.00A81
ATA Version	Unknown(0x0ffc) (unknown minor revision code: 0x009c)
ATA Standard	SATA >3.2

Smart stats

ID	Attribute	Value	Worst	Threshold	Raw data	Status
0	Raw_Read_Error_Rate	100	100	1	0	ok
1	Throughput_Performance	134	134	54	104	ok
2	Spin_Up_Time	83	83	1	333	ok
3	Start_Stop_Count	99	99	0	656	ok
4	Reallocated_Sector_Ct	100	100	1	0	ok
5	Seek_Error_Rate	100	100	1	0	ok
6	Seek_Time_Performance	140	140	20	15	ok
7	Power_On_Hours	100	100	0	447	ok
8	Spin_Retry_Count	100	100	1	0	ok
9	Power_Cycle_Count	91	91	0	628	ok
10	Power-Off_Retract_Count	100	100	0	674	ok
11	Load_Cycle_Count	100	100	0	674	ok
12	Temperature_Celsius	42	42	0	38	ok
13	Reallocated_Event_Count	100	100	0	0	ok
14	Current_Pending_Sector	100	100	0	0	ok
15	Offline_Uncorrectable	100	100	0	0	ok
16	UDMA_CRC_Error_Count	100	100	0	0	ok

This is a disk I randomly picked from one of my NASes… so what SMART stats matter, what should we be looking for?

codinghorror · April 24, 2021, 7:12pm

Per wikipedia, Google has a great starting point

A field study at Google covering over 100,000 consumer-grade drives from December 2005 to August 2006 found correlations between certain SMART information and annualized failure rates:

In the 60 days following the first uncorrectable error on a drive (SMART attribute 0xC6 or 198) detected as a result of an offline scan, the drive was, on average, 39 times more likely to fail than a similar drive for which no such error occurred.

First errors in reallocations, offline reallocations (SMART attributes 0xC4 and 0x05 or 196 and 5) and probational counts (SMART attribute 0xC5 or 197) were also strongly correlated to higher probabilities of failure.

Conversely, little correlation was found for increased temperature and no correlation for usage level. However, the research showed that a large proportion (56%) of the failed drives failed without recording any count in the “four strong SMART warnings” identified as scan errors, reallocation count, offline reallocation and probational count.

Further, 36% of failed drives did so without recording any SMART error at all, except the temperature, meaning that SMART data alone was of limited usefulness in anticipating failures.

Right off the bat… “SMART data alone is of limited usefulness in anticipating failures”

But it looks like

uncorrectable errors
reallocations
offline reallocations

are the SMART stats we want to be mindful of?

ryan · April 24, 2021, 7:34pm

The most important in my experience are Reallocated_Sector_Ct, Reallocated_Event_Count, and Current_Pending_Sector. They should all be 0 ideally. That’s the main 3 I always look at. Any Current_Pending_Sectors that won’t go away would warrant extra investigation.

I think UDMA_CRC_Error_Count is sometimes an indicator there’s a bad cable.

codinghorror · April 24, 2021, 8:24pm

On Windows I use CrystalDiskMark which shows these stats; this is my desktop and its two primary drives; there’s a copy to clipboard function:

----------------------------------------------------------------------------
 (03) Samsung SSD 970 PRO 1TB
----------------------------------------------------------------------------
           Model : Samsung SSD 970 PRO 1TB
        Firmware : 1B2QEXP7
   Serial Number : S5JXNG0N110212W
       Disk Size : 1024.2 GB
     Buffer Size : Unknown
    # of Sectors : 
   Rotation Rate : ---- (SSD)
       Interface : NVM Express
   Major Version : NVM Express 1.3
   Minor Version : 
   Transfer Mode : PCIe 3.0 x4 | PCIe 3.0 x4
  Power On Hours : 2987 hours
  Power On Count : 784 count
      Host Reads : 409963 GB
     Host Writes : 370074 GB
     Temperature : 57 C (134 F)
   Health Status : Good (91 %)
        Features : S.M.A.R.T., TRIM, VolatileWriteCache
       APM Level : ----
       AAM Level : ----
    Drive Letter : C:

-- S.M.A.R.T. --------------------------------------------------------------
ID RawValues(6) Attribute Name
01 000000000000 Critical Warning
02 00000000014A Composite Temperature
03 000000000064 Available Spare
04 00000000000A Available Spare Threshold
05 000000000009 Percentage Used
06 0000333ECB69 Data Units Read
07 00002E4252DD Data Units Written
08 00005259E631 Host Read Commands
09 0000455BA848 Host Write Commands
0A 0000000027A5 Controller Busy Time
0B 000000000310 Power Cycles
0C 000000000BAB Power On Hours
0D 000000000037 Unsafe Shutdowns
0E 000000000000 Media and Data Integrity Errors
0F 000000000206 Number of Error Information Log Entries

----------------------------------------------------------------------------
 (02) Samsung SSD 970 EVO 2TB
----------------------------------------------------------------------------
           Model : Samsung SSD 970 EVO 2TB
        Firmware : 2B2QEXE7
   Serial Number : S464NB0M302420M
       Disk Size : 2000.3 GB
     Buffer Size : Unknown
    # of Sectors : 
   Rotation Rate : ---- (SSD)
       Interface : NVM Express
   Major Version : NVM Express 1.3
   Minor Version : 
   Transfer Mode : PCIe 3.0 x4 | PCIe 3.0 x4
  Power On Hours : 958 hours
  Power On Count : 783 count
      Host Reads : 471700 GB
     Host Writes : 427250 GB
     Temperature : 72 C (161 F)
   Health Status : Good (85 %)
        Features : S.M.A.R.T., TRIM, VolatileWriteCache
       APM Level : ----
       AAM Level : ----
    Drive Letter : D:

-- S.M.A.R.T. --------------------------------------------------------------
ID RawValues(6) Attribute Name
01 000000000000 Critical Warning
02 000000000159 Composite Temperature
03 000000000064 Available Spare
04 00000000000A Available Spare Threshold
05 00000000000F Percentage Used
06 00003AF63824 Data Units Read
07 00003567C886 Data Units Written
08 0000457BCF0B Host Read Commands
09 00002B08AEE7 Host Write Commands
0A 00000000276D Controller Busy Time
0B 00000000030F Power Cycles
0C 0000000003BE Power On Hours
0D 000000000035 Unsafe Shutdowns
0E 000000000000 Media and Data Integrity Errors
0F 000000000245 Number of Error Information Log Entries

Note that in addition to SMART you’re seeing total writes; I have been burning through a lot of plots on these two drives so I’ll have to replace them eventually

Blueoxx · April 24, 2021, 9:42pm

I would try to cool down that second drive. Above 60C reduces the life expectancy. There are heatsinks on amazon to help with cooling.

codinghorror · April 25, 2021, 1:23am

OK, so looking at

Reallocated_Sector_Ct
Reallocated_Event_Count
Current_Pending_Sector
UDMA_CRC_Error_Count

Each NAS has 5 drives, so I’ll put 0 if a drive has all zeroes in those fields:

A: 0, 0, 0, 0, 0
B: 0, 0, 0, 0, 0
C: 0, 0, 0, 0, 0
D: 0, 0, 0, 0, 0
E: 0, 0, 0, 0, 0

codinghorror · April 25, 2021, 11:07pm

Here’s how to check SMART metrics for a drive in Linux:

sudo apt-get install smartmontools
smartctl -i /dev/sda 
smartctl -s on /dev/sda
smartctl -a /dev/sda

codinghorror · May 9, 2021, 7:34pm

Comparing these stats from 15 days ago, post is dated Monday April 24th, 1:24pm:

Samsung 970 Pro 1tb (1 plot)
Good 91% – Reads 409,963 GB – Writes 370,074 GB
Samsung 970 Evo 2tb (2 plots)
Good 85% – Reads 471,700 GB – Writes 427,250 GB

Today is Sunday May 9th, 12:27pm… we can call this 15 days for sure, it’s only off by an hour.

Samsung 970 Pro 1tb (1 plot)
Good 88% – Reads 521,864 GB – Writes 473,329 GB
Samsung 970 Evo 2tb (2 plots)
Good 77% – Reads 687,301 GB – Writes 627,193 GB

So in 15 days we see a difference of:

Samsung 970 Pro 1tb (1 plot)
Good -3% – Reads +111,901 GB – Writes +103,255 GB
Samsung 970 Evo 2tb (2 plots)
Good -8% – Reads +215,601 GB – Writes +199,943 GB

Dividing by 15, that means every day is:

Samsung 970 Pro 1tb (1 plot)
Good -0.2%, Reads +7,460 GB – Writes +6,883 GB
Samsung 970 Evo 2tb (2 plots)
Good -0.53% – Reads +14,373 GB – Writes +13,329 GB

The math sort of adds up, though it looks like the EVO drive is degrading a bit quicker than the Pro.

codinghorror · May 19, 2021, 4:47pm

Here’s what I have today, May 19th, at 9:47am:

Samsung 970 Pro 1tb (1 plot)
Good 87% – Reads 587,828 GB – Writes 534,397 GB
Samsung 970 Evo 2tb (2 plots)
Good 73% – Reads 819,594 GB – Writes 750,838 GB

0x0 · May 19, 2021, 4:52pm

You can use something like Scrutiny to have a Web interface for smartctl. It runs on docker, so the setup is almost one-line (if you have it installed)
NVME support is not working great (at least for my nvme drives) but the SATA Drives are showing just fine, and you’ll have Temps and status checks of all your drives in a single page

codinghorror · May 29, 2021, 9:08pm

OK! I’ve stopped chia plotting on this machine for good, let’s check the final SMART stats.

Started April 24th, 1:24 pm with these stats:

Samsung 970 Pro 1tb (1 plot)
Good 91% – Reads 409,963 GB – Writes 370,074 GB
Samsung 970 Evo 2tb (2 plots)
Good 85% – Reads 471,700 GB – Writes 427,250 GB

Finished May 29th, 1:59pm with these stats:

Samsung 970 Pro 1tb (1 plot)
Good 85% – Reads 662,635 GB – Writes 603,687 GB
Samsung 970 Evo 2tb (2 plots)
Good 67% – Reads 924,283 GB – Writes 885,834 GB

That’s a continuous plotting time of at least 35 days – two parallel plots on the 970 Evo, and one parallel on the 970 Pro… so over those 35 days the drives accumulated these many reads and writes:

Samsung 970 Pro 1tb (1 plot)
Reads 252,672 GB – Writes 233,613 GB
Samsung 970 Evo 2tb (2 plots)
Reads 452,583 GB – Writes 458,584 GB

The 970 evo should be divided in half since it was doing two plots, that makes

Samsung 970 Evo 2tb (1 plot)
Reads 226,292 GB – Writes 229,292 GB
Samsung 970 Pro 1tb (1 plot)
Reads 252,672 GB – Writes 233,613 GB

35 days is 840 hours, so let’s see…

Average read and write for a single plot process over 35 days is thus

~269mb/hour read, ~273mb/hour write
~6.5gb/day read, ~6.5gb/day write

zictes · May 29, 2021, 10:21pm

What do you do when smartctl reports that SMART is unavailable? I have all identical 12TB WD My Book drives, and only a few of them report smart data. 90% of them are unavailable for some reason, even though they’re all formatted the same way and attached to the same USB hub.

I also tried hddtemp, which says SMART is unsupported for all drives, even the ones where smartctl works.

codinghorror · May 29, 2021, 10:38pm

That’s weird, does SMART show up if you connect them directly vs. via the USB hub?

zictes · May 29, 2021, 11:12pm

Found the issue thanks to this stackexchange thread, I had to explicitly pass the device type.

smartctl -a -d sat /dev/sdX works for all of my drives, while smartctl -a /dev/sdX only works for some of them.