Testing out your hard drives via SMART checks

Great point by Ryan! What SMART data should we look at, if we think a hard drive might be bad? First thing to do is check SMART stats:

Drive info

Model WDC WD180EDFZ-11AFWA0
Serial Number 3WJY5LMK
Capacity 18,000,207,937,536 bytes [18.0 TB]
Firmware 81.00A81
ATA Version Unknown(0x0ffc) (unknown minor revision code: 0x009c)
ATA Standard SATA >3.2

Smart stats

ID Attribute Value Worst Threshold Raw data Status
0 Raw_Read_Error_Rate 100 100 1 0 ok
1 Throughput_Performance 134 134 54 104 ok
2 Spin_Up_Time 83 83 1 333 ok
3 Start_Stop_Count 99 99 0 656 ok
4 Reallocated_Sector_Ct 100 100 1 0 ok
5 Seek_Error_Rate 100 100 1 0 ok
6 Seek_Time_Performance 140 140 20 15 ok
7 Power_On_Hours 100 100 0 447 ok
8 Spin_Retry_Count 100 100 1 0 ok
9 Power_Cycle_Count 91 91 0 628 ok
10 Power-Off_Retract_Count 100 100 0 674 ok
11 Load_Cycle_Count 100 100 0 674 ok
12 Temperature_Celsius 42 42 0 38 ok
13 Reallocated_Event_Count 100 100 0 0 ok
14 Current_Pending_Sector 100 100 0 0 ok
15 Offline_Uncorrectable 100 100 0 0 ok
16 UDMA_CRC_Error_Count 100 100 0 0 ok

This is a disk I randomly picked from one of my NASes… so what SMART stats matter, what should we be looking for?

Per wikipedia, Google has a great starting point

A field study at Google covering over 100,000 consumer-grade drives from December 2005 to August 2006 found correlations between certain SMART information and annualized failure rates:

  • In the 60 days following the first uncorrectable error on a drive (SMART attribute 0xC6 or 198) detected as a result of an offline scan, the drive was, on average, 39 times more likely to fail than a similar drive for which no such error occurred.
  • First errors in reallocations, offline reallocations (SMART attributes 0xC4 and 0x05 or 196 and 5) and probational counts (SMART attribute 0xC5 or 197) were also strongly correlated to higher probabilities of failure.
  • Conversely, little correlation was found for increased temperature and no correlation for usage level. However, the research showed that a large proportion (56%) of the failed drives failed without recording any count in the “four strong SMART warnings” identified as scan errors, reallocation count, offline reallocation and probational count.
  • Further, 36% of failed drives did so without recording any SMART error at all, except the temperature, meaning that SMART data alone was of limited usefulness in anticipating failures.

Right off the bat… “SMART data alone is of limited usefulness in anticipating failures” :sweat:

But it looks like

  • uncorrectable errors
  • reallocations
  • offline reallocations

are the SMART stats we want to be mindful of?

The most important in my experience are Reallocated_Sector_Ct, Reallocated_Event_Count, and Current_Pending_Sector. They should all be 0 ideally. That’s the main 3 I always look at. Any Current_Pending_Sectors that won’t go away would warrant extra investigation.

I think UDMA_CRC_Error_Count is sometimes an indicator there’s a bad cable.

1 Like

On Windows I use CrystalDiskMark which shows these stats; this is my desktop and its two primary drives; there’s a copy to clipboard function:

----------------------------------------------------------------------------
 (03) Samsung SSD 970 PRO 1TB
----------------------------------------------------------------------------
           Model : Samsung SSD 970 PRO 1TB
        Firmware : 1B2QEXP7
   Serial Number : S5JXNG0N110212W
       Disk Size : 1024.2 GB
     Buffer Size : Unknown
    # of Sectors : 
   Rotation Rate : ---- (SSD)
       Interface : NVM Express
   Major Version : NVM Express 1.3
   Minor Version : 
   Transfer Mode : PCIe 3.0 x4 | PCIe 3.0 x4
  Power On Hours : 2987 hours
  Power On Count : 784 count
      Host Reads : 409963 GB
     Host Writes : 370074 GB
     Temperature : 57 C (134 F)
   Health Status : Good (91 %)
        Features : S.M.A.R.T., TRIM, VolatileWriteCache
       APM Level : ----
       AAM Level : ----
    Drive Letter : C:

-- S.M.A.R.T. --------------------------------------------------------------
ID RawValues(6) Attribute Name
01 000000000000 Critical Warning
02 00000000014A Composite Temperature
03 000000000064 Available Spare
04 00000000000A Available Spare Threshold
05 000000000009 Percentage Used
06 0000333ECB69 Data Units Read
07 00002E4252DD Data Units Written
08 00005259E631 Host Read Commands
09 0000455BA848 Host Write Commands
0A 0000000027A5 Controller Busy Time
0B 000000000310 Power Cycles
0C 000000000BAB Power On Hours
0D 000000000037 Unsafe Shutdowns
0E 000000000000 Media and Data Integrity Errors
0F 000000000206 Number of Error Information Log Entries

----------------------------------------------------------------------------
 (02) Samsung SSD 970 EVO 2TB
----------------------------------------------------------------------------
           Model : Samsung SSD 970 EVO 2TB
        Firmware : 2B2QEXE7
   Serial Number : S464NB0M302420M
       Disk Size : 2000.3 GB
     Buffer Size : Unknown
    # of Sectors : 
   Rotation Rate : ---- (SSD)
       Interface : NVM Express
   Major Version : NVM Express 1.3
   Minor Version : 
   Transfer Mode : PCIe 3.0 x4 | PCIe 3.0 x4
  Power On Hours : 958 hours
  Power On Count : 783 count
      Host Reads : 471700 GB
     Host Writes : 427250 GB
     Temperature : 72 C (161 F)
   Health Status : Good (85 %)
        Features : S.M.A.R.T., TRIM, VolatileWriteCache
       APM Level : ----
       AAM Level : ----
    Drive Letter : D:

-- S.M.A.R.T. --------------------------------------------------------------
ID RawValues(6) Attribute Name
01 000000000000 Critical Warning
02 000000000159 Composite Temperature
03 000000000064 Available Spare
04 00000000000A Available Spare Threshold
05 00000000000F Percentage Used
06 00003AF63824 Data Units Read
07 00003567C886 Data Units Written
08 0000457BCF0B Host Read Commands
09 00002B08AEE7 Host Write Commands
0A 00000000276D Controller Busy Time
0B 00000000030F Power Cycles
0C 0000000003BE Power On Hours
0D 000000000035 Unsafe Shutdowns
0E 000000000000 Media and Data Integrity Errors
0F 000000000245 Number of Error Information Log Entries

Note that in addition to SMART you’re seeing total writes; I have been burning through a lot of plots on these two drives so I’ll have to replace them eventually :cold_sweat:

I would try to cool down that second drive. Above 60C reduces the life expectancy. There are heatsinks on amazon to help with cooling.

1 Like

OK, so looking at

  • Reallocated_Sector_Ct
  • Reallocated_Event_Count
  • Current_Pending_Sector
  • UDMA_CRC_Error_Count

Each NAS has 5 drives, so I’ll put 0 if a drive has all zeroes in those fields:

  • A: 0, 0, 0, 0, 0
  • B: 0, 0, 0, 0, 0
  • C: 0, 0, 0, 0, 0
  • D: 0, 0, 0, 0, 0
  • E: 0, 0, 0, 0, 0
1 Like

Here’s how to check SMART metrics for a drive in Linux:

sudo apt-get install smartmontools
smartctl -i /dev/sda 
smartctl -s on /dev/sda
smartctl -a /dev/sda    
3 Likes

Comparing these stats from 15 days ago, post is dated Monday April 24th, 1:24pm:

  • Samsung 970 Pro 1tb (1 plot)
    Good 91% – Reads 409,963 GB – Writes 370,074 GB
  • Samsung 970 Evo 2tb (2 plots)
    Good 85% – Reads 471,700 GB – Writes 427,250 GB

Today is Sunday May 9th, 12:27pm… we can call this 15 days for sure, it’s only off by an hour.

  • Samsung 970 Pro 1tb (1 plot)
    Good 88% – Reads 521,864 GB – Writes 473,329 GB
  • Samsung 970 Evo 2tb (2 plots)
    Good 77% – Reads 687,301 GB – Writes 627,193 GB

So in 15 days we see a difference of:

  • Samsung 970 Pro 1tb (1 plot)
    Good -3% – Reads +111,901 GB – Writes +103,255 GB
  • Samsung 970 Evo 2tb (2 plots)
    Good -8% – Reads +215,601 GB – Writes +199,943 GB

Dividing by 15, that means every day is:

  • Samsung 970 Pro 1tb (1 plot)
    Good -0.2%, Reads +7,460 GB – Writes +6,883 GB
  • Samsung 970 Evo 2tb (2 plots)
    Good -0.53% – Reads +14,373 GB – Writes +13,329 GB

The math sort of adds up, though it looks like the EVO drive is degrading a bit quicker than the Pro.

Here’s what I have today, May 19th, at 9:47am:

  • Samsung 970 Pro 1tb (1 plot)
    Good 87% – Reads 587,828 GB – Writes 534,397 GB
  • Samsung 970 Evo 2tb (2 plots)
    Good 73% – Reads 819,594 GB – Writes 750,838 GB

You can use something like Scrutiny to have a Web interface for smartctl. It runs on docker, so the setup is almost one-line (if you have it installed)
NVME support is not working great (at least for my nvme drives) but the SATA Drives are showing just fine, and you’ll have Temps and status checks of all your drives in a single page

1 Like

OK! I’ve stopped chia plotting on this machine for good, let’s check the final SMART stats.

Started April 24th, 1:24 pm with these stats:

  • Samsung 970 Pro 1tb (1 plot)
    Good 91% – Reads 409,963 GB – Writes 370,074 GB
  • Samsung 970 Evo 2tb (2 plots)
    Good 85% – Reads 471,700 GB – Writes 427,250 GB

Finished May 29th, 1:59pm with these stats:

  • Samsung 970 Pro 1tb (1 plot)
    Good 85% – Reads 662,635 GB – Writes 603,687 GB
  • Samsung 970 Evo 2tb (2 plots)
    Good 67% – Reads 924,283 GB – Writes 885,834 GB

That’s a continuous plotting time of at least 35 days – two parallel plots on the 970 Evo, and one parallel on the 970 Pro… so over those 35 days the drives accumulated these many reads and writes:

  • Samsung 970 Pro 1tb (1 plot)
    Reads 252,672 GB – Writes 233,613 GB
  • Samsung 970 Evo 2tb (2 plots)
    Reads 452,583 GB – Writes 458,584 GB

The 970 evo should be divided in half since it was doing two plots, that makes

  • Samsung 970 Evo 2tb (1 plot)
    Reads 226,292 GB – Writes 229,292 GB
  • Samsung 970 Pro 1tb (1 plot)
    Reads 252,672 GB – Writes 233,613 GB

35 days is 840 hours, so let’s see…

Average read and write for a single plot process over 35 days is thus

  • ~269mb/hour read, ~273mb/hour write
  • ~6.5gb/day read, ~6.5gb/day write
1 Like

What do you do when smartctl reports that SMART is unavailable? I have all identical 12TB WD My Book drives, and only a few of them report smart data. 90% of them are unavailable for some reason, even though they’re all formatted the same way and attached to the same USB hub.

I also tried hddtemp, which says SMART is unsupported for all drives, even the ones where smartctl works.

1 Like

That’s weird, does SMART show up if you connect them directly vs. via the USB hub?

1 Like

Found the issue thanks to this stackexchange thread, I had to explicitly pass the device type.

smartctl -a -d sat /dev/sdX works for all of my drives, while smartctl -a /dev/sdX only works for some of them.

1 Like