The Network Attached Storage (NAS) array on my network has been complaining lately. It is reporting that one of the drives has SMART errors and that this could be an indication of a pending drive failure. So what is SMART and when will this drive fail?
History
Self-Monitoring, Analysis and Reporting Technology (SMART) began as IBM’s Predictive Failure Analysis (PFA), a reliability-prediction technology developed in 1992. The idea was to have a drive monitor key attributes (mechanical, electronic and the disk media) and report issues. This would permit a drive to be replaced in a controlled fashion. Compaq, in conjunction with Seagate Technology, Conner Peripherals and Quantum, introduced a similar ability named IntelliSafe in 1995. The IBM and Compaq solutions were then merged to create the SMART standard.
When Do Hard Drives Fail?
Hard disk manufactures calculate the anticipated failure rate of their drives from accelerated life test on a small number of their own drives. In 2007, researchers at Google undertook a study of over 100,000 drives from a variety of manufactures. Their study showed that of the drives that failed, 56% did so without any strong SMART signal warning and 36% failed with no SMART signal warning at all. So while it is possible that a drive will fail without any warning, what about the drives where there was a sign of trouble?
Reallocation counts are one of the SMART signals worth paying attention to. (The others are scan errors, offline reallocations and probational counts.) Reallocation occurs when a drive’s onboard logic believes a sector is probably damaged. The bad sector is eliminated from use and a sector from a spare area on the drive is mapped into its place. The Google study showed that once a single reallocation occurs, there is a 15% chance the drive will fail in the next 8 months. It’s even worse for older drives. If a drive has been operating for 10-20 months before that first reallocation, there is a 20% chance of failure within 8 months. For drives 20-60 months old the failure chance rises to about 24%.
The bottom line is that once a drive suffers its first reallocation, watch it closely and be prepared to replace it.
An Example
One of the nice things about the NAS is it talks to me when it is expecting trouble. The first email that got my attention was this:
Day 2 - Reallocated sector count has increased in the last day. Disk 2: Previous count: 5 Current count: 12 Growing SMART errors indicate a disk that may fail soon. If the errors continue to increase, you should be prepared to replace the disk.
The jump from 5 to 12 was quite different from the previous increases of a single sector. After this, it got worse fast.
- Day 1: 5
- Day 2: 12
- Day 3: 12
- Day 4: 14
- Day 5: 26
- Day 6: 26
- Day 7: 26
- Day 8: 51
- Day 9: 123
- Day 10: 195
- Day 11: 1843
On day 9 I ordered a new drive. On the morning of day 11 I received this email:
Reallocated sector count has increased in the last day. Disk 2: Previous count: 195 Current count: 1843 Disk 2 did not pass SMART self-assessment test. Please replace this disk as soon as possible. Reallocated sector count has increased in the last day.
1,843 reallocated sectors! Fortunately the new hard drive arrived in the afternoon and all is good with the NAS.
But Wait There’s More
Don’t just toss your bad drive away. Aside from possible security and privacy issues related to a hard drive (even a bad drive can give up its data in the right hands) you may be able to get it replaced. Attach your drive to a computer (I use a Thermaltake BlacX for this) and use a utility like Seagate’s SeaTools to diagnose the problem. Check your warranty and follow your disk drive manufacture’s process for returning the bad drive. In the case of my drive, I had three years left on the warranty. For the cost of shipping (one way) it is worth it to get a refurbished 1 TB drive.
References
S.M.A.R.T. site
Failure Trends in a Large Disk Drive Population
Add new comment