A New Picture of Flash Reliability

How do you assess flash reliability? The largest study of flash drives in production — six years, 10 device models, three types of flash, millions of drive days — offers new insight.

The best way to examine the reliability of a technology, one could argue, is by looking at how the technology actually performs in real-world production — rather than, say, in a lab experiment under controlled conditions, using a small number of drives, and under synthetic workloads. That was the premise for a research study on the reliability of flash drives presented at FAST ’16, the 14th USENIX Conference on File and Storage Technologies, February 22nd -15th, 2016 in Santa Clara California and titled: Flash Reliability in Production: The Expected and the Unexpected.
Covering a six-year period and millions of drive days, the study looks at reliability data from 10 device models with three types of flash technologies (MLC, eMLC, and SLC) and with feature sizes ranging from 24nm to 50nm. Two generations of drives were studied, with all drives of the same generation using the same driver, firmware, error correction codes, and algorithms for wear leveling. The data — collected daily from drives operating in Google data centers — included:
Error counts for a variety of error types
Workload (e.g., number of reads, writes, and erases)
Number of bad blocks developed during the day
Chips that had failed during the day
When a drive was swapped out for repair
The researchers also compared these statistics against those from previous studies of the reliability of hard disk drives. What follows here are highlights of the study’s findings, in Q&A format:

How do flash drives compare to hard disk drives on reliability?
The authors asked this question two ways: first, which technology has the higher replacement rate (answer: hard disk drives); and, second, which has the higher number of user-impacting errors, such as uncorrectable errors (answer: flash drives, by a significant number).

How often will flash drives develop bad blocks and bad chips?
The authors found that 30-80% of flash drives developed at least one bad block and 2-7% developed at least one bad chip during the first four years in the field.

How do you know a drive will likely develop bad blocks or bad chips?
History is a good predictor of future reliability. Drives that have more than four bad blocks will likely develop hundreds (with hundreds also indicating a chip failure).

How does flash feature size correlate with reliability?
One might assume that chips with smaller features are less reliable since there is less room, literally, for error. The authors found that this assumption to be true when it comes to raw bit error rate (RBER) in general; however, chips with smaller features did not show a higher incidence of the types of errors that impact users — so called non-transparent errors — such as uncorrectable errors.

Were some flash technologies found to be more reliable than others?
As the authors note, SLC drives are targeted at the enterprise market, while less-expensive MLC drives are not — a price difference at least partially based on the SLC’s perceived reliability advantage. However, the study found no evidence that this advantage exists.

Does age impact reliability regardless of usage?
Yes, the older a drive is, the less reliable it is likely to be, independent of usage.

What is (and is not) a good predictor of flash reliability?
Programmable erase (PE) cycles is a standard metric drive vendors use to spec the lifetime wear-out of their products. (For example, all MLC models in this study were spec’d at a lifetime of 3,000 PE cycles.) As expected, the study found a high positive correlation between PE cycles and RBER and UBER (uncorrectable bit error rate). However, this correlation varies greatly among drive models — a 4X difference between models with the highest and lowest RBER. Furthermore, contrary to vendors’ accelerated end-of-life tests; this study also found that the rate of increase of both types of errors to be consistently linear with usage — whereas vendors report an exponential increases starting at around 9,000 PE cycles. Moreover, RBER was not shown to be a good predictor of uncorrected errors. Nor was there a correlation between uncorrected errors and number of reads.

So how many errors should flash buyers expect?
The study found that 20-63% of drives experience at least one uncorrectable error during their first four years and that they affect 2-6 drives out of 1,000 — and the majority of drive days experience at least one uncorrectable error.

As flash drives have become widely adopted for persistent data storage, particularly in the embedded space, concerns about their reliability have always been top of mind. Hopefully, studies like this one — based on actual production experience — will go far in answering those concerns.

Flash