SSDs may have been around for some time now but they’re still something of an unknown. Their performance benefits are undeniable and their cost per gigabyte has plummeted year after year. However, for the enterprise space, their unknown status has led to a lot of hedged bets when it comes to their use. Most SSDs have a large portion of over provisioned space, to accommodate for failed cells and wear levelling. A lot of SSDs are sold as “accelerators”, meant to help speed up operations but not hold critical data for any length of time. This all comes from a lack of good data on their reliability and failure rates, something which can only come with time and use. Thankfully Google has been doing just that and at a recent conference released a paper about their findings.
The paper focused on three different types of flash media: the consumer level MLC, the more enterprise focused SLC and the somewhere-in-the-middle eMLC. These were all custom devices, sporting Google’s own PCIe interface and drivers, however the chips they used were your run of the mill flash. The drives were divided into 10 categories: 4 MLC, 4 SLC and 2 eMLC. For each of these different types of drives several different metrics were collected over their 6 year lifetime: raw bit error rate (RBER), uncorrectable bit error rate (UBER), program/erase cycles and various failure rates (bad blocks, bad cells, etc.). All of these were then collated to provide insights into the reliability of SSDs and their comparison to each other and to old fashioned, spinning rust drives.
Probably the most stunning finding out of the report is that, in general, SLC drives are no more reliable than their MLC brethren. For both enterprises and consumers this is a big deal as SLC based drives are often several times the price of their MLC equivalent. This should allay any fears that enterprises had about using MLC based products as they will likely be just as reliable and far more cheaper. Indeed products like the Intel 750 series (one of which I’m using for big data analysis at home) provide the same capabilities as products that cost ten times as much and, based on Google’s research, will last just as long.
Interestingly the biggest predictive indicator for drive reliability wasn’t the RBER, UBER or even the number of PE cycles. In fact the most predictive factor of drive failure was the physical age of the drive itself. What this means is that, for SSDs, there must be other factors at play which affect drive reliability. The paper hypothesizes that this might be due to silicon aging but it doesn’t appear that they had enough data to investigate that further. I’m very much interested in how this plays out as it will likely come down to the way they’re fabricated (I.E. different types of lithography, doping, etc.), something which does vary significantly between manufacturers.
It’s not all good news for SSDs however as the research showed that whilst SSDs have an overall failure rate below that of spinning rust they do exhibit a higher UBER. What this means is that SSDs will have a higher rate of unrecoverable errors which can lead to data corruption. Many modern operating systems, applications and storage controllers are aware of this and can accommodate it but it’s still an issue for systems with mission/business critical data.
This kind of insight into the reliability of SSDs is great and just goes to show that even nascent technology can be quite reliable. The insight into MLC vs SLC is telling, showing that whilst a certain technology may exhibit one better characteristic (in this case PE cycle count) that might not be the true indicator of reliability. Indeed Google’s research shows that the factors we have been watching so closely might not be the ones we need to look at. Thus we need to develop new ideas in order to better assess the reliability of SSDs so that we can better predict their failures. Then, once we have that, we can work towards eliminating them, making SSDs even more reliable again.