This post looks at how to interpret ‘durability’ claims of cloud storage providers. What does that long string of 9s means when a service provider talks about how safe your data is? I’ll argue that these measures of apparent data safety are largely meaningless. Much better is to understand what risk analysis the provider has done, which risks have been mitigated, and how they’ve done it – but this is the one thing that most providers won’t actually tell you.
Let’s start with some basics of what a series of 9s might mean. If an SLA says availability is 99.99%, i.e. uptime, then this means that the service is allowed to be down for 1/10,000th of the time – on average. The time period matters. 99.99% uptime could mean a service is unavailable for 1 second every 3hrs, 1 minute every 7 days, or 1 hour a year.
Lots of little outages are very different from the occasional big outage, which is why availability is typically specified on a monthly basis. In this context, within the SLA, the service can be down for no more than approximately 4 minutes during the month, either in one go or spread across the month. This is all pretty black and white and easy to understand.
But what does a similar series of 9s mean for durability? For example, Amazon currently say for S3 that “Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years.”
First notice the bit about ‘designed’, i.e. it is not a commitment nor is it a statement of based on evidence – after all Amazon haven’t been storing data for 10 years let alone 10 million. Indeed, many cloud providers simply won’t give any quantified indication of the level of data safety they offer – not because they don’t offer safe storage, but because its really hard to provide meaningful numbers.
Next notice the reference to ‘loss of objects’. An object in S3 can be between 1 byte and 5TB. So does this mean that if I have 10,000 small files then I could aggregate them into one big zip, i.e. one object, and hence reduce the chance of data loss by a factor of 10,000? I’d have to store my zip for more than the age of the Universe before I could expect to lose it. Or conversely, if I split up all my files into single byte objects then does this mean I can expect loose 1Byte out of every 1TB that I store? It would appear that simply changing the way I put data into Amazon could change data loss from an apparent impossibility to a near certainty!
Amazon themselves say they have over 2 trillion objects in S3, so if their durability is 11 nines then does that means it’s already likely that they have lost at least one of these objects? I wonder who the unlucky customer is and whether they know? Actually, it could easily be that they haven’t actually lost any data yet. But that doesn’t mean they should necessarily advertise even more nines!
The real question is what risks have Amazon addressed in their durability estimates and what measures they use to counter them. I’m not singling out Amazon here either, they have a pretty good data safety approach and track record. Rather it’s about the need for transparency and that applies to all cloud storage providers, including Arkivum.
When we started the Arkivum business we looked at existing cloud storage services and the assertions they made on the safety of the data they stored. We immediately decided to short-cut all that confusion and offer a straight-forward 100% data integrity guarantee and back it up with proper insurance. This makes it really simple for customers to understand what they are getting from us. This helps differentiate us from others in the market. It also makes it simpler for us because a 100% guarantee means the need to put data integrity first, so no discussions of what corners to cut or how to write wriggle room into our contract and SLA.
But, inevitably, some of our customers are now asking us whether we can store data at a lower cost in exchange for a lower level of data safety. This is why we’ve launched a new 1+1 service that reduces redundancy in favour of a lower price. And then comes the question ‘what are the chances of data loss’ and ‘how many 9s do we offer’. I’d argue that these actually aren’t the best questions to ask.
To illustrate the point, let’s take driving a car as an analogy. If you get in your car, then statistically speaking you have a 99.99% chance of making it to your destination without breaking down. But there are 600 trips per person per year on average in the UK and 70M people in the population, so the total number of car journeys on UK roads is very high. That’s why there are 20,000 breakdowns on UK roads each day. So whilst it is a near certainty that you won’t breakdown when you drive your car, it is also a near certainty that someone else will! In statistical terms, this is what happens in a large ‘population’ subject to ‘rare events’.
The same approach is often used to model data safety – large ‘populations’ of drives or tapes and ‘rare’ corruption or failure events. The key word here is model – it’s a prediction not an observation. It’s not as if a storage provider will actually load up PBs of data onto thousands of drives and store it for 10s of years just to see how much gets lost. Therefore, the numbers you get from storage providers are only as good as their models.
For example, a really simple model might consider multiple copies of a data object (redundancy) and how likely it is for each copy to be lost (e.g. because it gets corrupted). For example, a manufacturer of a SATA hard drive might say that the unrecoverable Bit Error Rate (BER) is 10 ^ -14.
This means that, on average, when reading data back from disk that one bit will be wrong for each thousand trillion bits you read – 14 nines of reliability. You might then think that storing two copies on separate drives means that there’s a 10 ^ -28 chance of losing just one bit in your data when you try to read it back because if a bit is lost on one drive then the same bit won’t be lost on the other. In other words, to all intents and purposes data loss is never going to happen.
But this is like saying that the chances of a car breaking down is purely a factor of whether one part of engine will fail or not. In the case of hard drives, it’s not only BER that matters, but also the Annual Failure Rate (AFR) of the drive as a whole, which can be anywhere between 1% and 15%, i.e. a much higher risk. Of course, this is why we have RAID storage arrays, erasure coding, and other clever ways of against protecting against data corruption and drive failures.
Cloud providers use these as a matter of course. But whilst these systems and techniques make life a whole lot better, they don’t eliminate data loss completely and they can introduce their own bugs and problems. Nothing is ever perfect. It is not necessarily not media failure rate (tapes or drives) that counts so much as how reliable the overall system is.
This the end of the story by a long way. Thinking again of cars on the road, the chances of a breakdown of an individual car isn’t something that can be calculated in isolation. Breakdowns are often correlated. A queue of cars overheats in a summer traffic jam. An accident blackspot. The first cold day of the winter catching out those who haven’t kept their antifreeze topped up or batteries in good order.
Contaminated fuel from a petrol station. A pile-up on the motorway. Faulty engine management software from a manufacturer. There’s a long list of ways that if something goes wrong for one car then it will also affect several others at the same time.
Exactly the same is true for storage. Failures are often correlated. Faulty firmware. Bugs in software. Human error. A bad batch of tapes or drives. Failing to test the backups. This is compounded when a service provider uses the same hardware/firmware/processes/staff for all the copies of data that they hold. And homogenisation is popular with cloud operators because it helps drive their costs down. But it means that it’s no longer possible to say that the copies are independent. This dramatically increases the real likelihood of data loss. Worse still, if all the copies are online and linked, then a software error that deletes one copy can automatically propagate and cause loss of the other copies.
To illustrate this, suppose you have 3 copies of a data object on three separate storage servers. Suppose you decide to check your data is OK at the end of the year. If the odds are say 1 in 1000 of losing one copy, e.g. because of a bug in a RAID array, then you might say that because the three copies are independent then there’s a 1 in billion chance of losing all three, i.e. permanent data loss because there are no ‘good’ copies left. But what if you said that there is 1 in 1000 chance of losing the first copy of the data, but correlation effects mean that if the first copy is lost then you revise the odds to 1 in 100 for also losing the second copy, and again if this is also lost then you drop the odds to 1 in 10 for the third copy.
The chances of data loss would now be 1 in a million. That’s a big change. Correlated failure modes make a lot of difference, yet this is typically overlooked when people build reliability models – probably because the maths gets a lot harder!
Long-term storage requires still further thought. To keep data for any significant length of time means media refreshes, software updates, migrations of storage systems and a whole host of other things. It also means regular checking of data integrity to pick up failures and fix them. Do this well and the risks are low and there is actually an opportunity to increase data safety. In particular, regular checks on data integrity offers the opportunity to do repairs before its too late. But do all of this badly and the risks of data loss when something needs to change in the system can go up dramatically.
What it does mean is that you can’t simply convert statistics on chances of data loss right now into statistics of losing data over the long-term. A 99.99% statistic of no car breakdown today doesn’t translate into a 10,000 days between breakdowns (about 27 years). How long you can go without a breakdown clearly depends on how well you maintain your car or if you buy to a new one that is more reliable (migration).
Just the same is true for storage. At this point, I’ve resisted delving into metrics such as Mean Time to Data Loss (MTTDL) or Mean Time Between Failures (MTBF) of systems. Partly because it would make this post even longer than it is now, partly because they over complicate the issue, and partly because they can be misleading.
An MTBF of 1,000,000 hrs for a hard drive doesn’t mean you can expect any one hard drive to last for 100 years! If you want more on this then it’s covered in a great paper called Mean Time to Meaningless where the title sums up the value of these metrics very nicely.
The way of dealing with all this is to use survival analysis and an actuarial approach to risk. Work out all the ways in which data can be lost, the chances of these happening, and then the chance of data ‘surviving’ all these factors.
It’s not the individual failure modes that matter so much as the chance of surviving all of them, especially taking correlations into account. It also means uncertainty about the risks and the ‘spread’ of scenarios that could play out. That means some hard maths and use of stochastic techniques if you want a more reasoned set of 9s.
In my opinion, the only real way forward is for a lot more transparency from cloud storage service providers so an informed judgement can be made by their users. Basically, go ask the provider for the risks they’ve considered and how they mitigate them. If it looks comprehensive and sensible then there’s a good chance that your data will be as safe as is reasonably possible.
There’s a lot to consider in long-term data storage. Data goes on a long and complex journey. To use the car analogy one last time, data makes hundreds of individual trips, both geographically as it gets transferred to and from systems and between locations, and also over time as it goes through migrations and checks. There are risks with every trip and these soon add up.
Multiple copies of the data in separate ‘cars’ is always a good strategy to help ensure that at least one makes it to the destination, as is the use regular servicing to keep each car in order – but beware if those cars are all in the same fleet run by the same supplier or come from the same manufacturer – otherwise you might just get hit by a recall for a fault across all of them…
At Arkivum, we’re very happy to share our risk analysis and mitigation with our customers (and auditors) and show why we think a 100% data integrity guarantee is justified for the Arkivum100 service. Many of the measures we use are outlined in a previously. And for those who opt for our reduced redundancy 1+1 service, whilst we might not put a string of 9s against it, it’s still possible to see exactly why data is still in safe hands.
Chief Technology Officer for Arkivum
Matthew is CTO of Arkivum and previously worked at the University of Southampton IT Innovation Centre. Over the last decade, Matthew has worked with a wide range of organisations on solving the challenges of long-term data retention and access, including in the lifesciences, aerospace, broadcasting and scientific sectors. By working with national archives and industry leaders across the UK and Europe, Matthew has investigated the issues of long-term data archiving, including preservation strategies, system architectures, total cost of ownership and how to mitigate the risk of loss of critical data assets. This resulted in IT Innovation spinning-out Arkivum ltd, which provides data archiving as a service. Matthew currently works on risk-based approaches to digital preservation and data retention including how to meet regulatory compliance as well as keep data accessible.