I've got a cool new Amazon S3 feature to tell you about, but I need to start with a definition!
Let's define durability (with respect to an object stored in S3) as the probability that the object will remain intact and accessible after a period of one year. 100% durability would mean that there's no possible way for the object to be lost, 90% durability would mean that there's a 1-in-10 chance, and so forth.
We've always said that Amazon S3 provides a "highly durable" storage infrastructure and that objects are stored redundantly across multiple facilities within an S3 region. But we've never provided a metric, or explained what level of failure it can withstand without losing any data.
Let's change that!
Using the definition that I stated above, the durability of an object stored in Amazon S3 is 99.999999999%. If you store 10,000 objects with us, on average we may lose one of them every 10 million years or so. This storage is designed in such a way that we can sustain the concurrent loss of data in two separate storage facilities.
If you are using S3 for permanent storage, I'm sure that you need and fully appreciate the need for this level of durability. It is comforting to know that you can simply store your data in S3 without having to worry about backups, scaling, device failures, fires, theft, meteor strikes, earthquakes, or toddlers.
But wait, there's less!
Not every application actually needs this much durability. In some cases, the object stored in S3 is simply a cloud-based copy of an object that actually lives somewhere else. In other cases, the object can be regenerated or re-derived from other information. Our research has shown that a number of interesting applications simply don't need eleven 9's worth of durability.
To accommodate these applications we're introducing a new concept to S3. Each S3 object now has an associated storage class. All of your existing objects have the STANDARD storage class, and are stored with eleven 9's of durability. If you don't need this level of durability, you can use the new REDUCED_REDUNDANCY storage class instead. You can set this on new objects when you store them in S3, or you can copy an object to itself while specifying a different storage class.
The new REDUCED_REDUNDANCY storage class activates a new feature known as Reduced Redundancy Storage, or RRS. Objects stored using RRS have a durability of 99.99%, or four 9's. If you store 10,000 objects with us, on average we may lose one of them every year. RRS is designed to sustain the loss of data in a single facility.
RRS pricing starts at a base tier of $0.10 per Gigabyte per month, 33% cheaper than the more durable storage.
If Amazon S3 detects that an object has been lost any subsequent requests for that object will return the HTTP 405 ("Method Not Allowed") status code. Your application can then handle this error in an appropriate fashion. If the object lives elsewhere you would fetch it, put it back into S3 (using the same key), and then retry the retrieval operation. If the object was designed to be derived from other information, you would do the processing (perhaps it is an image scaling or transcoding task), put the new image back into S3 (again, using the same key), and retry the retrieval operation.
Update (for HTTP protocol geeks only):
I’d like to provide clarification regarding our choice of the HTTP 405 (“Method Not Allowed”) status code. Although 410 (“Gone”) may seem more appropriate, the HTTP 1.1 spec says that “this condition is expected to be permanent” and that clients "SHOULD delete references to the Request-URI". In other words, the 410 status code indicates that the object has intentionally been removed and will not return. That is not necessarily true when data is lost. The object owner may wish to resolve the data loss by reuploading the object, in which case it would have been inappropriate for S3 to return a 410 status code. We believe that 405 is most appropriate because other methods (e.g. PUT, POST, and DELETE) remain valid for the object even if the object’s data has gone missing. The object’s name (its URI) remains valid, but the data for the object is gone. The 422 and 424 status codes are specific to WebDav and don’t apply here.
We expect to see management tools and toolkits add support for RRS in the very near future.
You can use either storage class with Amazon CloudFront, of course.
I anticipate many unanticipated uses for this cool new feature; please feel free to leave me a comment with your ideas.
-- Jeff;
PS - check out Amazon CTO Werner Vogels' take on RRS. His post goes in to a bit more detail on how S3 was designed so that it will never lose data -- "Core to the design of S3 is that we go to great lengths to never, ever lose a single bit. We use several techniques to ensure the durability of the data our customers trust us with..."


> If Amazon S3 detects that an object has been lost any subsequent requests for that object will return the HTTP 405 ("Method Not Allowed") status code.
Wouldn't "410 Gone" have been more logical since you know it did exist but not any longer.
Posted by: Sylvain Hellegouarch | May 19, 2010 at 12:01 AM
This is already supported in S3 Backup, BTW. http://s3bk.com/
Posted by: Sergey | May 19, 2010 at 12:09 AM
sounds cool!
is the API the same as for S3? In other words, is it just a configuration to switch from S3 to RSS?
greets, dirk
Posted by: Dirkdk | May 19, 2010 at 02:24 AM
Seems to me like you are setting up a case in which you will 'lose' an item ever year to get people to upgrade.
Posted by: -dan | May 19, 2010 at 07:00 AM
I agree with Sylvain. "405 Method Not Allowed" is definitely the wrong status code. Please change it to "410 Gone", that's exactly what it is for. You'll need 405 at some point in the future to indicate something else (let's say when you have a read-only item, then you want to return a 405 error when someone DELETEs or PUTs to it).
Posted by: David Zuelke | May 19, 2010 at 07:15 AM
Shame there's no way of getting notification when an object is lost, short of regularly looking at all objects and checking the status. Is this something that's impossible because of the S3 architecture, or could it later be added as a feature?
Posted by: Dan | May 19, 2010 at 07:48 AM
I think 405 is not the right response for content missing on the server. 405 tells clients "stop performing that request, this server doesn't support it." Instead I suggest using 422 (http://www.webdav.org/specs/rfc4918.html#STATUS_422) or maybe 424 (http://www.webdav.org/specs/rfc4918.html#STATUS_424) from the WebDAV specs. Both of these response code relate to the condition of the data on the server, not the protocol method used by the client.
Posted by: mamund | May 19, 2010 at 08:26 AM
The standard suggests that the correct status is 410 Gone, not 405 Method Not Allowed. Gone is for when a resource is missing, cannot be found elsewhere, and should be considered permanently missing. 405 is for when the resource is available, but the method (post, get, push, etc) is not allowed for this one specific resource.
This will have a significant impact on proxies. Note that you are not returning a list of valid methods in the Allow header as the standard requires (and cannot, because there will be no allowed methods, meaning that 405 is an impossible response to make legal.)
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
Quoting the standard:
---------------------------
10.4.6 405 Method Not Allowed
The method specified in the Request-Line is not allowed for the resource identified by the Request-URI. The response MUST include an Allow header containing a list of valid methods for the requested resource.
---------------------------
10.4.11 410 Gone
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.
---------------------------
David and Sylvain are correct. Please fix.
Posted by: John Haugeland | May 19, 2010 at 09:36 AM
I agree with those wanting 410 gone. The WebDAV errors that mamund suggested are not good because they suggest that the entry still exists or that it can possibly recover. 410 maps exactly to the problem: your file is gone, sorry.
Posted by: Eric | May 19, 2010 at 09:39 AM
Dan: disks fail. No amount of redundancy can prevent that more than one disk could fail simultaneously. One loss in ten million every ten million years - eleven nines - is ridiculously stable. If someone advertises 100% to you, they are lying through their teeth.
Amazon is a large industrial company. They cannot tell lies like that.
If you were to move the entirety of WordPress.com - the one that hosts tens of thousands of blogs - to this storage service, it's quite likely that you'd still lose less than one item (posts, comments, settings fields, et cetera) every ten thousand years.
So, y'know, a 20% chance that one thing would have been lost since we turned over Year 0.
That's not setting up for upgrades. That's just facing facts: electronics aren't perfect. Things fail, and even with extreme redundancy, it is possible that all the disks carrying copies of your data will fail before any may be replaced.
If one in ten million since the time of the dinosaurs isn't good enough for you, you might want to have hosting that isn't mass market value oriented.
It's more than good enough for almost anything legitimately imaginable, though. Short of if you're a bank or a nuclear power plant, eleven nines should be just fine.
Besides, statistically, the thing you'll end up losing is going to be some form of spam. That's pretty much the only thing the internet generates anymore, besides pornography.
Posted by: John Haugeland | May 19, 2010 at 09:45 AM
The standard suggests that the correct status is 410 Gone, not 405 Method Not Allowed. Gone is for when a resource is missing, cannot be found elsewhere, and should be considered permanently missing. 405 is for when the resource is available, but the method (post, get, push, etc) is not allowed for this one specific resource.
This will have a significant impact on proxies. Note that you are not returning a list of valid methods in the Allow header as the standard requires (and cannot, because there will be no allowed methods, meaning that 405 is an impossible response to make legal.)
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
Quoting the standard:
---------------------------
10.4.6 405 Method Not Allowed
The method specified in the Request-Line is not allowed for the resource identified by the Request-URI. The response MUST include an Allow header containing a list of valid methods for the requested resource.
---------------------------
10.4.11 410 Gone
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.
The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.
---------------------------
David and Sylvain are correct. Please fix.
Posted by: John Haugeland | May 19, 2010 at 10:17 AM
Sorry, that's not how probabilities work.
Using your given definition, if the durability is d, then there is a probability of (1-d) that any given object will be lost. For simplicity, and because from your use of the term "object" it seems the concept is atomic, I will take their probabilities of loss to be independent. This is blatantly false if there is exactly one hard disk that can fail, and comes closer and closer to the truth as you guys devise better and better methods. If I store N objects with you, the probability that you lose M of them after one year is given by sampling the binomial distribution: C(N,M) * (1-d)^M * d^(N-M).
If I stay in business with you for more than 1 year because of how awesome your data storage service is, say T years, then I will want to estimate the probability that you lose a total of M objects over the course of those years. I won't go into the details, but you want the product of the binomial probability given above for each year, for each partition of those M objects over those T years. You might want to try your hand at coming up with a (somewhat) closed form solution.
Why bring this up? Because the following two statements:
"Let's define durability (with respect to an object stored in S3) as the probability that the object will remain intact and accessible after a period of one year. 100% durability would mean that there's no possible way for the object to be lost, 90% durability would mean that there's a 1-in-10 chance, and so forth."
AND
"durability [...] is 99.999999999%. If you store 10,000 objects with us, on average we may lose one of them every 10 million years or so."
are incompatible, unless you have strong reason to believe that your failures occur in an idiosyncratic way such that multiple failures over time and object are interconnected in a very specific way that I, for the moment, wouldn't want to work out mathematically.
It is not inconceivable that you guys might end up putting these figures in some legally binding contract. Don't get burned by math.
Posted by: Kshitij Lauria | May 19, 2010 at 02:48 PM
Are you going to *ever* rollout HTTPS support on CloudFront? What the heck is taking so long for this obviously critical capability.
Posted by: pwb | May 19, 2010 at 10:13 PM
99.999999999% is purely through technical failure, right? I reckon the odds of Amazon going broke (or similar improbable happenings) would be more likely than that...
Reduced redundancy at a cheaper rate is a nice addition.
Posted by: Matt | May 20, 2010 at 08:00 AM
I was looking for "eleven 9s durability" and this post serves great! Still I want one more detail. If the durability is defined per object, shouldn't it vary by the size of an object? For example, 1kB object would be lost in the same probability as ~2GB object? If AWS's data loss granularity is per disk, and any given object should not be distributed across multiple disks, it makes sense that each object has same probability, though. Otherwise, will it be possible to know the durability as a function of object size? Thanks.
Posted by: Sewook Wee | May 21, 2010 at 11:33 AM
Pwb: Take a look at http://developer.amazonwebservices.com/connect/thread.jspa?threadID=45378&start=15&tstart=405
Looks like Cloudfront HTTPS support is coming in the next few weeks.
Posted by: Andrew S | June 02, 2010 at 12:57 AM