Today's guest post is brought to you by Doug Grismore, Director of Storage Operations for AWS. Doug has some useful performance tips and tricks that will help you to get the best possible performance from Amazon S3. There's also information about a special S3 hiring event that will take place later this week in Seattle.
We've worked with a large number of customers over the last few years getting some truly massive workloads into and out of Amazon S3. What follows is a little bit of best practice guidance for getting big on S3, including some background on why the tricks work.
First: for smaller workloads (<50 total requests per second), none of the below applies, no matter how many total objects one has! S3 has a bunch of automated agents that work behind the scenes, smoothing out load all over the system, to ensure the myriad diverse workloads all share the resources of S3 fairly and snappily. Even workloads that burst occasionally up over 100 requests per second really don't need to give us any hints about what's coming...we are designed to just grow and support these workloads forever. S3 is a true scale-out design in action.
S3 scales to both short-term and long-term workloads far, far greater than this. We have customers continuously performing thousands of requests per second against S3, all day every day. Some of these customers simply 'guessed' how our storage and retrieval system works, on their own, or may have come to S3 from another system that partitions namespaces using similar logic. We worked with other customers through our Premium Developer Support offerings to help them design a system that would scale basically indefinitely on S3. Today we’re going to publish that guidance for everyone’s benefit.
Some high-level design concepts are necessary here to explain why the approach below works. S3 must maintain a 'map' of each bucket's object names, or 'keys'. Traditionally, some form of partitioning would be used to scale out this type of map. Given that S3 supports a lexigraphically-sorted list API, it would stand to reason that the key names themselves are used in some way in both the map and the partitioning scheme...and in fact that is precisely the case: each key in this 'keymap' (that's what we call it internally) is stored and retrieved based on the name provided when the object is first put into S3 - this means that the object names you choose actually dictate how we manage the keymap.
Internally, the keys are all represented in S3 as strings like this:
Further, keys in S3 are partitioned by prefix.
As we said, S3 has automation that continually looks for areas of the keyspace that need splitting. Partitions are split either due to sustained high request rates, or because they contain a large number of keys (which would slow down lookups within the partition). There is overhead in moving keys into newly created partitions, but with request rates low and no special tricks, we can keep performance reasonably high even during partition split operations. This split operation happens dozens of times a day all over S3 and simply goes unnoticed from a user performance perspective. However, when request rates significantly increase on a single partition, partition splits become detrimental to request performance. How, then, do these heavier workloads work over time? Smart naming of the keys themselves!
We frequently see new workloads introduced to S3 where content is organized by user ID, or game ID, or other similar semi-meaningless identifier. Often these identifiers are incrementally increasing numbers, or date-time constructs of various types. The unfortunate part of this naming choice where S3 scaling is concerned is two-fold: First, all new content will necessarily end up being owned by a single partition (remember the request rates from above...). Second, all the partitions holding slightly older (and generally less 'hot') content get cold much faster than other naming conventions, effectively wasting the available operations per second that each partition can support by making all the old ones cold over time.
The simplest trick that makes these schemes work well in S3 at nearly any request rate is to simply reverse the order of the digits in this identifier (use seconds of precision for date or time-based identifiers). These identifiers then effectively start with a random number - and a few of them at that - which then fans out the transactions across many potential child partitions. Each of those child partitions scales close enough to linearly (even with some content being hotter or colder) that no meaningful operations per second budget is wasted either. In fact, S3 even has an algorithm to detect this parallel type of write pattern and will automatically create multiple child partitions from the same parent simultaneously - increasing the system's operations per second budget as request heat is detected.
Consider this small sample of incrementally increasing game IDs in the fictional S3 bucket 'mynewgame':
All these reads and writes will basically always go to the same partition...but if the identifiers are reversed:
This pattern instructs S3 to start by creating partitions named:
These can be split even further, automatically, as key count or request rate increases over time.
Clever readers will no doubt notice that using this trick alone makes listing keys lexigraphically (the only currently supported way) rather useless. For many S3 use-cases, this isn't important, but for others, another slightly more complex scheme is necessary, in order to allow groups of keys to be easily list-able. This scheme also works for more structured object namespaces as well. The trick here is to calculate a short hash (note: collisions don't matter here, just pseudo-randomness) and pre-pend it to the string you wish to use for your object name. This way, operations are again fanned out over multiple partitions. To list keys with common prefixes, several list operations can be performed in parallel, one for each unique character prefix in your hash.
Take the following naming scheme in fictional S3 bucket 'myserverlogs':
With thousands or tens of thousands of servers sending logs an hour, this scheme becomes untenable for the same reason as the first example. Instead, combining hashing and reversing domain identifiers, a scheme like the following provides the best balance between performance and flexibility of listing:
These prefixes could be a mod-16 operation on the ascii values in the string or really any hash function you like - the above are made up to illustrate the point. The benefits of this scheme, though, are now clear: it is possible to prefix-list all sets: an hour across all domains, or an hour in a single domain, as well as sustained reads and writes well over 1500 per second in these key map partitions (16 in all, using regular expression shorthand below):
16 prefixed reads get you all the logs for mydomain.com for that hour:
As you can see, some very useful selections from your data are more easily accessible given the right naming structure. The general pattern here is: after the partition-enabling hash, you should name keys with the key name elements you'd like to request by furthest to the left.
By the way: two or three prefix characters in your hash are really all you need: here's why. If we target conservative targets of 100 operations per second and 20 million stored objects per partition, a four character hex hash partition set in a bucket or sub-bucket namespace could theoretically grow to support millions of operations per second and over a trillion unique keys before we'd need a fifth character in the hash.
This hashing 'trick' can also be used when the object name is meaningless to your application. Any UUID scheme where the left-most characters are effectively random works fine (base64 encoded hashes, for example) - if you use base64, we recommend using a URL-safe implementation and avoiding the '+' and '/' characters, instead using '-' (dash) and '_' (underscore) in their places.
For even more personalized help, our Premium Developer Support team is ready to lend a hand...rest assured whatever you want to do on S3, we've seen before and either know the best way to do it or can help you discover it.
Talk to Us
And finally: maybe you are finding this content super interesting from a career perspective...perhaps scale problems of this type are what you were destined to help us solve? If we've piqued your interest in that way, and you're a solid software development engineer, systems engineer, or engineering leader/manager, maybe you'd enjoy meeting us this coming Wednesday, March 7, 2012, upstairs at the Seattle Hard Rock Cafe?
We're looking for some new additions to our strong team and would LOVE to meet you, share what we're working on, and see if we have mutual interest. Be sure to bring your resume, or send it to firstname.lastname@example.org in advance, and be sure to have some time over the 8th, 9th, or even Saturday the 10th for interviews - we are moving fast and will in-person interview the most promising candidates immediately.
Not local to Seattle? We'll still disposition your resume and if mutual interest develops, we also support relocation within North America in most cases.