David Yanacek of the Amazon DynamoDB team is back with another guest post, this one on the topic of optimizing the use of DynamoDB's unique provisioned throughput feature.
While I'm on the topic of DynamoDB, I should mention that we will be running an 8-hour DynamoDB bootcamp session at re:Invent. You'll need to have some development experience with a relational or non-relational database system in order to benefit.
-- Jeff;
DynamoDB offers scalable throughput and storage by horizontally
partitioning your data across a sufficient number of servers to meet your
needs. When you need more throughput for
your table, you simply use the AWS Management
Console or call an API. When your
table grows in size, DynamoDB automatically adds more partitions for storing
your growing dataset.
With traditional databases, you may be accustomed to buying
larger and larger databases as your demands grow. This is known as a "scale-up" solution. However, when your dataset grows too large
for even the largest database server, you must "scale-out" your database, implementing
logic in your application to route each query to the correct database
server. DynamoDB offers you this "scale-out"
architecture, while handling all the complex components that are needed to run
a secure, durable, scalable, and highly available data store.
While DynamoDB allows you to specify your level of
throughput, your application needs to be designed with the
DynamoDB architecture in mind to make use of it all. The Amazon
DynamoDB Developer Guide describes some best practices for achieving your
full provisioned throughput.
One of those recommendations is about how to efficiently store time series data in DynamoDB. Often when you store time series data, you access recent .hot. data more frequently than older, "cold" data. When storing time series data in DynamoDB, it is recommended that you spread your data across multiple tables - one per time period (month, day, etc). This article describes the reasons behind that advice, and the benefits of designing your application in this way.
Non-Uniform WorkloadsTo understand why hot and cold data separation is important,
consider the advice about
Uniform
Workloads in the developer guide:
When storing data, Amazon DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based on the hash key element. The provisioned throughput associated with a table is also divided evenly among the partitions, with no sharing of provisioned throughput across partitions. Consequently, to achieve the full amount of request throughput you have provisioned for a table, keep your workload spread evenly across the hash key values. Distributing requests across hash key values distributes the requests across partitions.
For example, if a table has a very small number of heavily accessed hash key elements, possibly even a single very heavily used hash key element, traffic is concentrated on a small number of partitions - potentially only one partition. If the workload is heavily unbalanced, meaning disproportionately focused on one or a few partitions, the operations will not achieve the overall provisioned throughput level. To get the most out of Amazon DynamoDB throughput, build tables where the hash key element has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.
Another example of a non-uniform workload is where an individual
request consumes a large amount of throughput.
Expensive requests are generally caused by Scan or Query operations, or
even single item operations when items are large. Even if these expensive requests are spread
out across a table, each request creates a temporary hot spot that can cause
subsequent requests to be throttled.
For instance, consider the Forum, Thread, and Reply tables
from the Getting
Started section of the developer guide, which demonstrate a forums web
application on top of DynamoDB. The
Reply table stores messages sent between users within a conversation, and within
a thread, are sorted by time.
If you Query the Reply table for all messages in a very
popular thread, that query could consume lots of throughput all at once from a
single partition. In the worst case,
this expensive query could consume so much of the partition's throughput that
it causes subsequent requests to be throttled for a few seconds, even if other partitions have throughput to spare.
Tip: Use Pagination
To spread that workload out over time, it is recommended to take advantage of the
pagination features of the Query operation, and limit the number of items
retrieved per call. Since a forums web
application displays a fixed number of replies to a thread at a time,
pagination lends itself well to this use case.
Impact of Non-Uniform Workloads on Throughput for Large Tables
As your table grows in size, DynamoDB adds more partitions behind the scenes to handle your storage needs. As the number of partitions in your table increases, each partition is given a smaller portion of your overall throughput.
In the case of a non-uniform workload, as your dataset increases some of your requests could be throttled even though you did not see this throttling when the table was smaller in size, and even if you are not utilizing your table's full provisioned throughput. When you first create your table, you may have hot spots that go unnoticed because each partition is allotted a larger amount of your overall table throughput. However, when your application adds large amounts of data to your table, DynamoDB automatically adds more partitions, which decreases your per-partition throughput, and can lead to increased throttling for non-uniform workloads.
However, if your request workload is uniform across your
table, even as the number of partitions grows, your application will continue to
run smoothly.
Tip: Separate Hot and Cold Data
Some types of applications store a mix of hot and cold data
together in the same table. Hot data is
accessed frequently, like recent replies in the example forums application. Cold data is accessed infrequently or never,
like forum replies from several months ago.
Applications that store time series data, often with a range
key involving a timestamp, fall into this category. The developer
guide describes the best practices for
storing time series data in DynamoDB, which involves creating a new table for
each time period. This approach offers several
benefits, including:
- Cost: You can provision higher throughput for tables
which contain hot data, and provision lower throughput for the tables
containing cold data. This keeps your per-partition
throughput higher on you hot tables, helping them better tolerate
non-uniformity in your workloads.
- Simplified Analytics: When analyzing your data
for periodic reports, you use the built-in integration with Amazon Elastic Map
Reduce, and run complex data analysis queries that are otherwise not supported
natively in DynamoDB. Since analytics
tends to be recurring on a scheduled basis, separating tables into time periods
makes it so that the analytics job only needs to access the new data for analysis.
- Easier Archival: If older data is no longer
relevant to your application, you can simply archive to cheaper storage systems
like Amazon S3 and delete the old table without having to delete items one at a
time, which would otherwise cost a great deal of provisioned throughput.
In Conclusion
Unlike traditional databases, DynamoDB lets you scale up
your throughput requirements with the push of a button. DynamoDB also automatically manages your
storage as your table grows in size. As
your table grows, its provisioned throughput is spread out across your additional
partitions. In order to fully utilize
your provisioned throughput in DynamoDB, you have to take this partitioning into
consideration when you design your application.
You can find additional best practices and suggestions for application
designs that work well on DynamoDB in the Amazon
DynamoDB Developer Guide.
-- David Yanacek
Tip: separate hot and cold data
Recent Comments