I spent yesterday morning working in a coffee shop while waiting to have an informal discussion with a candidate for an open position. From my vantage point in the corner I was able to watch the shop's "processing pipeline" in action. There were three queues and three types of processing!
The customers were waiting to place an order, waiting to pay, or waiting for their coffee.
The employees functioned as order takers, cashiers, or baristas.
It was fascinating to watch this dynamically scaled system in action. Traffic ebbed and flowed over the course of the three hours that I spent in my cozy little corner. The line of people waiting to place an order grew from one person to twenty people in just a few minutes. When things got busy, the order taker advanced through the line, taking orders so that the barista(s) could get a head start. The number of baristas varied from one to three. I'm not sure what was happening behind the scenes, but it was clear that they could scale up, scale down, and reallocate processing resources (employees) in response to changing conditions.
You could implement a system like this using the Amazon Simple Queue Service. However, until now, there was no way to scale the amount of processing power up and down as the number of items in the queue varied.
We've added some additional Amazon CloudWatch metrics to make it easier to handle this particular case. The following metrics are now available for each SQS queue (all at 5 minute intervals):
- NumberOfMessagesSent
- SentMessageSize
- NumberOfMessagesReceived
- NumberOfEmptyReceives
- NumberOfMessagesDeleted
- ApproximateNumberOfMessagesVisible
- ApproximateNumberOfMessagesNotVisible
We have also added the following metrics for each Amazon SNS topic, also at 5 minute intervals:
- NumberOfMessagesPublished
- NumberOfNotificationsDelivered
- NumberOfNotificationsFailed
- PublishSize
You can create alarms on any of these metrics using the AWS Management Console and you can use them to drive Auto Scaling actions. You can scale up when ApproximateNumberOfMessagesVisible starts to grow too large for one of your SQS queues, and scale down once it returns to a more reasonable value. You can also watch NumberOfEmptyReceives to make sure that your application isn't spending too much of its time polling for new messages. A rapid increase in the value of ApproximateNumberOfMessagesNotVisible could indicate possible bug in your code. Depending on your application, you could also watch NumberOfMessagesSent (SQS) or NumberOfMessagesPublished (SNS) to make sure that the application is still healthy. Here is how all of the pieces (An SQS queue, its metrics, CloudWatch, Auto Scaling, and so forth) fit together:

You can read more about these features in the newest version of the CloudWatch Developer Guide.
-- Jeff;


Jeff,
this all sounds great in theory but doesn't work that well in practice.
Two problems:
1. SQS basically guarantees out of order message arrival. That may or may not be important but for certain use cases, that IS important. And then you're stuck mitigating out of order arrival in your application. It'd be nice if there were a parameter that can be set to delay delivery of a message until the one prior arrives. I understand it's a trade-off between reliability and delivery speed but not having this configurable sucks.
2. AutoScale has no knowledge of which servers to kill. As a result, it is suited for the most basic applications only, like stateless HTTPD. Anything more complex and it fails entirely. For example, it's not suited for media transcoding because a server that's actually doing work might get killed off, while an idle instance is left alone. Not cool. So, again, one is stuck doing this logic in a custom app some place.
Posted by: Susan | July 21, 2011 at 08:11 PM
I missed this service for SQS badly, that's why we wrote our own solution to auto-scale based on SQS length. Please read here: http://tech-queries.blogspot.com/2011/04/suicide-workers.html
"So when our SQS length is more than a threshold, we know that our current conversion fleet is not enough to handle increased load."
Posted by: AkashAgrawal | July 21, 2011 at 10:02 PM
Akash,
your solution is the first that comes to mind but it's crude and not very elegant. (I can say this because we also went and abandoned this idea.:) )
The issue is that looking at absolute metric such as SQS queue depth tells you nothing about the TREND. And it's the trend you're after.
Consider this example. Let's say your threshold is 5. And queue depth metrics go 1, 3, 5, 200, 199, 198, 197, etc...
What happens is at 200 you spawn a server. You spawn another one at 199. By the time you get to 190, you have enough servers to ensure a steady downward trend on the queue depth. BUT! Your solution will keep launching servers, to the point where you end up with 100s of EC2 machines but your queue depth is 6.
A much better way is to look at moving avg to smooth out the spikes. Even better still is to look at rate of change of moving averages. You don't need to launch machines on a consistent downward trend.
Posted by: Susan | July 22, 2011 at 04:09 PM
@Susan: We are actually seeing if the SQS persist above this threshold for 30 mins then only it will spawn a new instance. Also as soon as a instance is launched, SQS length should drop significantly.
@Jeff: I couldn't find the pricing information about SQS monitoring on any product page. Can you please let me know the pricing info?
Posted by: AkashAgrawal | July 22, 2011 at 11:06 PM