Continuing along in our quest to give you the tools that you need to build ridiculously powerful web sites and applications in no time flat at the lowest possible cost, I'd like to introduce you to Amazon CloudSearch. If you have ever searched Amazon.com, you've already used the technology that underlies CloudSearch. You can now have a very powerful and scalable search system (indexing and retrieval) up and running in less than an hour.
You, sitting in your corporate cubicle, your coffee shop, or your dorm room, now have access to search technology at a very affordable price. You can start to take advantage of many years of Amazon R&D in the search space for just $0.12 per hour (I'll talk about pricing in depth later).
What is Search?
Search plays a major role in many web sites and other types of online applications. The basic model is seemingly simple. Think of your set of documents or your data collection as a book or a catalog, composed of a number of pages. You know that you can find the desired content quickly and efficiently by simply consulting the index.
Search does the same thing by indexing each document in a way that facilitates rapid retrieval. You enter some terms into a search box and the site responds (rather quickly if you use CloudSearch) with a list of pages that match the search terms.
As is the case with many things, this simple model masks a lot of complexity and might raise a lot of questions in your mind. For example:
- How efficient is the search? Did the search engine simply iterate through every page, looking for matches, or is there some sort of index?
- The search results were returned in the form of an ordered list. What factor(s) determined which documents were returned, and in what order (commonly known as ranking)? How are the results grouped?
- How forgiving or expansive was the search? Did a search for "dogs" return results for "dog?" Did it return results for "golden retriever," or "pet?"
- What kinds of complex searches or queries can be used? Does the result for "dog training" return the expected results. Can you search for "dog" in the Title field and "training" in the Description?
- How scalable is the search? What if there are millions or billions of pages? What if there are thousands of searches per hour? Is there enough storage space?
- What happens when new pages are added to the collection, or old pages are removed? How does this affect the search results?
- How can you efficiently navigate through and explore search results? Can you group and filter the search results in ways that take advantage of multiple named fields (often known as a faceted search).
Needless to say, things can get very complex very quickly. Even if you can write code to do some or all of this yourself, you still need to worry about the operational aspects. We know that scaling a search system is non-trivial. There are lots of moving parts, all of which must be designed, implemented, instantiated, scaled, monitored, and maintained. As you scale, algorithmic complexity often comes in to play; you soon learn that algorithms and techniques which were practical at the beginning aren't always practical at scale.
What is Amazon CloudSearch?
Amazon CloudSearch is a fully managed search service in the cloud. You can set it up and start processing queries in less than an hour, with automatic scaling for data and search traffic, all for less than $100 per month.
CloudSearch hides all of the complexity and all of the search infrastructure from you. You simply provide it with a set of documents and decide how you would like to incorporate search into your application.
You don't have to write your own indexing, query parsing, query processing, results handling, or any of that other stuff. You don't need to worry about running out of disk space or processing power, and you don't need to keep rewriting your code to add more features.
With CloudSearch, you can focus on your application layer. You upload your documents, CloudSearch indexes them, and you can build a search experience that is custom-tailored to the needs of your customers.
How Does it Work?
The Amazon CloudSearch model is really simple, but don't confuse simple, with simplistic -- there's a lot going on behind the scenes!
Here's all you need to do to get started (you can perform these operations from the AWS Management Console, the CloudSearch command line tools, or through the CloudSearch APIs):
- Create and configure a Search Domain. This is a data container and a related set of services. It exists within a particular Availability Zone of a single AWS Region (initially US East).
- Upload your documents. Documents can be uploaded as JSON or XML that conforms to our Search Document Format (SDF). Uploaded documents will typically be searchable within seconds. You can, if you'd like, send data over an HTTPS connection to protect it while it is transit.
- Perform searches.
There are plenty of options and goodies, but that's all it takes to get started.
Amazon CloudSearch applies data updates continuously, so newly changed data becomes searchable in near real-time. Your index is stored in RAM to keep throughput high and to speed up document updates. You can also tell CloudSearch to re-index your documents; you'll need to do this after changing certain configuration options, such as stemming (converting variations of a word to a base word, such as "dogs" to "dog") or stop words (very common words that you don't want to index).
Amazon CloudSearch has a number of advanced search capabilities including faceting and fielded search:
Faceting allows you to categorize your results into sub-groups, which can be used as the basis for another search. You could search for "umbrellas" and use a facet to group the results by price, such as $1-$10, $10-$20, $20-$50, and so forth. CloudSearch will even return document counts for each sub-group.
Fielded searching allows you to search on a particular attribute of a document. You could locate movies in a particular genre or actor, or products within a certain price range.
Search Scaling
Behind the scenes, CloudSearch stores data and processes searches using search instances. Each instance has a finite amount of CPU power and RAM. As your data expands, CloudSearch will automatically launch additional search instances and/or scale to larger instance types. As your search traffic expands beyond the capacity of a single instance, CloudSearch will automatically launch additional instances and replicate the data to the new instance. If you have a lot of data and a high request rate, CloudSearch will automatically scale in both dimensions for you.
Amazon CloudSearch will automatically scale your search fleet up to a maximum of 50 search instances. We'll be increasing this limit over time; if you have an immediate need for more than 50 instances, please feel free to contact us and we'll be happy to help.
The net-net of all of this automation is that you don't need to worry about having enough storage capacity or processing power. CloudSearch will take care of it for you, and you'll pay only for what you use.
Pricing Model
The Amazon CloudSearch pricing model is straightforward:
You'll be billed based on the number of running search instances. There are three search instance sizes (Small, Large, and Extra Large) at prices ranging from $0.12 to $0.68 per hour (these are US East Region prices, since that's where we are launching CloudSearch).
There's a modest charge for each batch of uploaded data. If you change configuration options and need to re-index your data, you will be billed $0.98 for each Gigabyte of data in the search domain.
There's no charge for in-bound data transfer, data transfer out is billed at the usual AWS rates, and you can transfer data to and from your Amazon EC2 instances in the Region at no charge.
Advanced Searching
Like the other Amazon Web Services, CloudSearch allows you to get started with a modest effort and to add richness and complexity over time. You can easily implement advanced features such as faceted search, free text search, Boolean search expressions, customized relevance ranking, field-based sorting and searching, and text processing options such as stopwords, synonyms, and stemming.
CloudSearch Programming
You can interact with CloudSearch through the AWS Management Console, a complete set of Amazon CloudSearch APIs, and a set of command line tools. You can easily create, configure, and populate a search domain through the AWS Management Console.
Here's a tour, starting with the welcome screen:

You start by creating a new Search Domain:

You can then load some sample data. It can come from local files, an Amazon S3 bucket, or several other sources:

Here's how you choose an S3 bucket (and an optional prefix to limit which documents will be indexed):

You can also configure your initial set of index fields:

You can also create access policies for the CloudSeach APIs:

Your search domain will be initialized and ready to use within twenty minutes:

Processing your documents is the final step in the initialization process:

After your documents have been processed you can perform some test searches from the console:

The CloudSearch console also provides you with full control over a number of indexing options including stopwords, stemming, and synonyms:

CloudSearch in Action
Some of our early customers have already deployed some applications powered by CloudSearch. Here's a sampling:
- Search Technologies has used CloudSearch to index the Wikipedia (see the demo).
- NewsRight is using CloudSearch to deliver search for news content, usage and rights information to over 1,000 publications.
- ex.fm is using CloudSearch to power their social music discovery website.
- CarDomain is powering search on their social networking website for car enthusiasts.
- Sage Bionetworks is powering search on their data-driven collaborative biological research website.
- Smugmug is using CloudSearch to deliver search on their website for over a billion photos.
As you can see, these early applications represent a very diverse set of use cases. How do you plan to use Amazon CloudSearch? Leave me a comment and let us know!
Interested in learning more? To learn more, please visit the Amazon CloudSearch overview page and watch a video that shows how to build a search application using Amazon CloudSearch. You can also sign up for the Introduction To Amazon CloudSearch webinar on May 10.
-- Jeff;


Can you index the contents of PDF files/.doc files stored on S3?
Posted by: Luke | April 12, 2012 at 12:17 AM
Does it support S3? do you plan to index S3 and offer the search for it?
Posted by: Josh | April 12, 2012 at 01:22 AM
How about different languages? Atm its not working for russian.
Posted by: topbot | April 12, 2012 at 01:44 AM
My impression is that this is a good service, but it won't be enough for medium and large web applications, because it lacks some good to have features like spacial queries and support for normalization rules in different languages. This is exactly what I need for my app, which I run on AWS stack already and looking towards utilizing Elastic Search.
Is there a plan to add these features in the near future?
Posted by: Anton Babenko | April 12, 2012 at 01:50 AM
In order to completely outsource the searching-service and query the AWS servers by JavaScript without using proxy servers, you need to add some more features:
- either an optional function call, which encloses the JSON response (e.g.: ...&callback=displayResults)
- or CNAMEs for search domains
I did browse through the documentation but have not found that, yet.
Posted by: Mark Kubacki | April 12, 2012 at 01:56 AM
Do you provide tools to easily integrate the search results into an existing web site according to its theme? It seems that this step of presenting the search results in a user-friendly way is also important to really decrease the cost of searching. For example, Acquia provides a Drupal module on tops of its cloud search that facilitates theming the search results.
It seems that your solution gives a bit more of flexibility concerning stopwords, stemming and synonyms. Do you think it can work well for another language than English? If yes, how much time do you think it can be adapted to another language than English?
Posted by: A Facebook User | April 12, 2012 at 02:26 AM
Jeff --
thanks for walkthru. We've been waiting for this to happen eagerly. One thing that we miss, however, is a pre-existing Web index that we can get access to. It seems that all you can search is 8 M of your own documents. Can we sing up as early customers for Web search as a service prior to its release?
Sincerely, Linda
Posted by: Linda | April 12, 2012 at 02:36 AM
Congrats! Awesome Timing guys! We started evaluating SphinxSearch for StrikeBase [our SaaS product] a week ago. We now plan to be using CloudSearch extensively. Could I have a date by which CloudSearch will be added to the AWS PHP SDK[it doesnt seem to be there right now]?
Cheers!
Posted by: Gaurav DCosta | April 12, 2012 at 02:40 AM
Very interesting. Good stuff. Is there support for languages other than English? How about multi-byte languages such as Chinese and Arabic?
Posted by: Aric Rosenbaum | April 12, 2012 at 04:15 AM
You guys never fail to amaze me :)
Do you have geospatial search coming up? Right now we have over 20 dedicated sphinx instances running just for this purpose. It would be incredibly nice if you could deliver a solution for geospatial search!
Thanks
Posted by: Olivier Janssens | April 12, 2012 at 04:20 AM
Looks like more fun than building and managing my own Solr cluster, Jeff! Hope you're good. :-)
Posted by: A Facebook User | April 12, 2012 at 04:25 AM
Sounds like a great offering, but it looks like the only option for getting data into it is to feed up all of the documents in their completed form. Is there any option for this service to crawl a website or URL? If not, that would be a logical thing to add.
While many CMS systems come with a built-in site search, website owners often want to present blended results that combine site pages (and blog posts, etc.) with more structured results. Being able to get all that search power in one place would be an even bigger value.
Just my two cents.
Jeff Greenhouse
President, 201 Proof - http://www.201proof.com
Posted by: Jeff Greenhouse | April 12, 2012 at 04:27 AM
Do you have a comparison/checklist chart against Solr & ElasticSearch?
Posted by: Learnwell | April 12, 2012 at 04:32 AM
Sounds amazing! What about autocomplete searches? Does it support autocomplete?
Thanks!
Posted by: Rafael Costa | April 12, 2012 at 05:17 AM
This is really great. All I would say is that I have a few clients who I have built custom Apache Solr-based web applications for, and who host those on virtual cloud servers that cost not that much more than the Small rate you have here just for search. Whilst I can definitely see the added value Amazon brings with scaling and I would love to move my Apache Solr clients to this, it's cost prohibitive I think, my clients would not pay the extra right now.
Also, do you support custom boosting of certain document types for example, or boosting by field? We have 4 types of document and we boost one of them as more important than the others. Then, within that, we boost further on a field of that document type if it exists/set to a certain value.
Keen to learn more about the advanced features (custom query handler, how much can you configure query slop, proximity etc..) and whether you index NGRAMs?
Posted by: PorridgeBear | April 12, 2012 at 05:50 AM
I've been working on trying to implement Xapian for search for the last few months, this service is a godsend (just wish I knew about it before we wasted all of our time trying to roll it out ourselves!)
Posted by: Kopertop | April 12, 2012 at 06:43 AM
Will the service support spatial search?
Posted by: Sian Kit Tjie | April 12, 2012 at 07:06 AM
It would be good if you also displayed the AWS responses to peoples comments.
Posted by: Niall | April 12, 2012 at 07:12 AM
Can/will this be multi-tenant? What if I have hundreds or thousands of customer-specific data sets to index, but they can only be searched by the customer?
Posted by: GJ | April 12, 2012 at 07:21 AM
Is there any way to know ahead of time how large of a search instance will be needed and potentially stop the search before it ends up costing more than I expect ?
Posted by: SearchMe | April 12, 2012 at 07:53 AM
This is definitely a good service offering; however, the devil is in the details. I'll be playing with the service for the next couple of months to learn more about it to see if it's a possible fit for us. It would be nice if spatial was part of the mix, boosting, custom query handlers etc.
Posted by: zoomage | April 12, 2012 at 08:32 AM
Congratulations Jeff. We were excited to use this service in beta and it was a great time saver over building our own search infrastructure with Solr / Lucine. Happy it is now launched and we can discuss publicly!
Posted by: Sciencereengineered.wordpress.com | April 12, 2012 at 09:10 AM
Great comments - thanks everyone. Keep them coming.
If you have specific queries or feedback about CloudSearch, the best place to get them infront of the CloudSearch team is on the forum:
https://forums.aws.amazon.com/forum.jspa?forumID=137&start=0
We have operators standing by.
Posted by: Matt Wood | April 12, 2012 at 09:56 AM
Could you clarify. Are there any limitatitions to the quantity of search domains?
How are "search instance" and "search domain" related? Are these notions equal?
Currently, we are having a separate Lucene index for each registered user, users number is unlimited.
To migrate our current architecture, should we create dedicated search domain for each user, or domain quantity are limited and we should separate users date through filtering search queries and results?
Posted by: Petro Sasnyk | April 12, 2012 at 01:29 PM
Is this a better choice than using DynamoDB with custom Index (tables) do you think ? I'm developing an website with Elastic Beanstalk (Java) and Dynamodb rightnow - some of the tables have around 500,000 entries in there currently
Posted by: Derek | April 12, 2012 at 04:34 PM
It is not clear any where how many documents an instance can index. I know it depends on the size of the document some sort of calculator will be helpful. I am working on an application that needs to index millions of documents every month. So it will be great to know when I will hit max 10 instance limit.
Posted by: Tahseen Ur Rehman Fida | April 12, 2012 at 07:55 PM
If you add geospatial I am *in*, currently using SOLR and worried about my cluster.
Posted by: Welocally | April 13, 2012 at 01:00 AM
Definitely faster and easier than setting up and managing a Solr instance. We were rapidly able to leverage this technology to index 10,000+ gene expression data sets in our application, more discussion at http://wp.me/p2faIU-T
Posted by: MichaelKellen | April 14, 2012 at 12:55 PM
How to create custom field type amazon cloud search like date, textgen, textTight, float
Posted by: vsreddy | May 21, 2012 at 02:03 AM
The CloudSearch forum at https://forums.aws.amazon.com/forum.jspa?forumID=137&start=0 is the best place to ask technical questions!
Posted by: Jeff Barr | May 21, 2012 at 01:59 PM