This document summarizes Rackspace's use of Hadoop to process and query logs from multiple datacenters. Key points:
- Rackspace needed to query logs from mail/app servers to answer support and analytics questions. Previous solutions using single databases could not scale across datacenters.
- Hadoop allowed ingesting raw logs, building Lucene indexes for querying, and storing data across multiple datacenters. Real-time queries used Solr, batch queries used MapReduce.
- Implementation collected logs into Hadoop, used SolrOutputFormat to generate indexes, and queried via distributed Solr and MapReduce. This provided scalable storage, analysis, and querying across datacenters.