SlideShare a Scribd company logo
How Rackspace Query
Terabytes of Log Data
 Uses MapReduce, Hadoop

    Case Study, by Schubert Zhang 2009-04-30
Rackspace
• Rackspace has more than 50K devices and 7 data
  centers.

• The mail system and logging servers are currently in 3 of
  the Rackspace data centers.

• The system stores over 800 million objects (an object = a
  user event such as receiving an email or logging into
  IMAP) within Solr and 9.6 (records?) billion within
  Hadoop, which equals 6.3 TB compressed.

• Several hundred gigabytes of email log data is
  generated each day. (seems 140GB after cleared up)
Background on Mailtrust
• Email hosting company
• Founded in 1999, merged with Rackspace in 2007,
  previous name: Webmail.us
• 80K business customers, 700K mailboxes.
• 2 hosted mail products: Noteworthy, MS Exchange
• The Noteworthy System:
   – Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds,
     shared calendaring, Outlook sync, Blackberry sync.
   – ~600 servers, commodity hardware, designed to work around
     frequent failures.
• The MS Exchange System:
   – MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync.
   – ~100 servers, higher-end hardware, SAN & DAS storage.
Problems
•   Hundreds of gigabytes of new data each day streaming in from over 600 hyperactive
    servers.
•   Log processing system.
     –   (1) Flat text files stored on each machine.
           •   Had to be manually searched by engineers logging into each individual machine.
     –   (2) Relational database solution that just couldn't compete. MySQL.
           •   Inserts quickly became the bottleneck.
           •   A lot of index churn.
           •   Data was then broken into Merge Tables based on time so index updates weren't a problem.
           •   Load and operational problems.
     –   (3) Hadoop based solution that works wisely and has virtually unlimited scalability potential.
           •   Hadoop
           •   Lucene and Solr.
•   The familiar faced problem now: Lots and lots of data streaming in.
     –   Where do you store all that data?
     –   How do you do anything useful with it?
     –   How to retrieve the wanted data from the data sea.

•   Examine mail logs in order to troubleshoot problems for our customers.
•   The query/search should be fast and accurate.
Now the new system
• The advantage of their new system is that they can now
  look at their data in anyway they want:
   – Nightly MapReduce jobs collect statistics about their mail system
     such as spam counts by domain, bytes transferred and number
     of logins.
   – When they wanted to find out which part of the the world their
     customers logged in from, a quick MapReduce job was created
     and they had the answer within a few hours. Not really possible
     in your typical ETL system.
• "Now whenever we think of complex question about our
  customers’ usage patterns, we can pull the answer from
  our logs within hours via MapReduce. This is powerful
  stuff."
The Platform
•   Hadoop MapReduce
•   Hadoop Distributed File System (HDFS)
•   Lucene
•   Solr
•   Tomcat
The Architecture
• Raw logs get streamed from hundreds of mail servers to
  the Hadoop Distributed File System (”HDFS”) in real time.

• MapReduce jobs are scheduled run to index the new
  data using Apache Lucene and Solr.

• Once the indexes have been built, they are compressed
  and stored away in HDFS.

• Each Hadoop datanode runs a Tomcat servlet container,
  which hosts a number of Solr instances that pull and
  merge the new indexes, and provide really fast search
  results to our support team.
The System Evolution
              Logging v1.0
• Logs were stored in flat text files on the local disk of
  each mail server and were kept for 14 days.

• Our support techs did not have login access to the
  servers, so in order to search the logs they would have
  to escalate a ticket to our engineers. The engineers
  would then have to ssh into each mail server and grep
  /var/log/maillog.

• Problems: Once we grew much past a dozen servers,
  this manual process of logging into each server become
  too time consuming for our engineers.
Logging v1.1
•   Sped up the search process by writing a script that would search
    multiple servers via one command run from a centralized server.

•   Remote still grep.

•   Problems: The support techs still had to escalate a ticket to the
    engineers in order to perform a search. As the number of customers
    and servers increased, this began to take too much of our
    engineers' scarce time. Also, storing and searching the logs on a
    live server was negatively affecting the performance of the servers.
    To make matters worse, the engineering team had grown and we
    started running into the problem where two engineers would perform
    a search at the same time, which really slowed things down.
Logging v2.0
•   a web-based tool where they could search the logs.
•   It allowed searching by the sender or recipient's email address, domain name or IP
    address.
•   All of these were indexed fields in a MySQL database. The centralized log server

•   Each day's logs were stored in a separate table, so that we could cleanup old data by
    simply dropping and recreating MySQL tables.
•   Log data was only kept for 3 days in order to keep the MySQL database down to a
    reasonable size.
•   Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the
    data set was very large and these queries would be horribly slow.

•   Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As
    the tables grew, indexing each entry as it was inserted became slow. Within the first
    hours of testing, the inserts began slowing and could not keep up with the rate at
    which data was received. Version 2.0 of the logging system was never used in
    production.
Logging v2.1
•   Fixed the MySQL INSERT bottleneck by queuing up the log entries
    in local text files on the centralized log server and periodically bulk
    loading them into the database. As syslog-ng received logs on its 6
    ports, the data would be streamed to 6 separate text files. Every 10
    minutes a script would rotate those text files and execute a MySQL
    LOAD to load the data into the database. This was magnitudes
    faster than inserting the log data one record at a time.

•   Problems: The LOADs would get progressively slower as the
    database grew because MySQL indexing performance decreases as
    the table you are inserting into gets larger. This version was fast
    enough to be released into production, but we knew the system
    would not scale too far without additional work.
Logging v2.2
•   Introduced Merge Tables in order to speed up loading the log data into the database.
•   every 10 minutes our script would create a new database table and then load the text
    logs into the empty table.
•   After the data was loaded, the script would modify a set of Merge Tables that
    combined all of the 10-minute tables together.
•   The web search tool was modified to allow searching within the different time ranges.
    Corresponding Merge Tables existed for each of those time ranges, and were
    modified every 10 minutes as new tables were created.

•   Problems: the database LOAD operations would take 2-3 minutes to run. the server
    was now always under a heavy cpu and disk IO load.
•   Searches were being performed more frequently and were becoming slow. We
    started to see some strange problems such as random errors while trying to create
    new tables or modify the Merge Tables. These errors progressively became more
    frequent, resulting in missing log data. The support team began to lose confidence in
    the system's accuracy.
•   the logging system had no redundancy.

•   We needed a new solution that would be fast, reliable and could scale indefinitely
    with our growth. We needed something truly scalable.
Logging v3+
• Avoid limiting our abilities to build new features down the
  road.
• For example, we wanted to build a tool that would allow
  our customers to search their logs directly.

• It scales out it's workload horizontally by adding servers
  and distributing the data and MapReduce jobs amongst
  the servers.

• In about 3 months we build a fresh new log processing
  system using Hadoop, Lucene and Solr.

• Put the log search tool in the hands of our customers.
Stu Hood’s Detailed Comments
•   The loading of data is streaming, but the indexing is not. We write to a file in Hadoop until it
    reaches a size below the block size, or until it times out, and then we close and move it to where it
    will be processed.
•   Our processing jobs run every 10 minutes or so, meaning that the logs become available for
    Customer Care after about 15. We’ve executed around 150K jobs on this cluster with 3 restarts.

•   We create the indexes on local disk in our reducer, and compress them into HDFS after they are
    complete.
•   When we pull the index to make it available for search, we decompress it to local disk and merge
    it using the Lucene IndexWriter.addIndexes method before calling /commit on the Solr instance.
    The Nutch project created an IndexReader that can do read-only access on HDFS, but for speed
    reasons, we decided not to take that approach.
•   Since we are indexing to local disk, we use an embedded SolrCore, in the same JVM as the
    reducer.

•   We have 10 Hadoop data nodes, with 3.5TB hard drives each. = 35TB
•   We are currently indexing an average of 140GBytes per day.

•   The merged indexes are not replicated at all… only one Solr node has a copy of each index, so
    failover involves a brief downtime for queries. If we lose a node, other nodes (consistent hashing)
    become responsible and merge the indexes from the copies we always have in Hadoop.
Future
• Creating reports or doing ad-hoc queries.
• More wanted MapReduce jobs to do
  wanted things.
References
• How Rackspace Now Uses MapReduce
  and Hadoop to Query Terabytes of Data
• MapReduce at Rackspace

More Related Content

PDF
Hadoop 2 - Going beyond MapReduce
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PPTX
Enterprise Grade Streaming under 2ms on Hadoop
PPTX
Scaling ETL with Hadoop - Avoiding Failure
PPTX
Couchbase 101
PDF
Scaling Hadoop at LinkedIn
PDF
Couchbase Day
PDF
Kudu - Fast Analytics on Fast Data
Hadoop 2 - Going beyond MapReduce
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Enterprise Grade Streaming under 2ms on Hadoop
Scaling ETL with Hadoop - Avoiding Failure
Couchbase 101
Scaling Hadoop at LinkedIn
Couchbase Day
Kudu - Fast Analytics on Fast Data

What's hot (20)

PDF
Hadoop 2 - Beyond MapReduce
PPTX
Deploying Apache Flume to enable low-latency analytics
PDF
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
PPTX
Unified Batch & Stream Processing with Apache Samza
PDF
Hadoop Operations - Best practices from the field
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PPTX
RedisConf18 - Application of Redis in IOT Edge Devices
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PPTX
What's new in hadoop 3.0
PDF
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
PPTX
Spark Tips & Tricks
PPTX
Bigdata workshop february 2015
PDF
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
PDF
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
PPT
NoSQL_Night
PPTX
Microservices - Is it time to breakup?
PDF
Hortonworks.Cluster Config Guide
PPTX
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
PPTX
Taking Splunk to the Next Level - Architecture Breakout Session
PDF
Apache kudu
Hadoop 2 - Beyond MapReduce
Deploying Apache Flume to enable low-latency analytics
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Unified Batch & Stream Processing with Apache Samza
Hadoop Operations - Best practices from the field
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
RedisConf18 - Application of Redis in IOT Edge Devices
Real time data viz with Spark Streaming, Kafka and D3.js
What's new in hadoop 3.0
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Spark Tips & Tricks
Bigdata workshop february 2015
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
NoSQL_Night
Microservices - Is it time to breakup?
Hortonworks.Cluster Config Guide
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Taking Splunk to the Next Level - Architecture Breakout Session
Apache kudu
Ad

Viewers also liked (7)

PPT
Behm Shah Pagerank
PDF
Data clustering using map reduce
PDF
Google PageRank
PDF
3 apache-avro
PPTX
PageRank Algorithm In data mining
PPTX
Thrift vs Protocol Buffers vs Avro - Biased Comparison
PDF
PageRank and Markov Chain
Behm Shah Pagerank
Data clustering using map reduce
Google PageRank
3 apache-avro
PageRank Algorithm In data mining
Thrift vs Protocol Buffers vs Avro - Biased Comparison
PageRank and Markov Chain
Ad

Similar to Case Study - How Rackspace Query Terabytes Of Data (20)

PPTX
Toronto High Scalability meetup - Scaling ELK
PPTX
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
PPT
HDFS_architecture.ppt
PDF
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
PPTX
Big data and hadoop
PPTX
Hardware Provisioning
PPTX
M6d cassandrapresentation
PPTX
Managing Security At 1M Events a Second using Elasticsearch
PPTX
Chaptor 2- Big Data Processing in big data technologies
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
PPTX
Batch to near-realtime: inspired by a real production incident
PPTX
Introduction to Hadoop and Big Data
PPT
AWS (Hadoop) Meetup 30.04.09
PPTX
Big Data and Hadoop
PPTX
Hadoop introduction
PPTX
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
PDF
Distributed Computing with Apache Hadoop: Technology Overview
PPTX
high performance databases
PPSX
Hadoop-Quick introduction
PDF
Petabyte scale on commodity infrastructure
Toronto High Scalability meetup - Scaling ELK
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
HDFS_architecture.ppt
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Big data and hadoop
Hardware Provisioning
M6d cassandrapresentation
Managing Security At 1M Events a Second using Elasticsearch
Chaptor 2- Big Data Processing in big data technologies
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Batch to near-realtime: inspired by a real production incident
Introduction to Hadoop and Big Data
AWS (Hadoop) Meetup 30.04.09
Big Data and Hadoop
Hadoop introduction
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Distributed Computing with Apache Hadoop: Technology Overview
high performance databases
Hadoop-Quick introduction
Petabyte scale on commodity infrastructure

More from Schubert Zhang (20)

PDF
Blockchain in Action
PDF
科普区块链
PDF
Engineering Culture and Infrastructure
PDF
Simple practices in performance monitoring and evaluation
PPTX
Scrum Agile Development
PPTX
Career Advice
PDF
Engineering practices in big data storage and processing
PPTX
HiveServer2
PPTX
Horizon for Big Data
PDF
Bigtable数据模型解决CDR清单存储问题的资源估算
PPTX
Big Data Engineering Team Meeting 20120223a
PPTX
HBase Coprocessor Introduction
PDF
Hadoop大数据实践经验
PPTX
Wild Thinking of BigdataBase
PDF
RockStor - A Cloud Object System based on Hadoop
PPTX
Fans of running gump
PDF
Hadoop compress-stream
PDF
Ganglia轻度使用指南
PDF
DaStor/Cassandra report for CDR solution
PPTX
Big data and cloud
Blockchain in Action
科普区块链
Engineering Culture and Infrastructure
Simple practices in performance monitoring and evaluation
Scrum Agile Development
Career Advice
Engineering practices in big data storage and processing
HiveServer2
Horizon for Big Data
Bigtable数据模型解决CDR清单存储问题的资源估算
Big Data Engineering Team Meeting 20120223a
HBase Coprocessor Introduction
Hadoop大数据实践经验
Wild Thinking of BigdataBase
RockStor - A Cloud Object System based on Hadoop
Fans of running gump
Hadoop compress-stream
Ganglia轻度使用指南
DaStor/Cassandra report for CDR solution
Big data and cloud

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Empathic Computing: Creating Shared Understanding
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
KodekX | Application Modernization Development
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Empathic Computing: Creating Shared Understanding
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KodekX | Application Modernization Development
Mobile App Security Testing_ A Comprehensive Guide.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology

Case Study - How Rackspace Query Terabytes Of Data

  • 1. How Rackspace Query Terabytes of Log Data Uses MapReduce, Hadoop Case Study, by Schubert Zhang 2009-04-30
  • 2. Rackspace • Rackspace has more than 50K devices and 7 data centers. • The mail system and logging servers are currently in 3 of the Rackspace data centers. • The system stores over 800 million objects (an object = a user event such as receiving an email or logging into IMAP) within Solr and 9.6 (records?) billion within Hadoop, which equals 6.3 TB compressed. • Several hundred gigabytes of email log data is generated each day. (seems 140GB after cleared up)
  • 3. Background on Mailtrust • Email hosting company • Founded in 1999, merged with Rackspace in 2007, previous name: Webmail.us • 80K business customers, 700K mailboxes. • 2 hosted mail products: Noteworthy, MS Exchange • The Noteworthy System: – Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds, shared calendaring, Outlook sync, Blackberry sync. – ~600 servers, commodity hardware, designed to work around frequent failures. • The MS Exchange System: – MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync. – ~100 servers, higher-end hardware, SAN & DAS storage.
  • 4. Problems • Hundreds of gigabytes of new data each day streaming in from over 600 hyperactive servers. • Log processing system. – (1) Flat text files stored on each machine. • Had to be manually searched by engineers logging into each individual machine. – (2) Relational database solution that just couldn't compete. MySQL. • Inserts quickly became the bottleneck. • A lot of index churn. • Data was then broken into Merge Tables based on time so index updates weren't a problem. • Load and operational problems. – (3) Hadoop based solution that works wisely and has virtually unlimited scalability potential. • Hadoop • Lucene and Solr. • The familiar faced problem now: Lots and lots of data streaming in. – Where do you store all that data? – How do you do anything useful with it? – How to retrieve the wanted data from the data sea. • Examine mail logs in order to troubleshoot problems for our customers. • The query/search should be fast and accurate.
  • 5. Now the new system • The advantage of their new system is that they can now look at their data in anyway they want: – Nightly MapReduce jobs collect statistics about their mail system such as spam counts by domain, bytes transferred and number of logins. – When they wanted to find out which part of the the world their customers logged in from, a quick MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system. • "Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff."
  • 6. The Platform • Hadoop MapReduce • Hadoop Distributed File System (HDFS) • Lucene • Solr • Tomcat
  • 7. The Architecture • Raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time. • MapReduce jobs are scheduled run to index the new data using Apache Lucene and Solr. • Once the indexes have been built, they are compressed and stored away in HDFS. • Each Hadoop datanode runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.
  • 8. The System Evolution Logging v1.0 • Logs were stored in flat text files on the local disk of each mail server and were kept for 14 days. • Our support techs did not have login access to the servers, so in order to search the logs they would have to escalate a ticket to our engineers. The engineers would then have to ssh into each mail server and grep /var/log/maillog. • Problems: Once we grew much past a dozen servers, this manual process of logging into each server become too time consuming for our engineers.
  • 9. Logging v1.1 • Sped up the search process by writing a script that would search multiple servers via one command run from a centralized server. • Remote still grep. • Problems: The support techs still had to escalate a ticket to the engineers in order to perform a search. As the number of customers and servers increased, this began to take too much of our engineers' scarce time. Also, storing and searching the logs on a live server was negatively affecting the performance of the servers. To make matters worse, the engineering team had grown and we started running into the problem where two engineers would perform a search at the same time, which really slowed things down.
  • 10. Logging v2.0 • a web-based tool where they could search the logs. • It allowed searching by the sender or recipient's email address, domain name or IP address. • All of these were indexed fields in a MySQL database. The centralized log server • Each day's logs were stored in a separate table, so that we could cleanup old data by simply dropping and recreating MySQL tables. • Log data was only kept for 3 days in order to keep the MySQL database down to a reasonable size. • Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the data set was very large and these queries would be horribly slow. • Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As the tables grew, indexing each entry as it was inserted became slow. Within the first hours of testing, the inserts began slowing and could not keep up with the rate at which data was received. Version 2.0 of the logging system was never used in production.
  • 11. Logging v2.1 • Fixed the MySQL INSERT bottleneck by queuing up the log entries in local text files on the centralized log server and periodically bulk loading them into the database. As syslog-ng received logs on its 6 ports, the data would be streamed to 6 separate text files. Every 10 minutes a script would rotate those text files and execute a MySQL LOAD to load the data into the database. This was magnitudes faster than inserting the log data one record at a time. • Problems: The LOADs would get progressively slower as the database grew because MySQL indexing performance decreases as the table you are inserting into gets larger. This version was fast enough to be released into production, but we knew the system would not scale too far without additional work.
  • 12. Logging v2.2 • Introduced Merge Tables in order to speed up loading the log data into the database. • every 10 minutes our script would create a new database table and then load the text logs into the empty table. • After the data was loaded, the script would modify a set of Merge Tables that combined all of the 10-minute tables together. • The web search tool was modified to allow searching within the different time ranges. Corresponding Merge Tables existed for each of those time ranges, and were modified every 10 minutes as new tables were created. • Problems: the database LOAD operations would take 2-3 minutes to run. the server was now always under a heavy cpu and disk IO load. • Searches were being performed more frequently and were becoming slow. We started to see some strange problems such as random errors while trying to create new tables or modify the Merge Tables. These errors progressively became more frequent, resulting in missing log data. The support team began to lose confidence in the system's accuracy. • the logging system had no redundancy. • We needed a new solution that would be fast, reliable and could scale indefinitely with our growth. We needed something truly scalable.
  • 13. Logging v3+ • Avoid limiting our abilities to build new features down the road. • For example, we wanted to build a tool that would allow our customers to search their logs directly. • It scales out it's workload horizontally by adding servers and distributing the data and MapReduce jobs amongst the servers. • In about 3 months we build a fresh new log processing system using Hadoop, Lucene and Solr. • Put the log search tool in the hands of our customers.
  • 14. Stu Hood’s Detailed Comments • The loading of data is streaming, but the indexing is not. We write to a file in Hadoop until it reaches a size below the block size, or until it times out, and then we close and move it to where it will be processed. • Our processing jobs run every 10 minutes or so, meaning that the logs become available for Customer Care after about 15. We’ve executed around 150K jobs on this cluster with 3 restarts. • We create the indexes on local disk in our reducer, and compress them into HDFS after they are complete. • When we pull the index to make it available for search, we decompress it to local disk and merge it using the Lucene IndexWriter.addIndexes method before calling /commit on the Solr instance. The Nutch project created an IndexReader that can do read-only access on HDFS, but for speed reasons, we decided not to take that approach. • Since we are indexing to local disk, we use an embedded SolrCore, in the same JVM as the reducer. • We have 10 Hadoop data nodes, with 3.5TB hard drives each. = 35TB • We are currently indexing an average of 140GBytes per day. • The merged indexes are not replicated at all… only one Solr node has a copy of each index, so failover involves a brief downtime for queries. If we lose a node, other nodes (consistent hashing) become responsible and merge the indexes from the copies we always have in Hadoop.
  • 15. Future • Creating reports or doing ad-hoc queries. • More wanted MapReduce jobs to do wanted things.
  • 16. References • How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data • MapReduce at Rackspace