SlideShare a Scribd company logo
Rackspace Hosting Hadoop World 2009 Stu Hood – Search Team Technical Lead Date:  October 2, 2009 Cross Datacenter Logs Processing
Overview Use case Background Log Types Querying Previous Solutions The Hadoop Solution Implementation Collection Index time Query time Advantages of Hadoop Storage Analysis Scalability Community
Use Case: Background “ Rackapps” - Email and Apps Division Founded 1999, merged with Rackspace 2007 Hybrid Mail Hosting 40% of accounts have a mix of Exchange and Rackspace Email Fantastic Control Panel to juggle accounts Webmail client with calendar/contact/note sharing More Apps to come Environment 1K+ servers at 3 of 6 Rackspace datacenters Breakdown - 80% Linux, 20% Windows “ Rackspace Email” - custom email and application platform Microsoft Exchange
Use Case: Log Types MTA (mail delivery) logs Postfix Exchange Momentum Spam and virus logs Amavis Access logs Dovecot Exchange httpd logs
Use Case: Querying Support Team Needs to answer basic questions: Mail Transfer – Was it delivered? Spam – Why was this (not) marked as spam? Access – Who (checked | failed to check) mail? Engineering More advanced questions: Which delivery routes have the highest latency? Which are the spammiest IPs? Where in the world do customers log in from? Elsewhere Cloud teams use Hadoop for even more mission critical statistics
Previous Solutions V1 – Query at the Source Founding – 2006 No processing: flat log files on each source machine To query, support escalates a ticket to Engineering Queries take hours 14 days available, single datacenter V2 – Bulk load to MySQL 2006 – 2007 Process logs, bulk load into denormalized schema Add merge tables for common query time ranges SQL self joins to find log entries for a path Queries take minutes 1 day available, single datacenter
The Hadoop Solution V3 – Lucene Indexes in Hadoop 2007 – Present Raw logs collected and processed in Hadoop Lucene indexes as intermediate format “ Realtime” queries via Solr Indexes merged to Solr nodes with15 minute turnaround 7 days stored uncompressed Queries take seconds Long term querying via MapReduce, high level languages Hadoop InputFormat for Lucene indexes 6 months available for MR queries Queries take minutes Multiple datacenters
The Hadoop Solution: Alternatives Splunk Great for realtime querying, but weak for long term analysis Archived data is not easily queryable Data warehouse package Weak for realtime querying, great for long term analysis Partitioned MySQL Mediocre solution to either goal Needed something similar to MapReduce for sharded MySQL
Implementation: Collection Software Transport Syslog-ng, SSH tunnel between datacenters Considering Scribe/rsyslog/? Storage App to deposit to Hadoop using Java API Hardware Per Datacenter 2-4 collector machines hundreds of source machines Single Datacenter 30 node Hadoop cluster 20 Solr nodes
Implementation: Indexing/Querying Indexing Unique processing code for schema’d, unschema’d logs SolrOutputFormat generates compressed Lucene indexes Querying “ Realtime” Sharded Lucene/Solr instances merge index chunks from Hadoop Using Solr API Plugin to optimize sharding: queries are distributed to relevant nodes Solr merges esults Raw Logs Using Hadoop Streaming and unix grep MapReduce
Implementation: Example
Implementation: Timeframe Development Developed by a team of 1.5 in 3 months Indexing, Statistics Deployment Developers acted as operations team Cloudera deployment resolved this problem Roadblocks Bumped into job-size limitations Resolved now
Advantages: Storage Raw Logs 3 days For debugging purposes, use by engineering In HDFS Indexes 7 days Queryable via Solr API On local disk Archived Indexes 6+ months Queryable via Hadoop, or use API to ask for old data to be made accessible in Solr In HDFS
Advantages: Analysis Java MapReduce API For optimal performance of frequently run jobs Apache Pig Ideal for one off queries Interactive development No need to understand MapReduce (SQL replacement) Extensible via UDFs Hadoop Streaming For users comfortable with MapReduce, in a hurry Use any language (frequently Python)
Pig Example records = LOAD 'amavis' USING us.webmail.pig.io.SolrSlicer('sender,timestamp,rip,recips', '1251777901', '1252447501'); flat = FOREACH records GENERATE FLATTEN(sender), FLATTEN(timestamp), FLATTEN(rip), FLATTEN(recips); filtered = FILTER flat BY sender IS NOT NULL AND sender MATCHES '.*whitehouse\\.gov$'; cleantimes = FOREACH filtered GENERATE sender,(us.webmail.pig.udf.FromSolrLong(timestamp) / 3600 * 3600) as timestamp,rip,recips; grouped = GROUP cleantimes BY (sender, rip, timestamp); counts = FOREACH grouped GENERATE group, COUNT(*); hostcounts = FOREACH counts GENERATE group.sender, us.webmail.pig.udf.ReverseDNS(group.rip) as host, group.timestamp, $1; dump hostcounts;
Advantages: Scalability, Cost, Community Scalability Add or remove nodes at any time Linearly increase processing and storage capacity No code changes Cost Only expansion cost is hardware No licensing Community Constant development and improvements Stream of patches adding capability and performance Companies like Cloudera exist to: Abstract away patch selection Trivialize deployment Provide emergency support
Fin! Questions?

More Related Content

PDF
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
PDF
Advanced Natural Language Processing with Apache Spark NLP
PDF
Building Robust, Adaptive Streaming Apps with Spark Streaming
PDF
Delta Architecture
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
PDF
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Advanced Natural Language Processing with Apache Spark NLP
Building Robust, Adaptive Streaming Apps with Spark Streaming
Delta Architecture
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

What's hot (20)

PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Managing ADLS gen2 using Apache Spark
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
DOCX
Spark,Hadoop,Presto Comparition
PDF
Announcing Databricks Cloud (Spark Summit 2014)
PDF
Spark, spark streaming & tachyon
PPTX
Producing Spark on YARN for ETL
PDF
Top 5 mistakes when writing Streaming applications
PDF
Apache Spark & Hadoop
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
PDF
Bring Satellite and Drone Imagery into your Data Science Workflows
PDF
Apache Spark Briefing
PDF
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
Lessons from Running Large Scale Spark Workloads
PDF
SSR: Structured Streaming for R and Machine Learning
PPTX
Introduction to Apache Spark Developer Training
PPTX
Distributed Deep Learning on Hadoop Clusters
A look under the hood at Apache Spark's API and engine evolutions
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Managing ADLS gen2 using Apache Spark
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark,Hadoop,Presto Comparition
Announcing Databricks Cloud (Spark Summit 2014)
Spark, spark streaming & tachyon
Producing Spark on YARN for ETL
Top 5 mistakes when writing Streaming applications
Apache Spark & Hadoop
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Bring Satellite and Drone Imagery into your Data Science Workflows
Apache Spark Briefing
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Spark streaming State of the Union - Strata San Jose 2015
Lessons from Running Large Scale Spark Workloads
SSR: Structured Streaming for R and Machine Learning
Introduction to Apache Spark Developer Training
Distributed Deep Learning on Hadoop Clusters
Ad

Viewers also liked (7)

PDF
Hw09 Optimizing Hadoop Deployments
PPT
Hw09 Fingerpointing Sourcing Performance Issues
PPT
Hw09 Matchmaking In The Cloud
PPT
Hw09 Analytics And Reporting
PPTX
Hadoop Puzzlers
PDF
Doug Cutting on the State of the Hadoop Ecosystem
PPTX
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
Hw09 Optimizing Hadoop Deployments
Hw09 Fingerpointing Sourcing Performance Issues
Hw09 Matchmaking In The Cloud
Hw09 Analytics And Reporting
Hadoop Puzzlers
Doug Cutting on the State of the Hadoop Ecosystem
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
Ad

Similar to Hw09 Cross Data Center Logs Processing (20)

PPT
Hadoop ecosystem framework n hadoop in live environment
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPT
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
PPTX
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
PPTX
Hadoop ppt1
PDF
Hoodie - DataEngConf 2017
PDF
What is Apache Hadoop and its ecosystem?
PDF
Hadoop Primer
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PPTX
Introduction to Apache Hadoop
PPTX
Hadoop and Big data in Big data and cloud.pptx
ODP
Get involved with the Apache Software Foundation
PPTX
Hive paris
PDF
How can Hadoop & SAP be integrated
PDF
Google Data Engineering.pdf
PDF
Data Engineering on GCP
PDF
data_engineering_on_GCP_PDE_cheat_sheets
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
Hadoop ecosystem framework n hadoop in live environment
Big Data Analytics with Hadoop, MongoDB and SQL Server
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Hadoop ppt1
Hoodie - DataEngConf 2017
What is Apache Hadoop and its ecosystem?
Hadoop Primer
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Introduction to Apache Hadoop
Hadoop and Big data in Big data and cloud.pptx
Get involved with the Apache Software Foundation
Hive paris
How can Hadoop & SAP be integrated
Google Data Engineering.pdf
Data Engineering on GCP
data_engineering_on_GCP_PDE_cheat_sheets
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Approach and Philosophy of On baking technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Big Data Technologies - Introduction.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Review of recent advances in non-invasive hemoglobin estimation
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Spectroscopy.pptx food analysis technology
Approach and Philosophy of On baking technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Hw09 Cross Data Center Logs Processing

  • 1. Rackspace Hosting Hadoop World 2009 Stu Hood – Search Team Technical Lead Date: October 2, 2009 Cross Datacenter Logs Processing
  • 2. Overview Use case Background Log Types Querying Previous Solutions The Hadoop Solution Implementation Collection Index time Query time Advantages of Hadoop Storage Analysis Scalability Community
  • 3. Use Case: Background “ Rackapps” - Email and Apps Division Founded 1999, merged with Rackspace 2007 Hybrid Mail Hosting 40% of accounts have a mix of Exchange and Rackspace Email Fantastic Control Panel to juggle accounts Webmail client with calendar/contact/note sharing More Apps to come Environment 1K+ servers at 3 of 6 Rackspace datacenters Breakdown - 80% Linux, 20% Windows “ Rackspace Email” - custom email and application platform Microsoft Exchange
  • 4. Use Case: Log Types MTA (mail delivery) logs Postfix Exchange Momentum Spam and virus logs Amavis Access logs Dovecot Exchange httpd logs
  • 5. Use Case: Querying Support Team Needs to answer basic questions: Mail Transfer – Was it delivered? Spam – Why was this (not) marked as spam? Access – Who (checked | failed to check) mail? Engineering More advanced questions: Which delivery routes have the highest latency? Which are the spammiest IPs? Where in the world do customers log in from? Elsewhere Cloud teams use Hadoop for even more mission critical statistics
  • 6. Previous Solutions V1 – Query at the Source Founding – 2006 No processing: flat log files on each source machine To query, support escalates a ticket to Engineering Queries take hours 14 days available, single datacenter V2 – Bulk load to MySQL 2006 – 2007 Process logs, bulk load into denormalized schema Add merge tables for common query time ranges SQL self joins to find log entries for a path Queries take minutes 1 day available, single datacenter
  • 7. The Hadoop Solution V3 – Lucene Indexes in Hadoop 2007 – Present Raw logs collected and processed in Hadoop Lucene indexes as intermediate format “ Realtime” queries via Solr Indexes merged to Solr nodes with15 minute turnaround 7 days stored uncompressed Queries take seconds Long term querying via MapReduce, high level languages Hadoop InputFormat for Lucene indexes 6 months available for MR queries Queries take minutes Multiple datacenters
  • 8. The Hadoop Solution: Alternatives Splunk Great for realtime querying, but weak for long term analysis Archived data is not easily queryable Data warehouse package Weak for realtime querying, great for long term analysis Partitioned MySQL Mediocre solution to either goal Needed something similar to MapReduce for sharded MySQL
  • 9. Implementation: Collection Software Transport Syslog-ng, SSH tunnel between datacenters Considering Scribe/rsyslog/? Storage App to deposit to Hadoop using Java API Hardware Per Datacenter 2-4 collector machines hundreds of source machines Single Datacenter 30 node Hadoop cluster 20 Solr nodes
  • 10. Implementation: Indexing/Querying Indexing Unique processing code for schema’d, unschema’d logs SolrOutputFormat generates compressed Lucene indexes Querying “ Realtime” Sharded Lucene/Solr instances merge index chunks from Hadoop Using Solr API Plugin to optimize sharding: queries are distributed to relevant nodes Solr merges esults Raw Logs Using Hadoop Streaming and unix grep MapReduce
  • 12. Implementation: Timeframe Development Developed by a team of 1.5 in 3 months Indexing, Statistics Deployment Developers acted as operations team Cloudera deployment resolved this problem Roadblocks Bumped into job-size limitations Resolved now
  • 13. Advantages: Storage Raw Logs 3 days For debugging purposes, use by engineering In HDFS Indexes 7 days Queryable via Solr API On local disk Archived Indexes 6+ months Queryable via Hadoop, or use API to ask for old data to be made accessible in Solr In HDFS
  • 14. Advantages: Analysis Java MapReduce API For optimal performance of frequently run jobs Apache Pig Ideal for one off queries Interactive development No need to understand MapReduce (SQL replacement) Extensible via UDFs Hadoop Streaming For users comfortable with MapReduce, in a hurry Use any language (frequently Python)
  • 15. Pig Example records = LOAD 'amavis' USING us.webmail.pig.io.SolrSlicer('sender,timestamp,rip,recips', '1251777901', '1252447501'); flat = FOREACH records GENERATE FLATTEN(sender), FLATTEN(timestamp), FLATTEN(rip), FLATTEN(recips); filtered = FILTER flat BY sender IS NOT NULL AND sender MATCHES '.*whitehouse\\.gov$'; cleantimes = FOREACH filtered GENERATE sender,(us.webmail.pig.udf.FromSolrLong(timestamp) / 3600 * 3600) as timestamp,rip,recips; grouped = GROUP cleantimes BY (sender, rip, timestamp); counts = FOREACH grouped GENERATE group, COUNT(*); hostcounts = FOREACH counts GENERATE group.sender, us.webmail.pig.udf.ReverseDNS(group.rip) as host, group.timestamp, $1; dump hostcounts;
  • 16. Advantages: Scalability, Cost, Community Scalability Add or remove nodes at any time Linearly increase processing and storage capacity No code changes Cost Only expansion cost is hardware No licensing Community Constant development and improvements Stream of patches adding capability and performance Companies like Cloudera exist to: Abstract away patch selection Trivialize deployment Provide emergency support

Editor's Notes

  • #14: All storage in Hadoop: no filers or SANs