Hw09 Cross Data Center Logs Processing

Rackspace Hosting Hadoop World 2009 Stu Hood – Search Team Technical Lead Date: October 2, 2009 Cross Datacenter Logs Processing

Overview Use case Background Log Types Querying Previous Solutions The Hadoop Solution Implementation Collection Index time Query time Advantages of Hadoop Storage Analysis Scalability Community

Use Case: Background “ Rackapps” - Email and Apps Division Founded 1999, merged with Rackspace 2007 Hybrid Mail Hosting 40% of accounts have a mix of Exchange and Rackspace Email Fantastic Control Panel to juggle accounts Webmail client with calendar/contact/note sharing More Apps to come Environment 1K+ servers at 3 of 6 Rackspace datacenters Breakdown - 80% Linux, 20% Windows “ Rackspace Email” - custom email and application platform Microsoft Exchange

Use Case: Log Types MTA (mail delivery) logs Postfix Exchange Momentum Spam and virus logs Amavis Access logs Dovecot Exchange httpd logs

Use Case: Querying Support Team Needs to answer basic questions: Mail Transfer – Was it delivered? Spam – Why was this (not) marked as spam? Access – Who (checked | failed to check) mail? Engineering More advanced questions: Which delivery routes have the highest latency? Which are the spammiest IPs? Where in the world do customers log in from? Elsewhere Cloud teams use Hadoop for even more mission critical statistics

Previous Solutions V1 – Query at the Source Founding – 2006 No processing: flat log files on each source machine To query, support escalates a ticket to Engineering Queries take hours 14 days available, single datacenter V2 – Bulk load to MySQL 2006 – 2007 Process logs, bulk load into denormalized schema Add merge tables for common query time ranges SQL self joins to find log entries for a path Queries take minutes 1 day available, single datacenter

The Hadoop Solution V3 – Lucene Indexes in Hadoop 2007 – Present Raw logs collected and processed in Hadoop Lucene indexes as intermediate format “ Realtime” queries via Solr Indexes merged to Solr nodes with15 minute turnaround 7 days stored uncompressed Queries take seconds Long term querying via MapReduce, high level languages Hadoop InputFormat for Lucene indexes 6 months available for MR queries Queries take minutes Multiple datacenters

The Hadoop Solution: Alternatives Splunk Great for realtime querying, but weak for long term analysis Archived data is not easily queryable Data warehouse package Weak for realtime querying, great for long term analysis Partitioned MySQL Mediocre solution to either goal Needed something similar to MapReduce for sharded MySQL

Implementation: Collection Software Transport Syslog-ng, SSH tunnel between datacenters Considering Scribe/rsyslog/? Storage App to deposit to Hadoop using Java API Hardware Per Datacenter 2-4 collector machines hundreds of source machines Single Datacenter 30 node Hadoop cluster 20 Solr nodes

Implementation: Indexing/Querying Indexing Unique processing code for schema’d, unschema’d logs SolrOutputFormat generates compressed Lucene indexes Querying “ Realtime” Sharded Lucene/Solr instances merge index chunks from Hadoop Using Solr API Plugin to optimize sharding: queries are distributed to relevant nodes Solr merges esults Raw Logs Using Hadoop Streaming and unix grep MapReduce

Implementation: Timeframe Development Developed by a team of 1.5 in 3 months Indexing, Statistics Deployment Developers acted as operations team Cloudera deployment resolved this problem Roadblocks Bumped into job-size limitations Resolved now

Advantages: Storage Raw Logs 3 days For debugging purposes, use by engineering In HDFS Indexes 7 days Queryable via Solr API On local disk Archived Indexes 6+ months Queryable via Hadoop, or use API to ask for old data to be made accessible in Solr In HDFS

Advantages: Analysis Java MapReduce API For optimal performance of frequently run jobs Apache Pig Ideal for one off queries Interactive development No need to understand MapReduce (SQL replacement) Extensible via UDFs Hadoop Streaming For users comfortable with MapReduce, in a hurry Use any language (frequently Python)

Pig Example records = LOAD 'amavis' USING us.webmail.pig.io.SolrSlicer('sender,timestamp,rip,recips', '1251777901', '1252447501'); flat = FOREACH records GENERATE FLATTEN(sender), FLATTEN(timestamp), FLATTEN(rip), FLATTEN(recips); filtered = FILTER flat BY sender IS NOT NULL AND sender MATCHES '.*whitehouse\\.gov$'; cleantimes = FOREACH filtered GENERATE sender,(us.webmail.pig.udf.FromSolrLong(timestamp) / 3600 * 3600) as timestamp,rip,recips; grouped = GROUP cleantimes BY (sender, rip, timestamp); counts = FOREACH grouped GENERATE group, COUNT(*); hostcounts = FOREACH counts GENERATE group.sender, us.webmail.pig.udf.ReverseDNS(group.rip) as host, group.timestamp, $1; dump hostcounts;

Advantages: Scalability, Cost, Community Scalability Add or remove nodes at any time Linearly increase processing and storage capacity No code changes Cost Only expansion cost is hardware No licensing Community Constant development and improvements Stream of patches adding capability and performance Companies like Cloudera exist to: Abstract away patch selection Trivialize deployment Provide emergency support

Hw09 Cross Data Center Logs Processing

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Hw09 Cross Data Center Logs Processing (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Hw09 Cross Data Center Logs Processing

Editor's Notes