SlideShare a Scribd company logo
Hadoop For OpenStack Log Analysis
Havana Summit, April 16th, 2013,

Mike Pittaro
Principal Architect, Big Data Solutions


@pmikeyp            Freenode: mikeyp
michael_pittaro@dell.com
The Problem : Operating OpenStack at Scale




2                                     Revolutionary Cloud Team
The Search for the Holy Grail of OpenStack Operations

    • Imagine if we could follow a request …
       –   Through the entire system …
       –   Across compute, storage, network …
       –   Independent of physical nodes …
       –   With timestamps …
       –   Correlated with events outside OpenStack …




3                                                       Revolutionary Cloud Team
It’s Easy !




    Collect
              Analyze ?      Sleep !
     Logs




4                                      Revolutionary Cloud Team
OpenStack Log Analysis is a Big Data Problem

      Big Data is when the data itself is part of the
                       problem.


                         Volume
    • A large amount of data, growing at large rates
                        Velocity
    • The speed at which the data must be processed
                         Variety
      • The range of data types and data structure


5                                                       Revolutionary Cloud Team
Initial Focus and Scope of Our Efforts

                        • The Operators
                           – Assist in running OpenStack
                           – Not for tenants
                        • The Data
                           – Load all detail into Hadoop
                           – Extract and index significant fields
                           – Enable future analysis
                        • The Patterns
                           – What works
                           – What is repeatable across installs
                           – How can we collaborate



6                                          Revolutionary Cloud Team
Hadoop 101 – Simplified Block Diagram

                                             HUE               Mahout
                             Oozie                            (Machine
                           (Workflow)      (Web UI)           Learning)
    Sqoop

                                      Pig           Hive
                API’s
    Flume
                                    (Batch        (SQL-ish
                                  Language)        query)
    JDBC

                                               MapReduce Distributed
                        HBase Database
                                                   Processing

                             HDFS Distributed Storage


7                                                       Revolutionary Cloud Team
The Big Pieces

    • Log Collection
       – Continuous streaming of log data into Hadoop from OpenStack
    • Intelligent Log parsing, Indexing and Search
          – Should ‘know’ about OpenStack

    • Well Defined Storage Organization
          – Defined schema for the data
          – Predefined queries for high level status - dashboard

    • Straightforward implementation pattern
          – Add as little complexity as possible

    • Ability to perform deeper analysis
          – Hadoop enables this



8                                                                  Revolutionary Cloud Team
OpenStack Log Analysis Block Diagram
           OpenStack Nodes                                                       Search

       Python
                        Syslog
       Logging
                                                                             Solr Cloud
                                        Query
               FlumeNG
                 Agent
                                               Pig          Hive
                                                                                Lucene
                                                                                Indexes

    Nagios +
    Ganglia          Avro
                                        Avro                                MapReduce
                  Sqoop                              HDFS                     Jobs
                  /Flume         Avro



9                                                                  Revolutionary Cloud Team
Current Development Status

     • Batch Only, no Flume Collection
     • Converting logs to AVRO format
     • First cut of schema in place
     • Loading into Hadoop
     • Processing into SOLR indexes
     • Starting to look at data
        – Solr Searches
        – Pig scripts




10                                                     Revolutionary Cloud Team
Schema Thoughts
     2013-03-26 11:57:41 WARNING nova.db.sqlalchemy.session

     [req-ace2ccc0-919e-4fd1-9f3a-671c0c87d28f None None] Got mysql server has gone
     away: (2006, 'MySQL server has gone away')
                                         {"namespace": "logfile.openstack",
                                          "type": "record",
                                          "name": "logentry",
                                          "fields": [
                                              {"name": "hostname", "type": "string"},
                                              {"name": "date", "type": "string"},
                                              {"name": "time",    "type": "string"},
                                              {"name": "level",    "type": "string"},
                                              {"name": "module",   "type": "string"},
                                              {"name": "request_id1",   "type": ["string", "null"]},
                                              {"name": "request_id2",   "type": ["string", "null"]},
                                              {"name": "request_id3",   "type": ["string", "null"]},
                                              {"name": "data",    "type": "string"}
                                          ]
                                         }


11                                                                        Revolutionary Cloud Team
Demo: Where we are today

     • Solr Indexing and Search
        – Example of indexed fields and searching.
     • Pig for batch analysis
        – Reconstruct a sequence of messages related to an API request




12                                                                 Revolutionary Cloud Team
Data Collection Thoughts

     • Sources
        –   OpenStack subsystems
        –   Syslog files
        –   Nagios and Ganglia
        –   General Infrastructure Data
             – Network Switches / Routers

     • Input Formats
        – Mostly Semi-structured text
        – Subsystem, timestamp, hostname, severity and error level are important
     • Output Formats
        – Avro
        – Thrift
        – Protocol Buffers

13                                                                  Revolutionary Cloud Team
Log Collection Thoughts

     • Well understood patterns
        – Evolving best practices
     • Commonly Used Tools
        – Kafka
        – Scribe
        – Flume and FlumeNG
     • Key Requirements
        –   Distributed
        –   Reliable
        –   Aggregators – consolidate streams
        –   Store and Forward – when links are down



14                                                      Revolutionary Cloud Team
Storage Organization Thoughts

     • File Organization within Hadoop
        – Naming Convention
        – Directory Organization


     • Data Lifecycle
        –   Input and Staging
        –   ‘Hot’ and ‘Cold’ Data
        –   Tiered Indexes
        –   Compression
        –   Archival and Deletion




15                                                   Revolutionary Cloud Team
What should we do next ?

     • Document the basic patterns so far ?
     • Are there any related efforts ?
     • Begin deeper discussions
        – Lots of decisions to make
        – Need community input and suggestions
     • Collaborate on Schema Design
     • What upstream OpenStack changes are needed ?
     • Get sample logs in hands of Hadoopians
        – Cleansed reference log sets would be useful



16                                                      Revolutionary Cloud Team
References

     • The unified logging infrastructure for data analytics at Twitter
     • Building LinkedIn’s Real-time Activity Data Pipeline
     • Advances and challenges in log analysis
     • BP: Ceilometer HBase Storage Backend
     • BP: Cross Service Request ID
     • Log Everything All the Time


     • Holy Grail, Gangam Style



17                                                                Revolutionary Cloud Team

More Related Content

KEY
Big Search with Big Data Principles
PPTX
Big data, just an introduction to Hadoop and Scripting Languages
PDF
Hadoop Jungle
PPTX
Introduction to Apache Drill - interactive query and analysis at scale
PPTX
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
PDF
Shark SQL and Rich Analytics at Scale
PDF
[212]big models without big data using domain specific deep networks in data-...
PDF
Apache Spark
Big Search with Big Data Principles
Big data, just an introduction to Hadoop and Scripting Languages
Hadoop Jungle
Introduction to Apache Drill - interactive query and analysis at scale
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Shark SQL and Rich Analytics at Scale
[212]big models without big data using domain specific deep networks in data-...
Apache Spark

What's hot (20)

PPTX
Scaling Big Data Mining Infrastructure Twitter Experience
PPTX
Spark Internals - Hadoop Source Code Reading #16 in Japan
PPTX
NYC Lucene/Solr Meetup: Spark / Solr
PPTX
Drill dchug-29 nov2012
PDF
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
PDF
Treasure Data on The YARN - Hadoop Conference Japan 2014
PDF
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
PDF
A deeper-understanding-of-spark-internals
PDF
[2 d1] elasticsearch 성능 최적화
PDF
Apache cassandra in 2016
PDF
Heuritech: Apache Spark REX
PPTX
Zaharia spark-scala-days-2012
PDF
Large Scale Math with Hadoop MapReduce
PPTX
Integrate Solr with real-time stream processing applications
PDF
Hadoop Summit Europe 2014: Apache Storm Architecture
PPT
HPTS talk on micro-sharding with Katta
PDF
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
PDF
Intro to py spark (and cassandra)
PDF
Spark Programming
PDF
Online Analytics with Hadoop and Cassandra
Scaling Big Data Mining Infrastructure Twitter Experience
Spark Internals - Hadoop Source Code Reading #16 in Japan
NYC Lucene/Solr Meetup: Spark / Solr
Drill dchug-29 nov2012
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Treasure Data on The YARN - Hadoop Conference Japan 2014
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
A deeper-understanding-of-spark-internals
[2 d1] elasticsearch 성능 최적화
Apache cassandra in 2016
Heuritech: Apache Spark REX
Zaharia spark-scala-days-2012
Large Scale Math with Hadoop MapReduce
Integrate Solr with real-time stream processing applications
Hadoop Summit Europe 2014: Apache Storm Architecture
HPTS talk on micro-sharding with Katta
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Intro to py spark (and cassandra)
Spark Programming
Online Analytics with Hadoop and Cassandra
Ad

Viewers also liked (20)

PDF
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...
PDF
Xpand IT - Overview and Intro - Xamarin Experience - Live Seminar
PPTX
Design Thinking for Big Data Applications
PDF
Customer Sucess Story: Big Data in EDP
PPT
Customer Success Story: Brisa
PDF
Racing to Big Data in the Cloud with Microsoft Azure
PDF
What's New in Pentaho 7.0?
PDF
Design Thinking for Enterprise Mobile Apps
PPTX
Anti-social Databases
PDF
Introduction Pentaho 5.0
PDF
Microsoft xamarin-experience
PDF
Special project
PPTX
Secret Life of a Weather Datum end of project event
PDF
Cartagena Data Festival | Telling Stories with Data 2015 04-21
PPTX
Challenges in opening up qualitative research data
PPTX
MongoDB at Flight Centre Ltd
PDF
Review: Leadership Frameworks
PDF
Grow Customer Retention with Predictive Marketing and User-Generated Content
PPT
онлайн бронирование модуль для турагенств
DOCX
GIT Best Practices V 0.1
Large Scale Log Analysis with HBase and Solr at Amadeus (Martin Alig, ETH Zur...
Xpand IT - Overview and Intro - Xamarin Experience - Live Seminar
Design Thinking for Big Data Applications
Customer Sucess Story: Big Data in EDP
Customer Success Story: Brisa
Racing to Big Data in the Cloud with Microsoft Azure
What's New in Pentaho 7.0?
Design Thinking for Enterprise Mobile Apps
Anti-social Databases
Introduction Pentaho 5.0
Microsoft xamarin-experience
Special project
Secret Life of a Weather Datum end of project event
Cartagena Data Festival | Telling Stories with Data 2015 04-21
Challenges in opening up qualitative research data
MongoDB at Flight Centre Ltd
Review: Leadership Frameworks
Grow Customer Retention with Predictive Marketing and User-Generated Content
онлайн бронирование модуль для турагенств
GIT Best Practices V 0.1
Ad

Similar to Pittaro open stackloganalysis_20130416 (20)

PPTX
Big data hadoop ecosystem and nosql
PDF
Managing Big Data (Chapter 2, SC 11 Tutorial)
PDF
Introduction to Hadoop
PPTX
Big data ppt
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
PDF
RuG Guest Lecture
PPTX
Introduction to Apache Drill
PDF
Architecting the Future of Big Data & Search - Eric Baldeschwieler
PPTX
Understanding the Value and Architecture of Apache Drill
PPTX
Hadoop Summit - Hausenblas 20 March
PDF
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
PDF
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
PDF
Xldb2011 tue 0940_facebook_realtimeanalytics
PDF
Hadoop and Hive Development at Facebook
 
PDF
Hadoop and Hive Development at Facebook
PPTX
The Evolution of the Hadoop Ecosystem
PDF
Hadoop, hive和scribe在运维方面的应用
PDF
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
PPTX
Hic 2011 realtime_analytics_at_facebook
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Big data hadoop ecosystem and nosql
Managing Big Data (Chapter 2, SC 11 Tutorial)
Introduction to Hadoop
Big data ppt
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
RuG Guest Lecture
Introduction to Apache Drill
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Understanding the Value and Architecture of Apache Drill
Hadoop Summit - Hausenblas 20 March
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Xldb2011 tue 0940_facebook_realtimeanalytics
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
The Evolution of the Hadoop Ecosystem
Hadoop, hive和scribe在运维方面的应用
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
Hic 2011 realtime_analytics_at_facebook
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...

More from OpenStack Foundation (20)

PDF
Sponsor Webinar - OpenStack Summit Vancouver 2018
PDF
OpenStack Summits 101: A Guide For Attendees
PPT
OpenStack Marketing Plan - Community Presentation
PPTX
OpenStack 5th Birthday - User Group Parties
PPTX
Liberty release: Preliminary marketing materials & messages
PPTX
OpenStack Foundation 2H 2015 Marketing Plan
PPTX
OpenStack Summit Tokyo Sponsor Webinar
PPTX
Cinder Updates - Liberty Edition
PPTX
Glance Updates - Liberty Edition
PPTX
Heat Updates - Liberty Edition
PPTX
Neutron Updates - Liberty Edition
PPTX
Nova Updates - Liberty Edition
PPTX
Sahara Updates - Liberty Edition
PDF
Searchlight Updates - Liberty Edition
PPTX
Trove Updates - Liberty Edition
PPTX
OpenStack: five years in
PDF
Swift Updates - Liberty Edition
PPTX
Congress Updates - Liberty Edition
PDF
Release Cycle Management Updates - Liberty Edition
PPT
OpenStack Day CEE 2015: Real-World Use Cases
Sponsor Webinar - OpenStack Summit Vancouver 2018
OpenStack Summits 101: A Guide For Attendees
OpenStack Marketing Plan - Community Presentation
OpenStack 5th Birthday - User Group Parties
Liberty release: Preliminary marketing materials & messages
OpenStack Foundation 2H 2015 Marketing Plan
OpenStack Summit Tokyo Sponsor Webinar
Cinder Updates - Liberty Edition
Glance Updates - Liberty Edition
Heat Updates - Liberty Edition
Neutron Updates - Liberty Edition
Nova Updates - Liberty Edition
Sahara Updates - Liberty Edition
Searchlight Updates - Liberty Edition
Trove Updates - Liberty Edition
OpenStack: five years in
Swift Updates - Liberty Edition
Congress Updates - Liberty Edition
Release Cycle Management Updates - Liberty Edition
OpenStack Day CEE 2015: Real-World Use Cases

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Empathic Computing: Creating Shared Understanding
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Per capita expenditure prediction using model stacking based on satellite ima...
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
A Presentation on Artificial Intelligence
NewMind AI Monthly Chronicles - July 2025
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Pittaro open stackloganalysis_20130416

  • 1. Hadoop For OpenStack Log Analysis Havana Summit, April 16th, 2013, Mike Pittaro Principal Architect, Big Data Solutions @pmikeyp Freenode: mikeyp michael_pittaro@dell.com
  • 2. The Problem : Operating OpenStack at Scale 2 Revolutionary Cloud Team
  • 3. The Search for the Holy Grail of OpenStack Operations • Imagine if we could follow a request … – Through the entire system … – Across compute, storage, network … – Independent of physical nodes … – With timestamps … – Correlated with events outside OpenStack … 3 Revolutionary Cloud Team
  • 4. It’s Easy ! Collect Analyze ? Sleep ! Logs 4 Revolutionary Cloud Team
  • 5. OpenStack Log Analysis is a Big Data Problem Big Data is when the data itself is part of the problem. Volume • A large amount of data, growing at large rates Velocity • The speed at which the data must be processed Variety • The range of data types and data structure 5 Revolutionary Cloud Team
  • 6. Initial Focus and Scope of Our Efforts • The Operators – Assist in running OpenStack – Not for tenants • The Data – Load all detail into Hadoop – Extract and index significant fields – Enable future analysis • The Patterns – What works – What is repeatable across installs – How can we collaborate 6 Revolutionary Cloud Team
  • 7. Hadoop 101 – Simplified Block Diagram HUE Mahout Oozie (Machine (Workflow) (Web UI) Learning) Sqoop Pig Hive API’s Flume (Batch (SQL-ish Language) query) JDBC MapReduce Distributed HBase Database Processing HDFS Distributed Storage 7 Revolutionary Cloud Team
  • 8. The Big Pieces • Log Collection – Continuous streaming of log data into Hadoop from OpenStack • Intelligent Log parsing, Indexing and Search – Should ‘know’ about OpenStack • Well Defined Storage Organization – Defined schema for the data – Predefined queries for high level status - dashboard • Straightforward implementation pattern – Add as little complexity as possible • Ability to perform deeper analysis – Hadoop enables this 8 Revolutionary Cloud Team
  • 9. OpenStack Log Analysis Block Diagram OpenStack Nodes Search Python Syslog Logging Solr Cloud Query FlumeNG Agent Pig Hive Lucene Indexes Nagios + Ganglia Avro Avro MapReduce Sqoop HDFS Jobs /Flume Avro 9 Revolutionary Cloud Team
  • 10. Current Development Status • Batch Only, no Flume Collection • Converting logs to AVRO format • First cut of schema in place • Loading into Hadoop • Processing into SOLR indexes • Starting to look at data – Solr Searches – Pig scripts 10 Revolutionary Cloud Team
  • 11. Schema Thoughts 2013-03-26 11:57:41 WARNING nova.db.sqlalchemy.session [req-ace2ccc0-919e-4fd1-9f3a-671c0c87d28f None None] Got mysql server has gone away: (2006, 'MySQL server has gone away') {"namespace": "logfile.openstack", "type": "record", "name": "logentry", "fields": [ {"name": "hostname", "type": "string"}, {"name": "date", "type": "string"}, {"name": "time", "type": "string"}, {"name": "level", "type": "string"}, {"name": "module", "type": "string"}, {"name": "request_id1", "type": ["string", "null"]}, {"name": "request_id2", "type": ["string", "null"]}, {"name": "request_id3", "type": ["string", "null"]}, {"name": "data", "type": "string"} ] } 11 Revolutionary Cloud Team
  • 12. Demo: Where we are today • Solr Indexing and Search – Example of indexed fields and searching. • Pig for batch analysis – Reconstruct a sequence of messages related to an API request 12 Revolutionary Cloud Team
  • 13. Data Collection Thoughts • Sources – OpenStack subsystems – Syslog files – Nagios and Ganglia – General Infrastructure Data – Network Switches / Routers • Input Formats – Mostly Semi-structured text – Subsystem, timestamp, hostname, severity and error level are important • Output Formats – Avro – Thrift – Protocol Buffers 13 Revolutionary Cloud Team
  • 14. Log Collection Thoughts • Well understood patterns – Evolving best practices • Commonly Used Tools – Kafka – Scribe – Flume and FlumeNG • Key Requirements – Distributed – Reliable – Aggregators – consolidate streams – Store and Forward – when links are down 14 Revolutionary Cloud Team
  • 15. Storage Organization Thoughts • File Organization within Hadoop – Naming Convention – Directory Organization • Data Lifecycle – Input and Staging – ‘Hot’ and ‘Cold’ Data – Tiered Indexes – Compression – Archival and Deletion 15 Revolutionary Cloud Team
  • 16. What should we do next ? • Document the basic patterns so far ? • Are there any related efforts ? • Begin deeper discussions – Lots of decisions to make – Need community input and suggestions • Collaborate on Schema Design • What upstream OpenStack changes are needed ? • Get sample logs in hands of Hadoopians – Cleansed reference log sets would be useful 16 Revolutionary Cloud Team
  • 17. References • The unified logging infrastructure for data analytics at Twitter • Building LinkedIn’s Real-time Activity Data Pipeline • Advances and challenges in log analysis • BP: Ceilometer HBase Storage Backend • BP: Cross Service Request ID • Log Everything All the Time • Holy Grail, Gangam Style 17 Revolutionary Cloud Team