SlideShare a Scribd company logo
Inside Flume

                            Henry Robinson
                          henry@cloudera.com
                               @henryr




Tuesday, 17 August 2010
Who am I?

  • Distributed systems guy

  • Apache ZooKeeper committer

  • I work at Cloudera on Flume, ZooKeeper, Hue, more...

  • p.s. Cloudera is hiring!




                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
About Cloudera

  • Software, services and support for Hadoop
  • Built around an open core
        • All our patches get contributed upstream
        • Flume and Hue are open-source
        • We just started the Whirr project
  • We maintain, package and support Cloudera’s Distribution
    for Hadoop
        • Smoothing off a lot of the rough edges around Hadoop
        • Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive,
          Pig, Hue, Flume and more.


                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
What’s the problem?

  • Data collection is currently a priori and ad hoc

  • A priori - decide what you want to collect ahead of time

  • Ad hoc - Each kind of data source goes through its own
    collection path
        • Usually a collection of fragile, custom scripts




                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
What is Flume? (and how can it help?)

  • Flume is:
        •   A distributed data collection service
        •   Scalable
        •   Configurable
        •   Extensible
        •   Manageable
        •   Open source
  • How can it help?
        • One-stop solution for data collection of all formats
        • Flexible reliability guarantees allow careful performance tuning
        • Enables quick iteration on new collection strategies
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
The Flume Model

  • Built around the concept of flows
  • A single flow corresponds to a type of data source
        • Like web server logs
        • Or machine monitoring metrics
  • Different flows might have different compression,
    batching or reliability setups
        • Flume multiplexes many flows onto one service instance
  • Flows are comprised of nodes chained together
        • Each Flume process can run many nodes, so resources are
          shared
        • Each node receives data at its source, and sends it to its sink
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume Flows

  • Three typical flows, all on the same Flume service


                               Flow 1: Web-clicks
                            Reliable Delivery, Compressed, Batched
                                                                                EV
              A                                                                    EN
          D AT                                                                        TS



          DATA            Flow 2: Process monitoring                            EVENTS
                                       Best Effort Delivery

          DA
            TA                                                                         N   TS
                                                                                E   VE

                          Flow 3: Advert Impressions
                                         Reliable Delivery




                             Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Anatomy of a Flume node

  • Data come in through a source...
  • ... are optionally processed by one or more decorators...
  • ... and then are transmitted out via a sink
  • Each of these components is (re-)configurable at run-
    time
  • Each has a very simple API, and a plugin interface that
    makes customizing Flume very easy
  • These simple abstractions are sufficient to build more
    complex features like acknowledged delivery, filtering,
    compression

                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Agents and Collectors

  • Nodes that receive data from an application are called
    agents
  • Flume supports many sources for agents, including:
        •   Syslog
        •   Tailing a file
        •   Unix processes
        •   Scribe API
        •   Twitter
  • Nodes that write data to permanent storage are called
    collectors
        • Most often they write to HDFS
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume Nodes                                          Source
                                                                      Agent
                                                                                   Sink

                                     HTTPD           Tail Apache             Downstream
                                                     HTTPD logs             processor node



  • Each role may be
    played by many
                                                                   Processor
    different nodes                        Source                 Decorator                    Sink
                                                              Extract browser
                                      Upstream agent        name from log string           Downstream
                                           node             and attach it to event        collector node


  • Usually require
    substantially fewer
    collectors than agents                                           Collector
                                                        Source                       Sink
                                                                                   HDFS://
                                                       Upstream                  namenode/                  S
                                                                                                      HDF
                                                    processor node                /weblogs/
                                                                                 %{browser}/



                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume Events

  • All data are transformed into a series of events

  • Events are a pair (body, metadata)

  • Body is a string of bytes

  • Metadata is a table mapping keys to values
        • Flume can use this to inform processing
        • Or simply write it with the event


                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
The Flume Configuration Language

  • Node configurations are written in a simple language
        • my-flume-node : src | { decorator => sink }
  • For example: a configuration to read HTTP log data from
    a file and send it to a collector:
        • web-log-agent : tail(“/var/log/httpd.log”) | agentBESink
  • On the collector, receive data and bucket it according to
    browser:
        • web-log-collector : autoCollectorSource
          | { regex(“(Firefox|Internet Explorer)”, “browser”) =>
          collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }
  • Two lines to set-up an entire flow
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Keeping Track of Nodes

  • The master service monitors all Flume nodes
        • A single port-of-call for checking on the health of your Flume
          service
  • Send commands to the master, and it will forward them
    to the nodes
  • The Flume Shell is a convenient, scriptable command-line
    tool
  • Web-based UIs are also available



                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume as a Distributed System

  • Fundamental principle: Keep state out of the data path
    where possible
        •   Replication is costly
        •   Consistency is problematic
        •   Global knowledge is impractical
        •   Follow the end-to-end principle - put smarts at the edges
  • Advantages
        • Failures become much cheaper
        • Performance is better
  • Disadvantages
        • Have to weaken some delivery guarantees
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Scalability and reliability in Flume

  • The data path is ‘horizontally scalable’
        • Add more machines, get more performance
        • Typically the bottleneck is write performance at the collector
        • If machines fail, others automatically take their place
  • The master only requires a few machines
        • Consistency and replication handled by ZooKeeper + gossip
        • A cluster of five or seven machines can handle thousands of
          nodes
        • Can add more if you manage to hit the limit



                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume as Open Source

  • http://github.com/cloudera/flume
  • Already vibrant contributor community
  • Flume 0.9.1 is at release candidate 0 right now

  • Cloudera provides
        • Packages
        • Standardisation
        • Support




                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010

More Related Content

PPTX
PPTX
Apache flume - Twitter Streaming
PDF
Apache Flume
PPTX
Apache flume
PPTX
Apache flume - an Introduction
PDF
Apache Flume (NG)
PDF
Apache Flume - DataDayTexas
PPT
Flume in 10minutes
Apache flume - Twitter Streaming
Apache Flume
Apache flume
Apache flume - an Introduction
Apache Flume (NG)
Apache Flume - DataDayTexas
Flume in 10minutes

What's hot (20)

PPTX
Flume and Hadoop performance insights
PPTX
Centralized logging with Flume
PDF
Apache flume by Swapnil Dubey
PPTX
ApacheCon-Flume-Kafka-2016
PPTX
Introduction to streaming and messaging flume,kafka,SQS,kinesis
PDF
Flume-Cassandra Log Processor
PPTX
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
PDF
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
PDF
Flume @ Austin HUG 2/17/11
PPTX
Cloudera's Flume
PPTX
Flume basic
PDF
Data Aggregation At Scale Using Apache Flume
PDF
Apache Flume and its use case in Manufacturing
PDF
Apache Flume
PPTX
Flume vs. kafka
PDF
Big data: Loading your data with flume and sqoop
PPTX
Large scale near real-time log indexing with Flume and SolrCloud
PDF
Query Pulsar Streams using Apache Flink
PDF
How Orange Financial combat financial frauds over 50M transactions a day usin...
PDF
Highlights Of Sqoop2
Flume and Hadoop performance insights
Centralized logging with Flume
Apache flume by Swapnil Dubey
ApacheCon-Flume-Kafka-2016
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Flume-Cassandra Log Processor
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Flume @ Austin HUG 2/17/11
Cloudera's Flume
Flume basic
Data Aggregation At Scale Using Apache Flume
Apache Flume and its use case in Manufacturing
Apache Flume
Flume vs. kafka
Big data: Loading your data with flume and sqoop
Large scale near real-time log indexing with Flume and SolrCloud
Query Pulsar Streams using Apache Flink
How Orange Financial combat financial frauds over 50M transactions a day usin...
Highlights Of Sqoop2
Ad

Similar to Inside Flume (20)

PPTX
Spark+flume seattle
PDF
Flume and HBase
PPTX
Chicago Data Summit: Flume: An Introduction
PPTX
Flume DS -JSP.pptx
PPTX
Flume and Flive Introduction
PPTX
Flume lspe-110325145754-phpapp01
PPTX
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
PPTX
Session 09 - Flume
PDF
Introduction to Flume
PPTX
Deploying Apache Flume to enable low-latency analytics
PDF
Flumetalk
PPTX
Flume office-hours-110228
PDF
How to collect Big Data into Hadoop
PDF
Flume intro-100715
PDF
Flume intro-100717
PPTX
Apache Flume
PDF
Flume-based Independent News Aggregator
PDF
Hadoop Ecosystem and Low Latency Streaming Architecture
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
PPTX
End to End Streaming Architectures
Spark+flume seattle
Flume and HBase
Chicago Data Summit: Flume: An Introduction
Flume DS -JSP.pptx
Flume and Flive Introduction
Flume lspe-110325145754-phpapp01
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Session 09 - Flume
Introduction to Flume
Deploying Apache Flume to enable low-latency analytics
Flumetalk
Flume office-hours-110228
How to collect Big Data into Hadoop
Flume intro-100715
Flume intro-100717
Apache Flume
Flume-based Independent News Aggregator
Hadoop Ecosystem and Low Latency Streaming Architecture
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
End to End Streaming Architectures
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Architecture types and enterprise applications.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
project resource management chapter-09.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
1. Introduction to Computer Programming.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
Web App vs Mobile App What Should You Build First.pdf
Architecture types and enterprise applications.pdf
WOOl fibre morphology and structure.pdf for textiles
project resource management chapter-09.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Univ-Connecticut-ChatGPT-Presentaion.pdf
NewMind AI Weekly Chronicles – August ’25 Week III
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
A novel scalable deep ensemble learning framework for big data classification...
1. Introduction to Computer Programming.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
DP Operators-handbook-extract for the Mautical Institute
cloud_computing_Infrastucture_as_cloud_p
A comparative study of natural language inference in Swahili using monolingua...
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Chapter 5: Probability Theory and Statistics
A contest of sentiment analysis: k-nearest neighbor versus neural network
observCloud-Native Containerability and monitoring.pptx
Getting started with AI Agents and Multi-Agent Systems

Inside Flume

  • 1. Inside Flume Henry Robinson henry@cloudera.com @henryr Tuesday, 17 August 2010
  • 2. Who am I? • Distributed systems guy • Apache ZooKeeper committer • I work at Cloudera on Flume, ZooKeeper, Hue, more... • p.s. Cloudera is hiring! Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 3. About Cloudera • Software, services and support for Hadoop • Built around an open core • All our patches get contributed upstream • Flume and Hue are open-source • We just started the Whirr project • We maintain, package and support Cloudera’s Distribution for Hadoop • Smoothing off a lot of the rough edges around Hadoop • Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive, Pig, Hue, Flume and more. Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 4. What’s the problem? • Data collection is currently a priori and ad hoc • A priori - decide what you want to collect ahead of time • Ad hoc - Each kind of data source goes through its own collection path • Usually a collection of fragile, custom scripts Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 5. What is Flume? (and how can it help?) • Flume is: • A distributed data collection service • Scalable • Configurable • Extensible • Manageable • Open source • How can it help? • One-stop solution for data collection of all formats • Flexible reliability guarantees allow careful performance tuning • Enables quick iteration on new collection strategies Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 6. The Flume Model • Built around the concept of flows • A single flow corresponds to a type of data source • Like web server logs • Or machine monitoring metrics • Different flows might have different compression, batching or reliability setups • Flume multiplexes many flows onto one service instance • Flows are comprised of nodes chained together • Each Flume process can run many nodes, so resources are shared • Each node receives data at its source, and sends it to its sink Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 7. Flume Flows • Three typical flows, all on the same Flume service Flow 1: Web-clicks Reliable Delivery, Compressed, Batched EV A EN D AT TS DATA Flow 2: Process monitoring EVENTS Best Effort Delivery DA TA N TS E VE Flow 3: Advert Impressions Reliable Delivery Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 8. Anatomy of a Flume node • Data come in through a source... • ... are optionally processed by one or more decorators... • ... and then are transmitted out via a sink • Each of these components is (re-)configurable at run- time • Each has a very simple API, and a plugin interface that makes customizing Flume very easy • These simple abstractions are sufficient to build more complex features like acknowledged delivery, filtering, compression Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 9. Agents and Collectors • Nodes that receive data from an application are called agents • Flume supports many sources for agents, including: • Syslog • Tailing a file • Unix processes • Scribe API • Twitter • Nodes that write data to permanent storage are called collectors • Most often they write to HDFS Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 10. Flume Nodes Source Agent Sink HTTPD Tail Apache Downstream HTTPD logs processor node • Each role may be played by many Processor different nodes Source Decorator Sink Extract browser Upstream agent name from log string Downstream node and attach it to event collector node • Usually require substantially fewer collectors than agents Collector Source Sink HDFS:// Upstream namenode/ S HDF processor node /weblogs/ %{browser}/ Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 11. Flume Events • All data are transformed into a series of events • Events are a pair (body, metadata) • Body is a string of bytes • Metadata is a table mapping keys to values • Flume can use this to inform processing • Or simply write it with the event Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 12. The Flume Configuration Language • Node configurations are written in a simple language • my-flume-node : src | { decorator => sink } • For example: a configuration to read HTTP log data from a file and send it to a collector: • web-log-agent : tail(“/var/log/httpd.log”) | agentBESink • On the collector, receive data and bucket it according to browser: • web-log-collector : autoCollectorSource | { regex(“(Firefox|Internet Explorer)”, “browser”) => collectorSink(“hdfs://namenode/flume-logs/%{browser}”) } • Two lines to set-up an entire flow Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 13. Keeping Track of Nodes • The master service monitors all Flume nodes • A single port-of-call for checking on the health of your Flume service • Send commands to the master, and it will forward them to the nodes • The Flume Shell is a convenient, scriptable command-line tool • Web-based UIs are also available Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 14. Flume as a Distributed System • Fundamental principle: Keep state out of the data path where possible • Replication is costly • Consistency is problematic • Global knowledge is impractical • Follow the end-to-end principle - put smarts at the edges • Advantages • Failures become much cheaper • Performance is better • Disadvantages • Have to weaken some delivery guarantees Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 15. Scalability and reliability in Flume • The data path is ‘horizontally scalable’ • Add more machines, get more performance • Typically the bottleneck is write performance at the collector • If machines fail, others automatically take their place • The master only requires a few machines • Consistency and replication handled by ZooKeeper + gossip • A cluster of five or seven machines can handle thousands of nodes • Can add more if you manage to hit the limit Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 16. Flume as Open Source • http://github.com/cloudera/flume • Already vibrant contributor community • Flume 0.9.1 is at release candidate 0 right now • Cloudera provides • Packages • Standardisation • Support Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 17. Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010