SlideShare a Scribd company logo
6
Most read
10
Most read
12
Most read
Arinto Murdopo
 Josep Subirats
       Group 4
     EEDC 2012
Outline
● Current problem
● What is Apache Flume?
● The Flume Model
  ○ Flows and Nodes
  ○ Agent, Processor and Collector Nodes
  ○ Data and Control Path
● Flume goals
  ○ Reliability
  ○ Scalability
  ○ Extensibility
  ○ Manageability
● Use case: Near Realtime Aggregator
 
Current Problem
● Situation:
You have hundreds of services running in different servers
that produce lots of large logs which should be analyzed
altogether. You have Hadoop to process them.
 
● Problem:
How do I send all my logs to a place that has Hadoop? I
need a reliable, scalable, extensible and manageable way
to do it!
What is Apache Flume?
● It is a distributed data collection service that gets
    flows of data (like logs) from their source and
    aggregates them to where they have to be processed.
●   Goals: reliability, scalability, extensibility,
    manageability.




                   Exactly what I needed!
The Flume Model: Flows and Nodes

● A flow corresponds to a type of data source (server
    logs, machine monitoring metrics...).
●   Flows are comprised of nodes chained together (see
    slide 7).
The Flume Model: Flows and Nodes
● In a Node, data come in through a source...
   ...are optionally processed by one or more decorators...
   ...and then are transmitted out via a sink.
    
                 Examples: Console, Exec, Syslog, IRC,
                 Twitter, other nodes...
                  
                 Examples: Console, local files, HDFS, S3,
                 other nodes...
                  
                 Examples: wire batching, compression,
                 sampling, projection, extraction...
The Flume Model: Agent, Processor and
Collector Nodes

● Agent:
    receives data from an
    application.
 
● Processor (optional):
    intermediate processing.
 
● Collector:
    write data to permanent
    storage.
The Flume Model: Data and Control
Path (1/2)
Nodes are in the data path.
The Flume Model: Data and Control
Path (2/2)
Masters are in the control path.
● Centralized point of configuration. Multiple: ZK.
● Specify sources, sinks and control data flows.
Flume Goals: Reliability
Tunable Failure Recovery Modes
 
● Best Effort
 
● Store on Failure and Retry
 
● End to End Reliability
Flume Goals: Scalability
Horizontally Scalable Data Path




Load Balancing
Flume Goals: Scalability
Horizontally Scalable Control Path
Flume Goals: Extensibility
● Simple Source and Sink API
  ○ Event streaming and composition of simple
       operation
   
● Plug in Architecture
   ○ Add your own sources, sinks, decorators
    
    
Flume Goals: Manageability
Centralized Data Flow Management Interface
 
Flume Goals: Manageability
Configuring Flume
 
 
   Node: tail(“file”) | filter [ console, roll
   (1000) { dfs(“hdfs://namenode/user/flume”) } ]
   ;
Output Bucketing
                              /logs/web/2010/0715/1200/data-xxx.txt
                              /logs/web/2010/0715/1200/data-xxy.txt
                              /logs/web/2010/0715/1300/data-xxx.txt
                              /logs/web/2010/0715/1300/data-xxy.txt
                              /logs/web/2010/0715/1400/data-xxx.txt
Use Case: Near Realtime Aggregator
Conclusion
Flume is
● Distributed data collection service
 
● Suitable for enterprise setting
 
● Large amount of log data to process
Q&A
Questions to be unveiled?
 
 
References
●   http://www.cloudera.
    com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie
    h_hadoop_log_processing/
●   http://www.slideshare.net/cloudera/inside-flume
●   http://www.slideshare.net/cloudera/flume-intro100715
●   http://www.slideshare.net/cloudera/flume-austin-hug-21711

More Related Content

PPTX
Hadoop Architecture
PDF
HDFS Architecture
PDF
Data Streaming For Big Data
PDF
Hadoop ecosystem
PPTX
The rise of “Big Data” on cloud computing
PDF
Hadoop Ecosystem
PPTX
Hadoop Oozie
PPTX
MapReduce Programming Model
Hadoop Architecture
HDFS Architecture
Data Streaming For Big Data
Hadoop ecosystem
The rise of “Big Data” on cloud computing
Hadoop Ecosystem
Hadoop Oozie
MapReduce Programming Model

What's hot (20)

PDF
PPT
Hadoop Security Architecture
PPTX
Hadoop File system (HDFS)
PPTX
Hive: Loading Data
PPTX
Ozone: An Object Store in HDFS
PDF
Spark and S3 with Ryan Blue
PDF
What's New in Apache Hive
PPTX
Airflow 101
PDF
SQOOP PPT
PDF
Airflow introduction
PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Hadoop Security
PDF
Apache Spark Introduction
PPTX
Introduction to Hadoop and Hadoop component
PDF
Dataflow with Apache NiFi
PDF
Hadoop Overview & Architecture
 
PPTX
Apache Knox setup and hive and hdfs Access using KNOX
PDF
Building an analytics workflow using Apache Airflow
PDF
Introduction to Redis
PPTX
Performance Optimizations in Apache Impala
Hadoop Security Architecture
Hadoop File system (HDFS)
Hive: Loading Data
Ozone: An Object Store in HDFS
Spark and S3 with Ryan Blue
What's New in Apache Hive
Airflow 101
SQOOP PPT
Airflow introduction
Apache Kafka Architecture & Fundamentals Explained
Hadoop Security
Apache Spark Introduction
Introduction to Hadoop and Hadoop component
Dataflow with Apache NiFi
Hadoop Overview & Architecture
 
Apache Knox setup and hive and hdfs Access using KNOX
Building an analytics workflow using Apache Airflow
Introduction to Redis
Performance Optimizations in Apache Impala
Ad

Similar to Apache Flume (20)

PPTX
Flume DS -JSP.pptx
PPTX
Flume basic
PPTX
PPTX
Apache flume
PPTX
Apache Flume
PDF
Inside Flume
PPTX
Apache flume - an Introduction
PPTX
Centralized logging with Flume
PDF
Data Aggregation At Scale Using Apache Flume
PDF
Introduction to Flume
PDF
Apache Flume - DataDayTexas
PPTX
Session 09 - Flume
PPTX
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
PPTX
Chicago Data Summit: Flume: An Introduction
PPTX
Spark+flume seattle
PPTX
Deploying Apache Flume to enable low-latency analytics
PPTX
Flume lspe-110325145754-phpapp01
PPTX
Cloudera's Flume
PDF
Apache flume by Swapnil Dubey
PPTX
Apache flume - Twitter Streaming
Flume DS -JSP.pptx
Flume basic
Apache flume
Apache Flume
Inside Flume
Apache flume - an Introduction
Centralized logging with Flume
Data Aggregation At Scale Using Apache Flume
Introduction to Flume
Apache Flume - DataDayTexas
Session 09 - Flume
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Chicago Data Summit: Flume: An Introduction
Spark+flume seattle
Deploying Apache Flume to enable low-latency analytics
Flume lspe-110325145754-phpapp01
Cloudera's Flume
Apache flume by Swapnil Dubey
Apache flume - Twitter Streaming
Ad

More from Arinto Murdopo (20)

PDF
Distributed Decision Tree Learning for Mining Big Data Streams
PDF
Distributed Decision Tree Learning for Mining Big Data Streams
PDF
Next Generation Hadoop: High Availability for YARN
PPTX
High Availability in YARN
PDF
Distributed Computing - What, why, how..
PDF
An Integer Programming Representation for Data Center Power-Aware Management ...
PDF
An Integer Programming Representation for Data Center Power-Aware Management ...
PDF
Quantum Cryptography and Possible Attacks-slide
PDF
Quantum Cryptography and Possible Attacks
PDF
Parallelization of Smith-Waterman Algorithm using MPI
PDF
Dremel Paper Review
PDF
Megastore - ID2220 Presentation
PDF
Flume Event Scalability
PDF
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
PDF
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
PDF
Rise of Network Virtualization
PDF
Intelligent Placement of Datacenter for Internet Services
PDF
Architecting a Cloud-Scale Identity Fabric
PDF
Consistency Tradeoffs in Modern Distributed Database System Design
PDF
Distributed Storage System for Volunteer Computing
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
Next Generation Hadoop: High Availability for YARN
High Availability in YARN
Distributed Computing - What, why, how..
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks
Parallelization of Smith-Waterman Algorithm using MPI
Dremel Paper Review
Megastore - ID2220 Presentation
Flume Event Scalability
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Rise of Network Virtualization
Intelligent Placement of Datacenter for Internet Services
Architecting a Cloud-Scale Identity Fabric
Consistency Tradeoffs in Modern Distributed Database System Design
Distributed Storage System for Volunteer Computing

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Per capita expenditure prediction using model stacking based on satellite ima...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
The AUB Centre for AI in Media Proposal.docx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Chapter 3 Spatial Domain Image Processing.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Apache Flume

  • 1. Arinto Murdopo Josep Subirats Group 4 EEDC 2012
  • 2. Outline ● Current problem ● What is Apache Flume? ● The Flume Model ○ Flows and Nodes ○ Agent, Processor and Collector Nodes ○ Data and Control Path ● Flume goals ○ Reliability ○ Scalability ○ Extensibility ○ Manageability ● Use case: Near Realtime Aggregator  
  • 3. Current Problem ● Situation: You have hundreds of services running in different servers that produce lots of large logs which should be analyzed altogether. You have Hadoop to process them.   ● Problem: How do I send all my logs to a place that has Hadoop? I need a reliable, scalable, extensible and manageable way to do it!
  • 4. What is Apache Flume? ● It is a distributed data collection service that gets flows of data (like logs) from their source and aggregates them to where they have to be processed. ● Goals: reliability, scalability, extensibility, manageability. Exactly what I needed!
  • 5. The Flume Model: Flows and Nodes ● A flow corresponds to a type of data source (server logs, machine monitoring metrics...). ● Flows are comprised of nodes chained together (see slide 7).
  • 6. The Flume Model: Flows and Nodes ● In a Node, data come in through a source... ...are optionally processed by one or more decorators... ...and then are transmitted out via a sink.   Examples: Console, Exec, Syslog, IRC, Twitter, other nodes...   Examples: Console, local files, HDFS, S3, other nodes...   Examples: wire batching, compression, sampling, projection, extraction...
  • 7. The Flume Model: Agent, Processor and Collector Nodes ● Agent: receives data from an application.   ● Processor (optional): intermediate processing.   ● Collector: write data to permanent storage.
  • 8. The Flume Model: Data and Control Path (1/2) Nodes are in the data path.
  • 9. The Flume Model: Data and Control Path (2/2) Masters are in the control path. ● Centralized point of configuration. Multiple: ZK. ● Specify sources, sinks and control data flows.
  • 10. Flume Goals: Reliability Tunable Failure Recovery Modes   ● Best Effort   ● Store on Failure and Retry   ● End to End Reliability
  • 11. Flume Goals: Scalability Horizontally Scalable Data Path Load Balancing
  • 12. Flume Goals: Scalability Horizontally Scalable Control Path
  • 13. Flume Goals: Extensibility ● Simple Source and Sink API ○ Event streaming and composition of simple operation   ● Plug in Architecture ○ Add your own sources, sinks, decorators    
  • 14. Flume Goals: Manageability Centralized Data Flow Management Interface  
  • 15. Flume Goals: Manageability Configuring Flume     Node: tail(“file”) | filter [ console, roll (1000) { dfs(“hdfs://namenode/user/flume”) } ] ; Output Bucketing   /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt   /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt
  • 16. Use Case: Near Realtime Aggregator
  • 17. Conclusion Flume is ● Distributed data collection service   ● Suitable for enterprise setting   ● Large amount of log data to process
  • 18. Q&A Questions to be unveiled?    
  • 19. References ● http://www.cloudera. com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie h_hadoop_log_processing/ ● http://www.slideshare.net/cloudera/inside-flume ● http://www.slideshare.net/cloudera/flume-intro100715 ● http://www.slideshare.net/cloudera/flume-austin-hug-21711