SlideShare a Scribd company logo
2
Most read
3
Most read
4
Most read
APACHE FLUME
A SERVICE FOR STREAMING LOGS INTO
HADOOP
Apache Flume is a tool/service/data ingestion mechanism
for collecting aggregating and transporting large amounts
of streaming data such as log files, events from various
sources to a centralized data store.
ï‚ą Flume is a highly reliable, distributed, and
configurable tool. It is principally designed to copy
streaming data (log data) from various web servers
to HDFS.
ADVANTAGES OF FLUME
ï‚ą Here are the advantages of using Flume −
ï‚ą Using Apache Flume we can store the data in to any of
the centralized stores (HBase, HDFS).
ï‚ą When the rate of incoming data exceeds the rate at
which data can be written to the destination, Flume acts
as a mediator between data producers and the
centralized stores and provides a steady flow of data
between them.
ï‚ą Flume provides the feature of contextual routing.
ï‚ą The transactions in Flume are channel-based where two
transactions (one sender and one receiver) are
maintained for each message. It guarantees reliable
message delivery.
ï‚ą Flume is reliable, fault tolerant, scalable, manageable,
and customizable.
FEATURES OF FLUME
ï‚ą Some of the notable features of Flume are as follows −
ï‚ą Flume ingests log data from multiple web servers into a
centralized store (HDFS, HBase) efficiently.
ï‚ą Using Flume, we can get the data from multiple servers
immediately into Hadoop.
ï‚ą Along with the log files, Flume is also used to import
huge volumes of event data produced by social
networking sites like Facebook and Twitter, and e-
commerce websites like Amazon and Flipkart.
ï‚ą Flume supports a large set of sources and destinations
types.
ï‚ą Flume supports multi-hop flows, fan-in fan-out flows,
contextual routing, etc.
ï‚ą Flume can be scaled horizontally.
STREAMING / LOG DATA
ï‚ą Generally, most of the data that is to be analyzed will be
produced by various data sources like applications
servers, social networking sites, cloud servers, and
enterprise servers. This data will be in the form of log
files and events.
ï‚ą Log file − In general, a log file is a file that lists
events/actions that occur in an operating system. For
example, web servers list every request made to the
server in the log files.
ï‚ą On harvesting such log data, we can get information
about −
ï‚ą the application performance and locate various software
and hardware failures.
ï‚ą the user behavior and derive better business insights.
FLUME EVENT
ï‚ą An event is the basic unit of the data transported
inside Flume. It contains a payload of byte array
that is to be transported from the source to the
destination accompanied by optional headers.
FLUME AGENT
ï‚ą An agent is an independent daemon process (JVM)
in Flume. It receives the data (events) from clients
or other agents and forwards it to its next
destination (sink or agent). Flume may have more
than one agent. Following diagram represents
a Flume Agent
SINKS
ï‚ą Sinks provide Flume agents pluggable output
capability — if you need to write to a new type
storage, just write a Java class that implements the
necessary classes. Like sources, sinks correspond
to a type of output: writes to HDFS or HBase,
remote procedure calls to other agents, or any
number of other external repositories. Sinks remove
events from the channel in transactions and write
them to output. Transactions close when the event
is successfully written, ensuring that all events are
committed to their final destination.
ANATOMY OF A FLUME AGENT
ï‚ą Apache Flume deploys as one or more agents,
each contained within its own instance of the Java
Virtual Machine (JVM). Agents consist of three
pluggable components: sources, sinks, and
channels. An agent must have at least one of each
in order to run. Sources collect incoming data
as events. Sinks write events out, and channels
provide a queue to connect the source and sink.
SOURCES
ï‚ą Flume sources listen for and consume events.
Events can range from newline-terminated strings
in stdout to HTTP POSTs and RPC calls and all
depends on what sources the agent is configured to
use. Flume agents may have more than one
source, but must have at least one. Sources require
a name and a type; the type then dictates additional
configuration parameters.
CHANNELS
ï‚ą Channels are the mechanism by which Flume
agents transfer events from their sources to their
sinks. Events written to the channel by a source are
not removed from the channel until a sink removes
that event in a transaction. This allows Flume sinks
to retry writes in the event of a failure in the external
repository (such as HDFS or an outgoing network
connection). For example, if the network between a
Flume agent and a Hadoop cluster goes down, the
channel will keep all events queued until the sink
can correctly write to the cluster and close its
transactions with the channel.
ï‚ą Channels are typically of two types: in-
memory queues and durable disk-backed
queues. In-memory channels provide high
throughput but no recovery if an agent fails.
File or database-backed channels, on the
other hand, are durable. They support full
recovery and event replay in the case of
agent failure.

More Related Content

PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
When NOT to use Apache Kafka?
PPTX
Kafka Tutorial - basics of the Kafka streaming platform
PDF
Hudi architecture, fundamentals and capabilities
PDF
Apache ZooKeeper
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
PDF
Apache Spark Introduction
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
When NOT to use Apache Kafka?
Kafka Tutorial - basics of the Kafka streaming platform
Hudi architecture, fundamentals and capabilities
Apache ZooKeeper
Flexible and Real-Time Stream Processing with Apache Flink
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Apache Spark Introduction

What's hot (20)

PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PPTX
Map Reduce
PPTX
[211] HBase ᄀᅔ번 ᄀᅄᆷᄉᅹᆚ ᄃᅊᄋᅔᄐᅄ á„Œá…„á„Œá…Ąá†Œá„‰á…© (á„€á…©á†Œá„€á…ąá„‹á…­á†Œ)
PDF
Scaling up uber's real time data analytics
PPTX
Spring integration
PPTX
Anatomy of a data driven architecture - Tamir Dresher
PDF
Apache Flume
PPTX
Introduction to spark
 
PDF
Kafka used at scale to deliver real-time notifications
PPTX
Web services SOAP
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
New Features in Apache Pinot
DOCX
Big data unit iv and v lecture notes qb model exam
PDF
Best Practices for Enabling Speculative Execution on Large Scale Platforms
PDF
Zero-Copy Event-Driven Servers with Netty
PDF
Introduction to Apache Kafka
PPTX
Hive + Tez: A Performance Deep Dive
PDF
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
PDF
Confluent Workshop Series: ksqlDB로 ìŠ€íŠžëŠŹë° 앱 ëčŒë“œ
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Map Reduce
[211] HBase ᄀᅔ번 ᄀᅄᆷᄉᅹᆚ ᄃᅊᄋᅔᄐᅄ á„Œá…„á„Œá…Ąá†Œá„‰á…© (á„€á…©á†Œá„€á…ąá„‹á…­á†Œ)
Scaling up uber's real time data analytics
Spring integration
Anatomy of a data driven architecture - Tamir Dresher
Apache Flume
Introduction to spark
 
Kafka used at scale to deliver real-time notifications
Web services SOAP
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Optimizing Delta/Parquet Data Lakes for Apache Spark
New Features in Apache Pinot
Big data unit iv and v lecture notes qb model exam
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Zero-Copy Event-Driven Servers with Netty
Introduction to Apache Kafka
Hive + Tez: A Performance Deep Dive
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Confluent Workshop Series: ksqlDB로 ìŠ€íŠžëŠŹë° 앱 ëčŒë“œ
Ad

Viewers also liked (20)

PPTX
'Flume' Case Study
PDF
Apache Flume and its use case in Manufacturing
PDF
Apache Flume - DataDayTexas
PDF
Analyse Tweets using Flume, Hadoop and Hive
PPTX
Centralized logging with Flume
PPT
Flume in 10minutes
PDF
Apache flume by Swapnil Dubey
PDF
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
PPTX
Load balancer in mule
PDF
Big data: Loading your data with flume and sqoop
PDF
Kibana
PDF
Apache kafka
PDF
Introduction To Kibana
PDF
Apache Flume
PDF
Kafka and Spark Streaming
PDF
Spark Streamingă«ă‚ˆă‚‹ăƒȘă‚ąăƒ«ă‚żă‚€ăƒ ăƒŠăƒŒă‚¶ć±žæ€§æŽšćźš
PDF
Kibana + timelion: time series with the elastic stack
PPTX
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
ODP
Introduction to Shield and kibana
PPTX
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
'Flume' Case Study
Apache Flume and its use case in Manufacturing
Apache Flume - DataDayTexas
Analyse Tweets using Flume, Hadoop and Hive
Centralized logging with Flume
Flume in 10minutes
Apache flume by Swapnil Dubey
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Load balancer in mule
Big data: Loading your data with flume and sqoop
Kibana
Apache kafka
Introduction To Kibana
Apache Flume
Kafka and Spark Streaming
Spark Streamingă«ă‚ˆă‚‹ăƒȘă‚ąăƒ«ă‚żă‚€ăƒ ăƒŠăƒŒă‚¶ć±žæ€§æŽšćźš
Kibana + timelion: time series with the elastic stack
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Introduction to Shield and kibana
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Ad

Similar to Apache flume (20)

PPTX
Flume DS -JSP.pptx
PPTX
Apache Flume
PPTX
Apache flume - an Introduction
PPTX
Flume basic
PPTX
Flume
PPTX
Session 09 - Flume
PDF
Introduction to Flume
PPTX
Apache flume - Twitter Streaming
PPTX
Deploying Apache Flume to enable low-latency analytics
PPTX
Flume lspe-110325145754-phpapp01
 
PPTX
Cloudera's Flume
PDF
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
PDF
Data Aggregation At Scale Using Apache Flume
PDF
Avvo fkafka
PPTX
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
PPTX
Data ingestion
PPTX
Apache Flume
PPTX
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
PPTX
Spark+flume seattle
PPTX
ApacheCon-Flume-Kafka-2016
Flume DS -JSP.pptx
Apache Flume
Apache flume - an Introduction
Flume basic
Flume
Session 09 - Flume
Introduction to Flume
Apache flume - Twitter Streaming
Deploying Apache Flume to enable low-latency analytics
Flume lspe-110325145754-phpapp01
 
Cloudera's Flume
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Data Aggregation At Scale Using Apache Flume
Avvo fkafka
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Data ingestion
Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Spark+flume seattle
ApacheCon-Flume-Kafka-2016

More from Ramakrishna kapa (20)

PPTX
Anypoint connectors
PPTX
Batch processing
PPTX
Msmq connectivity
PPTX
Scopes in mule
PPTX
Data weave more operations
PPTX
Basic math operations using dataweave
PPTX
Dataweave types operators
PPTX
Operators in mule dataweave
PPTX
Data weave in mule
PPTX
Servicenow connector
PPTX
Introduction to testing mule
PPTX
Choice flow control
PPTX
Message enricher example
PPTX
Mule exception strategies
PPTX
Anypoint connector basics
PPTX
Mule global elements
PPTX
Mule message structure and varibles scopes
PPTX
How to create an api in mule
PPTX
Log4j is a reliable, fast and flexible
PPTX
Anypoint connectors
Batch processing
Msmq connectivity
Scopes in mule
Data weave more operations
Basic math operations using dataweave
Dataweave types operators
Operators in mule dataweave
Data weave in mule
Servicenow connector
Introduction to testing mule
Choice flow control
Message enricher example
Mule exception strategies
Anypoint connector basics
Mule global elements
Mule message structure and varibles scopes
How to create an api in mule
Log4j is a reliable, fast and flexible

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Introduction to Artificial Intelligence
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
medical staffing services at VALiNTRY
PDF
System and Network Administration Chapter 2
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PPT
Introduction Database Management System for Course Database
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Digital Strategies for Manufacturing Companies
PDF
AI in Product Development-omnex systems
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
2025 Textile ERP Trends: SAP, Odoo & Oracle
Introduction to Artificial Intelligence
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
How to Choose the Right IT Partner for Your Business in Malaysia
Which alternative to Crystal Reports is best for small or large businesses.pdf
top salesforce developer skills in 2025.pdf
medical staffing services at VALiNTRY
System and Network Administration Chapter 2
ISO 45001 Occupational Health and Safety Management System
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
Introduction Database Management System for Course Database
Odoo Companies in India – Driving Business Transformation.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Digital Strategies for Manufacturing Companies
AI in Product Development-omnex systems
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Softaken Excel to vCard Converter Software.pdf
Wondershare Filmora 15 Crack With Activation Key [2025

Apache flume

  • 1. APACHE FLUME A SERVICE FOR STREAMING LOGS INTO HADOOP Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events from various sources to a centralized data store.
  • 2. ï‚ą Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS.
  • 3. ADVANTAGES OF FLUME ï‚ą Here are the advantages of using Flume − ï‚ą Using Apache Flume we can store the data in to any of the centralized stores (HBase, HDFS). ï‚ą When the rate of incoming data exceeds the rate at which data can be written to the destination, Flume acts as a mediator between data producers and the centralized stores and provides a steady flow of data between them. ï‚ą Flume provides the feature of contextual routing. ï‚ą The transactions in Flume are channel-based where two transactions (one sender and one receiver) are maintained for each message. It guarantees reliable message delivery. ï‚ą Flume is reliable, fault tolerant, scalable, manageable, and customizable.
  • 4. FEATURES OF FLUME ï‚ą Some of the notable features of Flume are as follows − ï‚ą Flume ingests log data from multiple web servers into a centralized store (HDFS, HBase) efficiently. ï‚ą Using Flume, we can get the data from multiple servers immediately into Hadoop. ï‚ą Along with the log files, Flume is also used to import huge volumes of event data produced by social networking sites like Facebook and Twitter, and e- commerce websites like Amazon and Flipkart. ï‚ą Flume supports a large set of sources and destinations types. ï‚ą Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc. ï‚ą Flume can be scaled horizontally.
  • 5. STREAMING / LOG DATA ï‚ą Generally, most of the data that is to be analyzed will be produced by various data sources like applications servers, social networking sites, cloud servers, and enterprise servers. This data will be in the form of log files and events. ï‚ą Log file − In general, a log file is a file that lists events/actions that occur in an operating system. For example, web servers list every request made to the server in the log files. ï‚ą On harvesting such log data, we can get information about − ï‚ą the application performance and locate various software and hardware failures. ï‚ą the user behavior and derive better business insights.
  • 6. FLUME EVENT ï‚ą An event is the basic unit of the data transported inside Flume. It contains a payload of byte array that is to be transported from the source to the destination accompanied by optional headers.
  • 7. FLUME AGENT ï‚ą An agent is an independent daemon process (JVM) in Flume. It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent). Flume may have more than one agent. Following diagram represents a Flume Agent
  • 8. SINKS ï‚ą Sinks provide Flume agents pluggable output capability — if you need to write to a new type storage, just write a Java class that implements the necessary classes. Like sources, sinks correspond to a type of output: writes to HDFS or HBase, remote procedure calls to other agents, or any number of other external repositories. Sinks remove events from the channel in transactions and write them to output. Transactions close when the event is successfully written, ensuring that all events are committed to their final destination.
  • 9. ANATOMY OF A FLUME AGENT ï‚ą Apache Flume deploys as one or more agents, each contained within its own instance of the Java Virtual Machine (JVM). Agents consist of three pluggable components: sources, sinks, and channels. An agent must have at least one of each in order to run. Sources collect incoming data as events. Sinks write events out, and channels provide a queue to connect the source and sink.
  • 10. SOURCES ï‚ą Flume sources listen for and consume events. Events can range from newline-terminated strings in stdout to HTTP POSTs and RPC calls and all depends on what sources the agent is configured to use. Flume agents may have more than one source, but must have at least one. Sources require a name and a type; the type then dictates additional configuration parameters.
  • 11. CHANNELS ï‚ą Channels are the mechanism by which Flume agents transfer events from their sources to their sinks. Events written to the channel by a source are not removed from the channel until a sink removes that event in a transaction. This allows Flume sinks to retry writes in the event of a failure in the external repository (such as HDFS or an outgoing network connection). For example, if the network between a Flume agent and a Hadoop cluster goes down, the channel will keep all events queued until the sink can correctly write to the cluster and close its transactions with the channel.
  • 12. ï‚ą Channels are typically of two types: in- memory queues and durable disk-backed queues. In-memory channels provide high throughput but no recovery if an agent fails. File or database-backed channels, on the other hand, are durable. They support full recovery and event replay in the case of agent failure.