SlideShare a Scribd company logo
Introduction
Flume
Flume
 Is a distributed, reliable tool/service for collecting large
amount of streaming data to a centralized storage.
 Or in simple way it means flume is helpful when we need
to load/collect data continuously in real time and not like
the traditional RDBMS that loads data in batch periodically.
Rupak Roy
 One of the biggest advantage is when the rate of incoming
data exceeds the rate at which data can be stored to its
destination, then Flume acts as a medium or middle person
between the data source and data storage to provide a steady
flow of data between them.
 Example: Log File in general is a file/record that consists of events
of the system operations such as a software creates log file
whenever there is a failure in its operations. On analyzing such
data one can figure out the behavior and locate the failures of
the software.
 So whenever we transfer data to HDFS using –put or
–copyFromLocal command, we can only transfer one file at a
time, so to overcome this issue Flume was created to transfer
streaming data without any delay.
 Another advantage of flume is the reliability of transferring the
data to HDFS because during a file transfer to HDFS the size of
the file will be zero until it is finished. So if there is any network
issue or power failure in the middle of transferring data, the data
in the HDFS will be lost.
Rupak Roy
Apache Flume - Architecture
 Source: Source extracts the data from the clients and
transfers it to one or more channels of the Flume.
Source type can be: avro, netcat, seq, exec,
syslogudp, http, twitter etc.
 Channel: it acts as a mediator between source and
the sink. It temporarily stores the data from the source
and buffers them until they are consumed by the sinks.
Channel can be of many types like Memory Channel,
File Channel, JDBC channel, custom channel etc.
 Sink: consumes the data from the channel and
transfers them to the centralized storages like Hbase
and HDFS. Some of the sink types are logger, avro,
hdfs, irc, file_roll etc.
Rupak Roy
 Agent: is an independent daemon process. It is a
collection of Source, Channel, Sink that receives
the data from the clients or other Flume Agents
and transfers it to its destination.
A flume agent can have multiple sources, sinks
and channels.
Flume Architecture
Rupak Roy
Channel Types:
 Memory Channels
These are volatile based memory and restrict the flumes
functions with RAM availability. Whenever there are some
interruptions due to power failures or network issue, any data
that are not transferred will be lost.
However it provides with one universal advantage of volatile
based memory is its SPEED. Memory channel types are faster
than File based channels.
 File Channels: are robust channels and uses disk instead of
RAM for any events. It is a bit slower than Memory based
channel but comes with another solid advantage that the
events or the data will not be lost even if there is any
interruptions in the Flume operations due to power failures
or network issues.
Rupak Roy
Configuration
First enter the Flume Folder : cd conf
: conf ls
: vi flumepractice.conf then press i (insert mode) then
#Name the components
test.sources = ts1
test.sinks = tk1
test.channels = tc1
#Describe/Configure the source
test.sources.ts1.type = exec
test.sources.ts1.command = tail –F /home/cloudera/hadoop/logs/ ………..
#Describe the sink
test.sink.tk1.type = hdfs
test.sink.tk1.path = hdfs://localhost:9001/flume
#Describe the channel
test.channels.tc1.type = memory
Rupak Roy
configuration
#Join/Bind the source and sink to the channel
#Joining the source to the channel
test.sources.ts1.channels= tc1
#Join the sinks to the channel
test.sinks.tk1.channel = tc1
Press ‘esc’ and then type ‘ :wq!‘ To save and exit.
Then run the following commands to start the Flume Job
Flume$ bin/flume–ng agent –n test –f conf/flumepractice.conf;
Where,
flume-ng is the executable file of flume
Agent: specifies a flume agent to be execute.
-n: allows to direct the name of the agent mentioned in the
configuration file.
-f: allows to specify the path of the configuration file.
Rupak Roy
Next
 Scoop to transfer bulk data to and from
HDFS to Structured Databases
Rupak Roy

More Related Content

PDF
Configuring and manipulating HDFS files
PDF
Introduction to hadoop ecosystem
PDF
Introduction to Hbase
PDF
Introductive to Hive
PDF
YARN(yet an another resource locator)
PDF
Introduction to scoop and its functions
PPTX
Hadoop architecture by ajay
PDF
Introduction to apache hadoop
Configuring and manipulating HDFS files
Introduction to hadoop ecosystem
Introduction to Hbase
Introductive to Hive
YARN(yet an another resource locator)
Introduction to scoop and its functions
Hadoop architecture by ajay
Introduction to apache hadoop

What's hot (20)

DOCX
Hadoop basic commands
DOC
Configure h base hadoop and hbase client
PPTX
Hadoop Installation presentation
PDF
Inside HDFS Append
PDF
HDFS_Command_Reference
PPTX
Session 01 - Into to Hadoop
PPTX
Bd class 2 complete
PPTX
Hadoop migration and upgradation
PDF
Design and Research of Hadoop Distributed Cluster Based on Raspberry
DOCX
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
PDF
Big data interview questions and answers
PDF
Import Database Data using RODBC in R Studio
ODP
Hadoop admin
PDF
Introduction to Hadoop
PDF
Hadoop File System Shell Commands,
PPT
Hadoop - Introduction to mapreduce
PDF
Hadoop single node installation on ubuntu 14
PPTX
Top 10 Hadoop Shell Commands
PPTX
Hadoop Interview Question and Answers
Hadoop basic commands
Configure h base hadoop and hbase client
Hadoop Installation presentation
Inside HDFS Append
HDFS_Command_Reference
Session 01 - Into to Hadoop
Bd class 2 complete
Hadoop migration and upgradation
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Big data interview questions and answers
Import Database Data using RODBC in R Studio
Hadoop admin
Introduction to Hadoop
Hadoop File System Shell Commands,
Hadoop - Introduction to mapreduce
Hadoop single node installation on ubuntu 14
Top 10 Hadoop Shell Commands
Hadoop Interview Question and Answers
Ad

Similar to Introduction to Flume (20)

PPTX
Session 09 - Flume
PPTX
Apache flume
PPTX
Apache Flume
PPTX
Apache flume - Twitter Streaming
PPTX
Flume DS -JSP.pptx
PPTX
Centralized logging with Flume
PPTX
PPTX
Data ingestion
PDF
Avvo fkafka
PPTX
Big data components - Introduction to Flume, Pig and Sqoop
PDF
File Transfer Protocol (FTP) in Computer Networks_ A Complete Guide.pdf
PPTX
Flume vs. kafka
PPTX
untitled_document.pptx
PPT
Flume with Twitter Integration
PPTX
File Transfer Protocol (FTP)
PPT
file transfer and access utilities
PDF
Apache flume by Swapnil Dubey
DOCX
File transfer protocol
PPT
Using an FTP client - Client server computing
PPTX
Deploying Apache Flume to enable low-latency analytics
Session 09 - Flume
Apache flume
Apache Flume
Apache flume - Twitter Streaming
Flume DS -JSP.pptx
Centralized logging with Flume
Data ingestion
Avvo fkafka
Big data components - Introduction to Flume, Pig and Sqoop
File Transfer Protocol (FTP) in Computer Networks_ A Complete Guide.pdf
Flume vs. kafka
untitled_document.pptx
Flume with Twitter Integration
File Transfer Protocol (FTP)
file transfer and access utilities
Apache flume by Swapnil Dubey
File transfer protocol
Using an FTP client - Client server computing
Deploying Apache Flume to enable low-latency analytics
Ad

More from Rupak Roy (20)

PDF
Hierarchical Clustering - Text Mining/NLP
PDF
Clustering K means and Hierarchical - NLP
PDF
Network Analysis - NLP
PDF
Topic Modeling - NLP
PDF
Sentiment Analysis Practical Steps
PDF
NLP - Sentiment Analysis
PDF
Text Mining using Regular Expressions
PDF
Introduction to Text Mining
PDF
Apache Hbase Architecture
PDF
Apache Hive Table Partition and HQL
PDF
Installing Apache Hive, internal and external table, import-export
PDF
Scoop Job, import and export to RDBMS
PDF
Apache Scoop - Import with Append mode and Last Modified mode
PDF
Apache Pig Relational Operators - II
PDF
Passing Parameters using File and Command Line
PDF
Apache PIG Relational Operations
PDF
Apache PIG casting, reference
PDF
Pig Latin, Data Model with Load and Store Functions
PDF
Introduction to PIG components
PDF
Map Reduce Execution Architecture
Hierarchical Clustering - Text Mining/NLP
Clustering K means and Hierarchical - NLP
Network Analysis - NLP
Topic Modeling - NLP
Sentiment Analysis Practical Steps
NLP - Sentiment Analysis
Text Mining using Regular Expressions
Introduction to Text Mining
Apache Hbase Architecture
Apache Hive Table Partition and HQL
Installing Apache Hive, internal and external table, import-export
Scoop Job, import and export to RDBMS
Apache Scoop - Import with Append mode and Last Modified mode
Apache Pig Relational Operators - II
Passing Parameters using File and Command Line
Apache PIG Relational Operations
Apache PIG casting, reference
Pig Latin, Data Model with Load and Store Functions
Introduction to PIG components
Map Reduce Execution Architecture

Recently uploaded (20)

PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Classroom Observation Tools for Teachers
PDF
01-Introduction-to-Information-Management.pdf
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
Cell Structure & Organelles in detailed.
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Week 4 Term 3 Study Techniques revisited.pptx
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
O7-L3 Supply Chain Operations - ICLT Program
Classroom Observation Tools for Teachers
01-Introduction-to-Information-Management.pdf
Complications of Minimal Access Surgery at WLH
Microbial diseases, their pathogenesis and prophylaxis
human mycosis Human fungal infections are called human mycosis..pptx
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Cell Structure & Organelles in detailed.
102 student loan defaulters named and shamed – Is someone you know on the list?
Anesthesia in Laparoscopic Surgery in India
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Supply Chain Operations Speaking Notes -ICLT Program
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES

Introduction to Flume

  • 2. Flume  Is a distributed, reliable tool/service for collecting large amount of streaming data to a centralized storage.  Or in simple way it means flume is helpful when we need to load/collect data continuously in real time and not like the traditional RDBMS that loads data in batch periodically. Rupak Roy
  • 3.  One of the biggest advantage is when the rate of incoming data exceeds the rate at which data can be stored to its destination, then Flume acts as a medium or middle person between the data source and data storage to provide a steady flow of data between them.  Example: Log File in general is a file/record that consists of events of the system operations such as a software creates log file whenever there is a failure in its operations. On analyzing such data one can figure out the behavior and locate the failures of the software.  So whenever we transfer data to HDFS using –put or –copyFromLocal command, we can only transfer one file at a time, so to overcome this issue Flume was created to transfer streaming data without any delay.  Another advantage of flume is the reliability of transferring the data to HDFS because during a file transfer to HDFS the size of the file will be zero until it is finished. So if there is any network issue or power failure in the middle of transferring data, the data in the HDFS will be lost. Rupak Roy
  • 4. Apache Flume - Architecture  Source: Source extracts the data from the clients and transfers it to one or more channels of the Flume. Source type can be: avro, netcat, seq, exec, syslogudp, http, twitter etc.  Channel: it acts as a mediator between source and the sink. It temporarily stores the data from the source and buffers them until they are consumed by the sinks. Channel can be of many types like Memory Channel, File Channel, JDBC channel, custom channel etc.  Sink: consumes the data from the channel and transfers them to the centralized storages like Hbase and HDFS. Some of the sink types are logger, avro, hdfs, irc, file_roll etc. Rupak Roy
  • 5.  Agent: is an independent daemon process. It is a collection of Source, Channel, Sink that receives the data from the clients or other Flume Agents and transfers it to its destination. A flume agent can have multiple sources, sinks and channels. Flume Architecture Rupak Roy
  • 6. Channel Types:  Memory Channels These are volatile based memory and restrict the flumes functions with RAM availability. Whenever there are some interruptions due to power failures or network issue, any data that are not transferred will be lost. However it provides with one universal advantage of volatile based memory is its SPEED. Memory channel types are faster than File based channels.  File Channels: are robust channels and uses disk instead of RAM for any events. It is a bit slower than Memory based channel but comes with another solid advantage that the events or the data will not be lost even if there is any interruptions in the Flume operations due to power failures or network issues. Rupak Roy
  • 7. Configuration First enter the Flume Folder : cd conf : conf ls : vi flumepractice.conf then press i (insert mode) then #Name the components test.sources = ts1 test.sinks = tk1 test.channels = tc1 #Describe/Configure the source test.sources.ts1.type = exec test.sources.ts1.command = tail –F /home/cloudera/hadoop/logs/ ……….. #Describe the sink test.sink.tk1.type = hdfs test.sink.tk1.path = hdfs://localhost:9001/flume #Describe the channel test.channels.tc1.type = memory Rupak Roy
  • 8. configuration #Join/Bind the source and sink to the channel #Joining the source to the channel test.sources.ts1.channels= tc1 #Join the sinks to the channel test.sinks.tk1.channel = tc1 Press ‘esc’ and then type ‘ :wq!‘ To save and exit. Then run the following commands to start the Flume Job Flume$ bin/flume–ng agent –n test –f conf/flumepractice.conf; Where, flume-ng is the executable file of flume Agent: specifies a flume agent to be execute. -n: allows to direct the name of the agent mentioned in the configuration file. -f: allows to specify the path of the configuration file. Rupak Roy
  • 9. Next  Scoop to transfer bulk data to and from HDFS to Structured Databases Rupak Roy