Apache flume

APACHE FLUME
A SERVICE FOR STREAMING LOGS INTO
HADOOP
Apache Flume is a tool/service/data ingestion mechanism
for collecting aggregating and transporting large amounts
of streaming data such as log files, events from various
sources to a centralized data store.

 Flume is a highly reliable, distributed, and
configurable tool. It is principally designed to copy
streaming data (log data) from various web servers
to HDFS.

ADVANTAGES OF FLUME
 Here are the advantages of using Flume −
 Using Apache Flume we can store the data in to any of
the centralized stores (HBase, HDFS).
 When the rate of incoming data exceeds the rate at
which data can be written to the destination, Flume acts
as a mediator between data producers and the
centralized stores and provides a steady flow of data
between them.
 Flume provides the feature of contextual routing.
 The transactions in Flume are channel-based where two
transactions (one sender and one receiver) are
maintained for each message. It guarantees reliable
message delivery.
 Flume is reliable, fault tolerant, scalable, manageable,
and customizable.

FEATURES OF FLUME
 Some of the notable features of Flume are as follows −
 Flume ingests log data from multiple web servers into a
centralized store (HDFS, HBase) efficiently.
 Using Flume, we can get the data from multiple servers
immediately into Hadoop.
 Along with the log files, Flume is also used to import
huge volumes of event data produced by social
networking sites like Facebook and Twitter, and e-
commerce websites like Amazon and Flipkart.
 Flume supports a large set of sources and destinations
types.
 Flume supports multi-hop flows, fan-in fan-out flows,
contextual routing, etc.
 Flume can be scaled horizontally.

STREAMING / LOG DATA
 Generally, most of the data that is to be analyzed will be
produced by various data sources like applications
servers, social networking sites, cloud servers, and
enterprise servers. This data will be in the form of log
files and events.
 Log file − In general, a log file is a file that lists
events/actions that occur in an operating system. For
example, web servers list every request made to the
server in the log files.
 On harvesting such log data, we can get information
about −
 the application performance and locate various software
and hardware failures.
 the user behavior and derive better business insights.

FLUME EVENT
 An event is the basic unit of the data transported
inside Flume. It contains a payload of byte array
that is to be transported from the source to the
destination accompanied by optional headers.

FLUME AGENT
 An agent is an independent daemon process (JVM)
in Flume. It receives the data (events) from clients
or other agents and forwards it to its next
destination (sink or agent). Flume may have more
than one agent. Following diagram represents
a Flume Agent

SINKS
 Sinks provide Flume agents pluggable output
capability — if you need to write to a new type
storage, just write a Java class that implements the
necessary classes. Like sources, sinks correspond
to a type of output: writes to HDFS or HBase,
remote procedure calls to other agents, or any
number of other external repositories. Sinks remove
events from the channel in transactions and write
them to output. Transactions close when the event
is successfully written, ensuring that all events are
committed to their final destination.

ANATOMY OF A FLUME AGENT
 Apache Flume deploys as one or more agents,
each contained within its own instance of the Java
Virtual Machine (JVM). Agents consist of three
pluggable components: sources, sinks, and
channels. An agent must have at least one of each
in order to run. Sources collect incoming data
as events. Sinks write events out, and channels
provide a queue to connect the source and sink.

SOURCES
 Flume sources listen for and consume events.
Events can range from newline-terminated strings
in stdout to HTTP POSTs and RPC calls and all
depends on what sources the agent is configured to
use. Flume agents may have more than one
source, but must have at least one. Sources require
a name and a type; the type then dictates additional
configuration parameters.

CHANNELS
 Channels are the mechanism by which Flume
agents transfer events from their sources to their
sinks. Events written to the channel by a source are
not removed from the channel until a sink removes
that event in a transaction. This allows Flume sinks
to retry writes in the event of a failure in the external
repository (such as HDFS or an outgoing network
connection). For example, if the network between a
Flume agent and a Hadoop cluster goes down, the
channel will keep all events queued until the sink
can correctly write to the cluster and close its
transactions with the channel.

 Channels are typically of two types: in-
memory queues and durable disk-backed
queues. In-memory channels provide high
throughput but no recovery if an agent fails.
File or database-backed channels, on the
other hand, are durable. They support full
recovery and event replay in the case of
agent failure.

Apache flume

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache flume (20)

More from Ramakrishna kapa (20)

Recently uploaded (20)

Apache flume