Apache Flume

Arinto Murdopo
Josep Subirats
Group 4
EEDC 2012

Outline
● Current problem
● What is Apache Flume?
● The Flume Model
○ Flows and Nodes
○ Agent, Processor and Collector Nodes
○ Data and Control Path
● Flume goals
○ Reliability
○ Scalability
○ Extensibility
○ Manageability
● Use case: Near Realtime Aggregator

Current Problem
● Situation:
You have hundreds of services running in different servers
that produce lots of large logs which should be analyzed
altogether. You have Hadoop to process them.

● Problem:
How do I send all my logs to a place that has Hadoop? I
need a reliable, scalable, extensible and manageable way
to do it!

What is Apache Flume?
● It is a distributed data collection service that gets
flows of data (like logs) from their source and
aggregates them to where they have to be processed.
● Goals: reliability, scalability, extensibility,
manageability.

Exactly what I needed!

The Flume Model: Flows and Nodes

● A flow corresponds to a type of data source (server
logs, machine monitoring metrics...).
● Flows are comprised of nodes chained together (see
slide 7).

The Flume Model: Flows and Nodes
● In a Node, data come in through a source...
...are optionally processed by one or more decorators...
...and then are transmitted out via a sink.

Examples: Console, Exec, Syslog, IRC,
Twitter, other nodes...

Examples: Console, local files, HDFS, S3,
other nodes...

Examples: wire batching, compression,
sampling, projection, extraction...

The Flume Model: Agent, Processor and
Collector Nodes

● Agent:
receives data from an
application.

● Processor (optional):
intermediate processing.

● Collector:
write data to permanent
storage.

The Flume Model: Data and Control
Path (1/2)
Nodes are in the data path.

The Flume Model: Data and Control
Path (2/2)
Masters are in the control path.
● Centralized point of configuration. Multiple: ZK.
● Specify sources, sinks and control data flows.

Flume Goals: Reliability
Tunable Failure Recovery Modes

● Best Effort

● Store on Failure and Retry

● End to End Reliability

Flume Goals: Scalability
Horizontally Scalable Data Path

Load Balancing

Flume Goals: Scalability
Horizontally Scalable Control Path

Flume Goals: Extensibility
● Simple Source and Sink API
○ Event streaming and composition of simple
operation

● Plug in Architecture
○ Add your own sources, sinks, decorators

Flume Goals: Manageability
Centralized Data Flow Management Interface

Flume Goals: Manageability
Configuring Flume

Node: tail(“file”) | filter [ console, roll
(1000) { dfs(“hdfs://namenode/user/flume”) } ]
;
Output Bucketing
/logs/web/2010/0715/1200/data-xxx.txt
/logs/web/2010/0715/1200/data-xxy.txt
/logs/web/2010/0715/1300/data-xxy.txt

Use Case: Near Realtime Aggregator

Conclusion
Flume is
● Distributed data collection service

● Suitable for enterprise setting

● Large amount of log data to process

References
● http://www.cloudera.
com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie
h_hadoop_log_processing/
● http://www.slideshare.net/cloudera/inside-flume
● http://www.slideshare.net/cloudera/flume-intro100715
● http://www.slideshare.net/cloudera/flume-austin-hug-21711

Apache Flume

More Related Content

What's hot (20)

Similar to Apache Flume (20)

More from Arinto Murdopo (20)

Recently uploaded (20)

Apache Flume