SlideShare a Scribd company logo
Log Ingestion on Big Data Platform
with Flume
Agenda
•Why Centralized Logging on Hadoop
•Flume Introduction
•Simple Flume Logging
•Centralized and Scalable Flume Logging
•Leveraging log data
•Example
2
•There are tons of logs generated from Applications
•These logs are stored on local disks on individual nodes.
•Log files containing records are required to archive in near real time to
create some value.
•Enable analytics on logs for diagnosing issues on Hadoop platform.
3
Use Case: Centralized Logging Requirements
Centralized Log Management & Analytics : Goals
•Have a central repository to store large volume of machine generated data
from all sources and tiers of applications and infrastructures
•Feed log data from multiple sources to the common repository in a non-
intrusive way and in near real time
•Enable analytics on log data using standard analytical solutions
•Provide capability to search and correlate information across different sources
for quick problem isolation and resolution.
•Improve operational intelligence and
•Be centralized without redundancy of multiple agents on all hosts for log
collections
4
Solution Components for centralized logging
Flume
•Flume is a streaming service, distributed as part of Apache Hadoop ecosystem, and
primarily a reliable way of getting stream and log data into HDFS. Its pluggable
architecture supports any consumer. A correctly configured pipeline of Flume is
guaranteed to not lose data, provided durable channels are used.
•Each Flume agent consists of three major components: sources, channels, and sinks.
Sources
An active component that receives events from a specialized location or mechanism
and places it on one or Channels.
Different Source types:
Specialized sources for integrating with well-known systems. Example:
Syslog, Netcat
AvroSource NetcatSource SpoolDirectorySource
ExecSource JMSSource SyslogTcpSource SyslogUDPSource
5
Channels
A passive component that buffers the incoming events until they are drained by
Sinks.
Different Channels offer different levels of persistence:
Memory Channel: volatile
Data lost if JVM or machine restarts
File Channel: backed by WAL implementation Data not lost unless the disk dies.
Eventually, when the agent comes back data can be accessed.
Channels are fully transactional
Provide weak ordering guarantees (in case of failures / rollbacks )
Can work with any number of Sources and Sinks.
Handles upstream bursts
Upstream or downstream buffers
Sinks
An active component that removes events from a Channel and transmits them
to their next hop destination.
Different types of Sinks:
Terminal sinks that deposit events to their final destination. For example:
HDFS, HBase, Kite-Solr, Elastic Search
Sinks support serialization to user’s preferred formats.
HDFS sink supports time-based and arbitrary bucketing of data while writing to
HDFS.
IPC sink for Agent-to-Agent communication: Avro
Require exactly one channel to function
Flume Multi Tier Setup
[Client]+ Agent [ Agent]* Destination_______
Centralized logging with Flume
Interceptors
Interceptor
Flume has the capability to modify/drop events in-flight. This is done with the help of
interceptors. An interceptor can modify or even drop events based on any criteria
chosen by the developer of the interceptor.
Built-in Interceptors allow adding headers such as timestamps, hostname, static
markers etc.
Custom interceptors can introspect event payload to create specific headers where
necessary
Configuration Example: Flume Agents
● Hierarchical
● Flow of components
11
Contextual Routing with Interceptors
Achieved using Interceptors and Channel Selectors
Terminal Sinks can directly use Headers to make destination selections
HDFS Sink can use headers values to create dynamic path for files that the event
will be added to.
# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1
Flume Client
An entity that generates events and sends them to one or more Agents.
• Example
• Flume/Syslog log4j Appender
• Custom Client using Client SDK (org.apache.flume.api)
• Embedded Agent – An agent embedded within your application
• Decouples Flume from the system where event data is consumed from
• Not needed in all cases
Client Applications
Configuration Example: Log4j
•log4j.appender.syslog=org.apache.log4j.net.SyslogAppender
•log4j.appender.syslog.Facility=LOCAL3
•log4j.appender.syslog.FacilityPrinting=false
•log4j.appender.syslog.Header=true
•log4j.appender.syslog.SyslogHost=FlumedestinationHost:4444
•log4j.appender.syslog.layout=org.apache.log4j.PatternLayout
•log4j.appender.syslog.layout.ConversionPattern= TYPE: DUMMY %p: (%F:%L) %x %m %n
14
Below is log4 configuration snippet , To enable java applications to send events
For Non log4j Applications
Rsyslog
•Rsyslog is an open-source software utility used on UNIX and Unix-like computer systems for
forwarding log messages in an IP network. It implements the basic syslog protocol, extends it
with content-based filtering, rich filtering capabilities, flexible configuration options and adds
features such as using TCP for transport.
● Used in most of the Linux distros as standard logger
● Has multiple facilities for application use local0-local7 (avoid local7)
● Can poll any file on system and send new events over the network to syslog destinations
● service rsyslog restart
$ModLoad imfile
$InputFileName /var/log/NEWAPP/NEWAPP.log
$InputFileTag TYPE:_NEWAPP
$WorkDirectory /var/spool/rsyslog/NEWAPP
$InputFileStateFile NEWAPP-log
$InputFileFacility local7
$InputFilePersistStateInterval 10
$InputFileSeverity info
$RepeatedMsgReduction off
$InputRunFileMonitor
local7.* @@flumehost:4444
Solution: Near Real Time Log Archive to Hadoop Platform
16
Event Flow :: Simple Flume Logging
Solution: Near Real Time Log Archive to Hadoop Platform
17
•Less centralized , avoiding single point of failure.
•In case collector fails , events are still not lost.
•Scope for further scalability , with minimum configuration.
Configuration Example: Flume Multi tier Config
●Flume Listener Agents
■ This agent gathers events from multiple applications.
■ can also perform event inspections using interceptors in this tier.
■ Each event is analyzed and sent forward with appropriate header(only) updates so next agent
can make sense of it.
■ We can use filechannel or any other durable channel here.
■ Events aggregated for next tier
●Flume Writer Tier
■ Minimum connections to HDFS
■ This agent gets events from aggregator and reads headers.
■ According to header events are sent to relevant location on HDFS.
18
DDL for creating a Hive table with log data,
CREATE TABLE logData_H2 (
Ltype STRING,
event_time STRING,
porder STRING,
SEVERITY STRING,
SCLASS STRING,
PHO STRING ,
MESG STRING
) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/logmgmt/_DUMMY/raz-XPS14/150703/';
Thank you

More Related Content

PDF
Apache Flume
PDF
Apache Flume
PDF
Flume-Cassandra Log Processor
PDF
Inside Flume
PPTX
Apache flume
PPT
Flume in 10minutes
PPTX
Flume basic
PPTX
Apache Flume
Apache Flume
Flume-Cassandra Log Processor
Inside Flume
Apache flume
Flume in 10minutes
Flume basic

What's hot (20)

PDF
Apache flume by Swapnil Dubey
PPTX
Apache flume - Twitter Streaming
PPTX
Flume and Hadoop performance insights
PDF
Apache Flume (NG)
PDF
Apache Flume - DataDayTexas
PPTX
Apache flume - an Introduction
PPTX
Cloudera's Flume
PPTX
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
PPTX
ApacheCon-Flume-Kafka-2016
PDF
Data Aggregation At Scale Using Apache Flume
PDF
Big data: Loading your data with flume and sqoop
PPT
PPTX
Big data components - Introduction to Flume, Pig and Sqoop
PPTX
Apache Arrow Flight Overview
PDF
Apache kafka
PPTX
HBase Data Modeling and Access Patterns with Kite SDK
PPTX
Deploying Apache Flume to enable low-latency analytics
PPTX
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
PPTX
Hadoop architecture meetup
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Apache flume by Swapnil Dubey
Apache flume - Twitter Streaming
Flume and Hadoop performance insights
Apache Flume (NG)
Apache Flume - DataDayTexas
Apache flume - an Introduction
Cloudera's Flume
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
ApacheCon-Flume-Kafka-2016
Data Aggregation At Scale Using Apache Flume
Big data: Loading your data with flume and sqoop
Big data components - Introduction to Flume, Pig and Sqoop
Apache Arrow Flight Overview
Apache kafka
HBase Data Modeling and Access Patterns with Kite SDK
Deploying Apache Flume to enable low-latency analytics
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
Hadoop architecture meetup
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Ad

Similar to Centralized logging with Flume (20)

PDF
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
PPTX
Flume lspe-110325145754-phpapp01
PPTX
Flume DS -JSP.pptx
PPTX
Session 09 - Flume
PPTX
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
PDF
Introduction to Flume
PPTX
Apache Flume
PPTX
Apache Flume
PPTX
Spark+flume seattle
PDF
Apache Flume and its use case in Manufacturing
PPTX
Chicago Data Summit: Flume: An Introduction
PPTX
Data ingestion
PDF
Flume and HBase
PDF
Apache Flume Online Training.pdf
ODP
Search Analytics with Flume and HBase
PPT
Flume with Twitter Integration
PPTX
FlumeBase Study
PDF
Backday Xebia : Retour d’expérience sur l’ingestion d’événements dans HDFS
DOCX
If you had a system- that had large amounts of daily data inflows- tha.docx
PPTX
Streaming map reduce
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Flume lspe-110325145754-phpapp01
Flume DS -JSP.pptx
Session 09 - Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Introduction to Flume
Apache Flume
Apache Flume
Spark+flume seattle
Apache Flume and its use case in Manufacturing
Chicago Data Summit: Flume: An Introduction
Data ingestion
Flume and HBase
Apache Flume Online Training.pdf
Search Analytics with Flume and HBase
Flume with Twitter Integration
FlumeBase Study
Backday Xebia : Retour d’expérience sur l’ingestion d’événements dans HDFS
If you had a system- that had large amounts of daily data inflows- tha.docx
Streaming map reduce
Ad

Recently uploaded (20)

PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Introduction to Knowledge Engineering Part 1
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Global journeys: estimating international migration
PDF
Introduction to Business Data Analytics.
PDF
Lecture1 pattern recognition............
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Clinical guidelines as a resource for EBP(1).pdf
Launch Your Data Science Career in Kochi – 2025
Introduction to Knowledge Engineering Part 1
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Global journeys: estimating international migration
Introduction to Business Data Analytics.
Lecture1 pattern recognition............
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Fluorescence-microscope_Botany_detailed content
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IBA_Chapter_11_Slides_Final_Accessible.pptx
Supervised vs unsupervised machine learning algorithms

Centralized logging with Flume

  • 1. Log Ingestion on Big Data Platform with Flume
  • 2. Agenda •Why Centralized Logging on Hadoop •Flume Introduction •Simple Flume Logging •Centralized and Scalable Flume Logging •Leveraging log data •Example 2
  • 3. •There are tons of logs generated from Applications •These logs are stored on local disks on individual nodes. •Log files containing records are required to archive in near real time to create some value. •Enable analytics on logs for diagnosing issues on Hadoop platform. 3 Use Case: Centralized Logging Requirements
  • 4. Centralized Log Management & Analytics : Goals •Have a central repository to store large volume of machine generated data from all sources and tiers of applications and infrastructures •Feed log data from multiple sources to the common repository in a non- intrusive way and in near real time •Enable analytics on log data using standard analytical solutions •Provide capability to search and correlate information across different sources for quick problem isolation and resolution. •Improve operational intelligence and •Be centralized without redundancy of multiple agents on all hosts for log collections 4
  • 5. Solution Components for centralized logging Flume •Flume is a streaming service, distributed as part of Apache Hadoop ecosystem, and primarily a reliable way of getting stream and log data into HDFS. Its pluggable architecture supports any consumer. A correctly configured pipeline of Flume is guaranteed to not lose data, provided durable channels are used. •Each Flume agent consists of three major components: sources, channels, and sinks. Sources An active component that receives events from a specialized location or mechanism and places it on one or Channels. Different Source types: Specialized sources for integrating with well-known systems. Example: Syslog, Netcat AvroSource NetcatSource SpoolDirectorySource ExecSource JMSSource SyslogTcpSource SyslogUDPSource 5
  • 6. Channels A passive component that buffers the incoming events until they are drained by Sinks. Different Channels offer different levels of persistence: Memory Channel: volatile Data lost if JVM or machine restarts File Channel: backed by WAL implementation Data not lost unless the disk dies. Eventually, when the agent comes back data can be accessed. Channels are fully transactional Provide weak ordering guarantees (in case of failures / rollbacks ) Can work with any number of Sources and Sinks. Handles upstream bursts Upstream or downstream buffers
  • 7. Sinks An active component that removes events from a Channel and transmits them to their next hop destination. Different types of Sinks: Terminal sinks that deposit events to their final destination. For example: HDFS, HBase, Kite-Solr, Elastic Search Sinks support serialization to user’s preferred formats. HDFS sink supports time-based and arbitrary bucketing of data while writing to HDFS. IPC sink for Agent-to-Agent communication: Avro Require exactly one channel to function
  • 8. Flume Multi Tier Setup [Client]+ Agent [ Agent]* Destination_______
  • 10. Interceptors Interceptor Flume has the capability to modify/drop events in-flight. This is done with the help of interceptors. An interceptor can modify or even drop events based on any criteria chosen by the developer of the interceptor. Built-in Interceptors allow adding headers such as timestamps, hostname, static markers etc. Custom interceptors can introspect event payload to create specific headers where necessary
  • 11. Configuration Example: Flume Agents ● Hierarchical ● Flow of components 11
  • 12. Contextual Routing with Interceptors Achieved using Interceptors and Channel Selectors Terminal Sinks can directly use Headers to make destination selections HDFS Sink can use headers values to create dynamic path for files that the event will be added to. # channel selector configuration agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing agent_foo.sources.avro-AppSrv-source1.selector.header = State agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1 agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2 agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2 agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1 file-channel-2 agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2 agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1
  • 13. Flume Client An entity that generates events and sends them to one or more Agents. • Example • Flume/Syslog log4j Appender • Custom Client using Client SDK (org.apache.flume.api) • Embedded Agent – An agent embedded within your application • Decouples Flume from the system where event data is consumed from • Not needed in all cases
  • 14. Client Applications Configuration Example: Log4j •log4j.appender.syslog=org.apache.log4j.net.SyslogAppender •log4j.appender.syslog.Facility=LOCAL3 •log4j.appender.syslog.FacilityPrinting=false •log4j.appender.syslog.Header=true •log4j.appender.syslog.SyslogHost=FlumedestinationHost:4444 •log4j.appender.syslog.layout=org.apache.log4j.PatternLayout •log4j.appender.syslog.layout.ConversionPattern= TYPE: DUMMY %p: (%F:%L) %x %m %n 14 Below is log4 configuration snippet , To enable java applications to send events
  • 15. For Non log4j Applications Rsyslog •Rsyslog is an open-source software utility used on UNIX and Unix-like computer systems for forwarding log messages in an IP network. It implements the basic syslog protocol, extends it with content-based filtering, rich filtering capabilities, flexible configuration options and adds features such as using TCP for transport. ● Used in most of the Linux distros as standard logger ● Has multiple facilities for application use local0-local7 (avoid local7) ● Can poll any file on system and send new events over the network to syslog destinations ● service rsyslog restart $ModLoad imfile $InputFileName /var/log/NEWAPP/NEWAPP.log $InputFileTag TYPE:_NEWAPP $WorkDirectory /var/spool/rsyslog/NEWAPP $InputFileStateFile NEWAPP-log $InputFileFacility local7 $InputFilePersistStateInterval 10 $InputFileSeverity info $RepeatedMsgReduction off $InputRunFileMonitor local7.* @@flumehost:4444
  • 16. Solution: Near Real Time Log Archive to Hadoop Platform 16 Event Flow :: Simple Flume Logging
  • 17. Solution: Near Real Time Log Archive to Hadoop Platform 17 •Less centralized , avoiding single point of failure. •In case collector fails , events are still not lost. •Scope for further scalability , with minimum configuration.
  • 18. Configuration Example: Flume Multi tier Config ●Flume Listener Agents ■ This agent gathers events from multiple applications. ■ can also perform event inspections using interceptors in this tier. ■ Each event is analyzed and sent forward with appropriate header(only) updates so next agent can make sense of it. ■ We can use filechannel or any other durable channel here. ■ Events aggregated for next tier ●Flume Writer Tier ■ Minimum connections to HDFS ■ This agent gets events from aggregator and reads headers. ■ According to header events are sent to relevant location on HDFS. 18
  • 19. DDL for creating a Hive table with log data, CREATE TABLE logData_H2 ( Ltype STRING, event_time STRING, porder STRING, SEVERITY STRING, SCLASS STRING, PHO STRING , MESG STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/data/logmgmt/_DUMMY/raz-XPS14/150703/';