SlideShare a Scribd company logo
#backdaybyxebia
Sylvain Lequeux
@slequeux
Event Ingestion in HDFS
#backdaybyxebia
Back To Basics
#backdaybyxebia
Basics : Event ?
Asynchronysm …
… to message systems …
… to event systems
#backdaybyxebia
Basics : Kafka
A messaging system
#backdaybyxebia
Basics : Kafka
A distributed messaging system
#backdaybyxebia
Basics : Kafka
A distributed messaging system
… multi-queues (called “topics”)
splitted in partitions
#backdaybyxebia
Basics : Kafka
A distributed messaging system
… multi-queues (called “topics”)
splitted in partitions
… multi-clients
#backdaybyxebia
#backdaybyxebia
#backdaybyxebia
Basics : Hadoop Distributed FileSystem
Distributed & scalable
Highly fault-tolerant
Standard support for BigData jobs to run
“Moving computation is cheaper than moving data”
#backdaybyxebia
#backdaybyxebia
VS
#backdaybyxebia
Flume
http://flume.apache.org/
#backdaybyxebia
Flume
Concepts
#backdaybyxebia
Flume
➔ Top level Apache project
➔ “Item” streaming based on data
flow
#backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files
#backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files
2. A source is a way to transform
this data into Flume events
#backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files
2. A source is a way to transform
this data into Flume events
3. A channel is a way to transport
data (memory, file)
#backdaybyxebia
Flume
1. An “item” exists somewhere
Initially, “items” were log files
2. A source is a way to pull and
transform this data into Flume
events
3. A channel is a way to transport
data (memory, file)
4. A sink is a way to put a Flume
event somewhere
#backdaybyxebia
Flume + Kafka = Flafka
#backdaybyxebia
Flume
How it works
#backdaybyxebia
Simple flume configuration file
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in
memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
1 - Define an agent
2 - Explicit the sources
3 - Explicit the sinks
4 - Explicit the channels
5 - Connect the pieces
#backdaybyxebia
Flafka source configuration
# Mandatory config
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.zookeeperConnect = localhost:2181
a1.sources.r1.topic = MyTopic
# Optional config
a1.sources.r1.batchSize = 1000
a1.sources.r1.batchDurationMillis = 1000
a1.sources.r1.consumer.timeout.ms = 10
a1.sources.r1.auto.commit.enabled = false
a1.sources.r1.groupId = flume
#backdaybyxebia
HDFS Sink configuration
# Mandatory config
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://localhost:54310/data/flume/ 
%{topic}/%y-%m-%d
# A log of optional configs inluding :
# Compression
# File types : SequenceFile, Avro, etc.
# Possibility to use a custom Writable
# Kerberos configuration
#backdaybyxebia
Flume : data transformation
#backdaybyxebia
Flume Interceptors
# Mandatory config
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = ...
..
➔ Transformation executed
◆ After event is generated
◆ Before sending it to channel
➔ Some predefined interceptors
◆ Timestamp
◆ UUID
◆ Filtering
◆ Morphline
◆ ...
➔ Could write your own (pure Java)
#backdaybyxebia
Flume : how to run it ?
Command line : Included in distribs :
flume-ng agent 
-n a1 
-c /usr/lib/flume-ng/conf/ 
-f /usr/lib/flume-ng/conf/flume-kafka.conf &
#backdaybyxebia
Camus
linkedin/camus
#backdaybyxebia
Camus
Concepts
#backdaybyxebia
Camus
➔ OpenSource project developped
by LinkedIn
➔ Based entirely on MapReduce
#backdaybyxebia
Camus
A batch consists in three steps :
- P1 : Gets metadata : topic & partitions, latest offsets
- P2 : Pulls new events
- P3 : Updates local metadatas
#backdaybyxebia
Camus
How it works
#backdaybyxebia
Time to write some code
Just explain how to transform
INTO
#backdaybyxebia
Time to write some code
#backdaybyxebia
Time to write some code
#backdaybyxebia
#backdaybyxebia
Round 1 : getting started
Flume Camus
Just a simple configuration file make it
works
Need a complete dev environment
(included maven) to use it
Morphline interceptor’syntax is quite
complex
Dev should understand MapReduce
concepts
#backdaybyxebia
Round 2 : running time
Flume Camus
Flume events are ingested with no delay MapReduce setup adds uncompressable
time
31 sec uncompressable
+
~ 1 sec / 500 messages / node
=> 111 sec for 1 message
=> 117 sec for 50 messages
=> 116 sec for 1.000 mesages
=> 127 sec for 10.000 messages
#backdaybyxebia
Round 3 : maintainability
Flume Camus
When used by CM, server is easy to
maintain, but config is not
Full Maven project. Just use a version
control system (Git, SVN, aso.)
#backdaybyxebia
Round 4 : customization
Flume Camus
Interceptors are fully customizable Morphing data could be done easily
Event headers make HDFS path highly
modulable
#backdaybyxebia
Round 5 : deployment
Flume Camus
When used by CM, just include your conf,
that’s it
MapReduce jobs may be included in any
MR orchestrator (Oozie for instance)
Without a manager, everything needs to be
done manually
#backdaybyxebia
Round 6 : state of the project
Flume Camus
Released 1.0.0 in 2012 Currently in v 0.1.0-SNAPSHOT
Included by default with Hadoop
distributions
Highly used by LinkedIn in production
Almost no documentation
#backdaybyxebia
Summary
Flume Camus
Getting Started
Running time
Maintainability
Customization
Deployment
State of the project
#backdaybyxebia
Global Feedback
#backdaybyxebia
Debugging
Debugging on Flume is quite complex
Some really critical bugs like [FLUME-2578]
#backdaybyxebia
Documentation
Flume has really good quality doc
Camus only has a readme file and not up to date !
#backdaybyxebia
Camus & M/R
Camus suffers the use of Map/Reduce.
Maybe using some other concept like Spark may result in better perfs.
#backdaybyxebia
Flume quantity of files
Flume needs a very precise configuration not to generate a bunch of file.
It is easy to get it generate a lot of little files, which is problematic in term of
BigData.
#backdaybyxebia
Thank you
Questions ?

More Related Content

PDF
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
PDF
How to tune Kafka® for production
PPTX
Capistrano 3 Deployment
PDF
Phoenix at scale
PDF
Optimizing kubernetes networking
PDF
Perl Dist::Surveyor 2011
PDF
Kafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
PDF
Kubernetes at Datadog the very hard way
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 People
How to tune Kafka® for production
Capistrano 3 Deployment
Phoenix at scale
Optimizing kubernetes networking
Perl Dist::Surveyor 2011
Kafka Summit SF 2017 - Shopify Flash Sales with Apache Kafka
Kubernetes at Datadog the very hard way

What's hot (20)

PDF
Transaction Support in Pulsar 2.5.0
PDF
Evolution of kube-proxy (Brussels, Fosdem 2020)
PPTX
Control your deployments with Capistrano
PDF
How the OOM Killer Deleted My Namespace
PPTX
Big data lambda architecture - Streaming Layer Hands On
PPT
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
PDF
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
PDF
Self Created Load Balancer for MTA on AWS
PDF
Multitenancy: Kafka clusters for everyone at LINE
PPTX
Cloudera's Flume
PDF
Ansible with AWS
PDF
LINE's messaging service architecture underlying more than 200 million monthl...
PPTX
ApacheCon-Flume-Kafka-2016
PDF
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
PDF
Better CSV processing with Ruby 2.6
PDF
Mad scalability: Scaling when you are not Google
PDF
What's new in Ansible 2.0
PPTX
Gobblin on-aws
PDF
How you can contribute to Apache Cassandra
PDF
Transaction preview of Apache Pulsar
Transaction Support in Pulsar 2.5.0
Evolution of kube-proxy (Brussels, Fosdem 2020)
Control your deployments with Capistrano
How the OOM Killer Deleted My Namespace
Big data lambda architecture - Streaming Layer Hands On
Honu - A Large Scale Streaming Data Collection and Processing Pipeline__Hadoo...
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Self Created Load Balancer for MTA on AWS
Multitenancy: Kafka clusters for everyone at LINE
Cloudera's Flume
Ansible with AWS
LINE's messaging service architecture underlying more than 200 million monthl...
ApacheCon-Flume-Kafka-2016
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
Better CSV processing with Ruby 2.6
Mad scalability: Scaling when you are not Google
What's new in Ansible 2.0
Gobblin on-aws
How you can contribute to Apache Cassandra
Transaction preview of Apache Pulsar
Ad

Similar to Backday Xebia : Retour d’expérience sur l’ingestion d’événements dans HDFS (20)

PPTX
Deploying Apache Flume to enable low-latency analytics
PDF
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
PDF
Symfony finally swiped right on envvars
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
White Paper: Scaling Servers and Storage for Film Assets
PDF
Fluentd Overview, Now and Then
PDF
Build High-Performance, Scalable, Distributed Applications with Stacks of Co...
PDF
NoSQL afternoon in Japan kumofs & MessagePack
PDF
NoSQL afternoon in Japan Kumofs & MessagePack
PDF
Bhushan m dev_ops_engr_31june
PPTX
Setting up a big data platform at kelkoo
PDF
Advanced Web Hosting
PDF
Sheep it
PDF
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
PDF
Apache flume by Swapnil Dubey
PDF
Apache Kafka - Scalable Message-Processing and more !
PPTX
Hive and Pig for .NET User Group
PPT
Class.aspera.chapter.1.intro
PDF
4Developers 2015: Continuous Security in DevOps - Maciej Lasyk
PDF
Continuous Security in DevOps
Deploying Apache Flume to enable low-latency analytics
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Symfony finally swiped right on envvars
Apache Kafka - Scalable Message-Processing and more !
White Paper: Scaling Servers and Storage for Film Assets
Fluentd Overview, Now and Then
Build High-Performance, Scalable, Distributed Applications with Stacks of Co...
NoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePack
Bhushan m dev_ops_engr_31june
Setting up a big data platform at kelkoo
Advanced Web Hosting
Sheep it
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Apache flume by Swapnil Dubey
Apache Kafka - Scalable Message-Processing and more !
Hive and Pig for .NET User Group
Class.aspera.chapter.1.intro
4Developers 2015: Continuous Security in DevOps - Maciej Lasyk
Continuous Security in DevOps
Ad

More from Publicis Sapient Engineering (20)

PDF
XebiCon'18 - L'algorithme de reconnaissance de formes par le cerveau humain
PDF
Xebicon'18 - IoT: From Edge to Cloud
PDF
Xebicon'18 - Spark in jail : conteneurisez vos traitements data sans serveur
PDF
XebiCon'18 - Modern Infrastructure
PDF
XebiCon'18 - La Web App d'aujourd'hui et de demain : état de l'art et bleedin...
PDF
XebiCon'18 - Des notebook pour le monitoring avec Zeppelin
PDF
XebiCon'18 - Event Sourcing et RGPD, incompatibles ?
PDF
XebiCon'18 - Deno, le nouveau NodeJS qui inverse la tendance ?
PDF
XebiCon'18 - Boostez vos modèles avec du Deep Learning distribué
PDF
XebiCon'18 - Comment j'ai développé un jeu vidéo avec des outils de développe...
PDF
XebiCon'18 - Les utilisateurs finaux, les oubliés de nos produits !
PDF
XebiCon'18 - Comment fausser l'interprétation de vos résultats avec des dataviz
PDF
XebiCon'18 - Le développeur dans la Pop Culture
PDF
XebiCon'18 - Architecturer son application mobile pour la durabilité
PDF
XebiCon'18 - Sécuriser son API avec OpenID Connect
PDF
XebiCon'18 - Structuration du Temps et Dynamique de Groupes, Théorie organisa...
PDF
XebiCon'18 - Spark NLP, un an après
PDF
XebiCon'18 - La sécurité, douce illusion même en 2018
PDF
XebiCon'18 - Utiliser Hyperledger Fabric pour la création d'une blockchain pr...
PDF
XebiCon'18 - Ce que l'histoire du métro Parisien m'a enseigné sur la création...
XebiCon'18 - L'algorithme de reconnaissance de formes par le cerveau humain
Xebicon'18 - IoT: From Edge to Cloud
Xebicon'18 - Spark in jail : conteneurisez vos traitements data sans serveur
XebiCon'18 - Modern Infrastructure
XebiCon'18 - La Web App d'aujourd'hui et de demain : état de l'art et bleedin...
XebiCon'18 - Des notebook pour le monitoring avec Zeppelin
XebiCon'18 - Event Sourcing et RGPD, incompatibles ?
XebiCon'18 - Deno, le nouveau NodeJS qui inverse la tendance ?
XebiCon'18 - Boostez vos modèles avec du Deep Learning distribué
XebiCon'18 - Comment j'ai développé un jeu vidéo avec des outils de développe...
XebiCon'18 - Les utilisateurs finaux, les oubliés de nos produits !
XebiCon'18 - Comment fausser l'interprétation de vos résultats avec des dataviz
XebiCon'18 - Le développeur dans la Pop Culture
XebiCon'18 - Architecturer son application mobile pour la durabilité
XebiCon'18 - Sécuriser son API avec OpenID Connect
XebiCon'18 - Structuration du Temps et Dynamique de Groupes, Théorie organisa...
XebiCon'18 - Spark NLP, un an après
XebiCon'18 - La sécurité, douce illusion même en 2018
XebiCon'18 - Utiliser Hyperledger Fabric pour la création d'une blockchain pr...
XebiCon'18 - Ce que l'histoire du métro Parisien m'a enseigné sur la création...

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PDF
KodekX | Application Modernization Development
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Modernizing your data center with Dell and AMD
Understanding_Digital_Forensics_Presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Review of recent advances in non-invasive hemoglobin estimation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
KodekX | Application Modernization Development
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Per capita expenditure prediction using model stacking based on satellite ima...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Backday Xebia : Retour d’expérience sur l’ingestion d’événements dans HDFS