SlideShare a Scribd company logo
Open Source
Data Collection/Ingestion
Treasure Data, Inc.
www.treasuredata.com
Hello!
- “Committer” of Fluentd
- Treasure Data, Inc.
- Former Algorithmic Trader
- Stanford Math and CS
Table of Contents
1. Why you should care
2. Data Collection v. Data Ingestion
3. Examples: Data Collection Tools
4. Examples: Data Ingestion Tools
5. Case Study: Async App Logging
Links to be added after the talk.
Data Collection/Ingestion is HARD
Data Sources
Raw Data
Storage
Processed
Data
Analysis
Environment
(Big) Data Pipeline
Data Collection
and Ingestion
Data Pre-
processing
Data Fetching
Data Engineers
Data Sources
Raw Data
Storage
Processed
Data
Analysis
Environment
If Data Collection Goes Awry...
Data Collection
and Ingestion
Data Pre-
processing
Data Fetching
Data Engineers
Collection v. Ingestion
Data Collection
- Happens where data originates
- “logging code”
- Batch v. Streaming
- Pull v. Push
log.error(“FUUUUU....WHY!?”)
cln.send({“uid”:1,”action”:”died”})
200 GET a.com/?utm=big%20data
Data Ingestion
- Receives data
- Sometimes coupled with storage
- Routing data Data Ingestion Layer
ex. Data Collection Tools
rsyslog
- The grandfather of data collectors
- Streaming
- Installed by default, widely understood
- Not as easy to extend/configure
rsyslog
https://github.com/rsyslog/rsyslog/blob/master/ChangeLog
Scribe
- Written originally at Facebook
- Streaming
- Fast (C++)
- Nightmare to build, largely
abandoned
Flume-ng
- Written and maintained by
Cloudera (successor to Flume)
- Commercial support by
Cloudera. Track record for
Hadoop
- Java can be heavy-handed for
some orgs/cases
Logstash
- Pluggable architecture, rich
ecosystem
- The “L” of the ELK stack by
Elastic
- JRuby
- HA uses Redis as a queue
http://apuntesdetrabajo.es/?p=263
Heka
- Developed at Mozilla
- Written in Go, extensible w/ Lua
- Plugin system, but compilation
needed (Go’s limitation, may
change)
Fluentd
- Plugin architecture
- Built-in HA
- CRuby (JRuby on the roadmap)
- google-fluentd, td-agent
- Lightweight multi-source, multi-
destination log routing
Embulk
- Plugin architecture
- Focuses on Batch workloads
- Java/JRuby
- Very new! (looking for
contributors!)
ex. Data Ingestion Tools
RabbitMQ
- Written in Erlang, supported by
Pivotal
- Implements AMQP
Kafka
- Begun at LinkedIn, now Confluent
- Topic-based Message Broker:
Producer/Broker/Consumer
- Distributed design
- Provides at least once, at most
once by consumers
Fluentd!?
- Used (abused?) as a bus/MQ
- tag-based event routing
- Can be combined with
RabbitMQ/Kafka, etc.
case study: Async App Logging
Application Logging
- Common ask: “How’s our new feature doing?”
GET
/foobar
API Server
200 {...}
Application Logging
- What NOT to do: synchronous logging
GET
/foobar
API Server200 {...} Data Backend
write
ack
Application Logging
- What NOT to do: synchronous logging
GET
/foobar
API Server200 {...} Local Data
Collector
write Flush
Data
Backendack
Buffer
- Is writing to a local log collector safe?
- What if the log collector retries by error?
But wait...
- A lot of problems to think about!
“Much of the blame, little of the glory”
(Just kidding. The entire data team relies on YOU!)
Thank you!
(...and we are hiring!)
www.treasuredata.com/careers
- Software
- www.fluentd.org
- hekad.readthedocs.org
- logstash.org
- kafka.apache.org
- Embulk.org
- www.rabbitmq.com
- Ideas
- https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
- http://radar.oreilly.com/2015/04/the-log-the-lifeblood-of-your-data-
pipeline.htmlL
Bibliography

More Related Content

PDF
Using Embulk at Treasure Data
PDF
Prestogres, ODBC & JDBC connectivity for Presto
PPTX
Data ingestion
PDF
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
PDF
Prestogres internals
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
PPTX
Building a system for machine and event-oriented data with Rocana
PDF
Presto at Twitter
Using Embulk at Treasure Data
Prestogres, ODBC & JDBC connectivity for Presto
Data ingestion
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Prestogres internals
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Building a system for machine and event-oriented data with Rocana
Presto at Twitter

What's hot (20)

PDF
Fluentd - Unified logging layer
PDF
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
PDF
Presto
PDF
Introduction to Presto at Treasure Data
PDF
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
PDF
Presto in the cloud
PDF
Netflix running Presto in the AWS Cloud
PDF
Top 5 mistakes when writing Streaming applications
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
PDF
Pinot: Near Realtime Analytics @ Uber
PPTX
Presto for the Enterprise @ Hadoop Meetup
PDF
Presto at Hadoop Summit 2016
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
PDF
To Have Own Data Analytics Platform, Or NOT To
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
PPT
Hadoop ecosystem framework n hadoop in live environment
PDF
Presto @ Treasure Data - Presto Meetup Boston 2015
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Data Analytics Service Company and Its Ruby Usage
ODP
Presto
Fluentd - Unified logging layer
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Presto
Introduction to Presto at Treasure Data
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto in the cloud
Netflix running Presto in the AWS Cloud
Top 5 mistakes when writing Streaming applications
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Pinot: Near Realtime Analytics @ Uber
Presto for the Enterprise @ Hadoop Meetup
Presto at Hadoop Summit 2016
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
To Have Own Data Analytics Platform, Or NOT To
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Hadoop ecosystem framework n hadoop in live environment
Presto @ Treasure Data - Presto Meetup Boston 2015
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Data Analytics Service Company and Its Ruby Usage
Presto
Ad

Viewers also liked (11)

PDF
Unifying Events and Logs into the Cloud
PDF
Insight Data Engineering: Open source data ingestion
PDF
Fluentd and Docker - running fluentd within a docker container
PDF
Introduction to New features and Use cases of Hivemall
PDF
What is support_engineer_in_treasuredata
PDF
Fluentd and Docker - running fluentd within a docker container
PPTX
Augmenting Mongo DB with Treasure Data
PDF
Packaging Ecosystems -Monki Gras 2017
PPTX
Augmenting Mongo DB with treasure data
PDF
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
PDF
Keynote - Fluentd meetup v14
Unifying Events and Logs into the Cloud
Insight Data Engineering: Open source data ingestion
Fluentd and Docker - running fluentd within a docker container
Introduction to New features and Use cases of Hivemall
What is support_engineer_in_treasuredata
Fluentd and Docker - running fluentd within a docker container
Augmenting Mongo DB with Treasure Data
Packaging Ecosystems -Monki Gras 2017
Augmenting Mongo DB with treasure data
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
Keynote - Fluentd meetup v14
Ad

Similar to Open source data ingestion (20)

PDF
Fluentd meetup
PDF
Fluentd introduction at ipros
PPTX
OpenSearchLab and the Lucene Ecosystem
PDF
Fluentd Unified Logging Layer At Fossasia
PDF
upload test 1
PPTX
Flume and Flive Introduction
PDF
Fluentd Overview, Now and Then
PDF
Fluentd meetup in japan
PDF
Collect distributed application logging using fluentd (EFK stack)
PDF
Unifying Events and Logs into the Cloud
PDF
Treasure Data and OSS
PDF
The basics of fluentd
ODP
Large scale crawling with Apache Nutch
PDF
Fluentd 101
PDF
FluentD vs. Logstash
PDF
Fluentd meetup at Slideshare
PDF
Fluentd meetup logging infrastructure in paa s
PDF
Log everything!
PPTX
CSE3069 - FLUENTD real time analytics.pptx
PDF
Data Infrastructure for a World of Music
Fluentd meetup
Fluentd introduction at ipros
OpenSearchLab and the Lucene Ecosystem
Fluentd Unified Logging Layer At Fossasia
upload test 1
Flume and Flive Introduction
Fluentd Overview, Now and Then
Fluentd meetup in japan
Collect distributed application logging using fluentd (EFK stack)
Unifying Events and Logs into the Cloud
Treasure Data and OSS
The basics of fluentd
Large scale crawling with Apache Nutch
Fluentd 101
FluentD vs. Logstash
Fluentd meetup at Slideshare
Fluentd meetup logging infrastructure in paa s
Log everything!
CSE3069 - FLUENTD real time analytics.pptx
Data Infrastructure for a World of Music

More from Treasure Data, Inc. (16)

PPTX
GDPR: A Practical Guide for Marketers
PPTX
AR and VR by the Numbers: A Data First Approach to the Technology and Market
PPTX
Introduction to Customer Data Platforms
PPTX
Hands On: Javascript SDK
PPTX
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
PPTX
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
PPTX
How to Power Your Customer Experience with Data
PPTX
Why Your VR Game is Virtually Useless Without Data
PDF
Connecting the Customer Data Dots
PPTX
Harnessing Data for Better Customer Experience and Company Success
PDF
Scalable Hadoop in the cloud
PDF
Scaling to Infinity - Open Source meets Big Data
PDF
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
PDF
Treasure Data From MySQL to Redshift
PPTX
Partner webinar presentation aws pebble_treasure_data
PDF
Introduction to Hivemall
GDPR: A Practical Guide for Marketers
AR and VR by the Numbers: A Data First Approach to the Technology and Market
Introduction to Customer Data Platforms
Hands On: Javascript SDK
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
How to Power Your Customer Experience with Data
Why Your VR Game is Virtually Useless Without Data
Connecting the Customer Data Dots
Harnessing Data for Better Customer Experience and Company Success
Scalable Hadoop in the cloud
Scaling to Infinity - Open Source meets Big Data
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data From MySQL to Redshift
Partner webinar presentation aws pebble_treasure_data
Introduction to Hivemall

Open source data ingestion