Open source data ingestion

Open Source
Data Collection/Ingestion
Treasure Data, Inc.
www.treasuredata.com

Hello!
- “Committer” of Fluentd
- Treasure Data, Inc.
- Former Algorithmic Trader
- Stanford Math and CS

Table of Contents
1. Why you should care
2. Data Collection v. Data Ingestion
3. Examples: Data Collection Tools
4. Examples: Data Ingestion Tools
5. Case Study: Async App Logging
Links to be added after the talk.

Data Collection/Ingestion is HARD

Data Sources
Raw Data
Storage
Processed
Data
Analysis
Environment
(Big) Data Pipeline
Data Collection
and Ingestion
Data Pre-
processing
Data Fetching
Data Engineers

Data Sources
Raw Data
Storage
Processed
Data
Analysis
Environment
If Data Collection Goes Awry...
Data Collection
and Ingestion
Data Pre-
processing
Data Fetching
Data Engineers

Data Collection
- Happens where data originates
- “logging code”
- Batch v. Streaming
- Pull v. Push
log.error(“FUUUUU....WHY!?”)
cln.send({“uid”:1,”action”:”died”})
200 GET a.com/?utm=big%20data

Data Ingestion
- Receives data
- Sometimes coupled with storage
- Routing data Data Ingestion Layer

rsyslog
- The grandfather of data collectors
- Streaming
- Installed by default, widely understood
- Not as easy to extend/configure

rsyslog
https://github.com/rsyslog/rsyslog/blob/master/ChangeLog

Scribe
- Written originally at Facebook
- Streaming
- Fast (C++)
- Nightmare to build, largely
abandoned

Flume-ng
- Written and maintained by
Cloudera (successor to Flume)
- Commercial support by
Cloudera. Track record for
Hadoop
- Java can be heavy-handed for
some orgs/cases

Logstash
- Pluggable architecture, rich
ecosystem
- The “L” of the ELK stack by
Elastic
- JRuby
- HA uses Redis as a queue
http://apuntesdetrabajo.es/?p=263

Heka
- Developed at Mozilla
- Written in Go, extensible w/ Lua
- Plugin system, but compilation
needed (Go’s limitation, may
change)

Fluentd
- Plugin architecture
- Built-in HA
- CRuby (JRuby on the roadmap)
- google-fluentd, td-agent
- Lightweight multi-source, multi-
destination log routing

Embulk
- Plugin architecture
- Focuses on Batch workloads
- Java/JRuby
- Very new! (looking for
contributors!)

RabbitMQ
- Written in Erlang, supported by
Pivotal
- Implements AMQP

Kafka
- Begun at LinkedIn, now Confluent
- Topic-based Message Broker:
Producer/Broker/Consumer
- Distributed design
- Provides at least once, at most
once by consumers

Fluentd!?
- Used (abused?) as a bus/MQ
- tag-based event routing
- Can be combined with
RabbitMQ/Kafka, etc.

Application Logging
- Common ask: “How’s our new feature doing?”
GET
/foobar
API Server
200 {...}

Application Logging
- What NOT to do: synchronous logging
GET
/foobar
API Server200 {...} Data Backend
write
ack

Application Logging
- What NOT to do: synchronous logging
GET
/foobar
API Server200 {...} Local Data
Collector
write Flush
Data
Backendack
Buffer

- Is writing to a local log collector safe?
- What if the log collector retries by error?
But wait...
- A lot of problems to think about!

“Much of the blame, little of the glory”
(Just kidding. The entire data team relies on YOU!)

Thank you!
(...and we are hiring!)
www.treasuredata.com/careers

- Software
- www.fluentd.org
- hekad.readthedocs.org
- logstash.org
- kafka.apache.org
- Embulk.org
- www.rabbitmq.com
- Ideas
- https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
- http://radar.oreilly.com/2015/04/the-log-the-lifeblood-of-your-data-
pipeline.htmlL
Bibliography

Open source data ingestion

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Open source data ingestion (20)

More from Treasure Data, Inc. (16)

Open source data ingestion