情報処理学会 Exciting Coding! Treasure Data

Treasure Data  
Exciting Coding!
Nov 2013
Presented by

Masahiro Nakagawa
Senior Software Engineer

www.treasuredata.com

1

Who are you
•  Masahiro Nakagawa
–  @repeatedly
–  masa@treasure-data.com or d@

•  Treasure Data, Inc
–  Senior Software Engineer
•  Fluentd / Client libraries / etc...

–  Since 2012/11

•  Open Source projects
–  D Programming Language
–  MessagePack: D, Python, etc…
–  Fluentd: Core, Mongo, Logger, etc…
–  Etc…

2

Company &

Board Meeting
Presentation
Service

Introduction

August 15th, 2013 - 3:30PM PDT

Presented by

Hironobu Yoshikawa – CEO
Kazuki Ohta – CTO
Rich Ghiossi – VP, Marketing
Keith Goldstein – VP, Sales
Kengo Hirouchi – Director, Japan
Ankush Rustagi – Director, Marketing


3

Company Background
•  Founded 2011 in Mountain View, CA
–  The ﬁrst cloud service for the entire data pipeline
–  Including: Acquisition, Storage, & Analysis

•  Provide a “Cloud Data Service”
–  Fast Time to Value
–  Cloud Flexibility and Economics
–  Simple and Well Supported

The Treasure Data Team
Hiro Yoshikawa – CEO
Open source business veteran
Kaz Ohta – CTO
Founder of world’s largest Hadoop Group
Jeff Yuan – Director, Engineering
LinkedIn, MIT / Michale Stonrebrraker Lab
Keith Goldstein – VP Sales & Bus Dev
VP of Bus Dev from Tibco and Talend
Rich Ghiossi – VP Marketing
VP of Marketing from ParAccel

Notable Investors

•  Treasure Data has over 100+ customers in
production
–  Incl. Fortune 500 companies
–  500+ Billion new records / month
–  Around 2 Trillion records under management
–  Variety of use cases and verticals

Othman Laraki
Ex-VP of Growth at Twitter
Jerry Yang
Founder of Yahoo!
Yukihiro “Matz” Matusmoto
Creator of “Ruby” programming language
James Lindenbaum
Founder of Heroku

4

Problem Statement
•  Lots of companies today produce Big Data by having
“New Data Sources” (Sensor, Weblog, etc)
–  But few have the resources to build a
Big Data Analytics system

•  60-70% of a company’s Big Data time & budget
consumed by:
–  Infrastructure setup & Maintenance
–  Building Collection & Storage Flows
–  Hiring/Training Hadoop Expertise

•  On average, it takes 6 months to get
a Hadoop environment into production
5

Treasure Data’s
Focus
(80% of the
needs)

7

Treasure Data Service: Overview
Acquire

Store

Analyze

Web logs
Treasure Agent
App logs

BI Connectivity

Streaming Log !
Collector (JSON)!

REST API, SQL, Pig,
JDBC / ODBC!

Sensor

Tableau, Metric Insights,
QlikView, Excel, etc.

Treasure Data Cloud

RDBMS
Bulk Import
CRM

BI Tools

Parallel Upload from
CSV, MySQL, etc.!

Flexible, Scalable,
Columnar Storage!

ERP

Time to Value

Economy & Flexibility

Result Push
REST API, SQL,
Pig!

Dashboards
Custom App, Local DB,
FTP Server, etc.

Simple & Supported

9

Our Value Propositions
•  Faster time to value

On-demand cloud infrastructure & versatile streaming data collection agent
–  Instantly provision a fully tuned & managed infrastructure
–  Go live into production on average in 14 days (collection, analytics, & BI)

•  Cloud ﬂexibility and economics

Fraction of the cost of traditional solutions by leveraging cloud storage and processing,
which scales to meet your needs
–  Leverage the cost-advantage of the cloud
–  Leverage the elasticity of the cloud – scale on demand
–  Predictable monthly subscription fee
–  No upfront costs & no long-term commitment

•  Simple and well supported
We are passionate about simplicity, and customer support excellence
–  Focus your time on analyzing your data
–  Rely on us to keep your data secure & online
–  We love making customers successful & building long-term relationships

10

Initial Setup & Onboarding – Two Weeks
1. Data Collection

2. Data Storage

•  Setup, tuning, and monitoring
of Treasure Agent
•  Embed Treasure Agent code
into applications

•  Basic log templates (register,
pay, login, etc.)
•  Basic KPI queries (DAU, MAU,
ARPU, etc.)

3. Data Analysis

4. Service & Support

•  Setup dashboards with basic
KPIs
•  Training on creating
customized reports and adhoc querying

•  Assigned a dedicated
technical account manager
•  Real-time support via email,
online chat, and call

11

Solutions Accelerators

…
Out-of-the Box Reporting

Treasure Data Platform

Configured Treasure Agent

Solution
Components:

-  Treasure Data Platform
-  Event Collection
Template
-  Pre-configured
Treasure Agent
Configuration
-  BI Dashboard with KPIs

12

- Vision -
gle Analytics Platform for the Wo

13

Treasure

Board Meeting
DataPresentation
Platform

Architecture Overview
Presented by

Kazuki Ohta – CTO


14

Data Acquisition – Streaming Capture
Application Server
# Application Code
...
...
# Post event to Treasure Data
TD.event.post('access', {:uid=>123})

•  Automatic Microbatching
•  Local buffering Fallback
•  Network Tolerance

...
...

Treasure Data Library

Java, Ruby, PHP, Perl, Python, Scala,
Node.js

Treasure Data Cloud

Treasure Agent (local)

Open-Sourced as Fluentd Project ( http://ﬂuentd.org/ )

15

Data Acquisition – Bulk Loader
RDBMS

App

SaaS

CSV, TSV, JSON,
MessagePack, Apache,
regex, MySQL, FTP

FTP

Treasure Data Cloud

Bulk Loader

Prepare ! Upload ! Perform ! Commit

16

Data Storage

Treasure Data Cloud

Default (schema-less)
time

v

13841604
00

{“ip”:”135.52.211.23”, “code”:”0”}

13841622
00

{“ip”:”45.25.38.156”, “code”:”-1”}

13841640
00

{“ip”:”97.12.76.55”, “code”:”99”}

•  Stored “schema-less” as JSON
– 

Schema can be applied/updated
AFTER storage

•  Compressed & columnar format

SELECT v[‘ip’] as ip, v[‘code’] as code …

Schema applied

~30% Faster

time

ip : string
135.52.211.23
45.25.38.156
97.12.76.55

•  Quickly scale-up processing power
– 

WITHOUT reloading/redistributing the data

-1

138416400
0

•  Optimized for time-based ﬁltering

0

138416220
0

For higher query performance

code : int

138416040
0

– 

99

SELECT ip, code …

17

Data Analysis
REST API

Treasure Data Cloud

Heavy Lifting SQL (Hive):
-  Hive’s Built-in UDFs
-  TD Added Functions:
-  Time Functions
-  First, Last, Rank
-  Sessionize

Scheduled Jobs
-  SQL, Pig Scripts
-  Data Pushes

JDBC Connectivity:
-  Custom Java Apps
-  Standards-based
-  BI Tool Integration

Tableau ODBC connector
-  Leverages Impala
Interactive SQL
Push Query Results:
Treasure Query Accelerator
-  MySQL, PostgreSQL
(Impala)
-  Google Spreadsheet
-  Web, FTP, S3
Scripted Processing (Pig):
-  Leftronic, Indicee
-  DataFu (LinkedIn)
-  Treasure Data Table
-  Piggybank (Apache)

18

Treasure

Board Meeting
Presentation
Data

General Use Cases
Presented by

Kazuki Ohta – CTO


19

A case: “14 Days” from Signup to Success

1.  Europe’s largest mobile ad
exchange.
2.  Serving >60 billion imps/
month for >30,000 mobile
apps (Q4 2013)
3.  Immediate need of analytics
infrastructure: ASAP!
4.  With TD, MobFox got into
production only in 14 days,
by one engineer.

"Time is the most precious asset in our fast-moving
business,
and Treasure Data saved us a lot of it."
 
Julian Zehetmayr, CEO & Founder
20

A case: “Replace” in-house Hadoop to TD
Before

1.  Global “Hulu” - Online Video
Service with millions of users

2.  Video contents are
distributed to over 150
languages.

After

3.  Had hard time maintaining
Hadoop cluster
4.  With TD, Viki deprecated
their in-house Hadoop
cluster and use engineer for
core businesses.

“Treasure Data has always given us thorough and timely
support peppered with insightful tips to make the best use of
their service."

Huy Nguyen, Software Engineer
21

A case: Treasure Data with BI Tool (Tableau)

1.  World’s largest android
application market
2.  Serving >3 billion app
downloads for >100 million
users
3.  Only one engineer managing
the data infrastructure
4.  With TD, the data engineer
can focus on analyzing data
with existing BI tool

"I will recommend Treasure Data to my friends in a heartbeat because it
beneﬁts all three stakeholders: Operations, Engineering and Business."

Simon Dong, Principal Architect - Data Engineering

22

Treasure

Board Meeting
DataPresentation
Platform

Fluentd Overview
Presented by

Kazuki Ohta – CTO


23

What is Fluentd?
•  Open sourced log collector written in Ruby
–  Easy to use, reliable and well performance
–  Streaming event processing

•  Using rubygems ecosystem to distribute plugins

Fluentd
the missing log collector
ﬂuentd.org
24

Data processing pipeline
Data source
Collect

Store

Process

Visualize

Reporting
Monitoring
25

Data processing pipeline
Important but no
defacto
middleware!

Collect

Store

Data source
Process

Visualize

Reporting
Monitoring
26

Fluentd general example
2012-02-04 01:33:51
apache.log

Web Server

{
"host": "127.0.0.1",
"method": "GET",
...

tail

127.0.0.1
127.0.0.1
127.0.0.1
127.0.0.1
127.0.0.1

-

-

[11/Dec/2012:07:26:27]
[11/Dec/2012:07:26:30]
[11/Dec/2012:07:26:32]
[11/Dec/2012:07:26:40]
[11/Dec/2012:07:27:01]
...

"GET
"GET
"GET
"GET
"GET

/
/
/
/
/

...
...
...
...
...

}

Fluentd

insert

event
buffering
27

Pluggable Architecture
Pluggable

Pluggable

Output
Input

> rewrite
> ...

Engine
Buffer
> Forward
> HTTP
> File tail
> dstat
> ...

> File
> Memory

Output
> Forward
> File
> MongoDB
> ...

28

Resolve your requirement by writing plugin

Access logs
Apache

Alerting
Nagios

App logs
Frontend
Backend

Analysis
MongoDB
MySQL
Hadoop

System logs
syslogd
Databases

filter / buffer / routing

Archiving
Amazon S3
29

Treasure Agent (td-agent)
•  Open sourced distribution package of Fluentd
–  ETL part of Treasure Data
–  deb / rpm / homebrew

•  Including useful components
–  Ruby, jemalloc, ﬂuentd
–  3rd party gems: td, mongo, webhdfs, etc…
–  Init script

•  http://packages.treasuredata.com/
30

Treasure

Board Meeting
DataPresentation
Platform

Backend Overview
Presented by

Kazuki Ohta – CTO


32

AWS components
•  RDS
–  Store user information, job, status, etc…
–  Queue Worker / Scheduler

•  EC2
–  API Server, Hadoop Cluster, Job Worker / Scheduler

•  S3
–  Columnar storage
•  Realtime / Archive storage
•  MessagePack columnar

•  ELB
33

Plazma(Hadoop, Storage, Queue and
Workers)
Frontend

Worker
Hadoop

Queue

Hadoop
Applications push
metrics to Fluentd
(via local Fluentd)

Treasure
Data

for historical analysis

Fluentd

Fluentd

sums up data minutes
(partial aggregation)

Librato Metrics
for realtime analysis

34

Treasure

Board Meeting
Presentation
Data

Development Philosophy
Presented by

Kazuki Ohta – CTO


35

Open-Source Culture
•  TD prefers engineers, who are contributing
to the OSS products
–  MessagePack, Fluentd, ZeroMQ, Hadoop,
MongoDB, Angular.js, Huahin, D-Lang, etc.
–  https://github.com/treasure-data?tab=members

•  Reasons
–  Fixing & Improving the other people’s code is
crucial for the distributed team.
–  TD’s engineering workﬂow is really similar with
OSS product workﬂow.
–  A+ OSS engineers will bring another A+ OSS
engineer!
36

OSS v.s. Proprietary
•  OSS Everything on the Client Side
–  http://github.com/treasure-data/
–  http://ﬂuentd.org/
•  TD is helping the world to collect more data in an analytics-ready
format
•  2000+ companies (e.g. Nintendo, SlideShare/LinkedIn) are using as
OSS product. 3-4% of the users are TD’s customer.
•  We also leverage other OSS products as much as possible.

•  Closed Source on the Cloud Side
–  The core value must be a proprietary to sustain as a
business.
–  The components can be OSS, but the most of the system will
remain proprietary to create the value chain.
37

How to decide Product Roadmap?
•  Solving the Customer Pain is the #1 Priority

–  Developers directly provide the support for customers, and spending
30%-40% of the development time to talk with customers
–  Developers are the BEST person to come up with the solution.
–  # of code lines != value

•  Suffering Oriented Development
–  First, make it possible
–  Then, make it beautiful
–  Then, make it fast

•  The Largest Customer Pain is NOT always applicable to other
customers.
–  Need to be brave to say NO. NO. NO. NO. NO….

•  TD doesn’t have 1-year Product Roadmap. Having 3-months
roadmap accelerates the development, and other teams
(marketing / sales), too.
38

Distributed Team (International)
•  13 Engineers as of Nov. 2013
–  5 Engineers in Tokyo, Japan
–  8 Engineers in Mountain View, USA
–  40% of the whole company

•  Asynchronous Communication
–  Use async communication tools as much as possible:
Chat, JIRA, Email, Github, etc.
–  Use video conferencing for weekly sync-up

•  English is the primary communication language
–  If you cannot speak English, your value is nearly zero at
Treasure Data engineering team.
39

Distributed Team (Deployment)
•  Predictable Deployment Cycle
–  Weekly Deployment

•  Continuous Deployment didn’t ﬁt into B2B SaaS application, our
customers want predictability of the changes.
•  As a distributed team, it’s hard to track the every changes +
deployment status.

–  Track every changes on JIRA, and QA engineer is responsible
for the deployment too.

•  Continuous Deployment for Staging

–  Single branch, always automatically deployed to the staging
environment
–  Monitoring is a continuous testing

•  On-Call Alert Schedule, based on the Timezone
–  No need to get up around 3am

40

Leverage Cloud Services
•  Use Cloud Services as Much as Possible

–  Don’t hire people, use cloud services.
–  Out source everything, except your core value.
–  Developers tend to forget his own cost. If you spend 1-hour, it
already costs around $50 as a company.

•  Examples
– 
– 
– 
– 
– 
– 
– 
– 
– 
– 

EC2 (IaaS)
CopperEgg (Infrastructure Monitoring)
NewRelic (Application Performance Management)
Hosted Chef (Conﬁguration Management)
Librato Metrics (Application Metrics)
Pager Duty (Alerting)
Logentries (Log Search)
CircleCI, TravisCI (Continuous Integration)
HipChat, JIRA, Conﬂuence (Development Tool)
Etc….
41

Treasure

Board Meeting
Presentation
Data

Conclusion


Presented by

Kazuki Ohta – CTO


42

Key points
•  Treasure Data, Inc
–  Cloud based Data Service for the world
–  Customer oriented development

•  Our Unique Products and Culture
–  Fluend / Plazma (backend)
–  OSS enthusiast

•  Use Cloud or not?
–  Cloud leverages an idea but not differentiator
–  Focus own vision!
43

情報処理学会 Exciting Coding! Treasure Data

More Related Content

What's hot (17)

Similar to 情報処理学会 Exciting Coding! Treasure Data (20)

More from Treasure Data, Inc. (20)

Recently uploaded (20)

情報処理学会 Exciting Coding! Treasure Data

Editor's Notes