SlideShare a Scribd company logo
Treasure Data 

Exciting Coding!
Nov 2013
Presented by



Masahiro Nakagawa
Senior Software Engineer


www.treasuredata.com

1
Who are you
•  Masahiro Nakagawa
–  @repeatedly
–  masa@treasure-data.com or d@

•  Treasure Data, Inc
–  Senior Software Engineer
•  Fluentd / Client libraries / etc...

–  Since 2012/11

•  Open Source projects
–  D Programming Language
–  MessagePack: D, Python, etc…
–  Fluentd: Core, Mongo, Logger, etc…
–  Etc…

2
Company &

Board Meeting
Presentation
Service

Introduction

August 15th, 2013 - 3:30PM PDT

Presented by


Hironobu Yoshikawa – CEO 
Kazuki Ohta – CTO 
Rich Ghiossi – VP, Marketing
Keith Goldstein – VP, Sales
Kengo Hirouchi – Director, Japan
Ankush Rustagi – Director, Marketing


www.treasuredata.com

3
Company Background
•  Founded 2011 in Mountain View, CA
–  The first cloud service for the entire data pipeline
–  Including: Acquisition, Storage, & Analysis

•  Provide a “Cloud Data Service”
–  Fast Time to Value
–  Cloud Flexibility and Economics
–  Simple and Well Supported

The Treasure Data Team
Hiro Yoshikawa – CEO
Open source business veteran
Kaz Ohta – CTO
Founder of world’s largest Hadoop Group
Jeff Yuan – Director, Engineering
LinkedIn, MIT / Michale Stonrebrraker Lab
Keith Goldstein – VP Sales & Bus Dev
VP of Bus Dev from Tibco and Talend
Rich Ghiossi – VP Marketing
VP of Marketing from ParAccel

Notable Investors

•  Treasure Data has over 100+ customers in
production
–  Incl. Fortune 500 companies
–  500+ Billion new records / month
–  Around 2 Trillion records under management
–  Variety of use cases and verticals

Othman Laraki
Ex-VP of Growth at Twitter
Jerry Yang
Founder of Yahoo!
Yukihiro “Matz” Matusmoto
Creator of “Ruby” programming language
James Lindenbaum
Founder of Heroku

4
Problem Statement
•  Lots of companies today produce Big Data by having
“New Data Sources” (Sensor, Weblog, etc)
–  But few have the resources to build a
Big Data Analytics system

•  60-70% of a company’s Big Data time & budget
consumed by:
–  Infrastructure setup & Maintenance
–  Building Collection & Storage Flows
–  Hiring/Training Hadoop Expertise

•  On average, it takes 6 months to get
a Hadoop environment into production
5
6
Treasure Data’s
Focus
(80% of the
needs)

7
8
Treasure Data Service: Overview
Acquire

Store

Analyze

Web logs
Treasure Agent
App logs

BI Connectivity

Streaming Log !
Collector (JSON)!

REST API, SQL, Pig,
JDBC / ODBC!

Sensor

Tableau, Metric Insights,
QlikView, Excel, etc.

Treasure Data Cloud

RDBMS
Bulk Import
CRM

BI Tools

Parallel Upload from
CSV, MySQL, etc.!

Flexible, Scalable,
Columnar Storage!

ERP

Time to Value

Economy & Flexibility

Result Push
REST API, SQL,
Pig!

Dashboards
Custom App, Local DB,
FTP Server, etc.

Simple & Supported

9
Our Value Propositions 
•  Faster time to value

On-demand cloud infrastructure & versatile streaming data collection agent
–  Instantly provision a fully tuned & managed infrastructure
–  Go live into production on average in 14 days (collection, analytics, & BI)

•  Cloud flexibility and economics

Fraction of the cost of traditional solutions by leveraging cloud storage and processing,
which scales to meet your needs
–  Leverage the cost-advantage of the cloud
–  Leverage the elasticity of the cloud – scale on demand
–  Predictable monthly subscription fee
–  No upfront costs & no long-term commitment

•  Simple and well supported
We are passionate about simplicity, and customer support excellence
–  Focus your time on analyzing your data
–  Rely on us to keep your data secure & online
–  We love making customers successful & building long-term relationships

10
Initial Setup & Onboarding – Two Weeks
1. Data Collection

2. Data Storage

•  Setup, tuning, and monitoring
of Treasure Agent
•  Embed Treasure Agent code
into applications

•  Basic log templates (register,
pay, login, etc.)
•  Basic KPI queries (DAU, MAU,
ARPU, etc.)

3. Data Analysis

4. Service & Support

•  Setup dashboards with basic
KPIs
•  Training on creating
customized reports and adhoc querying

•  Assigned a dedicated
technical account manager
•  Real-time support via email,
online chat, and call

11
Solutions Accelerators

…
Out-of-the Box Reporting 



Treasure Data Platform



Configured Treasure Agent

Solution
Components:



-  Treasure Data Platform
-  Event Collection
Template
-  Pre-configured
Treasure Agent
Configuration
-  BI Dashboard with KPIs

12
- Vision -
gle Analytics Platform for the Wo

13
Treasure

Board Meeting
DataPresentation
Platform
August 15th, 2013 - 3:30PM PDT

Architecture Overview
Presented by


Hironobu Yoshikawa – CEO 
Kazuki Ohta – CTO 
Rich Ghiossi – VP, Marketing
Keith Goldstein – VP, Sales
Kengo Hirouchi – Director, Japan
Ankush Rustagi – Director, Marketing


www.treasuredata.com

14
Data Acquisition – Streaming Capture
Application Server
# Application Code
...
...
# Post event to Treasure Data
TD.event.post('access', {:uid=>123})

•  Automatic Microbatching
•  Local buffering Fallback
•  Network Tolerance

...
...

Treasure Data Library

Java, Ruby, PHP, Perl, Python, Scala,
Node.js 

Treasure Data Cloud

Treasure Agent (local)

Open-Sourced as Fluentd Project ( http://fluentd.org/ )

15
Data Acquisition – Bulk Loader
RDBMS

App

SaaS

CSV, TSV, JSON,
MessagePack, Apache,
regex, MySQL, FTP

FTP

Treasure Data Cloud



Bulk Loader


Prepare ! Upload ! Perform ! Commit

16
Data Storage

Treasure Data Cloud

Default (schema-less)
time

v

13841604
00

{“ip”:”135.52.211.23”, “code”:”0”}

13841622
00

{“ip”:”45.25.38.156”, “code”:”-1”}

13841640
00

{“ip”:”97.12.76.55”, “code”:”99”}

•  Stored “schema-less” as JSON
– 

Schema can be applied/updated
AFTER storage

•  Compressed & columnar format

SELECT v[‘ip’] as ip, v[‘code’] as code …

Schema applied

~30% Faster

time

ip : string
135.52.211.23
45.25.38.156
97.12.76.55

•  Quickly scale-up processing power
– 

WITHOUT reloading/redistributing the data

-1

138416400
0

•  Optimized for time-based filtering

0

138416220
0

For higher query performance

code : int

138416040
0

– 

99

SELECT ip, code …

17
Data Analysis
REST API

Treasure Data Cloud

Heavy Lifting SQL (Hive):
-  Hive’s Built-in UDFs
-  TD Added Functions:
-  Time Functions
-  First, Last, Rank
-  Sessionize

Scheduled Jobs
-  SQL, Pig Scripts
-  Data Pushes

JDBC Connectivity:
-  Custom Java Apps
-  Standards-based
-  BI Tool Integration

Tableau ODBC connector
-  Leverages Impala
Interactive SQL
Push Query Results:
Treasure Query Accelerator 
 -  MySQL, PostgreSQL
(Impala)
-  Google Spreadsheet
-  Web, FTP, S3
Scripted Processing (Pig):
-  Leftronic, Indicee
-  DataFu (LinkedIn)
-  Treasure Data Table
-  Piggybank (Apache)

18
Treasure

Board Meeting
Presentation
Data
August 15th, 2013 - 3:30PM PDT

General Use Cases
Presented by


Hironobu Yoshikawa – CEO 
Kazuki Ohta – CTO 
Rich Ghiossi – VP, Marketing
Keith Goldstein – VP, Sales
Kengo Hirouchi – Director, Japan
Ankush Rustagi – Director, Marketing


www.treasuredata.com

19
A case: “14 Days” from Signup to Success

1.  Europe’s largest mobile ad
exchange.
2.  Serving >60 billion imps/
month for >30,000 mobile
apps (Q4 2013)
3.  Immediate need of analytics
infrastructure: ASAP!
4.  With TD, MobFox got into
production only in 14 days,
by one engineer.

"Time is the most precious asset in our fast-moving
business,
and Treasure Data saved us a lot of it."


Julian Zehetmayr, CEO & Founder
20
A case: “Replace” in-house Hadoop to TD
Before

1.  Global “Hulu” - Online Video
Service with millions of users

2.  Video contents are
distributed to over 150
languages.

After

3.  Had hard time maintaining
Hadoop cluster
4.  With TD, Viki deprecated
their in-house Hadoop
cluster and use engineer for
core businesses.

“Treasure Data has always given us thorough and timely
support peppered with insightful tips to make the best use of
their service."

Huy Nguyen, Software Engineer
21
A case: Treasure Data with BI Tool (Tableau)

1.  World’s largest android
application market
2.  Serving >3 billion app
downloads for >100 million
users
3.  Only one engineer managing
the data infrastructure
4.  With TD, the data engineer
can focus on analyzing data
with existing BI tool

"I will recommend Treasure Data to my friends in a heartbeat because it
benefits all three stakeholders: Operations, Engineering and Business."	

	

Simon Dong, Principal Architect - Data Engineering	


22
Treasure

Board Meeting
DataPresentation
Platform
August 15th, 2013 - 3:30PM PDT

Fluentd Overview
Presented by


Hironobu Yoshikawa – CEO 
Kazuki Ohta – CTO 
Rich Ghiossi – VP, Marketing
Keith Goldstein – VP, Sales
Kengo Hirouchi – Director, Japan
Ankush Rustagi – Director, Marketing


www.treasuredata.com

23
What is Fluentd?
•  Open sourced log collector written in Ruby
–  Easy to use, reliable and well performance
–  Streaming event processing

•  Using rubygems ecosystem to distribute plugins

Fluentd
the missing log collector
fluentd.org
24
Data processing pipeline
Data source
Collect

Store

Process

Visualize

Reporting
Monitoring
25
Data processing pipeline
Important but no
defacto
middleware!

Collect

Store

Data source
Process

Visualize

Reporting
Monitoring
26
Fluentd general example
2012-02-04 01:33:51
apache.log

Web Server

{
"host": "127.0.0.1",
"method": "GET",
...

tail

127.0.0.1
127.0.0.1
127.0.0.1
127.0.0.1
127.0.0.1

-

-

[11/Dec/2012:07:26:27]
[11/Dec/2012:07:26:30]
[11/Dec/2012:07:26:32]
[11/Dec/2012:07:26:40]
[11/Dec/2012:07:27:01]
...

"GET
"GET
"GET
"GET
"GET

/
/
/
/
/

...
...
...
...
...

}

Fluentd

insert

event
buffering
27
Pluggable Architecture
Pluggable

Pluggable

Output
Input

> rewrite
> ...

Engine
Buffer
> Forward
> HTTP
> File tail
> dstat
> ...

> File
> Memory

Output
> Forward
> File
> MongoDB
> ...

28
Resolve your requirement by writing plugin

Access logs
Apache

Alerting
Nagios

App logs
Frontend
Backend

Analysis
MongoDB
MySQL
Hadoop

System logs
syslogd
Databases

filter / buffer / routing

Archiving
Amazon S3
29
Treasure Agent (td-agent)
•  Open sourced distribution package of Fluentd
–  ETL part of Treasure Data
–  deb / rpm / homebrew

•  Including useful components
–  Ruby, jemalloc, fluentd
–  3rd party gems: td, mongo, webhdfs, etc…
–  Init script

•  http://packages.treasuredata.com/
30
Fluentd users

31
Treasure

Board Meeting
DataPresentation
Platform
August 15th, 2013 - 3:30PM PDT

Backend Overview
Presented by


Hironobu Yoshikawa – CEO 
Kazuki Ohta – CTO 
Rich Ghiossi – VP, Marketing
Keith Goldstein – VP, Sales
Kengo Hirouchi – Director, Japan
Ankush Rustagi – Director, Marketing


www.treasuredata.com

32
AWS components
•  RDS
–  Store user information, job, status, etc…
–  Queue Worker / Scheduler

•  EC2
–  API Server, Hadoop Cluster, Job Worker / Scheduler

•  S3
–  Columnar storage
•  Realtime / Archive storage
•  MessagePack columnar

•  ELB
33
Plazma(Hadoop, Storage, Queue and
Workers)
Frontend

Worker
Hadoop

Queue

Hadoop
Applications push
metrics to Fluentd
(via local Fluentd)

Treasure
Data

for historical analysis

Fluentd

Fluentd

sums up data minutes
(partial aggregation)

Librato Metrics
for realtime analysis

34
Treasure

Board Meeting
Presentation
Data
August 15th, 2013 - 3:30PM PDT

Development Philosophy
Presented by


Hironobu Yoshikawa – CEO 
Kazuki Ohta – CTO 
Rich Ghiossi – VP, Marketing
Keith Goldstein – VP, Sales
Kengo Hirouchi – Director, Japan
Ankush Rustagi – Director, Marketing


www.treasuredata.com

35
Open-Source Culture
•  TD prefers engineers, who are contributing
to the OSS products
–  MessagePack, Fluentd, ZeroMQ, Hadoop,
MongoDB, Angular.js, Huahin, D-Lang, etc.
–  https://github.com/treasure-data?tab=members

•  Reasons
–  Fixing & Improving the other people’s code is
crucial for the distributed team.
–  TD’s engineering workflow is really similar with
OSS product workflow.
–  A+ OSS engineers will bring another A+ OSS
engineer!
36
OSS v.s. Proprietary
•  OSS Everything on the Client Side
–  http://github.com/treasure-data/
–  http://fluentd.org/
•  TD is helping the world to collect more data in an analytics-ready
format
•  2000+ companies (e.g. Nintendo, SlideShare/LinkedIn) are using as
OSS product. 3-4% of the users are TD’s customer.
•  We also leverage other OSS products as much as possible.

•  Closed Source on the Cloud Side
–  The core value must be a proprietary to sustain as a
business.
–  The components can be OSS, but the most of the system will
remain proprietary to create the value chain.
37
How to decide Product Roadmap?
•  Solving the Customer Pain is the #1 Priority

–  Developers directly provide the support for customers, and spending
30%-40% of the development time to talk with customers
–  Developers are the BEST person to come up with the solution.
–  # of code lines != value

•  Suffering Oriented Development
–  First, make it possible
–  Then, make it beautiful
–  Then, make it fast

•  The Largest Customer Pain is NOT always applicable to other
customers.
–  Need to be brave to say NO. NO. NO. NO. NO….

•  TD doesn’t have 1-year Product Roadmap. Having 3-months
roadmap accelerates the development, and other teams
(marketing / sales), too.
38
Distributed Team (International)
•  13 Engineers as of Nov. 2013
–  5 Engineers in Tokyo, Japan
–  8 Engineers in Mountain View, USA
–  40% of the whole company

•  Asynchronous Communication
–  Use async communication tools as much as possible:
Chat, JIRA, Email, Github, etc.
–  Use video conferencing for weekly sync-up

•  English is the primary communication language
–  If you cannot speak English, your value is nearly zero at
Treasure Data engineering team.
39
Distributed Team (Deployment)
•  Predictable Deployment Cycle
–  Weekly Deployment

•  Continuous Deployment didn’t fit into B2B SaaS application, our
customers want predictability of the changes.
•  As a distributed team, it’s hard to track the every changes +
deployment status.

–  Track every changes on JIRA, and QA engineer is responsible
for the deployment too.

•  Continuous Deployment for Staging

–  Single branch, always automatically deployed to the staging
environment
–  Monitoring is a continuous testing

•  On-Call Alert Schedule, based on the Timezone
–  No need to get up around 3am

40
Leverage Cloud Services
•  Use Cloud Services as Much as Possible

–  Don’t hire people, use cloud services.
–  Out source everything, except your core value.
–  Developers tend to forget his own cost. If you spend 1-hour, it
already costs around $50 as a company.

•  Examples
– 
– 
– 
– 
– 
– 
– 
– 
– 
– 

EC2 (IaaS)
CopperEgg (Infrastructure Monitoring)
NewRelic (Application Performance Management)
Hosted Chef (Configuration Management)
Librato Metrics (Application Metrics)
Pager Duty (Alerting)
Logentries (Log Search)
CircleCI, TravisCI (Continuous Integration)
HipChat, JIRA, Confluence (Development Tool)
Etc….
41
Treasure

Board Meeting
Presentation
Data

Conclusion

August 15th, 2013 - 3:30PM PDT

Presented by


Hironobu Yoshikawa – CEO 
Kazuki Ohta – CTO 
Rich Ghiossi – VP, Marketing
Keith Goldstein – VP, Sales
Kengo Hirouchi – Director, Japan
Ankush Rustagi – Director, Marketing


www.treasuredata.com

42
Key points
•  Treasure Data, Inc
–  Cloud based Data Service for the world
–  Customer oriented development

•  Our Unique Products and Culture
–  Fluend / Plazma (backend)
–  OSS enthusiast

•  Use Cloud or not?
–  Cloud leverages an idea but not differentiator
–  Focus own vision!
43

More Related Content

PDF
Kengo Horiuchi, SaaS Business Born in the Cloud :: AWS Partner Techshift
PDF
Azure Days 2019: Keynote Azure Switzerland – Status Quo und Ausblick (Primo A...
PPTX
Reblaze Case Study on GCP
PPTX
Data saturday malta - ADX Azure Data Explorer overview
PPTX
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
PPT
Netflix Teradata partner's presentation
PDF
Customer Experience at Disney+ Through Data Perspective
PDF
Big data on AWS
Kengo Horiuchi, SaaS Business Born in the Cloud :: AWS Partner Techshift
Azure Days 2019: Keynote Azure Switzerland – Status Quo und Ausblick (Primo A...
Reblaze Case Study on GCP
Data saturday malta - ADX Azure Data Explorer overview
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
Netflix Teradata partner's presentation
Customer Experience at Disney+ Through Data Perspective
Big data on AWS

What's hot (17)

PPTX
Lecture1
PPTX
Big Data on azure
PPTX
Azure Synapse Analytics Overview (r2)
PPTX
Getting to 1.5M Ads/sec: How DataXu manages Big Data
PPTX
Azure cafe marketplace with looker data analytics
PDF
Azure Synapse Analytics
PDF
Netflix: Using Big Data in the Cloud to Drive Engagement
PDF
Scaling Privacy in a Spark Ecosystem
PPTX
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
PDF
Introduction to Azure Synapse Webinar
PDF
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
PPTX
Google Cloud Platform (GCP)
PDF
Extracting Value from IOT using Azure Cosmos DB, Azure Synapse Analytics and ...
PDF
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
PDF
Delivering business insights and automation utilizing aws data services
PDF
Part 3 - Modern Data Warehouse with Azure Synapse
PDF
Using Redash for SQL Analytics on Databricks
Lecture1
Big Data on azure
Azure Synapse Analytics Overview (r2)
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Azure cafe marketplace with looker data analytics
Azure Synapse Analytics
Netflix: Using Big Data in the Cloud to Drive Engagement
Scaling Privacy in a Spark Ecosystem
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Introduction to Azure Synapse Webinar
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Google Cloud Platform (GCP)
Extracting Value from IOT using Azure Cosmos DB, Azure Synapse Analytics and ...
Spark and Hadoop at Production Scale-(Anil Gadre, MapR)
Delivering business insights and automation utilizing aws data services
Part 3 - Modern Data Warehouse with Azure Synapse
Using Redash for SQL Analytics on Databricks
Ad

Similar to 情報処理学会 Exciting Coding! Treasure Data (20)

PDF
Treasure Data Cloud Strategy
PDF
Treasure Data Cloud Data Platform
PDF
Data Analytics Service Company and Its Ruby Usage
PDF
The architecture of data analytics PaaS on AWS
PPTX
Partner webinar presentation aws pebble_treasure_data
PDF
Treasure Data and OSS
PDF
Treasure Data and Heroku
PPT
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
PPTX
Introduction to Harnessing Big Data
PDF
Data-Driven Development Era and Its Technologies
PDF
Archiving is a No-brainer - Bloor Analyst and RainStor Executive Discuss
PDF
The Evolving Landscape of Data Engineering
PPTX
BICS empowers predictive analytics and customer centricity with a Hadoop base...
PPTX
Swiss Data Bank, the first data management bank
PDF
Big data rmoug
PDF
Good Data: Collaborative Analytics On Demand
PDF
ZIGRAM Introduction September 2020
PPTX
Deutsche Telekom on Big Data
PDF
To Have Own Data Analytics Platform, Or NOT To
PDF
Hybrid my sql_hadoop_datawarehouse
Treasure Data Cloud Strategy
Treasure Data Cloud Data Platform
Data Analytics Service Company and Its Ruby Usage
The architecture of data analytics PaaS on AWS
Partner webinar presentation aws pebble_treasure_data
Treasure Data and OSS
Treasure Data and Heroku
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Introduction to Harnessing Big Data
Data-Driven Development Era and Its Technologies
Archiving is a No-brainer - Bloor Analyst and RainStor Executive Discuss
The Evolving Landscape of Data Engineering
BICS empowers predictive analytics and customer centricity with a Hadoop base...
Swiss Data Bank, the first data management bank
Big data rmoug
Good Data: Collaborative Analytics On Demand
ZIGRAM Introduction September 2020
Deutsche Telekom on Big Data
To Have Own Data Analytics Platform, Or NOT To
Hybrid my sql_hadoop_datawarehouse
Ad

More from Treasure Data, Inc. (20)

PPTX
GDPR: A Practical Guide for Marketers
PPTX
AR and VR by the Numbers: A Data First Approach to the Technology and Market
PPTX
Introduction to Customer Data Platforms
PPTX
Hands On: Javascript SDK
PPTX
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
PPTX
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
PPTX
How to Power Your Customer Experience with Data
PPTX
Why Your VR Game is Virtually Useless Without Data
PDF
Connecting the Customer Data Dots
PPTX
Harnessing Data for Better Customer Experience and Company Success
PDF
Packaging Ecosystems -Monki Gras 2017
PDF
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
PDF
Keynote - Fluentd meetup v14
PDF
Introduction to New features and Use cases of Hivemall
PDF
Scalable Hadoop in the cloud
PDF
Using Embulk at Treasure Data
PDF
Scaling to Infinity - Open Source meets Big Data
PDF
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
PDF
Treasure Data From MySQL to Redshift
PDF
Unifying Events and Logs into the Cloud
GDPR: A Practical Guide for Marketers
AR and VR by the Numbers: A Data First Approach to the Technology and Market
Introduction to Customer Data Platforms
Hands On: Javascript SDK
Hands-On: Managing Slowly Changing Dimensions Using TD Workflow
Brand Analytics Management: Measuring CLV Across Platforms, Devices and Apps
How to Power Your Customer Experience with Data
Why Your VR Game is Virtually Useless Without Data
Connecting the Customer Data Dots
Harnessing Data for Better Customer Experience and Company Success
Packaging Ecosystems -Monki Gras 2017
글로벌 사례로 보는 데이터로 돈 버는 법 - 트레저데이터 (Treasure Data)
Keynote - Fluentd meetup v14
Introduction to New features and Use cases of Hivemall
Scalable Hadoop in the cloud
Using Embulk at Treasure Data
Scaling to Infinity - Open Source meets Big Data
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...
Treasure Data From MySQL to Redshift
Unifying Events and Logs into the Cloud

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Spectroscopy.pptx food analysis technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Cloud computing and distributed systems.
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Spectral efficient network and resource selection model in 5G networks
MIND Revenue Release Quarter 2 2025 Press Release
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
MYSQL Presentation for SQL database connectivity
Spectroscopy.pptx food analysis technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Dropbox Q2 2025 Financial Results & Investor Presentation
Cloud computing and distributed systems.
Empathic Computing: Creating Shared Understanding
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

情報処理学会 Exciting Coding! Treasure Data

  • 1. Treasure Data 
 Exciting Coding! Nov 2013 Presented by Masahiro Nakagawa Senior Software Engineer www.treasuredata.com 1
  • 2. Who are you •  Masahiro Nakagawa –  @repeatedly –  masa@treasure-data.com or d@ •  Treasure Data, Inc –  Senior Software Engineer •  Fluentd / Client libraries / etc... –  Since 2012/11 •  Open Source projects –  D Programming Language –  MessagePack: D, Python, etc… –  Fluentd: Core, Mongo, Logger, etc… –  Etc… 2
  • 3. Company & Board Meeting Presentation Service Introduction August 15th, 2013 - 3:30PM PDT Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 3
  • 4. Company Background •  Founded 2011 in Mountain View, CA –  The first cloud service for the entire data pipeline –  Including: Acquisition, Storage, & Analysis •  Provide a “Cloud Data Service” –  Fast Time to Value –  Cloud Flexibility and Economics –  Simple and Well Supported The Treasure Data Team Hiro Yoshikawa – CEO Open source business veteran Kaz Ohta – CTO Founder of world’s largest Hadoop Group Jeff Yuan – Director, Engineering LinkedIn, MIT / Michale Stonrebrraker Lab Keith Goldstein – VP Sales & Bus Dev VP of Bus Dev from Tibco and Talend Rich Ghiossi – VP Marketing VP of Marketing from ParAccel Notable Investors •  Treasure Data has over 100+ customers in production –  Incl. Fortune 500 companies –  500+ Billion new records / month –  Around 2 Trillion records under management –  Variety of use cases and verticals Othman Laraki Ex-VP of Growth at Twitter Jerry Yang Founder of Yahoo! Yukihiro “Matz” Matusmoto Creator of “Ruby” programming language James Lindenbaum Founder of Heroku 4
  • 5. Problem Statement •  Lots of companies today produce Big Data by having “New Data Sources” (Sensor, Weblog, etc) –  But few have the resources to build a Big Data Analytics system •  60-70% of a company’s Big Data time & budget consumed by: –  Infrastructure setup & Maintenance –  Building Collection & Storage Flows –  Hiring/Training Hadoop Expertise •  On average, it takes 6 months to get a Hadoop environment into production 5
  • 6. 6
  • 8. 8
  • 9. Treasure Data Service: Overview Acquire Store Analyze Web logs Treasure Agent App logs BI Connectivity Streaming Log ! Collector (JSON)! REST API, SQL, Pig, JDBC / ODBC! Sensor Tableau, Metric Insights, QlikView, Excel, etc. Treasure Data Cloud RDBMS Bulk Import CRM BI Tools Parallel Upload from CSV, MySQL, etc.! Flexible, Scalable, Columnar Storage! ERP Time to Value Economy & Flexibility Result Push REST API, SQL, Pig! Dashboards Custom App, Local DB, FTP Server, etc. Simple & Supported 9
  • 10. Our Value Propositions •  Faster time to value On-demand cloud infrastructure & versatile streaming data collection agent –  Instantly provision a fully tuned & managed infrastructure –  Go live into production on average in 14 days (collection, analytics, & BI) •  Cloud flexibility and economics Fraction of the cost of traditional solutions by leveraging cloud storage and processing, which scales to meet your needs –  Leverage the cost-advantage of the cloud –  Leverage the elasticity of the cloud – scale on demand –  Predictable monthly subscription fee –  No upfront costs & no long-term commitment •  Simple and well supported We are passionate about simplicity, and customer support excellence –  Focus your time on analyzing your data –  Rely on us to keep your data secure & online –  We love making customers successful & building long-term relationships 10
  • 11. Initial Setup & Onboarding – Two Weeks 1. Data Collection 2. Data Storage •  Setup, tuning, and monitoring of Treasure Agent •  Embed Treasure Agent code into applications •  Basic log templates (register, pay, login, etc.) •  Basic KPI queries (DAU, MAU, ARPU, etc.) 3. Data Analysis 4. Service & Support •  Setup dashboards with basic KPIs •  Training on creating customized reports and adhoc querying •  Assigned a dedicated technical account manager •  Real-time support via email, online chat, and call 11
  • 12. Solutions Accelerators … Out-of-the Box Reporting Treasure Data Platform Configured Treasure Agent Solution Components: -  Treasure Data Platform -  Event Collection Template -  Pre-configured Treasure Agent Configuration -  BI Dashboard with KPIs 12
  • 13. - Vision - gle Analytics Platform for the Wo 13
  • 14. Treasure Board Meeting DataPresentation Platform August 15th, 2013 - 3:30PM PDT Architecture Overview Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 14
  • 15. Data Acquisition – Streaming Capture Application Server # Application Code ... ... # Post event to Treasure Data TD.event.post('access', {:uid=>123}) •  Automatic Microbatching •  Local buffering Fallback •  Network Tolerance ... ... Treasure Data Library Java, Ruby, PHP, Perl, Python, Scala, Node.js Treasure Data Cloud Treasure Agent (local) Open-Sourced as Fluentd Project ( http://fluentd.org/ ) 15
  • 16. Data Acquisition – Bulk Loader RDBMS App SaaS CSV, TSV, JSON, MessagePack, Apache, regex, MySQL, FTP FTP Treasure Data Cloud Bulk Loader Prepare ! Upload ! Perform ! Commit 16
  • 17. Data Storage Treasure Data Cloud Default (schema-less) time v 13841604 00 {“ip”:”135.52.211.23”, “code”:”0”} 13841622 00 {“ip”:”45.25.38.156”, “code”:”-1”} 13841640 00 {“ip”:”97.12.76.55”, “code”:”99”} •  Stored “schema-less” as JSON –  Schema can be applied/updated AFTER storage •  Compressed & columnar format SELECT v[‘ip’] as ip, v[‘code’] as code … Schema applied ~30% Faster time ip : string 135.52.211.23 45.25.38.156 97.12.76.55 •  Quickly scale-up processing power –  WITHOUT reloading/redistributing the data -1 138416400 0 •  Optimized for time-based filtering 0 138416220 0 For higher query performance code : int 138416040 0 –  99 SELECT ip, code … 17
  • 18. Data Analysis REST API Treasure Data Cloud Heavy Lifting SQL (Hive): -  Hive’s Built-in UDFs -  TD Added Functions: -  Time Functions -  First, Last, Rank -  Sessionize Scheduled Jobs -  SQL, Pig Scripts -  Data Pushes JDBC Connectivity: -  Custom Java Apps -  Standards-based -  BI Tool Integration Tableau ODBC connector -  Leverages Impala Interactive SQL Push Query Results: Treasure Query Accelerator -  MySQL, PostgreSQL (Impala) -  Google Spreadsheet -  Web, FTP, S3 Scripted Processing (Pig): -  Leftronic, Indicee -  DataFu (LinkedIn) -  Treasure Data Table -  Piggybank (Apache) 18
  • 19. Treasure Board Meeting Presentation Data August 15th, 2013 - 3:30PM PDT General Use Cases Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 19
  • 20. A case: “14 Days” from Signup to Success 1.  Europe’s largest mobile ad exchange. 2.  Serving >60 billion imps/ month for >30,000 mobile apps (Q4 2013) 3.  Immediate need of analytics infrastructure: ASAP! 4.  With TD, MobFox got into production only in 14 days, by one engineer. "Time is the most precious asset in our fast-moving business, and Treasure Data saved us a lot of it." 
 Julian Zehetmayr, CEO & Founder 20
  • 21. A case: “Replace” in-house Hadoop to TD Before 1.  Global “Hulu” - Online Video Service with millions of users 2.  Video contents are distributed to over 150 languages. After 3.  Had hard time maintaining Hadoop cluster 4.  With TD, Viki deprecated their in-house Hadoop cluster and use engineer for core businesses. “Treasure Data has always given us thorough and timely support peppered with insightful tips to make the best use of their service." Huy Nguyen, Software Engineer 21
  • 22. A case: Treasure Data with BI Tool (Tableau) 1.  World’s largest android application market 2.  Serving >3 billion app downloads for >100 million users 3.  Only one engineer managing the data infrastructure 4.  With TD, the data engineer can focus on analyzing data with existing BI tool "I will recommend Treasure Data to my friends in a heartbeat because it benefits all three stakeholders: Operations, Engineering and Business." Simon Dong, Principal Architect - Data Engineering 22
  • 23. Treasure Board Meeting DataPresentation Platform August 15th, 2013 - 3:30PM PDT Fluentd Overview Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 23
  • 24. What is Fluentd? •  Open sourced log collector written in Ruby –  Easy to use, reliable and well performance –  Streaming event processing •  Using rubygems ecosystem to distribute plugins Fluentd the missing log collector fluentd.org 24
  • 25. Data processing pipeline Data source Collect Store Process Visualize Reporting Monitoring 25
  • 26. Data processing pipeline Important but no defacto middleware! Collect Store Data source Process Visualize Reporting Monitoring 26
  • 27. Fluentd general example 2012-02-04 01:33:51 apache.log Web Server { "host": "127.0.0.1", "method": "GET", ... tail 127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1 - - [11/Dec/2012:07:26:27] [11/Dec/2012:07:26:30] [11/Dec/2012:07:26:32] [11/Dec/2012:07:26:40] [11/Dec/2012:07:27:01] ... "GET "GET "GET "GET "GET / / / / / ... ... ... ... ... } Fluentd insert event buffering 27
  • 28. Pluggable Architecture Pluggable Pluggable Output Input > rewrite > ... Engine Buffer > Forward > HTTP > File tail > dstat > ... > File > Memory Output > Forward > File > MongoDB > ... 28
  • 29. Resolve your requirement by writing plugin Access logs Apache Alerting Nagios App logs Frontend Backend Analysis MongoDB MySQL Hadoop System logs syslogd Databases filter / buffer / routing Archiving Amazon S3 29
  • 30. Treasure Agent (td-agent) •  Open sourced distribution package of Fluentd –  ETL part of Treasure Data –  deb / rpm / homebrew •  Including useful components –  Ruby, jemalloc, fluentd –  3rd party gems: td, mongo, webhdfs, etc… –  Init script •  http://packages.treasuredata.com/ 30
  • 32. Treasure Board Meeting DataPresentation Platform August 15th, 2013 - 3:30PM PDT Backend Overview Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 32
  • 33. AWS components •  RDS –  Store user information, job, status, etc… –  Queue Worker / Scheduler •  EC2 –  API Server, Hadoop Cluster, Job Worker / Scheduler •  S3 –  Columnar storage •  Realtime / Archive storage •  MessagePack columnar •  ELB 33
  • 34. Plazma(Hadoop, Storage, Queue and Workers) Frontend Worker Hadoop Queue Hadoop Applications push metrics to Fluentd (via local Fluentd) Treasure Data for historical analysis Fluentd Fluentd sums up data minutes (partial aggregation) Librato Metrics for realtime analysis 34
  • 35. Treasure Board Meeting Presentation Data August 15th, 2013 - 3:30PM PDT Development Philosophy Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 35
  • 36. Open-Source Culture •  TD prefers engineers, who are contributing to the OSS products –  MessagePack, Fluentd, ZeroMQ, Hadoop, MongoDB, Angular.js, Huahin, D-Lang, etc. –  https://github.com/treasure-data?tab=members •  Reasons –  Fixing & Improving the other people’s code is crucial for the distributed team. –  TD’s engineering workflow is really similar with OSS product workflow. –  A+ OSS engineers will bring another A+ OSS engineer! 36
  • 37. OSS v.s. Proprietary •  OSS Everything on the Client Side –  http://github.com/treasure-data/ –  http://fluentd.org/ •  TD is helping the world to collect more data in an analytics-ready format •  2000+ companies (e.g. Nintendo, SlideShare/LinkedIn) are using as OSS product. 3-4% of the users are TD’s customer. •  We also leverage other OSS products as much as possible. •  Closed Source on the Cloud Side –  The core value must be a proprietary to sustain as a business. –  The components can be OSS, but the most of the system will remain proprietary to create the value chain. 37
  • 38. How to decide Product Roadmap? •  Solving the Customer Pain is the #1 Priority –  Developers directly provide the support for customers, and spending 30%-40% of the development time to talk with customers –  Developers are the BEST person to come up with the solution. –  # of code lines != value •  Suffering Oriented Development –  First, make it possible –  Then, make it beautiful –  Then, make it fast •  The Largest Customer Pain is NOT always applicable to other customers. –  Need to be brave to say NO. NO. NO. NO. NO…. •  TD doesn’t have 1-year Product Roadmap. Having 3-months roadmap accelerates the development, and other teams (marketing / sales), too. 38
  • 39. Distributed Team (International) •  13 Engineers as of Nov. 2013 –  5 Engineers in Tokyo, Japan –  8 Engineers in Mountain View, USA –  40% of the whole company •  Asynchronous Communication –  Use async communication tools as much as possible: Chat, JIRA, Email, Github, etc. –  Use video conferencing for weekly sync-up •  English is the primary communication language –  If you cannot speak English, your value is nearly zero at Treasure Data engineering team. 39
  • 40. Distributed Team (Deployment) •  Predictable Deployment Cycle –  Weekly Deployment •  Continuous Deployment didn’t fit into B2B SaaS application, our customers want predictability of the changes. •  As a distributed team, it’s hard to track the every changes + deployment status. –  Track every changes on JIRA, and QA engineer is responsible for the deployment too. •  Continuous Deployment for Staging –  Single branch, always automatically deployed to the staging environment –  Monitoring is a continuous testing •  On-Call Alert Schedule, based on the Timezone –  No need to get up around 3am 40
  • 41. Leverage Cloud Services •  Use Cloud Services as Much as Possible –  Don’t hire people, use cloud services. –  Out source everything, except your core value. –  Developers tend to forget his own cost. If you spend 1-hour, it already costs around $50 as a company. •  Examples –  –  –  –  –  –  –  –  –  –  EC2 (IaaS) CopperEgg (Infrastructure Monitoring) NewRelic (Application Performance Management) Hosted Chef (Configuration Management) Librato Metrics (Application Metrics) Pager Duty (Alerting) Logentries (Log Search) CircleCI, TravisCI (Continuous Integration) HipChat, JIRA, Confluence (Development Tool) Etc…. 41
  • 42. Treasure Board Meeting Presentation Data Conclusion August 15th, 2013 - 3:30PM PDT Presented by Hironobu Yoshikawa – CEO Kazuki Ohta – CTO Rich Ghiossi – VP, Marketing Keith Goldstein – VP, Sales Kengo Hirouchi – Director, Japan Ankush Rustagi – Director, Marketing www.treasuredata.com 42
  • 43. Key points •  Treasure Data, Inc –  Cloud based Data Service for the world –  Customer oriented development •  Our Unique Products and Culture –  Fluend / Plazma (backend) –  OSS enthusiast •  Use Cloud or not? –  Cloud leverages an idea but not differentiator –  Focus own vision! 43

Editor's Notes

  • #10: Time to Value Setup time and load time for data collection (td-agent) – 1 weekAnalysis capabilities out of the boxSimple integration with existing ecosystem (DI & BI)Cloud flexibility and economiesScalable (cloud), extensible (elastic), flexible (schemaless)Lower TCO compared to on-premise, hosted, or homegrownOn-demand ability to scale, adjust, meet future business requirementsSimple and supported“Full” solutions from collection to visualizationGreat customer service, support, setup, and SLAsEasy to extend on your own / self-service – DIY big data
  • #12: Time to Value Setup time and load time for data collection (td-agent) – 1 weekAnalysis capabilities out of the boxSimple integration with existing ecosystem (DI & BI)Cloud flexibility and economiesScalable (cloud), extensible (elastic), flexible (schemaless)Lower TCO compared to on-premise, hosted, or homegrownOn-demand ability to scale, adjust, meet future business requirementsSimple and supported“Full” solutions from collection to visualizationGreat customer service, support, setup, and SLAsEasy to extend on your own / self-service – DIY big data