Augmenting Mongo DB with treasure data

Augmenting
MongoDB with Treasure Data

About Me
• A recovering software engineer turned digital artist
once interested in fractals;
• now into data visualization based on large datasets
rendered directly to GPU (RGL, various Python GL
libraries, etc.)
• it’s easier these days to manipulate large dataset
with limited effort

Images courtesy of Edureka, 10gen, MongoDB,
clipart panda and aperfectworld.org

relatively speaking
Images courtesy of Edureka and MongoDB

Strengths: Format,
Convenience
Schema courtesy of Emily Stolfo, MongoDB

“not so strengths” of
MongoDB
• Horowitz was also very honest about where and how
MongoDB is lacking in its current offering – most notably in
terms of integration capabilities and some areas of high
performance.
• “In the relational world you’ve got a few big boxes, in the
MongoDB world you could have 2,000 commodity servers,
so you need really great management tools for that.
That’s a huge problem for us.”
• “The other big thing is automation, where you can have
automation tools that let you manage very large clusters all
from a very simple pane of glass.”
http://diginomica.com/2014/11/10/mongodb-cto-mongo-works-doesnt/
From an interview with MongoDB CTO Eliot Horowitz

testimony of John R Jensen, Cengage

hmmm…moar “not so strengths” ;)
of MongoDB
• The dreaded “Write Lock”
• https://news.ycombinator.com/item?id=1691748
• http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
- is the data actually relational or not?
• Slideshare “where not to use MongoDB”
• Slideshare “Hive vs. Cassandra vs. MongoDB”
• http://www.slideshare.net/johnrjenson/mongodb-pros-and-cons
• “choosing the right NoSQL database” video:
https://www.youtube.com/watch?t=34&v=gJFG04Sy6NY
• limits of MongoDB: http://docs.mongodb.org/manual/reference/limits/

Share some use-cases
with us, won’t you?

Complementing MongoDB
• Featurewise?
• Elastic Search
• Use-Case: Stripe
• Opslog feature as queryable log of diffs
• Hadoop (good for large scale processing), Hive
(Treasure Data)
• Use-Case: Wish

What’s wish.com?
• Mobile eCommerce -
world’s largest mobile
shopping mall
• Top 10 app on iOS &
Android
• Product
discovery/personalization

Your Fun, Personalized Shopping Mall

Wish & MongoDB
• Primary database
• 64x mongod
• AWS -> bare metal (SSDs ftw!)

Architecture
App server
Wish App fluentd
fluentd aggregation
server
fluentd
fluentd aggregation
server
fluentd
Hadoop/Hive

Complementing MongoDB
• Operationally?
• Managing MongoDB is hard (spin up instances
with Mongolab)
• MongoDB (monitoring products: Ops Manager
and MMS)
• What’s your pain? What’s are you missing?

You know what?
Managing ANY data is
hard.

Treasure Data as a Cloud
Way of complementing Mongo
DB

6
Batch
Imports
Streaming
Imports
Ingestion
Batch Processing
(Hive, Presto)
Storage
(Schema-on-read)
Console
ODBC JDBC
AWS S3
PrestogresCLI Luigi-TD
FTP
GDocs
HTTP Put
Python
PostgreSQL
Redshift
SFDC
Yahoo BDI
Tableau Server
MySQL
…
Ingest Analyze Distribute

DATA ACCESS > ADVANCED ANALYTICS
• Product analytics: data
access is a major issue.
• “Machine learning” is still
simple and “small” in scale
(can be done inside Python)
• Future work:
productized/operationalized
machine learning
bluetooth
iOS/Android SDK
Fluentd
Python/Pandas SDK
Data Science Team (5 people)

SCHEMA(LESS) COUNTS
• Redshift: lots of co-use
cases
• Event data is semi-
structured → Can be
modeled as JSON but
schemas change
• Treasure Data provides a
SQL-accessible, semi-
structured data lake.
email
Source of truth/JSON
More intensive data processing
Hourly/Daily load
Big data mart
More interactive data processing
Ad hoc queries
for new data
Ad hoc queries
for cached data

DATA COLLECTION IS HARD
• Want to assume all data is
on S3 or HDFS, but reality is
murkier.
• Sensor readings available as
email attachments
• Provide data collection tools
for 90% of the use cases.
Have APIs ready for 10%.
GH SCADA
email
Parse & transform
Import via REST
Import data as JSON
Analyze via SQL
Query
Results
Data-informed
maintenance

Some revised scenarios
• Revised scenario 1: Using Treasure Data for
Ingestion and analytics; exporting results to
MongoDB for reporting

Some revised scenarios
• Revised scenario 2: Ingestion data into MongoDB
and exporting to Treasure Data

Treasure Data is good for a
some of the same things…
• less overhead in setup
• less - make that practically no - effort to scale
• less overhead/effort to use
• but -> less fine-tuned control over outcome

The Tradeoff
Control (sharding and replication) vs. Ease
(scaling automatically in the cloud)
?

MongoDB is already excellent for a lot of things
+

jhammink@treasure-data.com
@Rijksband
www.treasure-data.com
Questions?

Augmenting Mongo DB with treasure data

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to Augmenting Mongo DB with treasure data (20)

More from Treasure Data, Inc. (16)

Recently uploaded (20)

Augmenting Mongo DB with treasure data

Editor's Notes