SlideShare a Scribd company logo
1
Distributed, fault-tolerant, transactional
Real-Time Integration: MongoDB and SQL Databases
Eugene Dvorkin
Architect, WebMD
2
WebMD: A lot of data; a lot of traffic
~900 millions page view a month
~100 million unique visitors a month
3
How We Use MongoDB
User Activity
4
Why Move Data to RDBMS?
Preserve existing investment in BI
and data warehouse
To use analytical database such as
Vertica
To use SQL
5
Why Move Data In Real-time?
Batch process is slow
No ad-hoc queries
No real-time reports
6
Challenge in moving data
Transform Document to Relational Structure
Insert into RDBMS at high rate
7
Challenge in moving data
Scale easily as data volume and velocity
increase
8
Our Solution to move data in Real-time: Storm
tem.Storm – open source distributed real-
time computation system.
Developed by Nathan Marz - acquired
by Twitter
9
Hadoop Storm
Our Solution to move data in Real-time: Storm
10
Why STORM?
JVM-based framework
Guaranteed data processing
Supports development in multiple languages
Scalable and transactional
Easy to learn and use
11
Overview of Storm cluster
Master Node
Cluster Coordination
run worker processes
12
Storm Abstractions
Tuples, Streams, Spouts, Bolts and Topologies
13
Tuples
(“ns:events”,”email:edvorkin@gmail.com”)
Ordered list of elements
14
Stream
Unbounded sequence of tuples
Example: Stream of messages from
message queue
15
Spout
Read from stream of data – Queues, web
logs, API calls, mongoDB oplog
Emit documents as tuples
Source of Streams
16
Bolts
Process tuples and create new streams
17
Bolts
Apply functions /transforms
Calculate and aggregate
data (word count!)
Access DB, API , etc.
Filter data
Map/Reduce
Process tuples and create new streams
18
Topology
19
Topology
Storm is transforming and moving data
20
MongoDB
How To Read All Incoming Data
from MongoDB?
21
MongoDB
How To Read All Incoming Data
from MongoDB?
Use MongoDB OpLog
22
What is OpLog?
Replication
mechanism in
MongoDB
It is a Capped
Collection
23
Spout: reading from OpLog
Located at local database, oplog.rs collection
24
Spout: reading from OpLog
Operations: Insert, Update, Delete
25
Spout: reading from OpLog
Name space: Table – Collection name
26
Spout: reading from OpLog
Data object:
27
Sharded cluster
28
Automatic discovery of sharded cluster
29
Example: Shard vs Replica set discovery
30
Example: Shard discovery
31
Spout: Reading data from OpLog
How to Read data continuously
from OpLog?
32
Spout: Reading data from OpLog
How to Read data continuously
from OpLog?
Use Tailable Cursor
33
Example: Tailable cursor - like tail –f
34
Manage timestamps
Use ts (timestamp in oplog entry) field to
track processed records
If system restart, start from recorded ts
35
Spout: reading from OpLog
36
SPOUT – Code Example
37
TOPOLOGY
38
Working With Embedded Arrays
Array represents One-to-Many relationship in
RDBMS
39
Example: Working with embedded arrays
40
Example: Working with embedded arrays
{_id: 1,
ns: “person_awards”,
o: { award: 'National Medal of Science',
year: 1975,
by: 'National Science Foundation' }
}
{ _id: 1,
ns: “person_awards”,
o: {award: 'Turing Award',
year: 1977,
by: 'ACM' }
}
41
Example: Working with embedded arrays
public void execute(Tuple tuple) {
.........
if (field instanceof BasicDBList) {
BasicDBObject arrayElement=processArray(field)
......
outputCollector.emit("documents", tuple, arrayElement);
42
Parse documents with Bolt
43
{"ns": "people", "op":"i",
o : {
_id: 1,
name: { first: 'John', last:
'Backus' },
birth: 'Dec 03, 1924’
}
["ns": "people", "op":"i",
“_id”:1,
"name_first": "John",
"name_last":"Backus",
"birth": "DEc 03, 1924"
]
Parse documents with Bolt
44
@Override
public void execute(Tuple tuple) {
......
final BasicDBObject oplogObject =
(BasicDBObject)tuple.getValueByField("document");
final BasicDBObject document = (BasicDBObject)oplogObject.get("o");
......
outputValues.add(flattenDocument(document));
outputCollector.emit(tuple,outputValues);
Parse documents with Bolt
45
Write to SQL with SQLWriter Bolt
46
Write to SQL with SQLWriter Bolt
["ns": "people", "op":"i",
“_id”:1,
"name_first": "John",
"name_last":"Backus",
"birth": "Dec 03, 1924"
]
insert into people (_id,name_first,name_last,birth) values
(1,'John','Backus','Dec 03,1924') ,
insert into people_awards (_id,awards_award,awards_award,awards_by)
values (1,'Turing Award',1977,'ACM'),
insert into people_awards (_id,awards_award,awards_award,awards_by)
values (1,'National Medal of Science',1975,'National Science Foundation')
47
@Override
public void prepare(.....) {
....
Class.forName("com.vertica.jdbc.Driver");
con = DriverManager.getConnection(dBUrl, username,password);
@Override
public void execute(Tuple tuple) {
String insertStatement=createInsertStatement(tuple);
try {
Statement stmt = con.createStatement();
stmt.execute(insertStatement);
stmt.close();
Write to SQL with SQLWriter Bolt
48
Topology Definition
TopologyBuilder builder = new TopologyBuilder();
// define our spout
builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",
opslog_progress)
builder.setBolt(arrayExtractorId ,new
ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)
builder.setBolt(mongoDocParserId, new
MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,
documentsStreamId)
builder.setBolt(sqlWriterId, new
SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle
Grouping(mongoDocParserId)
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf,
builder.createTopology());
49
Topology Definition
TopologyBuilder builder = new TopologyBuilder();
// define our spout
builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",
opslog_progress)
builder.setBolt(arrayExtractorId ,new
ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)
builder.setBolt(mongoDocParserId, new
MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId
,documentsStreamId)
builder.setBolt(sqlWriterId, new
SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffl
eGrouping(mongoDocParserId)
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf,
builder.createTopology());
50
Topology Definition
TopologyBuilder builder = new TopologyBuilder();
// define our spout
builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",
opslog_progress)
builder.setBolt(arrayExtractorId ,new
ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)
builder.setBolt(mongoDocParserId, new
MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,
documentsStreamId)
builder.setBolt(sqlWriterId, new
SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle
Grouping(mongoDocParserId)
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf,
builder.createTopology());
51
Topology Definition
TopologyBuilder builder = new TopologyBuilder();
// define our spout
builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",
opslog_progress)
builder.setBolt(arrayExtractorId ,new
ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)
builder.setBolt(mongoDocParserId, new
MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,
documentsStreamId)
builder.setBolt(sqlWriterId, new
SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle
Grouping(mongoDocParserId)
StormSubmitter.submitTopology("OfflineEventProcess",
conf,builder.createTopology())
52
Lesson learned
By leveraging MongoDB Oplog or other
capped collection, tailable cursor and Storm
framework, you can build fast, scalable,
real-time data processing pipeline.
53
Resources
Book: Getting started with Storm
Storm Project wiki
Storm starter project
Storm contributions project
Running a Multi-Node Storm cluster tutorial
Implementing real-time trending topic
A Hadoop Alternative: Building a real-time
data pipeline with Storm
Storm Use cases
54
Resources (cont’d)
Understanding the Parallelism of a Storm
Topology
Trident – high level Storm abstraction
A practical Storm’s Trident API
Storm online forum
Mongo connector from 10gen Labs
MoSQL streaming Translator in Ruby
Project source code
New York City Storm Meetup
55
Questions
Eugene Dvorkin, Architect, WebMD edvorkin@webmd.net
Twitter: @edvorkin LinkedIn: eugenedvorkin

More Related Content

PPTX
Real-Time Integration Between MongoDB and SQL Databases
PDF
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PDF
Web analytics at scale with Druid at naver.com
PDF
Cloud jpl
PPTX
Time Series Data in a Time Series World
PPTX
Bucket your partitions wisely - Cassandra summit 2016
PDF
A Deeper Dive into EXPLAIN
 
Real-Time Integration Between MongoDB and SQL Databases
PigSPARQL: A SPARQL Query Processing Baseline for Big Data
Introduction to Apache Tajo: Future of Data Warehouse
Web analytics at scale with Druid at naver.com
Cloud jpl
Time Series Data in a Time Series World
Bucket your partitions wisely - Cassandra summit 2016
A Deeper Dive into EXPLAIN
 

What's hot (20)

PPTX
High Performance, Scalable MongoDB in a Bare Metal Cloud
PPTX
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
PDF
Базы данных. HDFS
PDF
Streams processing with Storm
PDF
Шардинг в MongoDB, Henrik Ingo (MongoDB)
PPTX
Druid realtime indexing
PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
PPT
Building your own NSQL store
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
PDF
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
ODP
Big data nyu
ODP
Web-scale data processing: practical approaches for low-latency and batch
PDF
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
PPTX
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
PDF
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
PDF
Managing Data and Operation Distribution In MongoDB
PPTX
RedisConf17 - 3 Redis Scenarios, 3 Programming Languages
PPTX
MongoDB Chunks - Distribution, Splitting, and Merging
PPTX
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
High Performance, Scalable MongoDB in a Bare Metal Cloud
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
Базы данных. HDFS
Streams processing with Storm
Шардинг в MongoDB, Henrik Ingo (MongoDB)
Druid realtime indexing
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Building your own NSQL store
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Big data nyu
Web-scale data processing: practical approaches for low-latency and batch
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
MongoDB: Comparing WiredTiger In-Memory Engine to Redis
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Managing Data and Operation Distribution In MongoDB
RedisConf17 - 3 Redis Scenarios, 3 Programming Languages
MongoDB Chunks - Distribution, Splitting, and Merging
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Ad

Viewers also liked (9)

PDF
How would ESBs look like, if they were done today.
PDF
THEFT-PROOF JAVA EE - SECURING YOUR JAVA EE APPLICATIONS
PDF
Java EE microservices architecture - evolving the monolith
PPTX
初心者向けMongoDBのキホン!
PDF
Introduction to Real-time data processing
PPTX
Node.js×mongo dbで3年間サービス運用してみた話
PDF
[db tech showcase Tokyo 2016] C32: 世界一速いPostgreSQLを目指せ!インメモリカラムナの実現 by 富士通株式会...
PPTX
Yahoo compares Storm and Spark
PDF
Apache storm vs. Spark Streaming
How would ESBs look like, if they were done today.
THEFT-PROOF JAVA EE - SECURING YOUR JAVA EE APPLICATIONS
Java EE microservices architecture - evolving the monolith
初心者向けMongoDBのキホン!
Introduction to Real-time data processing
Node.js×mongo dbで3年間サービス運用してみた話
[db tech showcase Tokyo 2016] C32: 世界一速いPostgreSQLを目指せ!インメモリカラムナの実現 by 富士通株式会...
Yahoo compares Storm and Spark
Apache storm vs. Spark Streaming
Ad

Similar to Real-Time Integration Between MongoDB and SQL Databases (20)

PPTX
MongoDB 3.0
PDF
MongoDB for Coder Training (Coding Serbia 2013)
PPTX
How to learn MongoDB for beginner's
PPTX
Mongodb introduction and_internal(simple)
PPTX
Introduction To MongoDB
PPTX
MongoDB - A next-generation database that lets you create applications never ...
PDF
2016 feb-23 pyugre-py_mongo
PDF
Using MongoDB and Python
PPTX
introtomongodb
PPTX
Introduction to MongoDB
PDF
MongoDB FabLab León
KEY
PDF
MongoDB NoSQL database a deep dive -MyWhitePaper
ODP
MongoDB - A Document NoSQL Database
PDF
MongoDB.pdf
KEY
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
PDF
Getting Ahead with MongoDB
PPTX
MongoDB_ppt.pptx
PDF
Introduction to mongo db
PPTX
MongoDbPpt based on python installation.
MongoDB 3.0
MongoDB for Coder Training (Coding Serbia 2013)
How to learn MongoDB for beginner's
Mongodb introduction and_internal(simple)
Introduction To MongoDB
MongoDB - A next-generation database that lets you create applications never ...
2016 feb-23 pyugre-py_mongo
Using MongoDB and Python
introtomongodb
Introduction to MongoDB
MongoDB FabLab León
MongoDB NoSQL database a deep dive -MyWhitePaper
MongoDB - A Document NoSQL Database
MongoDB.pdf
NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!
Getting Ahead with MongoDB
MongoDB_ppt.pptx
Introduction to mongo db
MongoDbPpt based on python installation.

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPT
Teaching material agriculture food technology
PPTX
Big Data Technologies - Introduction.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
KodekX | Application Modernization Development
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
Understanding_Digital_Forensics_Presentation.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Spectroscopy.pptx food analysis technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Teaching material agriculture food technology
Big Data Technologies - Introduction.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Programs and apps: productivity, graphics, security and other tools
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KodekX | Application Modernization Development

Real-Time Integration Between MongoDB and SQL Databases

  • 1. 1 Distributed, fault-tolerant, transactional Real-Time Integration: MongoDB and SQL Databases Eugene Dvorkin Architect, WebMD
  • 2. 2 WebMD: A lot of data; a lot of traffic ~900 millions page view a month ~100 million unique visitors a month
  • 3. 3 How We Use MongoDB User Activity
  • 4. 4 Why Move Data to RDBMS? Preserve existing investment in BI and data warehouse To use analytical database such as Vertica To use SQL
  • 5. 5 Why Move Data In Real-time? Batch process is slow No ad-hoc queries No real-time reports
  • 6. 6 Challenge in moving data Transform Document to Relational Structure Insert into RDBMS at high rate
  • 7. 7 Challenge in moving data Scale easily as data volume and velocity increase
  • 8. 8 Our Solution to move data in Real-time: Storm tem.Storm – open source distributed real- time computation system. Developed by Nathan Marz - acquired by Twitter
  • 9. 9 Hadoop Storm Our Solution to move data in Real-time: Storm
  • 10. 10 Why STORM? JVM-based framework Guaranteed data processing Supports development in multiple languages Scalable and transactional Easy to learn and use
  • 11. 11 Overview of Storm cluster Master Node Cluster Coordination run worker processes
  • 12. 12 Storm Abstractions Tuples, Streams, Spouts, Bolts and Topologies
  • 14. 14 Stream Unbounded sequence of tuples Example: Stream of messages from message queue
  • 15. 15 Spout Read from stream of data – Queues, web logs, API calls, mongoDB oplog Emit documents as tuples Source of Streams
  • 16. 16 Bolts Process tuples and create new streams
  • 17. 17 Bolts Apply functions /transforms Calculate and aggregate data (word count!) Access DB, API , etc. Filter data Map/Reduce Process tuples and create new streams
  • 20. 20 MongoDB How To Read All Incoming Data from MongoDB?
  • 21. 21 MongoDB How To Read All Incoming Data from MongoDB? Use MongoDB OpLog
  • 22. 22 What is OpLog? Replication mechanism in MongoDB It is a Capped Collection
  • 23. 23 Spout: reading from OpLog Located at local database, oplog.rs collection
  • 24. 24 Spout: reading from OpLog Operations: Insert, Update, Delete
  • 25. 25 Spout: reading from OpLog Name space: Table – Collection name
  • 26. 26 Spout: reading from OpLog Data object:
  • 28. 28 Automatic discovery of sharded cluster
  • 29. 29 Example: Shard vs Replica set discovery
  • 31. 31 Spout: Reading data from OpLog How to Read data continuously from OpLog?
  • 32. 32 Spout: Reading data from OpLog How to Read data continuously from OpLog? Use Tailable Cursor
  • 33. 33 Example: Tailable cursor - like tail –f
  • 34. 34 Manage timestamps Use ts (timestamp in oplog entry) field to track processed records If system restart, start from recorded ts
  • 36. 36 SPOUT – Code Example
  • 38. 38 Working With Embedded Arrays Array represents One-to-Many relationship in RDBMS
  • 39. 39 Example: Working with embedded arrays
  • 40. 40 Example: Working with embedded arrays {_id: 1, ns: “person_awards”, o: { award: 'National Medal of Science', year: 1975, by: 'National Science Foundation' } } { _id: 1, ns: “person_awards”, o: {award: 'Turing Award', year: 1977, by: 'ACM' } }
  • 41. 41 Example: Working with embedded arrays public void execute(Tuple tuple) { ......... if (field instanceof BasicDBList) { BasicDBObject arrayElement=processArray(field) ...... outputCollector.emit("documents", tuple, arrayElement);
  • 43. 43 {"ns": "people", "op":"i", o : { _id: 1, name: { first: 'John', last: 'Backus' }, birth: 'Dec 03, 1924’ } ["ns": "people", "op":"i", “_id”:1, "name_first": "John", "name_last":"Backus", "birth": "DEc 03, 1924" ] Parse documents with Bolt
  • 44. 44 @Override public void execute(Tuple tuple) { ...... final BasicDBObject oplogObject = (BasicDBObject)tuple.getValueByField("document"); final BasicDBObject document = (BasicDBObject)oplogObject.get("o"); ...... outputValues.add(flattenDocument(document)); outputCollector.emit(tuple,outputValues); Parse documents with Bolt
  • 45. 45 Write to SQL with SQLWriter Bolt
  • 46. 46 Write to SQL with SQLWriter Bolt ["ns": "people", "op":"i", “_id”:1, "name_first": "John", "name_last":"Backus", "birth": "Dec 03, 1924" ] insert into people (_id,name_first,name_last,birth) values (1,'John','Backus','Dec 03,1924') , insert into people_awards (_id,awards_award,awards_award,awards_by) values (1,'Turing Award',1977,'ACM'), insert into people_awards (_id,awards_award,awards_award,awards_by) values (1,'National Medal of Science',1975,'National Science Foundation')
  • 47. 47 @Override public void prepare(.....) { .... Class.forName("com.vertica.jdbc.Driver"); con = DriverManager.getConnection(dBUrl, username,password); @Override public void execute(Tuple tuple) { String insertStatement=createInsertStatement(tuple); try { Statement stmt = con.createStatement(); stmt.execute(insertStatement); stmt.close(); Write to SQL with SQLWriter Bolt
  • 48. 48 Topology Definition TopologyBuilder builder = new TopologyBuilder(); // define our spout builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress) builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId) builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId, documentsStreamId) builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle Grouping(mongoDocParserId) LocalCluster cluster = new LocalCluster(); cluster.submitTopology("test", conf, builder.createTopology());
  • 49. 49 Topology Definition TopologyBuilder builder = new TopologyBuilder(); // define our spout builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress) builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId) builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId ,documentsStreamId) builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffl eGrouping(mongoDocParserId) LocalCluster cluster = new LocalCluster(); cluster.submitTopology("test", conf, builder.createTopology());
  • 50. 50 Topology Definition TopologyBuilder builder = new TopologyBuilder(); // define our spout builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress) builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId) builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId, documentsStreamId) builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle Grouping(mongoDocParserId) LocalCluster cluster = new LocalCluster(); cluster.submitTopology("test", conf, builder.createTopology());
  • 51. 51 Topology Definition TopologyBuilder builder = new TopologyBuilder(); // define our spout builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress) builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId) builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId, documentsStreamId) builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle Grouping(mongoDocParserId) StormSubmitter.submitTopology("OfflineEventProcess", conf,builder.createTopology())
  • 52. 52 Lesson learned By leveraging MongoDB Oplog or other capped collection, tailable cursor and Storm framework, you can build fast, scalable, real-time data processing pipeline.
  • 53. 53 Resources Book: Getting started with Storm Storm Project wiki Storm starter project Storm contributions project Running a Multi-Node Storm cluster tutorial Implementing real-time trending topic A Hadoop Alternative: Building a real-time data pipeline with Storm Storm Use cases
  • 54. 54 Resources (cont’d) Understanding the Parallelism of a Storm Topology Trident – high level Storm abstraction A practical Storm’s Trident API Storm online forum Mongo connector from 10gen Labs MoSQL streaming Translator in Ruby Project source code New York City Storm Meetup
  • 55. 55 Questions Eugene Dvorkin, Architect, WebMD edvorkin@webmd.net Twitter: @edvorkin LinkedIn: eugenedvorkin

Editor's Notes

  • #3: Leading source of health and medical information.
  • #4: Data is rawData is immutable, data is trueDynamic personalized marketing campaigns
  • #14: Main data structure in stormNamed list of value where each valuecan be of any typeTyple know ho to serialize primitive data types, string and byte arrays. For any other type Register serializer for this type
  • #24: The oplog is a capped collection that lives in a database called local on every replicating node and records all changes to the data. Every time a client writes to the primary, an entry with enough information to reproduce the write is automatically added to the primary’s oplog. Once the write is replicated to a given secondary, that secondary’s oplog also stores a record of the write. Each oplog entry is identified with a BSON timestamp, and all secondaries use the timestamp to keep track of the latest entry they’ve applied.
  • #29: How do you now if you connected to shard cluster
  • #36: Use mongo Oplog as a queue
  • #37: Spout extend interface
  • #40: Awards array in Person document – converted into 2 documents with id as of parent document Id
  • #41: Awards array – converted into 2 documents with id as of parent document Id. Name space will be used later to insert data into correct table on SQL side
  • #42: Instance of BasicDBList in Java
  • #44: Flatten out your document structure – use loop or recursion to flatten it outHopefully you don’t have deeply nested documents, which against mongoDB guidelines for schema design
  • #48: Use tickle tuples and update in batches
  • #49: Local mode vs prod mode
  • #50: Increasing papallelization of the bolt. Let say You want 5 bolts to process your array, because it more time consuming operation or you want more SQLWtirerBolts,Because it takes long time to insert data, then use parallelization hint parameters in bolt definition.System will create correspponding number of workers to process your request.
  • #51: Local mode vs prod mode
  • #52: Local mode vs prod mode