SlideShare a Scribd company logo
FROM	
  ORACLE	
  TO	
  CASSANDRA	
  WITH	
  SPARK	
  
@TheAllantGroup | @SVDataScience 2
© 2015. ALL RIGHTS RESERVED.
WHO ARE WE?
Shambho Krishnasamy Fausto Inestroza
@TheAllantGroup | @SVDataScience 3
© 2015. ALL RIGHTS RESERVED.
CUSTOMER RECOGNITION
Challenges in the Digital age
–  Scalability
–  Throughput
–  Cost
@TheAllantGroup | @SVDataScience 4
© 2015. ALL RIGHTS RESERVED.
CUSTOMER RECOGNITION
Key	
  Management	
   Tailoring	
  Key	
  Assignment	
  Hygiene	
  
Func6onal	
  Buckets	
  
@TheAllantGroup | @SVDataScience 5
© 2015. ALL RIGHTS RESERVED.
CUSTOMER RECOGNITION
Key	
  Management	
   Tailoring	
  Key	
  Assignment	
  Hygiene	
  
Func6onal	
  Buckets	
  
@TheAllantGroup | @SVDataScience 6
© 2015. ALL RIGHTS RESERVED.
LEGACY APPLICATION	
  
Input	
   Output	
  
Recogni6on	
  Bus	
  Service	
  
Party	
   Address	
   HHLD	
   Indv	
   Digital	
   Keying	
  Lookup	
   Reference	
  
Address	
  
	
  
	
  
Household	
  
	
  
	
  
Individual	
  
	
  
	
  
DigitalKey	
  
	
  
	
  
Digi-­‐Asso	
  
	
  
	
  
Reference	
  
	
  
	
  
@TheAllantGroup | @SVDataScience 7
© 2015. ALL RIGHTS RESERVED.
LEGACY SOLUTION
JMS	
  
@TheAllantGroup | @SVDataScience 8
© 2015. ALL RIGHTS RESERVED.
NEED	
  FOR	
  CHANGE?	
  
RE-­‐PLATFORM	
  !	
  
RE-­‐ARCHITECT	
  !	
  
@TheAllantGroup | @SVDataScience 9
© 2015. ALL RIGHTS RESERVED.
LIMITATIONS TO SCALE – MESSAGE PROCESSING
ARCHITECTURE
•  Message processing engine
•  Common API to handle real-time and batch
•  Batch is converted into messages
@TheAllantGroup | @SVDataScience 10
© 2015. ALL RIGHTS RESERVED.
LIMITATIONS TO SCALE – DATA THROUGHPUT
4-­‐8	
  MM	
  records/hour	
  
Volume	
   Performance	
  
Scale	
  to	
  meet	
  Allant’s	
  Audience	
  Interconnect®	
  	
  customer	
  recogni6on	
  needs	
  
@TheAllantGroup | @SVDataScience 11
© 2015. ALL RIGHTS RESERVED.
LIMITATIONS TO SCALE – SCALING HORIZONTALLY
Locking!	
  
@TheAllantGroup | @SVDataScience 12
© 2015. ALL RIGHTS RESERVED.
LIMITATIONS TO SCALE – SCALING VERTICALLY
=	
  
@TheAllantGroup | @SVDataScience 13
© 2015. ALL RIGHTS RESERVED.
WHAT DO WE WANT?
Increase	
  throughput	
  
	
  
Improve	
  scalability	
  
Elas6c	
  infrastructure	
  
(but	
  don’t	
  compromise	
  on	
  
real-­‐6me	
  API	
  capability!)	
  
	
  
(but	
  contain	
  cost!)	
  
	
  
(well…	
  so	
  we	
  went	
  Cloud)	
  
@TheAllantGroup | @SVDataScience 14
© 2015. ALL RIGHTS RESERVED.
WHAT TO RE-PLATFORM?
?	
  
JMS	
  
@TheAllantGroup | @SVDataScience 15
© 2015. ALL RIGHTS RESERVED.
CASSANDRA
@TheAllantGroup | @SVDataScience 16
© 2015. ALL RIGHTS RESERVED.
Consistent	
  Reads!	
  
Consistent	
  Writes!	
  
SWITCH DATA STORE
JMS	
  
@TheAllantGroup | @SVDataScience 17
© 2015. ALL RIGHTS RESERVED.
WE’RE DONE!
@TheAllantGroup | @SVDataScience 18
© 2015. ALL RIGHTS RESERVED.
BUT…APPLICATION	
  LAYER	
  IS	
  STILL	
  A	
  BOTTLENECK	
  
@TheAllantGroup | @SVDataScience 19
© 2015. ALL RIGHTS RESERVED.
MUST	
  MAINTAIN	
  EXISTING	
  LOGIC!	
  
@TheAllantGroup | @SVDataScience 20
© 2015. ALL RIGHTS RESERVED.
RECAP…	
  
@TheAllantGroup | @SVDataScience 21
© 2015. ALL RIGHTS RESERVED.
RECAP – BASELINE
JMS	
  
@TheAllantGroup | @SVDataScience 22
© 2015. ALL RIGHTS RESERVED.
RECAP	
  -­‐	
  LEGACY	
  APPLICATION	
  
Input	
   Output	
  
Recogni6on	
  Bus	
  Service	
  
Party	
   Address	
   HHLD	
   Indv	
   Digital	
   Keying	
  Lookup	
   Reference	
  
Address	
  
	
  
	
  
Household	
  
	
  
	
  
Individual	
  
	
  
	
  
DigitalKey	
  
	
  
	
  
Digi-­‐Asso	
  
	
  
	
  
Reference	
  
	
  
	
  
@TheAllantGroup | @SVDataScience 23
© 2015. ALL RIGHTS RESERVED.
HOW	
  DID	
  WE	
  DO	
  IT?	
  
@TheAllantGroup | @SVDataScience 24
© 2015. ALL RIGHTS RESERVED.
JMS	
  
INTRODUCED CASSANDRA…
But	
  Cassandra	
  is	
  very	
  bored…	
  
@TheAllantGroup | @SVDataScience 25
© 2015. ALL RIGHTS RESERVED.
What	
  about	
  this	
  part?	
  	
  
HOW TO RE-ARCHITECT?
JMS	
  
Cassandra	
  is	
  very	
  bored…	
  
@TheAllantGroup | @SVDataScience 26
© 2015. ALL RIGHTS RESERVED.
NOW INTRODUCE HADOOP
We	
  employed	
  Distributed	
  Data	
  Management	
  Technology	
  end-­‐to-­‐end…	
  
Cassandra	
  is	
  very	
  happy!	
  
@TheAllantGroup | @SVDataScience 27
© 2015. ALL RIGHTS RESERVED.
PERFORMANCE BENCHMARK RESULTS -
ENVIRONMENT
•  12 Cassandra Nodes
–  4 CPU
–  15GB RAM
–  80GB SSD
•  6 Hadoop Nodes
–  32 CPU
–  60GB RAM
–  640GB SSD
@TheAllantGroup | @SVDataScience 28
© 2015. ALL RIGHTS RESERVED.
PERFORMANCE BENCHMARK RESULTS - MAPREDUCE
Environment Results
JMS – Oracle 4.5 Million / Hour
MapReduce – Cassandra 44 Million / Hour
Benchmark 1: Smaller Input (~15 Million Profiles)
~10x
Environment Results
JMS – Oracle 2.5 Million / Hour
MapReduce – Cassandra 45 Million / Hour
Benchmark 1: Larger Input (~400 Million Profiles)
~20x
From 6-7 days down to ~8 hours!
@TheAllantGroup | @SVDataScience 29
© 2015. ALL RIGHTS RESERVED.
COULD WE DO BETTER?
@TheAllantGroup | @SVDataScience 30
© 2015. ALL RIGHTS RESERVED.
INTRODUCE SPARK
Cassandra	
  is	
  ecsta6c!	
  
@TheAllantGroup | @SVDataScience 31
© 2015. ALL RIGHTS RESERVED.
EMPLOY DATASTAX LIGHTNING FAST CONNECTOR
@TheAllantGroup | @SVDataScience 32
© 2015. ALL RIGHTS RESERVED.
PERFORMANCE BENCHMARK RESULTS - SPARK
Environment Results
JMS – Oracle 2.5 Million / Hour
MapReduce – Cassandra 45 Million / Hour
Spark – Cassandra 125 Million / Hour
[185 Million / Hour for “match only”]
Benchmark 1: Larger Input (~400 Million Profiles)
~50x
From 6-7 days down to ~3 hours!
@TheAllantGroup | @SVDataScience 33
© 2015. ALL RIGHTS RESERVED.
TAKEAWAYS
•  We	
  did	
  contain	
  cost!	
  –	
  with	
  be^er	
  throughput	
  &	
  scalability	
  	
  
•  Pu`ng	
  Cassandra	
  to	
  work	
  by	
  employing	
  MapReduce	
  and	
  Spark	
  	
  
•  Unimpeded	
  throughput	
  regardless	
  of	
  the	
  data-­‐store	
  volume	
  	
  
•  Unique	
  Key	
  Genera6on	
  under	
  distributed	
  data	
  technology	
  	
  
•  Resolving	
  Latency	
  vs.	
  Throughput	
  -­‐	
  Tradi6onal	
  Conflict	
  	
  
•  In	
  our	
  use-­‐case,	
  the	
  data-­‐store	
  	
  
•  Is	
  encapsulated	
  !	
  
•  Has	
  only	
  controlled	
  access	
  !	
  
•  Does	
  only	
  Reads	
  and	
  Writes	
  !	
  
@TheAllantGroup | @SVDataScience 34
© 2015. ALL RIGHTS RESERVED.
THANK YOU

More Related Content

PDF
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
PDF
Cassandra and IoT
PPTX
Dataworks | 2018-06-20 | Gimel data platform
PPTX
Spark meets Smart Meters
PDF
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
PPTX
Refactoring your EDW with Mobile Analytics Products
PDF
How to design and implement a data ops architecture with sdc and gcp
PPTX
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Cassandra and IoT
Dataworks | 2018-06-20 | Gimel data platform
Spark meets Smart Meters
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Refactoring your EDW with Mobile Analytics Products
How to design and implement a data ops architecture with sdc and gcp
Aeris + Cassandra: An IOT Solution Helping Automakers Make the Connected Car ...

What's hot (20)

PDF
NetApp Industry Keynote - Flash Memory Summit - Aug2015
PPTX
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PDF
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
PDF
InfoTrack: Creating a single source of truth with the Elastic Stack
PDF
JBoss OneDayTalk 2013: "NoSQL Integration with Apache Camel - MongoDB, CouchD...
PDF
Unified Data Access with Gimel
PPTX
IoT Platform Meetup - GE
PDF
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
PDF
ActiveEon - Automate, Accelerate, Scale
PDF
LIVE DEMO: Big Data Suite
PDF
Successful AI/ML Projects with End-to-End Cloud Data Engineering
PPTX
Cloudera - IoT & Smart Cities
PPTX
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
PDF
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
PPTX
Webinar: Introducing the SnapLogic Elastic Integration Platform Summer 2014 R...
PDF
Improving Response Times at Optum with Elastic APM
PPTX
Extending Hortonworks with Oracle's Big Data Platform
PDF
Apache Kafka in the Healthcare Industry
PDF
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
PDF
Complex Data Transformations Made Easy
NetApp Industry Keynote - Flash Memory Summit - Aug2015
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
"Big Data beyond Apache Hadoop - How to Integrate ALL your Data" - JavaOne 2013
InfoTrack: Creating a single source of truth with the Elastic Stack
JBoss OneDayTalk 2013: "NoSQL Integration with Apache Camel - MongoDB, CouchD...
Unified Data Access with Gimel
IoT Platform Meetup - GE
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
ActiveEon - Automate, Accelerate, Scale
LIVE DEMO: Big Data Suite
Successful AI/ML Projects with End-to-End Cloud Data Engineering
Cloudera - IoT & Smart Cities
The 6th Wave of Automation: Automation of Decisions | Cloudera Analytics & Ma...
Building a Streaming Microservices Architecture - Data + AI Summit EU 2020
Webinar: Introducing the SnapLogic Elastic Integration Platform Summer 2014 R...
Improving Response Times at Optum with Elastic APM
Extending Hortonworks with Oracle's Big Data Platform
Apache Kafka in the Healthcare Industry
Fast Cars, Big Data - How Streaming Can Help Formula 1 - Tugdual Grall - Code...
Complex Data Transformations Made Easy
Ad

Viewers also liked (20)

PDF
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
PPTX
Despliegue de Cassandra en la nube de Amazon
PDF
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
PDF
AdStage: Monacella: An Relational Object Database using Cassandra as the Data...
PDF
DataStax: Testing Cassandra Guarantees Under Diverse Failure Modes With Jepsen
PDF
Capital One: Using Cassandra In Building A Reporting Platform
PDF
MyDrive Solutions: Case Study: Troubleshooting Production Issues as a Developer.
PDF
DataStax: The Cassandra Validation Harness: Achieving More Stable Releases
PDF
The Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
PDF
AddThis: Scaling Cassandra up and down into containers with ZFS
PPTX
Cassandra internals
PDF
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
PDF
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
PPTX
3800 die-bonder overview
PDF
DataStax: A deep look at the CQL WHERE clause
PDF
GumGum: Multi-Region Cassandra in AWS
PDF
SKB Kontur: Digging Cassandra cluster
PDF
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
PDF
Cisco: Cassandra adoption on Cisco UCS & OpenStack
PDF
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
Despliegue de Cassandra en la nube de Amazon
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
AdStage: Monacella: An Relational Object Database using Cassandra as the Data...
DataStax: Testing Cassandra Guarantees Under Diverse Failure Modes With Jepsen
Capital One: Using Cassandra In Building A Reporting Platform
MyDrive Solutions: Case Study: Troubleshooting Production Issues as a Developer.
DataStax: The Cassandra Validation Harness: Achieving More Stable Releases
The Last Pickle: Repeatable, Scalable, Reliable, Observable: Cassandra
AddThis: Scaling Cassandra up and down into containers with ZFS
Cassandra internals
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
3800 die-bonder overview
DataStax: A deep look at the CQL WHERE clause
GumGum: Multi-Region Cassandra in AWS
SKB Kontur: Digging Cassandra cluster
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Ad

Similar to Silicon Valley Data Science: From Oracle to Cassandra with Spark (20)

PPTX
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
PPTX
BI, Reporting and Analytics on Apache Cassandra
PDF
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
PPTX
DataStax
PDF
Slides: Relational to NoSQL Migration
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
PPTX
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
PDF
Harnessing Spark and Cassandra with Groovy
PDF
Cassandra Summit 2015 - A Change of Seasons
PDF
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
PDF
State of Cassandra 2012
PDF
Spark and cassandra (Hulu Talk)
PDF
Oracle NoSQL Database release 3.0 overview
PPTX
John Glendenning - Real time data driven services in the Cloud
PPTX
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
PPTX
Big Data Analytics with Spark
PDF
Getting started with Spark & Cassandra by Jon Haddad of Datastax
PDF
Cassandra and Spark - Tim Berglund
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
BI, Reporting and Analytics on Apache Cassandra
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax
Slides: Relational to NoSQL Migration
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
5 Ways to Use Spark to Enrich your Cassandra Environment
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Harnessing Spark and Cassandra with Groovy
Cassandra Summit 2015 - A Change of Seasons
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
State of Cassandra 2012
Spark and cassandra (Hulu Talk)
Oracle NoSQL Database release 3.0 overview
John Glendenning - Real time data driven services in the Cloud
Realtime Data Pipeline with Spark Streaming and Cassandra with Mesos (Rahul K...
Real time data pipeline with spark streaming and cassandra with mesos
Big Data Analytics with Spark
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Cassandra and Spark - Tim Berglund

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
PPTX
Introduction to DataStax Enterprise Graph Database
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
PPTX
Cassandra on Docker @ Walmart Labs
PDF
Cassandra 3.0 Data Modeling
PPTX
Cassandra Adoption on Cisco UCS & Open stack
PDF
Data Modeling for Apache Cassandra
PDF
Coursera Cassandra Driver
PDF
Production Ready Cassandra
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
PDF
Standing Up Your First Cluster
PDF
Real Time Analytics with Dse
PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Cassandra Core Concepts
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PPTX
Bad Habits Die Hard
PDF
Advanced Data Modeling with Apache Cassandra
PDF
Advanced Cassandra
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Cassandra on Docker @ Walmart Labs
Cassandra 3.0 Data Modeling
Cassandra Adoption on Cisco UCS & Open stack
Data Modeling for Apache Cassandra
Coursera Cassandra Driver
Production Ready Cassandra
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 2
Standing Up Your First Cluster
Real Time Analytics with Dse
Introduction to Data Modeling with Apache Cassandra
Cassandra Core Concepts
Enabling Search in your Cassandra Application with DataStax Enterprise
Bad Habits Die Hard
Advanced Data Modeling with Apache Cassandra
Advanced Cassandra

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Review of recent advances in non-invasive hemoglobin estimation
Programs and apps: productivity, graphics, security and other tools
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
A comparative analysis of optical character recognition models for extracting...
Dropbox Q2 2025 Financial Results & Investor Presentation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
“AI and Expert System Decision Support & Business Intelligence Systems”

Silicon Valley Data Science: From Oracle to Cassandra with Spark

  • 1. FROM  ORACLE  TO  CASSANDRA  WITH  SPARK  
  • 2. @TheAllantGroup | @SVDataScience 2 © 2015. ALL RIGHTS RESERVED. WHO ARE WE? Shambho Krishnasamy Fausto Inestroza
  • 3. @TheAllantGroup | @SVDataScience 3 © 2015. ALL RIGHTS RESERVED. CUSTOMER RECOGNITION Challenges in the Digital age –  Scalability –  Throughput –  Cost
  • 4. @TheAllantGroup | @SVDataScience 4 © 2015. ALL RIGHTS RESERVED. CUSTOMER RECOGNITION Key  Management   Tailoring  Key  Assignment  Hygiene   Func6onal  Buckets  
  • 5. @TheAllantGroup | @SVDataScience 5 © 2015. ALL RIGHTS RESERVED. CUSTOMER RECOGNITION Key  Management   Tailoring  Key  Assignment  Hygiene   Func6onal  Buckets  
  • 6. @TheAllantGroup | @SVDataScience 6 © 2015. ALL RIGHTS RESERVED. LEGACY APPLICATION   Input   Output   Recogni6on  Bus  Service   Party   Address   HHLD   Indv   Digital   Keying  Lookup   Reference   Address       Household       Individual       DigitalKey       Digi-­‐Asso       Reference      
  • 7. @TheAllantGroup | @SVDataScience 7 © 2015. ALL RIGHTS RESERVED. LEGACY SOLUTION JMS  
  • 8. @TheAllantGroup | @SVDataScience 8 © 2015. ALL RIGHTS RESERVED. NEED  FOR  CHANGE?   RE-­‐PLATFORM  !   RE-­‐ARCHITECT  !  
  • 9. @TheAllantGroup | @SVDataScience 9 © 2015. ALL RIGHTS RESERVED. LIMITATIONS TO SCALE – MESSAGE PROCESSING ARCHITECTURE •  Message processing engine •  Common API to handle real-time and batch •  Batch is converted into messages
  • 10. @TheAllantGroup | @SVDataScience 10 © 2015. ALL RIGHTS RESERVED. LIMITATIONS TO SCALE – DATA THROUGHPUT 4-­‐8  MM  records/hour   Volume   Performance   Scale  to  meet  Allant’s  Audience  Interconnect®    customer  recogni6on  needs  
  • 11. @TheAllantGroup | @SVDataScience 11 © 2015. ALL RIGHTS RESERVED. LIMITATIONS TO SCALE – SCALING HORIZONTALLY Locking!  
  • 12. @TheAllantGroup | @SVDataScience 12 © 2015. ALL RIGHTS RESERVED. LIMITATIONS TO SCALE – SCALING VERTICALLY =  
  • 13. @TheAllantGroup | @SVDataScience 13 © 2015. ALL RIGHTS RESERVED. WHAT DO WE WANT? Increase  throughput     Improve  scalability   Elas6c  infrastructure   (but  don’t  compromise  on   real-­‐6me  API  capability!)     (but  contain  cost!)     (well…  so  we  went  Cloud)  
  • 14. @TheAllantGroup | @SVDataScience 14 © 2015. ALL RIGHTS RESERVED. WHAT TO RE-PLATFORM? ?   JMS  
  • 15. @TheAllantGroup | @SVDataScience 15 © 2015. ALL RIGHTS RESERVED. CASSANDRA
  • 16. @TheAllantGroup | @SVDataScience 16 © 2015. ALL RIGHTS RESERVED. Consistent  Reads!   Consistent  Writes!   SWITCH DATA STORE JMS  
  • 17. @TheAllantGroup | @SVDataScience 17 © 2015. ALL RIGHTS RESERVED. WE’RE DONE!
  • 18. @TheAllantGroup | @SVDataScience 18 © 2015. ALL RIGHTS RESERVED. BUT…APPLICATION  LAYER  IS  STILL  A  BOTTLENECK  
  • 19. @TheAllantGroup | @SVDataScience 19 © 2015. ALL RIGHTS RESERVED. MUST  MAINTAIN  EXISTING  LOGIC!  
  • 20. @TheAllantGroup | @SVDataScience 20 © 2015. ALL RIGHTS RESERVED. RECAP…  
  • 21. @TheAllantGroup | @SVDataScience 21 © 2015. ALL RIGHTS RESERVED. RECAP – BASELINE JMS  
  • 22. @TheAllantGroup | @SVDataScience 22 © 2015. ALL RIGHTS RESERVED. RECAP  -­‐  LEGACY  APPLICATION   Input   Output   Recogni6on  Bus  Service   Party   Address   HHLD   Indv   Digital   Keying  Lookup   Reference   Address       Household       Individual       DigitalKey       Digi-­‐Asso       Reference      
  • 23. @TheAllantGroup | @SVDataScience 23 © 2015. ALL RIGHTS RESERVED. HOW  DID  WE  DO  IT?  
  • 24. @TheAllantGroup | @SVDataScience 24 © 2015. ALL RIGHTS RESERVED. JMS   INTRODUCED CASSANDRA… But  Cassandra  is  very  bored…  
  • 25. @TheAllantGroup | @SVDataScience 25 © 2015. ALL RIGHTS RESERVED. What  about  this  part?     HOW TO RE-ARCHITECT? JMS   Cassandra  is  very  bored…  
  • 26. @TheAllantGroup | @SVDataScience 26 © 2015. ALL RIGHTS RESERVED. NOW INTRODUCE HADOOP We  employed  Distributed  Data  Management  Technology  end-­‐to-­‐end…   Cassandra  is  very  happy!  
  • 27. @TheAllantGroup | @SVDataScience 27 © 2015. ALL RIGHTS RESERVED. PERFORMANCE BENCHMARK RESULTS - ENVIRONMENT •  12 Cassandra Nodes –  4 CPU –  15GB RAM –  80GB SSD •  6 Hadoop Nodes –  32 CPU –  60GB RAM –  640GB SSD
  • 28. @TheAllantGroup | @SVDataScience 28 © 2015. ALL RIGHTS RESERVED. PERFORMANCE BENCHMARK RESULTS - MAPREDUCE Environment Results JMS – Oracle 4.5 Million / Hour MapReduce – Cassandra 44 Million / Hour Benchmark 1: Smaller Input (~15 Million Profiles) ~10x Environment Results JMS – Oracle 2.5 Million / Hour MapReduce – Cassandra 45 Million / Hour Benchmark 1: Larger Input (~400 Million Profiles) ~20x From 6-7 days down to ~8 hours!
  • 29. @TheAllantGroup | @SVDataScience 29 © 2015. ALL RIGHTS RESERVED. COULD WE DO BETTER?
  • 30. @TheAllantGroup | @SVDataScience 30 © 2015. ALL RIGHTS RESERVED. INTRODUCE SPARK Cassandra  is  ecsta6c!  
  • 31. @TheAllantGroup | @SVDataScience 31 © 2015. ALL RIGHTS RESERVED. EMPLOY DATASTAX LIGHTNING FAST CONNECTOR
  • 32. @TheAllantGroup | @SVDataScience 32 © 2015. ALL RIGHTS RESERVED. PERFORMANCE BENCHMARK RESULTS - SPARK Environment Results JMS – Oracle 2.5 Million / Hour MapReduce – Cassandra 45 Million / Hour Spark – Cassandra 125 Million / Hour [185 Million / Hour for “match only”] Benchmark 1: Larger Input (~400 Million Profiles) ~50x From 6-7 days down to ~3 hours!
  • 33. @TheAllantGroup | @SVDataScience 33 © 2015. ALL RIGHTS RESERVED. TAKEAWAYS •  We  did  contain  cost!  –  with  be^er  throughput  &  scalability     •  Pu`ng  Cassandra  to  work  by  employing  MapReduce  and  Spark     •  Unimpeded  throughput  regardless  of  the  data-­‐store  volume     •  Unique  Key  Genera6on  under  distributed  data  technology     •  Resolving  Latency  vs.  Throughput  -­‐  Tradi6onal  Conflict     •  In  our  use-­‐case,  the  data-­‐store     •  Is  encapsulated  !   •  Has  only  controlled  access  !   •  Does  only  Reads  and  Writes  !  
  • 34. @TheAllantGroup | @SVDataScience 34 © 2015. ALL RIGHTS RESERVED. THANK YOU