SlideShare a Scribd company logo
1©MapR Technologies
Expect More from Hadoop
Jack Norris, MapR Technologies
3©MapR Technologies
Hadoop Growth
4©MapR Technologies
Important Drivers for Hadoop
 Data on compute
 You don’t need to know what
questions to ask beforehand
 Simple algorithms on Big Data
 Analysis of unstructured data
5©MapR Technologies
The Cost of Enterprise Storage
SAN Storage
$2 - $10/Gigabyte
$1M gets:
0.5Petabytes
200,000 IOPS
1Gbyte/sec
NAS Filers
$1 - $5/Gigabyte
$1M gets:
1 Petabyte
400,000 IOPS
2Gbyte/sec
Local Storage
$0.02/Gigabyte
$1M gets:
50 Petabytes
10,000,000 IOPS
800 Gbytes/sec
1/100 to 1/20 the cost
6©MapR Technologies
MapReduce: A Paradigm Shift
 Distributed, scalable computing platform
– Data/Compute framework
– Commodity hardware
 Pioneered at Google
 Commercially available as Hadoop
7©MapR Technologies
MapR Distribution for Apache Hadoop
 Complete Hadoop
distribution
 Comprehensive
management suite
 Industry-standard
interfaces
 Enterprise-grade
dependability
 Higher performance
8©MapR Technologies
How do you Benefit?
9©MapR Technologies
Expanding data
for existing applications
10©MapR Technologies
Use Case #1
 Major telecom vendor
 Key step in billing pipeline handled by data warehouse (EDW)
 EDW at maximum capacity
 Multiple rounds of software optimization already done
 Revenue limiting (= career limiting) bottleneck
11©MapR Technologies
Transformation
Extract and Load
CDR billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow
12©MapR Technologies
Problem Analysis
 70% of EDW load is related to call detail record (CDR)
normalization
–< 10% of total lines of code
–CDR normalization difficult within the EDW
–Binary extraction and conversion
 Data rates are too high for upstream transform
–Requires high volume joins
13©MapR Technologies
ETL
CDR billing
records
Billing
reports
Data Warehouse
Customer
billing
With ETL Offload
Hadoop Cluster
15©MapR Technologies
Simplified Analysis
 70% of EDW consumed by ETL processing – Offload
frees capacity
 EDW direct hardware cost is approximately $30 million
vs. Hadoop cluster at 1/50 the cost
 Additional EDW only increases capacity by 50% due to
poor division of labor
17©MapR Technologies
The Results
 EDW strategy
–1.5 x performance
–$30 million
 MapR Strategy
–3 x faster
–20x cost/performance advantage for MapR strategy
–With High Availability and data protection
19©MapR Technologies
Use Case #2
Combine Many Different Data Sources
20©MapR Technologies
Use Case #2 – Customer Example
 Global Credit Card Issuer
 Launching a New Location Based Service
 Benefits both Merchants and Consumers
21©MapR Technologies
Combining different feeds on one platform
Hadoop and HBase
Storage and Processing
…
Real-time data feed
from social network
Stored in
Hadoop
Historical
Purchase
Information
Predictive Analytics from
Historical data combined with
NoSQL querying on real-time
social networking data
Billing
Data
22©MapR Technologies
Results
 New Service Rolled out in 1 quarter
 Processing time cut from 20 hours per day to 3
 Recommendation engine load time decreased from 8
hours to 3 minutes
 Includes data versioning support for easier
development and updating of models
25©MapR Technologies
Use Case #3
New Application from New Data Source
26©MapR Technologies
Ancestry.com – Family Tree
27©MapR Technologies
Overview and Requirements
 Collect and Collate information from disparate sources
(Text files, Images, etc.)
 Leverage new data source: Spit
 Machine learning techniques and DNA Matching
Algorithms
28©MapR Technologies
The Results
 Storage Infrastructure for billions of small and large files
 Blob Store for large images through NoSQL solutions
 Multi-tenant capability for data-mining and machine-learning
algorithm development
 One highly available, efficient platform
29©MapR Technologies
MapR M7: Making HBase Enterprise Grade
Disks
ext3
JVM
DFS
JVM
HBase
Other Distributions
Disks
Unified
Easy Dependable Fast
No RegionServers No compactions Consistent low latency
Seamless splits Instant recovery from node
failure
Real-time in-memory
configuration
Automatic merges Snapshots Disk and network compression
In-memory column families Mirroring Reduced I/O to disk
30©MapR Technologies
Use Case
New Analytics on Existing Data
31©MapR Technologies
Analytic Flexibility
 MapReduce enabled Machine learning algorithms
 Enhanced Search
 Real-time event processing
 No need to sample the data
Fraud Detection Target Marketing
Consumer
Behavior Analysis …
32©MapR Technologies
Hadoop Expands Analytics
“Simple algorithms and lots of data
trump complex models ”
Halevy, Norvig, and Pereira, Google
IEEE Intelligent Systems
34©MapR Technologies
Use Case #4
Combine All Three
35©MapR Technologies
Where do you Start?
36©MapR Technologies
One Platform for Big Data
…
Batch
99.999%
HA
Data
Protection
Disaster
Recovery
Scalability
&
Performance
Enterprise
Integration
Multi-
tenancy
Batch
Processing
File-Based
Applications
SQL Database Search Stream
Processing
Interactive Realtime
37©MapR Technologies
World Record Performance
Why is MapR faster and more efficient?
– C/C++ vs. Java
– Distributed metadata
– Optimized shuffle
New Minute Sort World
Record
1.5 TB in 1 minute
2103 nodes
38©MapR Technologies
Thank You
39©MapR Technologies

More Related Content

PDF
GPUdb: A Distributed Database for Many-Core Devices
PDF
Trends towards the merge of HPC + Big Data systems
PPTX
High Performance Computing and Big Data
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
PDF
Threat Detection and Response at Scale with Dominique Brezinski
PDF
02 a holistic approach to big data
PPTX
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
PPTX
The elephantintheroom bigdataanalyticsinthecloud
GPUdb: A Distributed Database for Many-Core Devices
Trends towards the merge of HPC + Big Data systems
High Performance Computing and Big Data
Data Warehouse Modernization: Accelerating Time-To-Action
Threat Detection and Response at Scale with Dominique Brezinski
02 a holistic approach to big data
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
The elephantintheroom bigdataanalyticsinthecloud

What's hot (20)

PPTX
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
PDF
An Introduction to the MapR Converged Data Platform
PPTX
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
PDF
Real time big data analytical architecture for remote sensing application
PPTX
Best Practices for Data Convergence in Healthcare
PDF
Bigdata Hadoop project payment gateway domain
PPTX
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
PPTX
Smart Meter Data Analytic using Hadoop
PPTX
IBM Big Data in the Cloud
PPTX
Bigdata
PDF
Future of Data - Big Data
PDF
Bigdata Machine Learning Platform
PPTX
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
PPTX
Enabling Real-Time Business with Change Data Capture
PDF
Predictive Maintenance Using Recurrent Neural Networks
PPTX
Big data use cases
PPTX
Hadoop - An Introduction
PPTX
Evolving Beyond the Data Lake: A Story of Wind and Rain
PPTX
Revolution Analytics
PPT
My other computer is a datacentre - 2012 edition
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
An Introduction to the MapR Converged Data Platform
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Real time big data analytical architecture for remote sensing application
Best Practices for Data Convergence in Healthcare
Bigdata Hadoop project payment gateway domain
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Smart Meter Data Analytic using Hadoop
IBM Big Data in the Cloud
Bigdata
Future of Data - Big Data
Bigdata Machine Learning Platform
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Enabling Real-Time Business with Change Data Capture
Predictive Maintenance Using Recurrent Neural Networks
Big data use cases
Hadoop - An Introduction
Evolving Beyond the Data Lake: A Story of Wind and Rain
Revolution Analytics
My other computer is a datacentre - 2012 edition
Ad

Viewers also liked (6)

PDF
The Power of_Like - How Social Marketing Works
PDF
BigData @ comScore
PDF
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
PDF
comScore
PDF
The power of like. (ComScore, Facebook 2011)
PDF
Facebook and Myspace App Platforms: A Brief Update
The Power of_Like - How Social Marketing Works
BigData @ comScore
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
comScore
The power of like. (ComScore, Facebook 2011)
Facebook and Myspace App Platforms: A Brief Update
Ad

Similar to Expect More from Hadoop (20)

PPTX
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
PDF
Exploring the Wider World of Big Data
PDF
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
PDF
IBM Data Centric Systems & OpenPOWER
PPTX
Big Data Lessons from the Cloud
PPTX
Monitizing Big Data at Telecom Service Providers
PDF
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
PDF
IBM Power Systems: Designed for Data
PPTX
Monetizing Big Data at Telecom Service Providers
PDF
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
PPTX
Integrating Hadoop into your enterprise IT environment
PDF
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
PDF
Covid-19 Response Capability with Power Systems
PDF
Is your cloud ready for Big Data? Strata NY 2013
PPTX
Ibm symp14 referentin_barbara koch_power_8 launch bk
PDF
2016 August POWER Up Your Insights - IBM System Summit Mumbai
PPT
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
PDF
Data Warehouse Evolution Roadshow
PDF
Big Data and OSS at IBM
PDF
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
Exploring the Wider World of Big Data
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
IBM Data Centric Systems & OpenPOWER
Big Data Lessons from the Cloud
Monitizing Big Data at Telecom Service Providers
th1330-1410effectenbeurszaal4-3v2-140424180955-phpapp01 (1).pdf
IBM Power Systems: Designed for Data
Monetizing Big Data at Telecom Service Providers
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Integrating Hadoop into your enterprise IT environment
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
Covid-19 Response Capability with Power Systems
Is your cloud ready for Big Data? Strata NY 2013
Ibm symp14 referentin_barbara koch_power_8 launch bk
2016 August POWER Up Your Insights - IBM System Summit Mumbai
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Data Warehouse Evolution Roadshow
Big Data and OSS at IBM
Building a Single Logical Data Lake: For Advanced Analytics, Data Science, an...

More from MapR Technologies (20)

PPTX
Converging your data landscape
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
PPTX
Machine Learning Success: The Key to Easier Model Management
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
PDF
Live Machine Learning Tutorial: Churn Prediction
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
PPTX
Geo-Distributed Big Data and Analytics
PPTX
MapR Product Update - Spring 2017
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
PPTX
MapR and Cisco Make IT Better
PPTX
Evolving from RDBMS to NoSQL + SQL
PDF
Open Source Innovations in the MapR Ecosystem Pack 2.0
PPTX
How Spark is Enabling the New Wave of Converged Cloud Applications
PDF
MapR 5.2: Getting More Value from the MapR Converged Data Platform
PPTX
MapR on Azure: Getting Value from Big Data in the Cloud -
Converging your data landscape
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
ML Workshop 1: A New Architecture for Machine Learning Logistics
Machine Learning Success: The Key to Easier Model Management
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Live Machine Learning Tutorial: Churn Prediction
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
Geo-Distributed Big Data and Analytics
MapR Product Update - Spring 2017
3 Benefits of Multi-Temperature Data Management for Data Analytics
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR and Cisco Make IT Better
Evolving from RDBMS to NoSQL + SQL
Open Source Innovations in the MapR Ecosystem Pack 2.0
How Spark is Enabling the New Wave of Converged Cloud Applications
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR on Azure: Getting Value from Big Data in the Cloud -

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Electronic commerce courselecture one. Pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
sap open course for s4hana steps from ECC to s4
Per capita expenditure prediction using model stacking based on satellite ima...
Electronic commerce courselecture one. Pdf
MIND Revenue Release Quarter 2 2025 Press Release
NewMind AI Weekly Chronicles - August'25 Week I
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Unlocking AI with Model Context Protocol (MCP)
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
sap open course for s4hana steps from ECC to s4

Expect More from Hadoop

Editor's Notes

  • #4: Let’s start with this chart. To reinforce you’re in the right room you picked the right session…HadoopNot only is it the fastest growing Big Data technology…It is one of the fastest technologies period….Hadoop adoption is happening across industries and across a wide range of application areas.What’s driving this adoption
  • #5: There are many drivers for Hadoop adoption…
  • #6: One of the drivers for Hadoop adoption is storage costs… Dramatically cheaper….. You might say I can’t use raw disks because I need the high end availability and data protection and speed. We agree with you that’s where MapR focused bringing the performance and features of high end to Disk Attached Storage…This is a paradigm shift
  • #7: Map Reduce is a paradigm shiftGoogle Poster ChildWhat exactly does Hadoop look like?
  • #8: This is a Hadoop distribution it includes a series of open source packages that are tested, hardened and combined into a complete suite. With MapR we’ve combined this with our own innovations at the data platform level to make it highly available, dependable and easier to access and integrate through industry standards like NFS, ODBC, etc…
  • #9: How do you benefit. I mentioned that used wide variety of use cases…I’ve generalized these into 4 groups… The first
  • #10: Is expanding data….Sampled to all of the transactions, ….. Netflix….recommends 5 movies to you and. It’s because they look at everybody’s movie watching and ratings and identify like clusters of individuals like you….Risk triangles for insurance companies go from zip code level down to the neighborhood street…Trading information going for last 3 months to 7 years….
  • #11: Let’s look at a specific example…
  • #12: Load CDR – Call detail records into the data warehouse and transform data into the proper format for processing and analysis…
  • #13: The problem with this process is that 70% of the EDW load is related to the CDR normalization process AI: Why is this the case?CDR normalization difficult within the EDWBinary extraction and conversion to SQL is difficult
  • #33: The first is “simple algorithms and lots of data trump complex models”. This comes from an IEEE article written by 3 research directors at Google. The article was titled the “Unreasonable effectiveness of Data” it was reaction to an article called “The Unreasonable Effectives of Mathematics in Natural Science” This paper made the point that simple formulas can explain the complex natural world. The most famous example being E=MC2 in physics. Their paper talked about how economist were jealous since they lacked similar models to neatly explain human behavior. But they found that in the area of Natural Language Processing an area notoriously complex that has been studied for years with many AI attempts at addressing this. They found that relatively simple approaches on massive data produced stunning results. They cited an example of scene completion. An algorithm is used to eliminate something in a picture a car for instance and based on a corpus of thousands of pictures fill in the the missing background. Well this algorithm did rather poorly until they increased the corpus to millions of photos and with this amount of data the same algorithm performed extremely well. While not a direct example from financial services I think it’s a great analogy. After all aren’t you looking for an approach that can fill in the missing pieces of a picture or pattern.
  • #34: Okay interesting graphs how does this translate to the real world. Here are some broad examples.
  • #37: Start with the right platform…Power to address your needs and the flexibility to grow with your expansion..If you haven’t started with this platform it is easy to switch….
  • #41: Take all of Twitter400 x 10^6 tweets per day &lt; 400 GB per day &lt; 40MB/s