SlideShare a Scribd company logo
© comScore, Inc. Proprietary.
Syncsort & MapR @ comScore
Michael Brown, CTO | July 9th, 2014
© comScore, Inc. Proprietary.© comScore, Inc. Proprietary.
The comScore Story
Analytics for a Digital World™
© comScore, Inc. Proprietary. 3
The Digital World is Complex
V0113
© comScore, Inc. Proprietary. 4
comScore’s Mission
Be the Leader in
Digital Media Analytics.
Measure all forms of
media—content and
advertising—at scale,
across all platforms, in
real-time, globally.
© comScore, Inc. Proprietary. 5
comScore Brings it Together
TabletPC/Mac TV SmartphoneGaming
V0113
© comScore, Inc. Proprietary. 6
comScore is a leading internet technology company that
provides Analytics for a Digital World™
NASDAQ SCOR
Clients 2,400+ Worldwide
Employees 1,200+
Headquarters Reston, Virginia, USA
Global Coverage Measurement from 172 Countries; 44 Markets Reported
Local Presence 32 Locations in 23 Countries
V0113
© comScore, Inc. Proprietary. 7
Providing Analytics For More Than 2,400+ Clients Globally
Media Agencies Telecom/Mobile Financial Retail Travel CPG Health Technology
V0113
© comScore, Inc. Proprietary. 8
Census
Tags & Data Feeds
Panels
PC, iOS, Android
Survey
Non-behavioral elements
Methods
Aggregation
Dictionaries
Taxonomies
Syndicated
Data
Platform
Media Metrix
vCE
Collection Calibration Delivery
Consulting
Analysis
Models
Weighting
Projection
De-Duplication
Attribution
Turning Big Data into Powerful Insight
Client
Analytics
Platform
Digital
Analytix
© comScore, Inc. Proprietary. 9
© comScore, Inc. Proprietary. 10
Panel Heat Map
© comScore, Inc. Proprietary. 11
Average Records Captured per Day (2005-2009)
-
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
1,800,000,000
9/26/2005
10/26/2005
11/26/2005
12/26/2005
1/26/2006
2/26/2006
3/26/2006
4/26/2006
5/26/2006
6/26/2006
7/26/2006
8/26/2006
9/26/2006
10/26/2006
11/26/2006
12/26/2006
1/26/2007
2/26/2007
3/26/2007
4/26/2007
5/26/2007
6/26/2007
7/26/2007
8/26/2007
9/26/2007
10/26/2007
11/26/2007
12/26/2007
1/26/2008
2/26/2008
3/26/2008
4/26/2008
5/26/2008
6/26/2008
7/26/2008
8/26/2008
9/26/2008
10/26/2008
11/26/2008
12/26/2008
1/26/2009
2/26/2009
3/26/2009
© comScore, Inc. Proprietary. 12
CENSUS
Unified Digital Measurement™ (UDM) Establishes Platform For
Panel + Census Data Integration
Adopted by 90% of Top 100 U.S. Media Properties
PANEL
Unified Digital Measurement (UDM)
Patent-Pending Methodology
Global PERSON
Measurement
Global DEVICE
Measurement
V0411
© comScore, Inc. Proprietary. 13
Beacon Heat Map
© comScore, Inc. Proprietary. 14
Monthly Records Collection
Billion
200 Billion
400 Billion
600 Billion
800 Billion
1,000 Billion
1,200 Billion
1,400 Billion
1,600 Billion
1,800 Billion
2,000 Billion
#ofrecords
Beacon Records
Panel Records
Total records collected in June 2014 = 1,726,563,202,649
Total records collected YTD 2014 = 10,037,131,368,475
© comScore, Inc. Proprietary.
DMX @ comScore
© comScore, Inc. Proprietary. 16
DMX use at comScore
Purchased our first 4 licenses in 2000!
We use DMX from Syncsort across hundreds of servers for efficient data
processing and aggregation.
We currently run over 100+ unique jobs every day.
With these jobs we process over 150 billion rows of data through DMX!
Connect
Design
Process Accelerate
© comScore, Inc. Proprietary. 17
Compression w/Sorting
Compress Log Files when processing large volumes of log data
Several advantages to Sorting Data First:
 Reduces the size of the data
 Improves application performance
Examples:
 1 Hour of one source of our data 2,315 GB raw (2.9 billion rows)
 Standard compression of time ordered data is 509 GB (22% of original)
 Standard compression on a sorted set is 324 GB (14% of original)
When applied to all our sources we save
 5.0 TB per day
 155 TB per month
 460 TB per quarter
© comScore, Inc. Proprietary.
Hadoop @ comScore
© comScore, Inc. Proprietary. 19
Why Hadoop?
• comScore built our own distributed
computing stack in 2002.
• In 2009 we decided it was better to leverage
the efforts of the Hadoop community instead
of building our own stack.
• We recognized the benefit of switching to
Hadoop which would allow for seamless
scaling of our infrastructure to meet the
needs of the business.
• Hadoop allows us to add compute, storage
and memory linearly and allows you to
process things at tremendous scale.
• Partnered with SyncSort on their Hadoop
efforts from Oct 2010
• Evaluated the beta of MapR in the fall of 2011
© comScore, Inc. Proprietary. 20
90 Days of Data
1,148
1,919
3,049
4,862
5,084
Trillion
1,000 Trillion
2,000 Trillion
3,000 Trillion
4,000 Trillion
5,000 Trillion
6,000 Trillion
2009 2010 2011 2012 2013 2014 2016
© comScore, Inc. Proprietary. 21
High Level Data Flow
Panel
Census
Custom Code +
ADW
EDW
Delivery
© comScore, Inc. Proprietary. 22
Our Cluster
Production Hadoop Cluster
 400+ nodes: Mix of Dell 720xd, R710 and R510 servers
 Each R720xd has (24x1.2TB drives; 128GB RAM; 32 cores)
 13,800+ total CPUs
 31.6 TB total memory
 8.2 PB total disk space
 Our distro is MapR M5 2.1.3
© comScore, Inc. Proprietary.
Leveraging Partitions from MapR
© comScore, Inc. Proprietary.
© comScore, Inc. Proprietary.
Validation Funnel & Target Effectiveness
© comScore, Inc. Proprietary. 26
Our growth
As our volume has grown we have the following stats:
 Over 683 billion events per month
 Daily Aggregate 1.8 billion
 160 billion aggregate records for 92 days
 146K Campaigns
 Over 50 countries
 We see 15 billion distinct cookies in a month
 We only need to output 26 million rows
© comScore, Inc. Proprietary. 27
Solution to reduce the shuffle
The Problem:
 Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and
job performance issues
The Idea:
 Partition and sort the data by cookie on a daily basis
 Create a custom InputFormat to merge daily partitions for monthly aggregations
© comScore, Inc. Proprietary. 28
Custom Input Format with Map Side Aggregation
CB
Mapper MapperMapperMap Map Map
Reduce ReduceReduce
BA AC
A B C
A B C
Combiner Combiner Combiner
A B C
© comScore, Inc. Proprietary. 29
Risks for Partitioning
Data locality
 Custom InputFormat requires reading blocks of the partitioned data over the network
 This was solved using a feature of the MapR file system. We created volumes and set the chunk size to
zero which guarantees that the data written to a volume will stay on one node
Map failures might result in long run times
 Size of the map inputs is no longer set by block size
 This was solved by creating a large number (10K) of volumes to limit the size of data processed by each
mapper
© comScore, Inc. Proprietary. 30
Partitioning Summary
Benefits:
 A large portion of the aggregation can be completed in the map phase
 Applications can now take advantage of combiners
 Shuffles sizes are minimal
Results:
 Took a job from 35 hours to 3 hours with no hardware changes
© comScore, Inc. Proprietary.
DMX-h @ comScore
© comScore, Inc. Proprietary. 32
Reasons for comScore selecting DMX-h
Performance
• DMX-h as the pluggable sort in Hadoop allows us to increase throughput on
it’s existing platform; this reduces capital and ongoing operational
expenses
• The increase in throughput allows us to also deliver our data more quickly
to our customers. These things make the data more valuable to our clients.
Speed of Development
• The ability to quickly build out applications in the DMX-h GUI allows us to
iterate and respond quicker to the needs of the business.
• The ease of development also allows us to democratize the access to the
Hadoop platform by leveraging a point and click GUI.
© comScore, Inc. Proprietary. 33
Performance - DMx Pluggable Sort Testing Results
First Comparison Run on our Dev Cluster
Pig scripts and called with SyncSort plug in
GroupBy / Distinct Operations
• Counting uniques
• These have large shuffle steps which leads to more data to sort.
• Observed up to a 20% decrease in job runtime
Filter Operations
• Searching for a specific value
• Observed a 5% – 10% decrease in job runtime
• Dependent on type of filter and size of job output
40GB compressed data, base run is 86 min, test run is 68 min; Savings of 20%
Results from 7 Nodes; 56 cores; 433 GB RAM; 28 TB disk; MapR M5 3.0.2; DMX-h 7.12
© comScore, Inc. Proprietary. 34
Speed of Development - POC
We took an existing process that runs in our Hadoop cluster and converted
that to DMX-h to validate the new capabilities.
The existing process:
• Written in 75 lines of Pig with 3 Java UDFs
• Developed in about 25 hours
• Processes 3.5 billion input rows per day
• Takes 35 minutes to run on a daily basis
© comScore, Inc. Proprietary. 35
DMXh-Process
© comScore, Inc. Proprietary. 36
Speed of Development - POC
The new process in DMX-h:
• Developed a new job with 13 tasks
• No Java UDF required
• Runs on the same data and in the same environment.
• Developed in 12 hours.
• Runs in 11 minutes! 1/3 of the time of the Pig & Java code.
© comScore, Inc. Proprietary. 37
Useful Factoids
Visit www.comscoredatamine.com or follow @datagems for the latest gems.
Colorful, bite-sized graphical representations of the best discoveries we unearth.
© comScore, Inc. Proprietary. 38
Thank You!
Michael Brown
CTO
comScore, Inc.
mbrown@comscore.com
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Today’s Presenters
Steve Wooledge
VP - Product Marketing
@swooledge
Jorge Lopez
Director - Product Marketing
@zanilli
Mike Brown
CTO
© 2014 MapR Technologies 3© 2014 MapR Technologies
comScore
© comScore, Inc. Proprietary.
Syncsort & MapR @ comScore
• Michael Brown, CTO | July 9th, 2014
© 2014 MapR Technologies 5© 2014 MapR Technologies
Leveraging MapR and Syncsort
© 2014 MapR Technologies 6
Big Data is Overwhelming Traditional Systems
• Mission-critical reliability
• Transaction guarantees
• Deep security
• Real-time performance
• Backup and recovery
• Interactive SQL
• Rich analytics
• Workload management
• Data governance
• Backup and recovery
Enterprise
Data
Architecture
1TRENDTREND
ENTERPRISE
USERS
OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
PRODUCTION
REQUIREMENTS
PRODUCTION
REQUIREMENTS
OUTSIDE SOURCES
© 2014 MapR Technologies 7
Hadoop: The Disruptive Technology at the Core of Big DataTRENDTREND
JOB TRENDS FROM INDEED.COM
Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13
2
© 2014 MapR Technologies 8
OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
ENTERPRISE
USERS
1REALITYREALITY
• Data staging
• Archive
• Data transformation
• Data exploration
• Streaming,
interactions
Hadoop Relieves the Pressure from Enterprise Systems
2 Interoperability
1 Reliability and DR
4
Supports operations
and analytics
3 High performance
Keys for Production Success
© 2014 MapR Technologies 9
FOUNDATION
Architecture Matters for Success2REALITYREALITY
Data protection
& security
High performance
Multi-tenancy
Operational &
Analytical Workloads
Open standards
for integration
NEW APPLICATIONS SLAs TRUSTEDINFORMATION LOWERTCO
© 2014 MapR Technologies 10
The Power of the Open Source Community
ManagementManagement
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integration
& Access
HttpFS
Hue
* Certification/support planned for 2014
© 2014 MapR Technologies 11
MapR Distribution for Hadoop
ManagementManagement
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integration
& Access
HttpFS
Hue
* Certification/support planned for 2014
• High availability
• Data protection
• Disaster recovery
• Standard file access
• Standard database
access
• Pluggable services
• Broad developer
support
• Enterprise security
authorization
• Wire-level
authentication
• Data governance
• Ability to support
predictive analytics,
real-time database
operations, and
support high arrival
rate data
• Ability to logically
divide a cluster to
support different use
cases, job types,
user groups, and
administrators
• 2X to 7X higher
performance
• Consistent, low
latency
Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
© 2014 MapR Technologies 12
MapR: Best Solution for Customer Success
Top Ranked
Exponential
Growth
500+
Customers
Premier
Investors
3X3X bookings Q1 ‘13 – Q1 ‘14
80%80% of accounts expand 3X
90%90% software licenses
<1%<1% lifetime churn
>$1B>$1B in incremental revenue
generated by 1 customer
© 2014 MapR Technologies 13
MapR and Syncsort Reference Architecture
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
BLOGS,
TWEETS,
LINK DATA
DATA MARTS DATA WAREHOUSE
MapR Data Platform
Business
Intelligence /
Visualization
MapR-DB MapR-FS
Batch
(MR, Spark, Hive, Pig,
…)
Interactive
(Impala, Drill, …)
Streaming
(Spark Streaming,
Storm…)
MAPR DISTRIBUTION FOR HADOOP
© 2014 MapR Technologies 14
Do You Know Syncsort?
• Syncsort provides fast, secure, enterprise‐grade 
software spanning “Big Iron to Big Data” 
• Fastest sort technology in the market
• Powering 50% of mainframes’ sort
• A history of innovation
• 25+ issued & pending patents
• Large global customer base
• 12,000+ deployments in 80 countries and serving 87 of 
the Fortune 100
• First‐to‐market, fully integrated approach to Hadoop 
ETL
• Top 7 contributors to Hadoop. Based on number of 
lines of code changed in 2013
Our customers are achieving the impossible, every 
day!
Our customers are achieving the impossible, every 
day!
Key Partners
© 2014 MapR Technologies 15
The Hadoop Challenge
PROCESS
Sort
JoinAggregate Copy
Merge
DISTRIBUTECOLLECT
Most organizations use Hadoop to…
EExtract
TTransform
LLoad
© 2014 MapR Technologies 16
Turning Hadoop into a Feature-rich ETL Solution
Collect
• Broad based connectivity with automated parallelism 
• Best in class mainframe data access & translation
Process & Distribute
• No manual coding. GUI for developing & maintaining MR jobs
• No code generation. Engine runs natively on each node
• Develop & test locally in Windows; run natively on Hadoop
Optimize & Secure
• Faster throughput per node
• Full support for Kerberos & LDAP
• Web‐based monitoring console
• Sort‐work compression for storage savings
DMX‐h 
ETL
Collect Process
& Distribute
Optimize
& Secure
© 2014 MapR Technologies 17
A Roadmap to Hadoop Success
Agile Data 
Exploration & 
Visualization
Next‐gen Analytics
Cheap Storage
Offload Data 
Warehouse
Enabling The
Data‐driven Organization
Solving The Intractable
IT Problem
17
© 2014 MapR Technologies 18
MapR + Syncsort Solutions
Data Warehouse 
Optimization
Click‐stream 
Analysis
Mainframe Offload
Shift ELT Workloads 
to Hadoop
Access, Translate & Analyze 
Mainframe Data with Hadoop
Collect, Process & Analyze More 
Data from Your Website
© 2014 MapR Technologies 19
Q&AEngage with us!
1. Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox
2. Try Syncsort’s Hadoop ETL in the MapR Sandbox: www.syncsort.com/mapr
3. Learn best practices for Hadoop ETL: www.mapr.com/EDH

More Related Content

PDF
Using Hadoop
PPTX
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
PDF
BigData @ comScore
PPTX
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
PPTX
Expect More from Hadoop
PDF
Meruvian - Introduction to MapR
PDF
An Introduction to the MapR Converged Data Platform
PDF
Big Data LDN 2018: 7 SUCCESSFUL HABITS FOR DATA-INTENSIVE APPLICATIONS IN PRO...
Using Hadoop
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
BigData @ comScore
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Expect More from Hadoop
Meruvian - Introduction to MapR
An Introduction to the MapR Converged Data Platform
Big Data LDN 2018: 7 SUCCESSFUL HABITS FOR DATA-INTENSIVE APPLICATIONS IN PRO...

What's hot (16)

PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
PPTX
Best Practices for Data Convergence in Healthcare
PPTX
Distributed graph mining
PPTX
Geo-Distributed Big Data and Analytics
PPTX
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
PDF
Big data processing with PubSub, Dataflow, and BigQuery
PPTX
CEP - simplified streaming architecture - Strata Singapore 2016
PDF
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
PDF
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
PDF
Modern real-time streaming architectures
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
PPTX
Machine Learning Success: The Key to Easier Model Management
PPTX
Innovating to Create a Brighter Future for AI, HPC, and Big Data
PPTX
Enabling Real-Time Business with Change Data Capture
PPTX
Costing your Bug Data Operations
Data Warehouse Modernization: Accelerating Time-To-Action
Best Practices for Data Convergence in Healthcare
Distributed graph mining
Geo-Distributed Big Data and Analytics
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
Big data processing with PubSub, Dataflow, and BigQuery
CEP - simplified streaming architecture - Strata Singapore 2016
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
3 Benefits of Multi-Temperature Data Management for Data Analytics
Modern real-time streaming architectures
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Machine Learning Success: The Key to Easier Model Management
Innovating to Create a Brighter Future for AI, HPC, and Big Data
Enabling Real-Time Business with Change Data Capture
Costing your Bug Data Operations
Ad

Similar to How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying Hadoop for Deeper Consumer Insights (20)

PPTX
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
PDF
Keeping Data in Sync with Syncsort
PDF
Hadoop is Happening
PPTX
Why Hadoop is important to Syncsort
PPTX
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
PDF
Simplifying Big Data Integration with Syncsort DMX and DMX-h
PDF
Syncsort et le retour d'expérience ComScore
PDF
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PDF
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
PDF
Scaling up with Cisco Big Data: Data + Science = Data Science
PDF
comScore
PDF
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
PPTX
Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
PDF
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...
PPTX
Experimentation Platform on Hadoop
PPTX
eBay Experimentation Platform on Hadoop
PPTX
Simplifying and Future-Proofing Hadoop
PPTX
How Experian increased insights with Hadoop
PDF
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Keeping Data in Sync with Syncsort
Hadoop is Happening
Why Hadoop is important to Syncsort
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
Simplifying Big Data Integration with Syncsort DMX and DMX-h
Syncsort et le retour d'expérience ComScore
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Scaling up with Cisco Big Data: Data + Science = Data Science
comScore
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Yahoo!, Big Data, and Microsoft BI: Bigger and Better Together
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...
Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
Simplifying and Future-Proofing Hadoop
How Experian increased insights with Hadoop
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Ad

More from MapR Technologies (20)

PPTX
Converging your data landscape
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
PDF
Live Machine Learning Tutorial: Churn Prediction
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
PPTX
MapR Product Update - Spring 2017
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
PPTX
MapR and Cisco Make IT Better
PPTX
Evolving from RDBMS to NoSQL + SQL
PPTX
Evolving Beyond the Data Lake: A Story of Wind and Rain
PDF
Open Source Innovations in the MapR Ecosystem Pack 2.0
PPTX
How Spark is Enabling the New Wave of Converged Cloud Applications
PDF
MapR 5.2: Getting More Value from the MapR Converged Data Platform
PPTX
MapR on Azure: Getting Value from Big Data in the Cloud -
PDF
Handling the Extremes: Scaling and Streaming in Finance
PDF
Baptist Health: Solving Healthcare Problems with Big Data
PDF
The Keys to Digital Transformation
Converging your data landscape
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
ML Workshop 1: A New Architecture for Machine Learning Logistics
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Machine Learning Tutorial: Churn Prediction
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Product Update - Spring 2017
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR and Cisco Make IT Better
Evolving from RDBMS to NoSQL + SQL
Evolving Beyond the Data Lake: A Story of Wind and Rain
Open Source Innovations in the MapR Ecosystem Pack 2.0
How Spark is Enabling the New Wave of Converged Cloud Applications
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR on Azure: Getting Value from Big Data in the Cloud -
Handling the Extremes: Scaling and Streaming in Finance
Baptist Health: Solving Healthcare Problems with Big Data
The Keys to Digital Transformation

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
KodekX | Application Modernization Development
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Encapsulation theory and applications.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation_ Review paper, used for researhc scholars
KodekX | Application Modernization Development
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Mobile App Security Testing_ A Comprehensive Guide.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation theory and applications.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying Hadoop for Deeper Consumer Insights

  • 1. © comScore, Inc. Proprietary. Syncsort & MapR @ comScore Michael Brown, CTO | July 9th, 2014
  • 2. © comScore, Inc. Proprietary.© comScore, Inc. Proprietary. The comScore Story Analytics for a Digital World™
  • 3. © comScore, Inc. Proprietary. 3 The Digital World is Complex V0113
  • 4. © comScore, Inc. Proprietary. 4 comScore’s Mission Be the Leader in Digital Media Analytics. Measure all forms of media—content and advertising—at scale, across all platforms, in real-time, globally.
  • 5. © comScore, Inc. Proprietary. 5 comScore Brings it Together TabletPC/Mac TV SmartphoneGaming V0113
  • 6. © comScore, Inc. Proprietary. 6 comScore is a leading internet technology company that provides Analytics for a Digital World™ NASDAQ SCOR Clients 2,400+ Worldwide Employees 1,200+ Headquarters Reston, Virginia, USA Global Coverage Measurement from 172 Countries; 44 Markets Reported Local Presence 32 Locations in 23 Countries V0113
  • 7. © comScore, Inc. Proprietary. 7 Providing Analytics For More Than 2,400+ Clients Globally Media Agencies Telecom/Mobile Financial Retail Travel CPG Health Technology V0113
  • 8. © comScore, Inc. Proprietary. 8 Census Tags & Data Feeds Panels PC, iOS, Android Survey Non-behavioral elements Methods Aggregation Dictionaries Taxonomies Syndicated Data Platform Media Metrix vCE Collection Calibration Delivery Consulting Analysis Models Weighting Projection De-Duplication Attribution Turning Big Data into Powerful Insight Client Analytics Platform Digital Analytix
  • 9. © comScore, Inc. Proprietary. 9
  • 10. © comScore, Inc. Proprietary. 10 Panel Heat Map
  • 11. © comScore, Inc. Proprietary. 11 Average Records Captured per Day (2005-2009) - 200,000,000 400,000,000 600,000,000 800,000,000 1,000,000,000 1,200,000,000 1,400,000,000 1,600,000,000 1,800,000,000 9/26/2005 10/26/2005 11/26/2005 12/26/2005 1/26/2006 2/26/2006 3/26/2006 4/26/2006 5/26/2006 6/26/2006 7/26/2006 8/26/2006 9/26/2006 10/26/2006 11/26/2006 12/26/2006 1/26/2007 2/26/2007 3/26/2007 4/26/2007 5/26/2007 6/26/2007 7/26/2007 8/26/2007 9/26/2007 10/26/2007 11/26/2007 12/26/2007 1/26/2008 2/26/2008 3/26/2008 4/26/2008 5/26/2008 6/26/2008 7/26/2008 8/26/2008 9/26/2008 10/26/2008 11/26/2008 12/26/2008 1/26/2009 2/26/2009 3/26/2009
  • 12. © comScore, Inc. Proprietary. 12 CENSUS Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration Adopted by 90% of Top 100 U.S. Media Properties PANEL Unified Digital Measurement (UDM) Patent-Pending Methodology Global PERSON Measurement Global DEVICE Measurement V0411
  • 13. © comScore, Inc. Proprietary. 13 Beacon Heat Map
  • 14. © comScore, Inc. Proprietary. 14 Monthly Records Collection Billion 200 Billion 400 Billion 600 Billion 800 Billion 1,000 Billion 1,200 Billion 1,400 Billion 1,600 Billion 1,800 Billion 2,000 Billion #ofrecords Beacon Records Panel Records Total records collected in June 2014 = 1,726,563,202,649 Total records collected YTD 2014 = 10,037,131,368,475
  • 15. © comScore, Inc. Proprietary. DMX @ comScore
  • 16. © comScore, Inc. Proprietary. 16 DMX use at comScore Purchased our first 4 licenses in 2000! We use DMX from Syncsort across hundreds of servers for efficient data processing and aggregation. We currently run over 100+ unique jobs every day. With these jobs we process over 150 billion rows of data through DMX! Connect Design Process Accelerate
  • 17. © comScore, Inc. Proprietary. 17 Compression w/Sorting Compress Log Files when processing large volumes of log data Several advantages to Sorting Data First:  Reduces the size of the data  Improves application performance Examples:  1 Hour of one source of our data 2,315 GB raw (2.9 billion rows)  Standard compression of time ordered data is 509 GB (22% of original)  Standard compression on a sorted set is 324 GB (14% of original) When applied to all our sources we save  5.0 TB per day  155 TB per month  460 TB per quarter
  • 18. © comScore, Inc. Proprietary. Hadoop @ comScore
  • 19. © comScore, Inc. Proprietary. 19 Why Hadoop? • comScore built our own distributed computing stack in 2002. • In 2009 we decided it was better to leverage the efforts of the Hadoop community instead of building our own stack. • We recognized the benefit of switching to Hadoop which would allow for seamless scaling of our infrastructure to meet the needs of the business. • Hadoop allows us to add compute, storage and memory linearly and allows you to process things at tremendous scale. • Partnered with SyncSort on their Hadoop efforts from Oct 2010 • Evaluated the beta of MapR in the fall of 2011
  • 20. © comScore, Inc. Proprietary. 20 90 Days of Data 1,148 1,919 3,049 4,862 5,084 Trillion 1,000 Trillion 2,000 Trillion 3,000 Trillion 4,000 Trillion 5,000 Trillion 6,000 Trillion 2009 2010 2011 2012 2013 2014 2016
  • 21. © comScore, Inc. Proprietary. 21 High Level Data Flow Panel Census Custom Code + ADW EDW Delivery
  • 22. © comScore, Inc. Proprietary. 22 Our Cluster Production Hadoop Cluster  400+ nodes: Mix of Dell 720xd, R710 and R510 servers  Each R720xd has (24x1.2TB drives; 128GB RAM; 32 cores)  13,800+ total CPUs  31.6 TB total memory  8.2 PB total disk space  Our distro is MapR M5 2.1.3
  • 23. © comScore, Inc. Proprietary. Leveraging Partitions from MapR
  • 24. © comScore, Inc. Proprietary.
  • 25. © comScore, Inc. Proprietary. Validation Funnel & Target Effectiveness
  • 26. © comScore, Inc. Proprietary. 26 Our growth As our volume has grown we have the following stats:  Over 683 billion events per month  Daily Aggregate 1.8 billion  160 billion aggregate records for 92 days  146K Campaigns  Over 50 countries  We see 15 billion distinct cookies in a month  We only need to output 26 million rows
  • 27. © comScore, Inc. Proprietary. 27 Solution to reduce the shuffle The Problem:  Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and job performance issues The Idea:  Partition and sort the data by cookie on a daily basis  Create a custom InputFormat to merge daily partitions for monthly aggregations
  • 28. © comScore, Inc. Proprietary. 28 Custom Input Format with Map Side Aggregation CB Mapper MapperMapperMap Map Map Reduce ReduceReduce BA AC A B C A B C Combiner Combiner Combiner A B C
  • 29. © comScore, Inc. Proprietary. 29 Risks for Partitioning Data locality  Custom InputFormat requires reading blocks of the partitioned data over the network  This was solved using a feature of the MapR file system. We created volumes and set the chunk size to zero which guarantees that the data written to a volume will stay on one node Map failures might result in long run times  Size of the map inputs is no longer set by block size  This was solved by creating a large number (10K) of volumes to limit the size of data processed by each mapper
  • 30. © comScore, Inc. Proprietary. 30 Partitioning Summary Benefits:  A large portion of the aggregation can be completed in the map phase  Applications can now take advantage of combiners  Shuffles sizes are minimal Results:  Took a job from 35 hours to 3 hours with no hardware changes
  • 31. © comScore, Inc. Proprietary. DMX-h @ comScore
  • 32. © comScore, Inc. Proprietary. 32 Reasons for comScore selecting DMX-h Performance • DMX-h as the pluggable sort in Hadoop allows us to increase throughput on it’s existing platform; this reduces capital and ongoing operational expenses • The increase in throughput allows us to also deliver our data more quickly to our customers. These things make the data more valuable to our clients. Speed of Development • The ability to quickly build out applications in the DMX-h GUI allows us to iterate and respond quicker to the needs of the business. • The ease of development also allows us to democratize the access to the Hadoop platform by leveraging a point and click GUI.
  • 33. © comScore, Inc. Proprietary. 33 Performance - DMx Pluggable Sort Testing Results First Comparison Run on our Dev Cluster Pig scripts and called with SyncSort plug in GroupBy / Distinct Operations • Counting uniques • These have large shuffle steps which leads to more data to sort. • Observed up to a 20% decrease in job runtime Filter Operations • Searching for a specific value • Observed a 5% – 10% decrease in job runtime • Dependent on type of filter and size of job output 40GB compressed data, base run is 86 min, test run is 68 min; Savings of 20% Results from 7 Nodes; 56 cores; 433 GB RAM; 28 TB disk; MapR M5 3.0.2; DMX-h 7.12
  • 34. © comScore, Inc. Proprietary. 34 Speed of Development - POC We took an existing process that runs in our Hadoop cluster and converted that to DMX-h to validate the new capabilities. The existing process: • Written in 75 lines of Pig with 3 Java UDFs • Developed in about 25 hours • Processes 3.5 billion input rows per day • Takes 35 minutes to run on a daily basis
  • 35. © comScore, Inc. Proprietary. 35 DMXh-Process
  • 36. © comScore, Inc. Proprietary. 36 Speed of Development - POC The new process in DMX-h: • Developed a new job with 13 tasks • No Java UDF required • Runs on the same data and in the same environment. • Developed in 12 hours. • Runs in 11 minutes! 1/3 of the time of the Pig & Java code.
  • 37. © comScore, Inc. Proprietary. 37 Useful Factoids Visit www.comscoredatamine.com or follow @datagems for the latest gems. Colorful, bite-sized graphical representations of the best discoveries we unearth.
  • 38. © comScore, Inc. Proprietary. 38 Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com
  • 39. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 40. © 2014 MapR Technologies 2 Today’s Presenters Steve Wooledge VP - Product Marketing @swooledge Jorge Lopez Director - Product Marketing @zanilli Mike Brown CTO
  • 41. © 2014 MapR Technologies 3© 2014 MapR Technologies comScore
  • 42. © comScore, Inc. Proprietary. Syncsort & MapR @ comScore • Michael Brown, CTO | July 9th, 2014
  • 43. © 2014 MapR Technologies 5© 2014 MapR Technologies Leveraging MapR and Syncsort
  • 44. © 2014 MapR Technologies 6 Big Data is Overwhelming Traditional Systems • Mission-critical reliability • Transaction guarantees • Deep security • Real-time performance • Backup and recovery • Interactive SQL • Rich analytics • Workload management • Data governance • Backup and recovery Enterprise Data Architecture 1TRENDTREND ENTERPRISE USERS OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS PRODUCTION REQUIREMENTS PRODUCTION REQUIREMENTS OUTSIDE SOURCES
  • 45. © 2014 MapR Technologies 7 Hadoop: The Disruptive Technology at the Core of Big DataTRENDTREND JOB TRENDS FROM INDEED.COM Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13 2
  • 46. © 2014 MapR Technologies 8 OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS ENTERPRISE USERS 1REALITYREALITY • Data staging • Archive • Data transformation • Data exploration • Streaming, interactions Hadoop Relieves the Pressure from Enterprise Systems 2 Interoperability 1 Reliability and DR 4 Supports operations and analytics 3 High performance Keys for Production Success
  • 47. © 2014 MapR Technologies 9 FOUNDATION Architecture Matters for Success2REALITYREALITY Data protection & security High performance Multi-tenancy Operational & Analytical Workloads Open standards for integration NEW APPLICATIONS SLAs TRUSTEDINFORMATION LOWERTCO
  • 48. © 2014 MapR Technologies 10 The Power of the Open Source Community ManagementManagement MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue * Certification/support planned for 2014
  • 49. © 2014 MapR Technologies 11 MapR Distribution for Hadoop ManagementManagement MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue * Certification/support planned for 2014 • High availability • Data protection • Disaster recovery • Standard file access • Standard database access • Pluggable services • Broad developer support • Enterprise security authorization • Wire-level authentication • Data governance • Ability to support predictive analytics, real-time database operations, and support high arrival rate data • Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators • 2X to 7X higher performance • Consistent, low latency Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
  • 50. © 2014 MapR Technologies 12 MapR: Best Solution for Customer Success Top Ranked Exponential Growth 500+ Customers Premier Investors 3X3X bookings Q1 ‘13 – Q1 ‘14 80%80% of accounts expand 3X 90%90% software licenses <1%<1% lifetime churn >$1B>$1B in incremental revenue generated by 1 customer
  • 51. © 2014 MapR Technologies 13 MapR and Syncsort Reference Architecture Sources RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS LOG FILES, CLICKSTREAMS BLOGS, TWEETS, LINK DATA DATA MARTS DATA WAREHOUSE MapR Data Platform Business Intelligence / Visualization MapR-DB MapR-FS Batch (MR, Spark, Hive, Pig, …) Interactive (Impala, Drill, …) Streaming (Spark Streaming, Storm…) MAPR DISTRIBUTION FOR HADOOP
  • 52. © 2014 MapR Technologies 14 Do You Know Syncsort? • Syncsort provides fast, secure, enterprise‐grade  software spanning “Big Iron to Big Data”  • Fastest sort technology in the market • Powering 50% of mainframes’ sort • A history of innovation • 25+ issued & pending patents • Large global customer base • 12,000+ deployments in 80 countries and serving 87 of  the Fortune 100 • First‐to‐market, fully integrated approach to Hadoop  ETL • Top 7 contributors to Hadoop. Based on number of  lines of code changed in 2013 Our customers are achieving the impossible, every  day! Our customers are achieving the impossible, every  day! Key Partners
  • 53. © 2014 MapR Technologies 15 The Hadoop Challenge PROCESS Sort JoinAggregate Copy Merge DISTRIBUTECOLLECT Most organizations use Hadoop to… EExtract TTransform LLoad
  • 54. © 2014 MapR Technologies 16 Turning Hadoop into a Feature-rich ETL Solution Collect • Broad based connectivity with automated parallelism  • Best in class mainframe data access & translation Process & Distribute • No manual coding. GUI for developing & maintaining MR jobs • No code generation. Engine runs natively on each node • Develop & test locally in Windows; run natively on Hadoop Optimize & Secure • Faster throughput per node • Full support for Kerberos & LDAP • Web‐based monitoring console • Sort‐work compression for storage savings DMX‐h  ETL Collect Process & Distribute Optimize & Secure
  • 55. © 2014 MapR Technologies 17 A Roadmap to Hadoop Success Agile Data  Exploration &  Visualization Next‐gen Analytics Cheap Storage Offload Data  Warehouse Enabling The Data‐driven Organization Solving The Intractable IT Problem 17
  • 56. © 2014 MapR Technologies 18 MapR + Syncsort Solutions Data Warehouse  Optimization Click‐stream  Analysis Mainframe Offload Shift ELT Workloads  to Hadoop Access, Translate & Analyze  Mainframe Data with Hadoop Collect, Process & Analyze More  Data from Your Website
  • 57. © 2014 MapR Technologies 19 Q&AEngage with us! 1. Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox 2. Try Syncsort’s Hadoop ETL in the MapR Sandbox: www.syncsort.com/mapr 3. Learn best practices for Hadoop ETL: www.mapr.com/EDH