SlideShare a Scribd company logo
Self-Service Analytics on Hadoop: Lessons Learned
June 29, 2016
Drew Leamon
Director – Advanced Technology Solutions
Comcast: Shaping the Future of Media and Technology
High Speed
Internet
Video
IP
Telephony
Home
Security /
Automation
Universal
Parks
Media
Properties
Forecast
Engineering
Design
Budget
Engineering Analysis: Global Central Analysis Team
Animals are Best Suited in Their Native Habitat
Spreadsheets: The Natural Habitat of Analysts
Evolution of Self Service Analytics
SSRS
Self Service: Native Habitat
Limitations of the Spreadsheet Native Habitat
• 1 Million Row Max
Self Service
• Not Even Medium Data
• Not Collaborative
• No Automation
• Not Repeatable
IT Analyst
Self Service: How We Started
Analyst goes to IT, makes request, waited weeks to get results
SSRS
• 10 TB Storage
• 1 Compute Node
Not Self Service
• 10 TB (Medium Data)
• Limited Compute
• IT Hand-off
• Consultative service
• Not self service.
IT Analysts
Bigger database still meant building dashboards for team
IT Analysts
Still Not Self Service
• 100s TBs (Large Data)
• Data silos
• IT Hand-off
• Consultative service
• Analysts not SQL experts
Graduated to Specialized Databases
• Clustered Storage
• Columnar Compression
• Clustered Compute
Datameer, native on Hadoop, enables self-service for big data
Analysts
True Self Service
• PB == Big Data
• Data Lake
• Excel-like UI
• No more waiting for IT
Self Service: The New Way
• Clustered Storage
• Columnar Compression
• Clustered Compute
• Liberated Data
11
Multiple Configurations for Big Data
12
Engineering
Analysis
IP
Telephony
Video
Research
IP Video
Engineering
X1
Operations
Advanced
Advertising
Web
Analytics
Enterprise
Business
Intelligence
Network
EngineeringMature
Evolving
On-Boarded
On-Deck
Expanding Use Cases with Datameer
Use Case #1: Comcast Digital Voice
One Of The Largest IP Telephony Networks
Anonymized Call Detail Records (CDR) Data Set
Data complexity from network
Data size: TBs/month
Discovered Unusual Patterns
Noticed large spikes for high cost areas
Hypothesis: Network Abuse
30% of this traffic was coming from three
accounts.
Analysis Shows Traffic Concentration Few Accounts
Ongoing Monitoring of Future Abuse
Analyst Scheduled a Tableau Data Extract and built a Tableau dashboard
- Now the business can keep an eye out for further abuse.
Result: Future Abuse Prevented and More
Abuse detected Analysts empowered Resources saved
No IT hand-off Value to organizationAutomated and
repeatable
21
Engineering
Analysis
IP
Telephony
Video
Research
IP Video
Engineering
X1
Operations
Advanced
Advertising
Web
Analytics
Enterprise
Business
Intelligence
Network
EngineeringMature
Evolving
On-Boarded
On-Deck
Expanding Use Cases with Datameer
Use Case #2: Customer Perspective
How to measure customer experience from the customer perspective
22
23
Millions of Viewing
Experiences
Improved Customer Experience through Data Analytics
24
Findings / Analysis
Best
Practices
Improved Customer Experience
Data driven scheduling
Dataflow Automation
Solution:
25
- Build views
quickly &
aggregate
large
datasets.
- Early visibility
of data in
Hadoop
- Create
repeatable
processes
through
automated
workflow
• Aggregations of large datasets from disparate data sources.
- RDBMS, HDFS, APIs
• Data Joins / Data Quality Checks / Pipeline between clusters
Result: Data-driven Customer Viewing Experience Enhancements
26
Customer Experience
Improved
Analysts empowered Capital Spend
Directed Intelligently
No IT hand-off Value to organizationAutomated and
repeatable
Self-Service Analytics on Hadoop: Lessons Learned

More Related Content

PPTX
Integrating Apache Phoenix with Distributed Query Engines
PPTX
Assaf Araki – Real Time Analytics at Scale
PDF
Spark Summit EU talk by Zoltan Zvara
PPTX
Spark Technology Center IBM
PPTX
Lightning Fast Analytics with Hive LLAP and Druid
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
How Spark Enables the Internet of Things- Paula Ta-Shma
Integrating Apache Phoenix with Distributed Query Engines
Assaf Araki – Real Time Analytics at Scale
Spark Summit EU talk by Zoltan Zvara
Spark Technology Center IBM
Lightning Fast Analytics with Hive LLAP and Druid
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
How Spark Enables the Internet of Things- Paula Ta-Shma

What's hot (20)

PPTX
Flink Case Study: Bouygues Telecom
PDF
Conviva spark
PPTX
Log I am your father
PDF
The Next Generation of Data Processing and Open Source
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
PPTX
Building Data Pipelines with Spark and StreamSets
PPTX
Real-Time Robot Predictive Maintenance in Action
PPTX
Monitoring and Troubleshooting a Real Time Pipeline
PDF
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
PPTX
Realtime streaming architecture in INFINARIO
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
PDF
Apache Druid 101
PPTX
Apache Spark in Scientific Applciations
PDF
The Future of Computing is Distributed
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PPTX
Building a Scalable Data Science Platform with R
PDF
Headaches and Breakthroughs in Building Continuous Applications
PDF
Big Telco - Yousun Jeong
PDF
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
PPTX
Spark in the Enterprise - 2 Years Later by Alan Saldich
Flink Case Study: Bouygues Telecom
Conviva spark
Log I am your father
The Next Generation of Data Processing and Open Source
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Building Data Pipelines with Spark and StreamSets
Real-Time Robot Predictive Maintenance in Action
Monitoring and Troubleshooting a Real Time Pipeline
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Realtime streaming architecture in INFINARIO
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Apache Druid 101
Apache Spark in Scientific Applciations
The Future of Computing is Distributed
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Building a Scalable Data Science Platform with R
Headaches and Breakthroughs in Building Continuous Applications
Big Telco - Yousun Jeong
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Spark in the Enterprise - 2 Years Later by Alan Saldich
Ad

Viewers also liked (20)

PPTX
Self-Service Provisioning and Hadoop Management with Apache Ambari
PPTX
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
PPTX
What the #$* is a Business Catalog and why you need it
PPTX
Extreme Analytics @ eBay
PPTX
Accelerating Data Warehouse Modernization
PPTX
The Future of Apache Hadoop an Enterprise Architecture View
PPTX
Operationalizing YARN based Hadoop Clusters in the Cloud
PPTX
Keep your Hadoop Cluster at its Best
PPTX
Analysis of Major Trends in Big Data Analytics
PDF
Reliable and Scalable Data Ingestion at Airbnb
PPT
Toward Better Multi-Tenancy Support from HDFS
PDF
Filling the Data Lake
PPTX
Apache Hive ACID Project
PPTX
From Zero to Data Flow in Hours with Apache NiFi
PPTX
Producing Spark on YARN for ETL
PDF
The Ecosystem is too damn big
PPTX
How to build a successful Data Lake
PDF
Elephant grooming: quality with Hadoop
PDF
Hadoop do data warehousing rules apply
Self-Service Provisioning and Hadoop Management with Apache Ambari
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
What the #$* is a Business Catalog and why you need it
Extreme Analytics @ eBay
Accelerating Data Warehouse Modernization
The Future of Apache Hadoop an Enterprise Architecture View
Operationalizing YARN based Hadoop Clusters in the Cloud
Keep your Hadoop Cluster at its Best
Analysis of Major Trends in Big Data Analytics
Reliable and Scalable Data Ingestion at Airbnb
Toward Better Multi-Tenancy Support from HDFS
Filling the Data Lake
Apache Hive ACID Project
From Zero to Data Flow in Hours with Apache NiFi
Producing Spark on YARN for ETL
The Ecosystem is too damn big
How to build a successful Data Lake
Elephant grooming: quality with Hadoop
Hadoop do data warehousing rules apply
Ad

Similar to Self-Service Analytics on Hadoop: Lessons Learned (20)

PDF
How to Avoid Pitfalls in Big Data Analytics Webinar
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
PPT
Gartner peer forum sept 2011 orbitz
PDF
Virtual workshop telco version
PDF
Making Hadoop based analytics simple for everyone to use
PDF
Complement Your Existing Data Warehouse with Big Data & Hadoop
PDF
Fight Fraud with Big Data Analytics
PPTX
Hadoop Summit - Sanoma self service on hadoop
PPTX
Pass bac jd_sm
PPTX
Scaling self service on Hadoop
PDF
BI, Hive or Big Data Analytics?
PDF
Machine Data Analytics
PDF
PXL Data Engineering Workshop By Selligent
PDF
IoT Crash Course Hadoop Summit SJ
PDF
Solving Big Data Problems using Hortonworks
PDF
James Mesney_"Datameer's Big Data Analytics Platform"_April 9th_Data Enthusia...
PDF
Big data-analytics-ebook
PDF
Big Data Analytics - From Generating Big Data to Deriving Business Value
PDF
Hadoop 2.0: YARN to Further Optimize Data Processing
PPTX
Is Your Staff Big Data Ready? 5 Things to Know About What It Will Take to Suc...
How to Avoid Pitfalls in Big Data Analytics Webinar
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Gartner peer forum sept 2011 orbitz
Virtual workshop telco version
Making Hadoop based analytics simple for everyone to use
Complement Your Existing Data Warehouse with Big Data & Hadoop
Fight Fraud with Big Data Analytics
Hadoop Summit - Sanoma self service on hadoop
Pass bac jd_sm
Scaling self service on Hadoop
BI, Hive or Big Data Analytics?
Machine Data Analytics
PXL Data Engineering Workshop By Selligent
IoT Crash Course Hadoop Summit SJ
Solving Big Data Problems using Hortonworks
James Mesney_"Datameer's Big Data Analytics Platform"_April 9th_Data Enthusia...
Big data-analytics-ebook
Big Data Analytics - From Generating Big Data to Deriving Business Value
Hadoop 2.0: YARN to Further Optimize Data Processing
Is Your Staff Big Data Ready? 5 Things to Know About What It Will Take to Suc...

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
cuic standard and advanced reporting.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Electronic commerce courselecture one. Pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced IT Governance
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
cuic standard and advanced reporting.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Electronic commerce courselecture one. Pdf
Big Data Technologies - Introduction.pptx
Spectral efficient network and resource selection model in 5G networks
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Monthly Chronicles - July 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced IT Governance
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
Review of recent advances in non-invasive hemoglobin estimation

Self-Service Analytics on Hadoop: Lessons Learned

Editor's Notes

  • #2: Welcome Self Introduction Journey to Self-Service Big Data Based on Lessons Learned from the work that we have done at Comcast.
  • #3: Comcast Introduction Cable Organization High Speed Internet Emmy winning Video Platform Home Security & Automation IP Telephony NBC Universal Media Properties Universal Theme Parks Scale 10s of Millions of Customers / 100s of Millions of Devices
  • #4: Intro to my team Initial Charter Start with Massive amounts of Data Deliver Budget Guidance Deliver Forecasts Engineering Design Guidance My specific goal is to empower all of these activities and more with Technology
  • #5: Hadoop Summit – Data Lake Safari in Africa Musth – testosterone spikes 60x - You will never experience that in a zoo - nor a theme park Native Habitat is critical
  • #7: We started with Self Service Analytices Excel on a Laptop Single Resource / No handoffs Contained Scaled to 1M rows Migrated to SQL Server / SSRS – Not Self Service IT Infrastructure / Handoff Limit at 250 GB 8 years ago before Big Data was cool we had big data problems – Enter Vertica Columnar Data store 100s of TBs Stil have silos Enter Datameer on Hadoop to bring us back to Self-Service Analytics
  • #8: Technical Limitations 1 M Row Max (Not even medium data) Not Collaborative No Automation Not Repeatable
  • #9: Consultative model - Limit for SQL Server at ~250GBs - IT Handoffs Model is Consultative Actually moved away from self service. In excel, analysts had access to data. IT Service Analytics
  • #10: - Now we can store TBs of data in clusters of servers - If you really have big data, you are still going to end up with silos - IT Handoff and still consultative - Analysts don’t know SQL or at least don’t know it well enough to not make problems.
  • #11: - OpenSource - Dataset blending Have true Self-Service No IT Handoffs Datameer 5000 row sample
  • #12: Multiple Configurations/Distributions Mixture of Bare Metal and Virtualized Multiple Distributions When we use “big data” like this we do so in compliance with all applicable privacy and security requirements and laws.
  • #13: Diagram Details the Maturity of Different Use Cases Many are being targeted and are in varying levels of maturity At this point I’m going to focus in on a specific use case. Lots of Consultative work Invested to make them Self-Sufficient with Datameer
  • #14: Comcast Digital Voice We are one of the largest telephone carriers in the country This is am important line of business for us There are many parts of the business including wholesale and peering relationships that need to be managed All of these rely on data to make decisions on how to manage the network and the relationships
  • #15: IP Telephony is complex. deep engineering field Intricacies session boarded controllers media gateways. SMEs and Analysts deep engineering knowledge My team did not have knowledge Consultative approach was very challenging Handoff errors Built the wrong thing Extremely iterative and costly
  • #16: Solution: Get the SMEs and Analysts into the data with Datameer Data Anonymized CDRs TBs of data per day Datameer UI – 5000 row sample of data Real-time feedback Create your Data pipeline via XLS-like Instantaneous Feedback
  • #17: What Happened – second hand Data Discovery Profiling – Understand the Data Let the data tell it’s story Noticed something strange in the data Spike to High cost areas (international?) Question: What does it mean?
  • #18: Hypothesis: Network abuse Not legitimate use Violation of the terms of service Not going to give a course in how to abuse our services
  • #19: The SMEs/Analysts Hypothesis Dug deeper / created aggregations Large percentage of traffic was coming from a handful of accounts
  • #20: Datameer has Visualization capability Infographics Tableau is fairly well adopted using Datameers integration with Tableau SMEs created an automation in Datameer to push a TDE to Tableau Server
  • #21: Abuse detected and addressed Analyst directed and empowered No IT Handoff - Value delivered to the organization - Automated and repeatable
  • #22: Diagram Details the Maturity of Different Use Cases Many are being targeted and are in varying levels of maturity At this point I’m going to focus in on a specific use case. Lots of Consultative work Invested to make them Self-Sufficient with Datameer
  • #25: Sausage Funnel Inputs 3rd Party QoE Network QoS In-Home QoS Outputs Improved IP Video QoE Improved NPS
  • #26: Blend Analyze Share Rapid Prototyping – Disparate Data Sets
  • #27: Changing how we prioritize capital spend Optimizing for CX – Right KPI