SlideShare a Scribd company logo
ABC of Distributed Data
Processing.
Achieving Buzzword Compliance.
1
Piyush Verma
Oogway Consulting
Idiosyncrasies
2
When will my Data become Big Data?
Hive Data Will Save.
How did we reach here?
5
Data :: Business
Data :: Business
Types of Workload
When do I call it Big Enough?
Why bother with Data
Engineering?
10
Why do analysis at all?
Descriptive
- Historical.
- Deterministic.
- Inferential.
- Managers make pretty graphs.
Predictive
- Future.
- Probabilistic.
- Based on Descriptive.
- This is how pundits predict stocks.
Prescriptive
Architecture:
Round 1
15
What does data look like?
Storage Choice 1
Storage Choice 2
Oh no
Challenges:
Round 1
20
Scaling
Archival Policy
Garbage / Purging
All related entities end up in complex joins
All Relationships complicate over Dimension of time
Anatomy
26
Anatomy
Challenges:
Round 2
28
Star Schema
Snowflake Schema
De-Duplication
Bloom Filters
Cuckoo Filters
- Does not exist for sure.
- May or may not exist.
Slow Changing
Dimensions
Batching vs
Streaming
Out-of-Order
Processing
Cubes
● Efficiency of Retrieval
● Warehouse:Cube :: DB:Table
● View: Dimension + Measure
● Slice, Dice & Rotate
Architecture:
Revisited
37
Sample Solution
Thank you!
Piyush Verma
@meson10
Oogway
Consulting
http://oogway.in

More Related Content

PPTX
How to start thinking like a data scientist?
PDF
Is Velocity a Worthwhile Predictor?
PPTX
5 practical tips to make a successful big data project
PDF
10 Reasons to Ditch Spreadsheets
PDF
Agile Fest 2017 Small is beautiful
PDF
Agile Data Science
PDF
The Economics of Scale: Promises and Perils of Going Distributed
PPTX
Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC
How to start thinking like a data scientist?
Is Velocity a Worthwhile Predictor?
5 practical tips to make a successful big data project
10 Reasons to Ditch Spreadsheets
Agile Fest 2017 Small is beautiful
Agile Data Science
The Economics of Scale: Promises and Perils of Going Distributed
Dan Mallinger, Data Science Practice Manager, Think Big Analytics at MLconf NYC

Similar to ABC of Distributed Data Processing (20)

PDF
Data Warehouse Design and Best Practices
PDF
The Death of the Star Schema
PPTX
Data Vault Overview
PDF
Building Data Products
PDF
ABC of Distributed Data Systems.
PDF
#bluecruxtalks crash course - Part 1 - Master Data Factories.pdf
PDF
Sql saturday el salvador 2016 - Me, A Data Scientist?
PPTX
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
PPT
Scalable Machine Learning: The Role of Stratified Data Sharding
PPTX
Unit 1- Review of Basic Concepts-part 1.pptx
PDF
BI on Big Data Presentation
PDF
Lean Digital | Data Driven Factory
PDF
Cs437 lecture 1-6
PPTX
Real World Performance - OLTP
PDF
A look inside pandas design and development
PPTX
Accelerating Data Lakes and Streams with Real-time Analytics
PPTX
Three Tools for "Human-in-the-loop" Data Science
PDF
Data kitchen 7 agile steps - big data fest 9-18-2015
PDF
OLAP – Creating Cubes with SQL Server Analysis Services
PDF
Big data pipelines
Data Warehouse Design and Best Practices
The Death of the Star Schema
Data Vault Overview
Building Data Products
ABC of Distributed Data Systems.
#bluecruxtalks crash course - Part 1 - Master Data Factories.pdf
Sql saturday el salvador 2016 - Me, A Data Scientist?
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
Scalable Machine Learning: The Role of Stratified Data Sharding
Unit 1- Review of Basic Concepts-part 1.pptx
BI on Big Data Presentation
Lean Digital | Data Driven Factory
Cs437 lecture 1-6
Real World Performance - OLTP
A look inside pandas design and development
Accelerating Data Lakes and Streams with Real-time Analytics
Three Tools for "Human-in-the-loop" Data Science
Data kitchen 7 agile steps - big data fest 9-18-2015
OLAP – Creating Cubes with SQL Server Analysis Services
Big data pipelines
Ad

Recently uploaded (20)

PPTX
Tech Workshop Escape Room Tech Workshop
PDF
Autodesk AutoCAD Crack Free Download 2025
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PPTX
Introduction to Windows Operating System
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
Cost to Outsource Software Development in 2025
PDF
Salesforce Agentforce AI Implementation.pdf
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
STL Containers in C++ : Sequence Container : Vector
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
Website Design Services for Small Businesses.pdf
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
Tech Workshop Escape Room Tech Workshop
Autodesk AutoCAD Crack Free Download 2025
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Designing Intelligence for the Shop Floor.pdf
DNT Brochure 2025 – ISV Solutions @ D365
How Tridens DevSecOps Ensures Compliance, Security, and Agility
wealthsignaloriginal-com-DS-text-... (1).pdf
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
Why Generative AI is the Future of Content, Code & Creativity?
Introduction to Windows Operating System
"Secure File Sharing Solutions on AWS".pptx
Cost to Outsource Software Development in 2025
Salesforce Agentforce AI Implementation.pdf
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
STL Containers in C++ : Sequence Container : Vector
GSA Content Generator Crack (2025 Latest)
Website Design Services for Small Businesses.pdf
Computer Software and OS of computer science of grade 11.pptx
Advanced SystemCare Ultimate Crack + Portable (2025)
Ad

ABC of Distributed Data Processing