SlideShare a Scribd company logo
How to build
a data warehouse?
Dmytro Popovych, SE @ Tubular
Theory vs practice
Цитата #441422
Пока инженеры в белых халатах прикручивают красивый двигатель к
идеальному крылу, бригада взлохмаченных придурков во главе с
безумным авантюристом пролетает над ними на конструкции из
микроавтобуса, забора и двух промышленных фенов, навстречу
второму туру инвестиций.
Красивые проекты не взлетают, потому что они не успевают взлететь.
Agenda
• Problem statement:
• Data Ingestion
• Data Normalisation
• Data Access
• Our way to solve the problem :)
About us
• Video intelligence for the cross-platform world
• 30+ video platforms including YouTube, Facebook, Instagram
• 7M creators
• 3B videos
• 2Tb of newly ingested data a day
• 150Tb of data in the warehouse
What is a data warehouse?
A central repository of data collected from disparate sources.
ANALYST
ENGINEER
SERVICE
DATA
WAREHOUSE
Key features
Ingestion
Store raw data extracted from disparate data sources
Normalisation
Cleanup / combine raw data
Access
Help user to retrieve data
What problems does it solve in Tubular?
• For engineers / analysts:
• data discovery
• prototyping / analyse
• For services:
• data exchange
Data Ingestion
Data Ingestion Problems
• Real time data:
• tweets, comments, shares, views
• Periodical snapshots:
• dump of real time data
• results of the data analysis
• databases from internal services (in some cases)
Real time data
DATABUS / event log / message queue
Powered by KAFKA
Data serialised with AVRO
Keeps all events for the last N days
SERVICE #1 SERVICE #2
PERMANENT
STORAGE
...
Why did we choose Kafka?
• Stores streams of records in a fault-tolerant way
• Designed to serve multiple consumers per topic
• Allows to keep the last N days of records
• Tested in very big companies Linkedin, Twitter, Uber, Airbnb...
• Strict schema definition
• Safe schema evolution
• Compact (binary serialisation format)
• Cross-technology format (Java, Python, …)
• Has some ecosystem around (Schema Registry, CLI consumers, …)
• Hadoop-friendly
Why did we choose Avro?
Periodical Snapshots
DATABUS
SERVICE #1
Powered by ELASTIC
PERMANENT STORAGE
SERVICE #2
Powered by CASSANDRA
SERVICE #3
Powered by MYSQL
...
DATA IMPORT TOOL
Powered by S3
Data serialised with PARQUET
Powered by SPARK
Why did we choose S3?
• There is no need to support it
• Compatible with Hadoop ecosystem
• Relatively stable & cheap
Why did we choose Parquet?
• Column-oriented format (perfect for analytics and partial reads)
• Supports complex data structures
• Compatible with Hadoop ecosystem
Why did we choose Spark?
• Scalable data processing engine
• Faster than Hadoop
• Has connectors to all popular storages: JDBC, Elastic, Cassandra, Kafka
• Has Python bindings
• Built-in support of Parquet
Data Normalisation
Data Normalisation Problems
• Cleanup duplicates
• Partition by year / month / date / hour
• Join various data sources
Normalisation of real time data (example)
SERVICE #1
Powered by ELASTIC
DATABUS
UI
PERMANENT STORAGE
The service joins multiple data streams by sending
partial updates to Elastic.
Note! It isn’t the only way to implement a real time
join, more generic solution could be implemented
with Apache Samza.
Why did we choose Elastic?
• Provides real time search and analytics
• Has relatively cheap partial updates
• Easy to scale
Normalisation of previously imported data
DATA NORMALISATION TOOL
PERMANENT STORAGE
Powered by Spark
Joins various datasets
Removes duplicates
Creates partitions by time range buckets
Why did we choose Spark?
• Scalable data processing engine
• Has built-in SQL api to transform data (perfect for joins and deduplication)
Data Access
Data Access Problems
• Datasets discovery
• Unified data access interface
Metadata Storage
PERMANENT STORAGE
Parquet
# 1
Avro
#1
CSV
#1
Parquet
# 2
...
METADATA STORAGE
Table # 1
Table # 2
Table # 3
...
Powered by Hive Metastore
Why did we choose Hive Metastore?
• Supported by Hadoop ecosystem
• Simple (Thrift api on top of MySQL table)
• Supported by Hue (UI to access metadata)
Let's summarize...
System Overview
ANALYST,
ENGINEER
PERMANENT STORAGE
DATABUS
METADATA
STORAGE
IMPORT
TOOL
NORMALISATION
TOOL
WAREHOUSE
SERVICES
* Data flows for Metadata Storage are explained verbally, too many arrows...
Thanks! Questions?
Check this out: https://github.com/Tubular/sparkly

More Related Content

PDF
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
PDF
How to Build Modern Data Architectures Both On Premises and in the Cloud
PDF
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
PDF
DBP-010_Using Azure Data Services for Modern Data Applications
PDF
Automate your data flows with Apache NIFI
PDF
Building Data Lakes with Apache Airflow
PDF
Accelerate Data Science Initiatives: Databricks & Privacera
PDF
Presto: Fast SQL on Everything
Дмитрий Лавриненко "Big & Fast Data for Identity & Telemetry services"
How to Build Modern Data Architectures Both On Premises and in the Cloud
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
DBP-010_Using Azure Data Services for Modern Data Applications
Automate your data flows with Apache NIFI
Building Data Lakes with Apache Airflow
Accelerate Data Science Initiatives: Databricks & Privacera
Presto: Fast SQL on Everything

What's hot (20)

PPTX
Eugene Polonichko "Architecture of modern data warehouse"
PDF
Northwestern Mutual Journey – Transform BI Space to Cloud
PDF
Converging Database Transactions and Analytics
PDF
Building Custom Big Data Integrations
PPTX
Integration Monday - Analysing StackExchange data with Azure Data Lake
PDF
Modern Data architecture Design
PDF
Personalization Journey: From Single Node to Cloud Streaming
PDF
Real-Time Analytics with Confluent and MemSQL
PPTX
Snaplogic Live: Big Data in Motion
PPTX
Community day ppt_kinesisv1.0
PDF
Data platform architecture
PDF
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
PDF
Industrializing Machine Learning on an Enterprise Azure Platform with Databri...
PDF
Strata+Hadoop World NY 2016 - Avinash Ramineni
PDF
Scaling to Infinity - Open Source meets Big Data
PPTX
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
PPTX
Data Architecture Brief Overview
PPTX
Using Premium Data - for Business Analysts
PDF
LogStash: Concept Run-Through
PPTX
Options for Data Prep - A Survey of the Current Market
Eugene Polonichko "Architecture of modern data warehouse"
Northwestern Mutual Journey – Transform BI Space to Cloud
Converging Database Transactions and Analytics
Building Custom Big Data Integrations
Integration Monday - Analysing StackExchange data with Azure Data Lake
Modern Data architecture Design
Personalization Journey: From Single Node to Cloud Streaming
Real-Time Analytics with Confluent and MemSQL
Snaplogic Live: Big Data in Motion
Community day ppt_kinesisv1.0
Data platform architecture
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
Industrializing Machine Learning on an Enterprise Azure Platform with Databri...
Strata+Hadoop World NY 2016 - Avinash Ramineni
Scaling to Infinity - Open Source meets Big Data
TechDays NL 2016 - Building your scalable secure IoT Solution on Azure
Data Architecture Brief Overview
Using Premium Data - for Business Analysts
LogStash: Concept Run-Through
Options for Data Prep - A Survey of the Current Market
Ad

Similar to Дмитрий Попович "How to build a data warehouse?" (20)

PDF
Making Data Timelier and More Reliable with Lakehouse Technology
PPTX
Reshape Data Lake (as of 2020.07)
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Data Infrastructure for a World of Music
PDF
Building modern data lakes
PPTX
Cruising in data lake from zero to scale
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
ODP
Building next generation data warehouses
PDF
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
PDF
Overview of data analytics service: Treasure Data Service
PPTX
How we evolved data pipeline at Celtra and what we learned along the way
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PDF
Lambda at Weather Scale - Cassandra Summit 2015
PDF
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
PDF
Technologies for Data Analytics Platform
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Making Data Timelier and More Reliable with Lakehouse Technology
Reshape Data Lake (as of 2020.07)
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Data Infrastructure for a World of Music
Building modern data lakes
Cruising in data lake from zero to scale
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
AWS Big Data Demystified #1: Big data architecture lessons learned
Building next generation data warehouses
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
Overview of data analytics service: Treasure Data Service
How we evolved data pipeline at Celtra and what we learned along the way
The Future of Fast Databases: Lessons from a Decade of QuestDB
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Lambda at Weather Scale - Cassandra Summit 2015
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Technologies for Data Analytics Platform
Introduction SQL Analytics on Lakehouse Architecture
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Ad

More from Fwdays (20)

PDF
"Mastering UI Complexity: State Machines and Reactive Patterns at Grammarly",...
PDF
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
PPTX
"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai
PPTX
"Як ми переписали Сільпо на Angular", Євген Русаков
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
PDF
"Validation and Observability of AI Agents", Oleksandr Denisyuk
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
PPTX
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
PPTX
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
PDF
"AI is already here. What will happen to your team (and your role) tomorrow?"...
PPTX
"Is it worth investing in AI in 2025?", Alexander Sharko
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
PDF
"Scaling in space and time with Temporal", Andriy Lupa.pdf
PDF
"Database isolation: how we deal with hundreds of direct connections to the d...
PDF
"Scaling in space and time with Temporal", Andriy Lupa .pdf
PPTX
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
PPTX
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
PPTX
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
PPTX
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...
"Mastering UI Complexity: State Machines and Reactive Patterns at Grammarly",...
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai
"Як ми переписали Сільпо на Angular", Євген Русаков
"AI Transformation: Directions and Challenges", Pavlo Shaternik
"Validation and Observability of AI Agents", Oleksandr Denisyuk
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
"AI is already here. What will happen to your team (and your role) tomorrow?"...
"Is it worth investing in AI in 2025?", Alexander Sharko
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Database isolation: how we deal with hundreds of direct connections to the d...
"Scaling in space and time with Temporal", Andriy Lupa .pdf
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Empathic Computing: Creating Shared Understanding
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
NewMind AI Weekly Chronicles - August'25 Week I
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.
Dropbox Q2 2025 Financial Results & Investor Presentation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Programs and apps: productivity, graphics, security and other tools
Empathic Computing: Creating Shared Understanding
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Big Data Technologies - Introduction.pptx
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation_ Review paper, used for researhc scholars
NewMind AI Weekly Chronicles - August'25 Week I

Дмитрий Попович "How to build a data warehouse?"

  • 1. How to build a data warehouse? Dmytro Popovych, SE @ Tubular
  • 2. Theory vs practice Цитата #441422 Пока инженеры в белых халатах прикручивают красивый двигатель к идеальному крылу, бригада взлохмаченных придурков во главе с безумным авантюристом пролетает над ними на конструкции из микроавтобуса, забора и двух промышленных фенов, навстречу второму туру инвестиций. Красивые проекты не взлетают, потому что они не успевают взлететь.
  • 3. Agenda • Problem statement: • Data Ingestion • Data Normalisation • Data Access • Our way to solve the problem :)
  • 4. About us • Video intelligence for the cross-platform world • 30+ video platforms including YouTube, Facebook, Instagram • 7M creators • 3B videos • 2Tb of newly ingested data a day • 150Tb of data in the warehouse
  • 5. What is a data warehouse? A central repository of data collected from disparate sources. ANALYST ENGINEER SERVICE DATA WAREHOUSE
  • 6. Key features Ingestion Store raw data extracted from disparate data sources Normalisation Cleanup / combine raw data Access Help user to retrieve data
  • 7. What problems does it solve in Tubular? • For engineers / analysts: • data discovery • prototyping / analyse • For services: • data exchange
  • 9. Data Ingestion Problems • Real time data: • tweets, comments, shares, views • Periodical snapshots: • dump of real time data • results of the data analysis • databases from internal services (in some cases)
  • 10. Real time data DATABUS / event log / message queue Powered by KAFKA Data serialised with AVRO Keeps all events for the last N days SERVICE #1 SERVICE #2 PERMANENT STORAGE ...
  • 11. Why did we choose Kafka? • Stores streams of records in a fault-tolerant way • Designed to serve multiple consumers per topic • Allows to keep the last N days of records • Tested in very big companies Linkedin, Twitter, Uber, Airbnb...
  • 12. • Strict schema definition • Safe schema evolution • Compact (binary serialisation format) • Cross-technology format (Java, Python, …) • Has some ecosystem around (Schema Registry, CLI consumers, …) • Hadoop-friendly Why did we choose Avro?
  • 13. Periodical Snapshots DATABUS SERVICE #1 Powered by ELASTIC PERMANENT STORAGE SERVICE #2 Powered by CASSANDRA SERVICE #3 Powered by MYSQL ... DATA IMPORT TOOL Powered by S3 Data serialised with PARQUET Powered by SPARK
  • 14. Why did we choose S3? • There is no need to support it • Compatible with Hadoop ecosystem • Relatively stable & cheap
  • 15. Why did we choose Parquet? • Column-oriented format (perfect for analytics and partial reads) • Supports complex data structures • Compatible with Hadoop ecosystem
  • 16. Why did we choose Spark? • Scalable data processing engine • Faster than Hadoop • Has connectors to all popular storages: JDBC, Elastic, Cassandra, Kafka • Has Python bindings • Built-in support of Parquet
  • 18. Data Normalisation Problems • Cleanup duplicates • Partition by year / month / date / hour • Join various data sources
  • 19. Normalisation of real time data (example) SERVICE #1 Powered by ELASTIC DATABUS UI PERMANENT STORAGE The service joins multiple data streams by sending partial updates to Elastic. Note! It isn’t the only way to implement a real time join, more generic solution could be implemented with Apache Samza.
  • 20. Why did we choose Elastic? • Provides real time search and analytics • Has relatively cheap partial updates • Easy to scale
  • 21. Normalisation of previously imported data DATA NORMALISATION TOOL PERMANENT STORAGE Powered by Spark Joins various datasets Removes duplicates Creates partitions by time range buckets
  • 22. Why did we choose Spark? • Scalable data processing engine • Has built-in SQL api to transform data (perfect for joins and deduplication)
  • 24. Data Access Problems • Datasets discovery • Unified data access interface
  • 25. Metadata Storage PERMANENT STORAGE Parquet # 1 Avro #1 CSV #1 Parquet # 2 ... METADATA STORAGE Table # 1 Table # 2 Table # 3 ... Powered by Hive Metastore
  • 26. Why did we choose Hive Metastore? • Supported by Hadoop ecosystem • Simple (Thrift api on top of MySQL table) • Supported by Hue (UI to access metadata)
  • 29. Thanks! Questions? Check this out: https://github.com/Tubular/sparkly