SlideShare a Scribd company logo
Data Integration
Contents
Introduction1
2 Data Ingestion
3 Data Processing
4 Data Architectures
5 Workshop
1. Introduction
vision
products
data science
Data access
data
infrastructure
Data Needs
Relational DBs
Log filesSearch indexes
NoSQL DBs
Message queueMonitoring
Data Sources
Data Warehouse
ETL
ETL ETL
ETL
Data Warehouse Ingestion
Sink
Source
. . . .Transform
Load
Extract
1990 Data Warehousing
- Drop relational assumption
- Programmability
- Open Source
2008 Hadoop + MapReduce
- Batch → Real-time
- Daily → Continous
2015 Kafka + Streaming data
2. Data Ingestion
From ETL to ELT: Flume, sqoop, kafka
sqoopflume
Data Lake
Kafka Producer Kafka Producer
Kafka Consumer
Data Lake Ingestion
Kafka
Channel
Channel
Processor
Interceptor #1
Interceptor #N
SinkSource
Flume Agent
Apache Flume
Avro
Thrift
Kafka
Exec
JMS
Spool dir
Twitter
Netcat
Syslog
HTTP
HDFS
Kafka
Hive
Logger
Avro
Thrift
IRC
HBase
Elastic
RDBMS
Apache Sqoop
Sqoop Tool
Import
Export
Data Pipeline Problem
Inter-process
communication
channel
Data Pipeline Problem
Metrics
Pub/Sub
A publish/subscribe
System
Data Pipeline Problem
Metrics
Pub/Sub
Logging
Pub/Sub
Multiple
publish/subscribe
Systems
Apache Kafka
Broker 1 Broker 2 Broker 3
Kafka Cluster
●
●
●
●
Consumer
Kafka as reliable Flume channel
Flume + Kafka
Source Sink
Channel
Producer
Flume as kafka producer/consumer
3. Data Processing
Batch Processing
Data Lake
Batch
Processing
Pageviews
[url, timestamp]
[url, timestamp]
[url, timestamp]
[url, timestamp]
DBRollups
[url, hour,
count]
[url, hour,
count]
[url, hour,
count]
{url+hour :
count}
{url+hour :
count}
{url+hour :
count}
mapreduce mapreduce Data Analysis
Stream Processing
Real Time Technologies
Data
Source
flume
Kafka producer
Events /
DB writes
Process
Stream
Event
Stream
Output
Stream
4. Data Architectures
Data Lake
Batch
Processing
Data Processing Architecture
Data
Source
flume
Kafka producer
Data Analysis
Data Lake
Batch
Processing
Stream
Processing
Data Processing Architecture
Data
Source
flume
Kafka producer
Data Analysis
Lambda Architecture
Serving Layer
New Data
Stream
Batch Views
Real-Time Views
Partial
Aggregate
Partial
Aggregate
Partial
Aggregate
Real-Time Data
Bath LayerPrecompute Views
(MapReduce)Batch
Processing
Real-Time
Layer
Increment Views
Stream
Processing
Process
Stream
Merged
View
query
merge
Data Lake
Batch
Processing
Stream
Processing
Data Processing Architecture
Data
Source
flume
Kafka producer
Serving
Layer
Data Analysis
Kappa Architecture
Serving Layer
query
Serving DB
Output Table n
Output Table n+1
Stream Processing System
Job Version n
Job Version n+1
Data Storage
1
New Data
Stream
2 3 ..
Where everything is a stream
Real-Time Layer
query
4. Workshop
THANKS!
Any questions?
@datiobd
flasheras@datiobd.com rbravo@datiobd.com
datio-big-data

More Related Content

PDF
Introduction to Spark Streaming
PPTX
Kafka connect-london-meetup-2016
PDF
Kafka Connect by Datio
PPTX
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
PPT
Introduction to Spark Streaming
PPTX
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
PPTX
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Introduction to Spark Streaming
Kafka connect-london-meetup-2016
Kafka Connect by Datio
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Introduction to Spark Streaming
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)

What's hot (20)

PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PDF
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
PDF
Real-time Data Streaming from Oracle to Apache Kafka
PPTX
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
PDF
Stream Processing using Apache Spark and Apache Kafka
PDF
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
PDF
Apache Spark Introduction - CloudxLab
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
PPTX
Jack Gudenkauf sparkug_20151207_7
PPTX
Bullet: A Real Time Data Query Engine
PPTX
Apache kafka
PDF
Cooperative Data Exploration with iPython Notebook
PDF
Data Pipeline with Kafka
PPTX
Analytics Beyond RAM Capacity using R
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
PDF
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Real-time Data Streaming from Oracle to Apache Kafka
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Stream Processing using Apache Spark and Apache Kafka
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Large-Scale Stream Processing in the Hadoop Ecosystem
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Apache Spark Introduction - CloudxLab
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Jack Gudenkauf sparkug_20151207_7
Bullet: A Real Time Data Query Engine
Apache kafka
Cooperative Data Exploration with iPython Notebook
Data Pipeline with Kafka
Analytics Beyond RAM Capacity using R
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
Ad

Similar to Data Integration (20)

PPTX
Data lake-itweekend-sharif university-vahid amiry
PDF
Achieve data democracy in data lake with data integration
PDF
ETL Is Dead, Long-live Streams
PPTX
Speak to Your Data
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Harness the power of Data in a Big Data Lake
PPTX
Chap3-Data Warehousing and OLAP operations..pptx
PPTX
Designing modern dw and data lake
PPTX
Big data architectures and the data lake
PDF
10 basic terms so you can talk to data engineer
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
PPTX
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx
KEY
Large scale ETL with Hadoop
KEY
Large Scale ETL with Hadoop
PPTX
Data Lake Overview
PPTX
Top 6 Data Ingestion Tools for Seamless Data Integration
PDF
ADV Slides: Data Pipelines in the Enterprise and Comparison
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Accelerate and modernize your data pipelines
Data lake-itweekend-sharif university-vahid amiry
Achieve data democracy in data lake with data integration
ETL Is Dead, Long-live Streams
Speak to Your Data
Webinar Think Right - Shift Left - 19-03-2025.pptx
Harness the power of Data in a Big Data Lake
Chap3-Data Warehousing and OLAP operations..pptx
Designing modern dw and data lake
Big data architectures and the data lake
10 basic terms so you can talk to data engineer
Data Warehouse Modernization: Accelerating Time-To-Action
UNIT 2 DATA WAREHOUSING AND DATA MINING PRESENTATION.pptx
Large scale ETL with Hadoop
Large Scale ETL with Hadoop
Data Lake Overview
Top 6 Data Ingestion Tools for Seamless Data Integration
ADV Slides: Data Pipelines in the Enterprise and Comparison
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Accelerate and modernize your data pipelines
Ad

More from Datio Big Data (20)

PDF
Búsqueda IA
PDF
Descubriendo la Inteligencia Artificial
PDF
Learning Python. Level 0
PDF
Learn Python
PDF
How to document without dying in the attempt
PDF
Developers on test
PDF
Ceph: The Storage System of the Future
PDF
A Travel Through Mesos
PDF
Datio OpenStack
PDF
Quality Assurance Glossary
PDF
Gamification: from buzzword to reality
PDF
Pandas: High Performance Structured Data Manipulation
PPTX
Apache Spark II (SparkSQL)
PDF
Road to Analytics
PDF
Introduction to Apache Spark
PDF
Del Mono al QA
PDF
Databases and how to choose them
PPTX
DC/OS: The definitive platform for modern apps
PPTX
PDP Your personal development plan
PPTX
Security&Governance
Búsqueda IA
Descubriendo la Inteligencia Artificial
Learning Python. Level 0
Learn Python
How to document without dying in the attempt
Developers on test
Ceph: The Storage System of the Future
A Travel Through Mesos
Datio OpenStack
Quality Assurance Glossary
Gamification: from buzzword to reality
Pandas: High Performance Structured Data Manipulation
Apache Spark II (SparkSQL)
Road to Analytics
Introduction to Apache Spark
Del Mono al QA
Databases and how to choose them
DC/OS: The definitive platform for modern apps
PDP Your personal development plan
Security&Governance

Recently uploaded (20)

PPTX
Sustainable Sites - Green Building Construction
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Well-logging-methods_new................
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Sustainable Sites - Green Building Construction
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Arduino robotics embedded978-1-4302-3184-4.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
Well-logging-methods_new................
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Structs to JSON How Go Powers REST APIs.pdf
UNIT 4 Total Quality Management .pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Mechanical Engineering MATERIALS Selection
Internet of Things (IOT) - A guide to understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Lecture Notes Electrical Wiring System Components
CYBER-CRIMES AND SECURITY A guide to understanding
Foundation to blockchain - A guide to Blockchain Tech
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...

Data Integration