SlideShare a Scribd company logo
Breaking ETL barrier
with Real-time reporting
using Kafka, Spark Streaming
About us
Concur (now part of SAP) provides travel and
expense management services to
businesses.
Data Insights
A team that is building solutions to provide
customer access to data, visualization and
reporting.
Expense
Travel
Invoice
About me
Santosh Sahoo
Principal Architect III, Data Insights
Stack so far..
OLAP ReportETL
OLTP
App
Numbers
7K OLTP database sources
14K OLAP Reporting dbs
28K ETL Jobs
2B row changes
300M rows (Compacted)
Only ~20 failure a night
Traditional ETL challenges
Scheduled (High latency)
Hard to scale.
Failover and recovery.
Monolithic-ness
Spaghetti (Logic +SQL)
Moving forward
Streaming, real time
Scalable
Highly available
Reduce maintenance overhead
Eventual Consistency
Streaming Data Pipeline
Source
Flow Management
Processor
Storage
Querying
Data Source
Event bus for business events
Log Scrapping
Transaction log scraping
(Oracle GoldenGate, MySQL binlog, MongoDB oplog, Postgres BottledWater, SQL Server fn_dblog)
Change Data Capture
Application messaging/JMS
Micro batching
(High watermarked, change tracking)
Kafka - Flow Management
No nonsense logging
100K/s throughput vs 20k of RabbitMQ
Log compaction
Durable persistence
Partition tolerance
Replication
Best in class integration with Spark
Columnar Storage
Optimized for analytic query
performance.
Vertical partitioning
Column Projection
Compression
Loosely coupled schema.
HBase
AWS Redshift
Parquet
ORC
Postgres (Citrus)
SAP HANA
Hadoop/HDFS
Pro - Scale
Con- Latency
Spark Streaming
What? A data processing framework to build
scalable fault-tolerant streaming
applications.
Why? It lets you reuse the same code for
batch processing, join streams against
historical data, or run ad-hoc queries on
stream state.
Spark Streaming Architecture
Worker
Worker
Worker
Receiver
Driver Master
Executor
Executor
Executor
Source
D1 D2
D3 D4
WAL
D1 D2
Replication
Data
Store
TASK
DStream- Discretized Stream of RDD
RDD - Resilient Distributed Datasets
Optimized Direct Kafka API
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
How
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092,
anotherhost:9092")
val topics = Set("sometopic", "anothertopic")
val kafkaStream = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](streamingContext, kafkaParams,
topics)
Architecture
App
OLTP
Kafka
Spark
Streaming OLAP
Reporting
App
High level view
OLTP
Reporting
Cognos
Tableau ?
Archive
Flume
Camus
Stream
Processor
Spark
Samza,
Storm,
Flink
HDFS
Import
FTP
HTTP
SMTP
C
Tachyon
P
Standby
Protobuf
Json
Broker
Kafka
Hive/
Spark SQL
HANA
Load balance
Failover
HANA
HANA
HANA
Replication
Service bus
SqoopSnapshot
Pig/Hive/MR -
Normalization
Extract
Compensate
Data {Quality, Correction, Analytics}
Migrate method
API/SQL
Expense
Travel
TTX
API
Complete Architecture
Can Spark Streaming
survive Chaos Monkey?
http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
Lambda Architecture
Lambda architecture is a data-processing
pattern designed to handle massive
quantities of data by taking advantage of
both batch- and stream-processing methods.
Demo
….
QnA
concur.com/en-us/careers
We are hiring
Thank you!

More Related Content

PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PDF
Lambda Architecture Using SQL
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
PDF
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
PPTX
Intro to Apache Spark
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Bellevue Big Data meetup: Dive Deep into Spark Streaming
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Lambda Architecture Using SQL
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Intro to Apache Spark
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...

What's hot (20)

PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PPTX
ETL with SPARK - First Spark London meetup
PDF
SMACK Stack 1.1
PDF
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
PDF
Top 5 mistakes when writing Streaming applications
PPTX
Intro to Spark development
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
Scalable And Incremental Data Profiling With Spark
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
Reactive dashboard’s using apache spark
PDF
Big Data visualization with Apache Spark and Zeppelin
PDF
Change Data Capture with Data Collector @OVH
PDF
Spark streaming State of the Union - Strata San Jose 2015
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Spark Meetup at Uber
PDF
Distributed Stream Processing - Spark Summit East 2017
PDF
Strata NYC 2015: What's new in Spark Streaming
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
ETL with SPARK - First Spark London meetup
SMACK Stack 1.1
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Top 5 mistakes when writing Streaming applications
Intro to Spark development
Spark Summit East 2015 Advanced Devops Student Slides
Scalable And Incremental Data Profiling With Spark
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Reactive dashboard’s using apache spark
Big Data visualization with Apache Spark and Zeppelin
Change Data Capture with Data Collector @OVH
Spark streaming State of the Union - Strata San Jose 2015
Real time data viz with Spark Streaming, Kafka and D3.js
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Simplifying Big Data Analytics with Apache Spark
Spark Meetup at Uber
Distributed Stream Processing - Spark Summit East 2017
Strata NYC 2015: What's new in Spark Streaming
Ad

Viewers also liked (10)

PDF
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
PDF
Data-Ed Webinar: Data Governance Strategies
PDF
Tips to achieve continuous integration/delivery using HP ALM, Jenkins, and S...
PPTX
DevOps Powered by Splunk
PDF
Business intelligence 3.0 and the data lake
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
PPT
Log Mining: Beyond Log Analysis
PPTX
Production and Beyond: Deploying and Managing Machine Learning Models
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Data-Ed Webinar: Data Governance Strategies
Tips to achieve continuous integration/delivery using HP ALM, Jenkins, and S...
DevOps Powered by Splunk
Business intelligence 3.0 and the data lake
GOTO Night Amsterdam - Stream processing with Apache Flink
Log Mining: Beyond Log Analysis
Production and Beyond: Deploying and Managing Machine Learning Models
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real-Time Analytics with Apache Cassandra and Apache Spark
Ad

Similar to Realtime Reporting using Spark Streaming (20)

PDF
Event Driven Microservices
PPTX
Unify Analytics: Combine Strengths of Data Lake and Data Warehouse
PPTX
Nike tech talk.2
PDF
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
PPTX
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
PDF
SnappyData at Spark Summit 2017
PPTX
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
PPTX
SnappyData overview NikeTechTalk 11/19/15
PPTX
Real-time Analytics for Data-Driven Applications
PPTX
An Architect's guide to real time big data systems
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
PDF
SnappyData @ Seattle Spark Meetup
PDF
Unifying Analytics
PPT
5 Years of Progress in Active Data Warehousing
PDF
Stsg17 speaker yousunjeong
PPTX
Data Con LA 2022 Keynote
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
PDF
Introduction to apache kafka, confluent and why they matter
Event Driven Microservices
Unify Analytics: Combine Strengths of Data Lake and Data Warehouse
Nike tech talk.2
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
SnappyData at Spark Summit 2017
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData overview NikeTechTalk 11/19/15
Real-time Analytics for Data-Driven Applications
An Architect's guide to real time big data systems
Don't Cross The Streams - Data Streaming And Apache Flink
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
SnappyData @ Seattle Spark Meetup
Unifying Analytics
5 Years of Progress in Active Data Warehousing
Stsg17 speaker yousunjeong
Data Con LA 2022 Keynote
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
Introduction to apache kafka, confluent and why they matter

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced IT Governance
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Modernizing your data center with Dell and AMD
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Big Data Technologies - Introduction.pptx
Advanced IT Governance
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Per capita expenditure prediction using model stacking based on satellite ima...
Spectral efficient network and resource selection model in 5G networks
Advanced Soft Computing BINUS July 2025.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Modernizing your data center with Dell and AMD
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
GamePlan Trading System Review: Professional Trader's Honest Take
Diabetes mellitus diagnosis method based random forest with bat algorithm
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf

Realtime Reporting using Spark Streaming