SlideShare a Scribd company logo
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Big Data Architectures
Guido Schmutz
Guido Schmutz
Working for Trivadis for more than 18 years
Oracle ACE Director for Fusion Middleware and SOA
Co-Author of different books
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Member of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 25 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Twitter: gschmutz
Agenda
1. Introduction
2. Traditional Architecture for Big Data
3. Streaming Analytics Architecture for Fast Data
4. Lambda/Kappa/Unifed Architecture for Big Data
5. Summary
Introduction
Big Data is still “work in progress”
Choosing the right architecture is key for any (big data) project
Big Data is still quite a young field and therefore there are no standard architectures
available which have been used for years
In the past few years, a few architectures have evolved and have been discussed online
Know the use cases before choosing your architecture
To have one/a few reference architectures can help in choosing the right components
Hadoop Ecosystem – many choices ….
Management	
/Monitoring
Core
Analytics Workflow/JobUnstructured	
Data	Sources
Structured	Data	
Sources
SQL	on	Hadoop
SerializationData	Storage Security
Important Properties to choose a Big Data Architecture
Latency
Keep raw and un-interpreted data “forever” ?
Volume, Velocity, Variety, Veracity
Ad-Hoc Query Capabilities needed ?
Robustness & Fault Tolerance
Scalability
…
From Volume and Variety to Velocity
Big Data has evolved …
and the Hadoop Ecosystem as well ….
Past
Big	Data =	Volume	&	Variety
Present
Big	Data =	Volume	&	Variety	&	Velocity
Past
Batch	Processing
Time	to	insight	of	Hours
Present
Batch &	Stream	Processing
Time	to	insight in	Seconds
Adapted	from	Cloudera	blog	article
Traditional Architecture for Big
Data
“Traditional Architecture” for Big Data
Data
Collection
(Analytical)	Data	Processing
Result	StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
Batch
compute
Stage
Result	Store
Query
Engine
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
Use Case 1) – Click Stream analysis: 360 degree view
of customer
Data
Collection
(Analytical)	Data	Processing
Result	StoreData
Sources
Data
Consumer
Channel
Batch
compute
Computed	
Information
Raw	Data	
(Reservoir)
Result	Store
Query
Engine
Reports
Analytic
Tools
Logfiles
Use Case 2) – Ingest Relational Data into Hadoop and
make it accessible
Data
Collection
(Analytical)	Data	Processing
Result	StoreData
Sources
Data
Consumer
RDBMS
Batch
compute
Computed	
Information
Raw	Data	
(Reservoir)
Result	Store
Query
Engine
Reports
Service
Analytic
Tools
Alerting
Tools
Use Case 2a) – Ingest Relational Data into Hadoop and
make it accessible
Data
Collection
(Analytical)	Data	Processing
Result	StoreData
Sources
Data
Consumer
RDBMS
(CDC) Batch
compute
Computed	
Information
Raw	Data	
(Reservoir)
Result	Store
Query
Engine
Reports
Service
Analytic
Tools
Alerting
Tools
Channel
“Hadoop Ecosystem” Technology Mapping
Data
Collection
(Analytical)	Data	Processing
Result	StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
Batch
compute
Staging
Result	Store
Query
Engine
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
Apache Spark – the new kid on the block
Apache Spark is a fast and general engine for large-scale data processing
• The hot trend in Big Data!
• Originally developed 2009 in UC Berkley’s AMPLab
• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk
• One of the largest OSS communities in big data with over 200 contributors in 50+
organizations
• Open Sourced in 2010 – since 2014 part of Apache Software foundation
• Supported by many vendors
Motivation – Why Apache Spark?
Apache Hadoop MapReduce: Data Sharing on Disk
Apache Spark: Speed up processing by using Memory instead of Disks
map reduce . . .
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
op1 op2
. . .
Input
Output
Output
Apache Spark “Ecosystem”
Spark	SQL
(Batch	Processing)
Blink	DB
(Approximate
Querying)
Spark	Streaming
(Real-Time)
MLlib,	Spark	R
(Machine	
Learning)
GraphX
(Graph	Processing)
Spark	Core	API	and	Execution	Model
Spark
Standalone
MESOS YARN HDFS
Elastic
Search
NoSQL S3
Libraries
Core	Runtime
Cluster	Resource	Managers Data		/	Data	Stores
Use Case 3) – Predictive Maintenance through Machine
Learning on collected data
Data
Collection
(Analytical)	Data	Processing
Result	StoreData
Sources
Data
Consumer
Machine
Batch
compute
Computed	
Information
Raw	Data	
(Reservoir)
Result	Store
Query
Engine
Reports
Service
Analytic
Tools
Alerting
Tools
DB
StagingFile
Channel
“Spark Ecosystem” Technology Mapping
Data
Collection
(Analytical)	Data	Processing
Result	StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
Batch
compute
Stage
Result	Store
Query
Engine
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
Traditional Architecture for Big Data
• Batch Processing
• Not for low latency use cases
• Spark can speed up, but if positioned as alternative to Hadoop
Map/Reduce, it’s still Batch Processing
• Spark Ecosystems offers a lot of additional advanced analytic capabilities
(machine learning, graph processing, …)
Streaming Analytics Architecture
for Big Data
Streaming Analytics Architecture for Big Data
aka. (Complex) Event Processing)
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Result	Store
Messaging
Result	Store
=	Data	in	Motion =	Data	at	Rest
Use Case 4) Alerting in Internet of Things (IoT)
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Consumer
Analytic
Tools
Alerting
Tools
Sensor
Machine
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Result	Store
Messaging
Result	Store
Use Case 5) Real-Time Analytics on Sensor Events
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Consumer
Analytic
Tools
Sensor
Machine
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Result	Store
Messaging
Result	Store
Unified Log (Event) Processing
Stream processing allows for
computing feeds off of other feeds
Derived feeds are no
different than
original feeds
they are computed off
Single deployment of “Unified
Log” but logically different feeds
Meter
Readings
Collector
Enrich	/	
Transform
Aggregate	by	
Minute
Raw Meter
Readings
Meter	with
Customer
Meter	by Customer	by
Minute
Customer
Aggregate	by	
Minute
Meter	by Minute
Persist
Meter	by	
Minute
Persist
Raw	Meter
Readings
Streaming Analytics Technology Mapping
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Result	Store
Messaging
Result	Store
=	Data	in	Motion =	Data	at	Rest
Streaming Analytics Architecture for Big Data
The solution for low latency use cases
Process each event separately => low latency
Process events in micro-batches => increases latency but offers better
reliability
Previously known as “Complex Event Processing”
Keep the data moving / Data in Motion instead of Data at Rest => raw events
are (often) not stored
Lambda Architecture for Big Data
“Lambda Architecture” for Big Data
Data
Collection
(Analytical)	Batch	Data	Processing
Batch
compute
Result	StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
Use Case 6) Social Media and Social Network Analysis
Data
Collection
(Analytical)	Batch	Data	Processing
Batch
compute
Result	StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
Lambda Architecture for Big Data
Combines (Big) Data at Rest with (Fast) Data in Motion
Closes the gap from high-latency batch processing
Keeps the raw information forever
Makes it possible to rerun analytics operations on whole data set if necessary
=> because the old run had an error or
=> because we have found a better algorithm we want to apply
Have to implement functionality twice
• Once for batch
• Once for real-time streaming
„Kappa“ Architecture for Big Data
“Kappa Architecture” for Big Data
Data
Collection
“Raw	Data	Reservoir”
Batch
compute
Data
Sources
Messaging
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Result	Store
Messaging
Result	Store
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
Kappa Architecture for Big Data
Today the stream processing infrastructure are as scalable as Big Data
processing architectures
• Some using the same base infrastructure, i.e. Hadoop YARN
Only implement processing / analytics logic once
Can Replay historical events out of an historical (raw) event store
• Provided by either the Messaging or Raw Data (Reservoir) component
Updates of processing logic / Event replay are handled by deploying new
version of logic in parallel to old one
• New logic will reprocess events until it caught up with the current events and then
the old version can be de-commissioned.
„Unified“ Architecture for Big Data
“Unified Architecture” for Big Data
Data
Collection
(Analytical)	Batch	Data	Processing	(Calculate	
Models	of	incoming	data)
Batch
compute
Result	StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
Prediction	
Models
Use Case 7) Fraud Detection
Data
Collection
(Analytical)	Batch	Data	Processing	(Calculate	
Models	of	incoming	data)
Batch
compute
Result	StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
Prediction	
Models
Summary
Summary
Know your use cases and then choose your architecture and the relevant
components/products/frameworks
You don’t have to use all the components of the Hadoop Ecosystem to be successful
Big Data is still quite a young field and therefore there are no standard architectures
available which have been used for years
Lambda, Kappa Architecture are best practices architectures which you have to adapt
to your environment
Big Data Architecture
Guido Schmutz
Technology Manager
guido.schmutz@trivadis.com

More Related Content

PPTX
Big data architecture
PPTX
Netflix Recommender System : Big Data Case Study
PPTX
Map Reduce
PPTX
Modern Data Architecture
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
Unit 2 unsupervised learning.pptx
PPT
Shell and its types in LINUX
PPTX
Big data architecture
Netflix Recommender System : Big Data Case Study
Map Reduce
Modern Data Architecture
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Unit 2 unsupervised learning.pptx
Shell and its types in LINUX

What's hot (20)

PPTX
Introduction to Microsoft’s Hadoop solution (HDInsight)
PDF
Introduction to HBase
PDF
Databricks Delta Lake and Its Benefits
PPTX
Introduction to Data Engineering
PPTX
Data Lake Overview
PPTX
Databricks Fundamentals
PPTX
Introduction to Map Reduce
PPTX
Free Training: How to Build a Lakehouse
PPTX
Hadoop Tutorial For Beginners
PPTX
Data Warehousing Trends, Best Practices, and Future Outlook
PDF
Five Things to Consider About Data Mesh and Data Governance
PDF
ETL VS ELT.pdf
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PPTX
Building a modern data warehouse
PPTX
Databricks Platform.pptx
PDF
PDF
18 Data Streams
PPTX
Databricks for Dummies
PPTX
Snowflake: The Good, the Bad, and the Ugly
PPTX
Data Streaming in Big Data Analysis
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to HBase
Databricks Delta Lake and Its Benefits
Introduction to Data Engineering
Data Lake Overview
Databricks Fundamentals
Introduction to Map Reduce
Free Training: How to Build a Lakehouse
Hadoop Tutorial For Beginners
Data Warehousing Trends, Best Practices, and Future Outlook
Five Things to Consider About Data Mesh and Data Governance
ETL VS ELT.pdf
Enabling a Data Mesh Architecture with Data Virtualization
Building a modern data warehouse
Databricks Platform.pptx
18 Data Streams
Databricks for Dummies
Snowflake: The Good, the Bad, and the Ugly
Data Streaming in Big Data Analysis
Ad

Similar to Big Data Architecture (20)

PDF
Big Data Architectures @ JAX / BigDataCon 2016
PDF
Architektur von Big Data Lösungen
PDF
Big Data Architectures
PDF
Architecture of Big Data Solutions
PDF
Introduction to Stream Processing
PDF
Data Ingestion in Big Data and IoT platforms
PDF
Introduction to Stream Processing
PDF
Stream Processing – Concepts and Frameworks
PDF
Streaming Visualization
PDF
Introduction to Stream Processing
PDF
The sensor data challenge - Innovations (not only) for the Internet of Things
PDF
Oracle Stream Analytics - Simplifying Stream Processing
PDF
Introduction to Streaming Analytics
PDF
Introduction to Stream Processing
PDF
Introduction Big Data
PDF
Fundamentals Big Data and AI Architecture
PDF
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
PDF
Horses for Courses: Database Roundtable
PDF
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
PPTX
Big Data: It’s all about the Use Cases
Big Data Architectures @ JAX / BigDataCon 2016
Architektur von Big Data Lösungen
Big Data Architectures
Architecture of Big Data Solutions
Introduction to Stream Processing
Data Ingestion in Big Data and IoT platforms
Introduction to Stream Processing
Stream Processing – Concepts and Frameworks
Streaming Visualization
Introduction to Stream Processing
The sensor data challenge - Innovations (not only) for the Internet of Things
Oracle Stream Analytics - Simplifying Stream Processing
Introduction to Streaming Analytics
Introduction to Stream Processing
Introduction Big Data
Fundamentals Big Data and AI Architecture
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Horses for Courses: Database Roundtable
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Big Data: It’s all about the Use Cases
Ad

More from Guido Schmutz (20)

PDF
30 Minutes to the Analytics Platform with Infrastructure as Code
PDF
Event Broker (Kafka) in a Modern Data Architecture
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
PDF
ksqlDB - Stream Processing simplified!
PDF
Kafka as your Data Lake - is it Feasible?
PDF
Event Hub (i.e. Kafka) in Modern Data Architecture
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
PDF
Building Event Driven (Micro)services with Apache Kafka
PDF
Location Analytics - Real-Time Geofencing using Apache Kafka
PDF
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
PDF
What is Apache Kafka? Why is it so popular? Should I use it?
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Location Analytics Real-Time Geofencing using Kafka
PDF
Streaming Visualisation
PDF
Kafka as an event store - is it good enough?
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
PDF
Location Analytics - Real-Time Geofencing using Kafka
PDF
Streaming Visualization
PDF
Streaming Visualization
30 Minutes to the Analytics Platform with Infrastructure as Code
Event Broker (Kafka) in a Modern Data Architecture
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
ksqlDB - Stream Processing simplified!
Kafka as your Data Lake - is it Feasible?
Event Hub (i.e. Kafka) in Modern Data Architecture
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Building Event Driven (Micro)services with Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
What is Apache Kafka? Why is it so popular? Should I use it?
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Location Analytics Real-Time Geofencing using Kafka
Streaming Visualisation
Kafka as an event store - is it good enough?
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Location Analytics - Real-Time Geofencing using Kafka
Streaming Visualization
Streaming Visualization

Recently uploaded (20)

PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
1. Introduction to Computer Programming.pptx
PPT
Teaching material agriculture food technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Machine Learning_overview_presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
A comparative study of natural language inference in Swahili using monolingua...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
OMC Textile Division Presentation 2021.pptx
1. Introduction to Computer Programming.pptx
Teaching material agriculture food technology
MIND Revenue Release Quarter 2 2025 Press Release
Building Integrated photovoltaic BIPV_UPV.pdf
TLE Review Electricity (Electricity).pptx
Tartificialntelligence_presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
Assigned Numbers - 2025 - Bluetooth® Document
Machine Learning_overview_presentation.pptx
Machine learning based COVID-19 study performance prediction
NewMind AI Weekly Chronicles - August'25-Week II
cloud_computing_Infrastucture_as_cloud_p
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Heart disease approach using modified random forest and particle swarm optimi...

Big Data Architecture

  • 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Big Data Architectures Guido Schmutz
  • 2. Guido Schmutz Working for Trivadis for more than 18 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Member of Trivadis Architecture Board Technology Manager @ Trivadis More than 25 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Twitter: gschmutz
  • 3. Agenda 1. Introduction 2. Traditional Architecture for Big Data 3. Streaming Analytics Architecture for Fast Data 4. Lambda/Kappa/Unifed Architecture for Big Data 5. Summary
  • 5. Big Data is still “work in progress” Choosing the right architecture is key for any (big data) project Big Data is still quite a young field and therefore there are no standard architectures available which have been used for years In the past few years, a few architectures have evolved and have been discussed online Know the use cases before choosing your architecture To have one/a few reference architectures can help in choosing the right components
  • 6. Hadoop Ecosystem – many choices …. Management /Monitoring Core Analytics Workflow/JobUnstructured Data Sources Structured Data Sources SQL on Hadoop SerializationData Storage Security
  • 7. Important Properties to choose a Big Data Architecture Latency Keep raw and un-interpreted data “forever” ? Volume, Velocity, Variety, Veracity Ad-Hoc Query Capabilities needed ? Robustness & Fault Tolerance Scalability …
  • 8. From Volume and Variety to Velocity Big Data has evolved … and the Hadoop Ecosystem as well …. Past Big Data = Volume & Variety Present Big Data = Volume & Variety & Velocity Past Batch Processing Time to insight of Hours Present Batch & Stream Processing Time to insight in Seconds Adapted from Cloudera blog article
  • 10. “Traditional Architecture” for Big Data Data Collection (Analytical) Data Processing Result StoreData Sources Channel Data Consumer Reports Service Analytic Tools Alerting Tools Social RDBMS Sensor ERP Logfiles Mobile Machine Batch compute Stage Result Store Query Engine Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest
  • 11. Use Case 1) – Click Stream analysis: 360 degree view of customer Data Collection (Analytical) Data Processing Result StoreData Sources Data Consumer Channel Batch compute Computed Information Raw Data (Reservoir) Result Store Query Engine Reports Analytic Tools Logfiles
  • 12. Use Case 2) – Ingest Relational Data into Hadoop and make it accessible Data Collection (Analytical) Data Processing Result StoreData Sources Data Consumer RDBMS Batch compute Computed Information Raw Data (Reservoir) Result Store Query Engine Reports Service Analytic Tools Alerting Tools
  • 13. Use Case 2a) – Ingest Relational Data into Hadoop and make it accessible Data Collection (Analytical) Data Processing Result StoreData Sources Data Consumer RDBMS (CDC) Batch compute Computed Information Raw Data (Reservoir) Result Store Query Engine Reports Service Analytic Tools Alerting Tools Channel
  • 14. “Hadoop Ecosystem” Technology Mapping Data Collection (Analytical) Data Processing Result StoreData Sources Channel Data Consumer Reports Service Analytic Tools Alerting Tools Social RDBMS Sensor ERP Logfiles Mobile Machine Batch compute Staging Result Store Query Engine Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest
  • 15. Apache Spark – the new kid on the block Apache Spark is a fast and general engine for large-scale data processing • The hot trend in Big Data! • Originally developed 2009 in UC Berkley’s AMPLab • Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk • One of the largest OSS communities in big data with over 200 contributors in 50+ organizations • Open Sourced in 2010 – since 2014 part of Apache Software foundation • Supported by many vendors
  • 16. Motivation – Why Apache Spark? Apache Hadoop MapReduce: Data Sharing on Disk Apache Spark: Speed up processing by using Memory instead of Disks map reduce . . . Input HDFS read HDFS write HDFS read HDFS write op1 op2 . . . Input Output Output
  • 18. Use Case 3) – Predictive Maintenance through Machine Learning on collected data Data Collection (Analytical) Data Processing Result StoreData Sources Data Consumer Machine Batch compute Computed Information Raw Data (Reservoir) Result Store Query Engine Reports Service Analytic Tools Alerting Tools DB StagingFile Channel
  • 19. “Spark Ecosystem” Technology Mapping Data Collection (Analytical) Data Processing Result StoreData Sources Channel Data Consumer Reports Service Analytic Tools Alerting Tools Social RDBMS Sensor ERP Logfiles Mobile Machine Batch compute Stage Result Store Query Engine Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest
  • 20. Traditional Architecture for Big Data • Batch Processing • Not for low latency use cases • Spark can speed up, but if positioned as alternative to Hadoop Map/Reduce, it’s still Batch Processing • Spark Ecosystems offers a lot of additional advanced analytic capabilities (machine learning, graph processing, …)
  • 22. Streaming Analytics Architecture for Big Data aka. (Complex) Event Processing) Data Collection Batch compute Data Sources Channel Data Consumer Reports Service Analytic Tools Alerting Tools Social Logfiles Sensor RDBMS ERP Mobile Machine (Analytical) Real-Time Data Processing Stream/Event Processing Result Store Messaging Result Store = Data in Motion = Data at Rest
  • 23. Use Case 4) Alerting in Internet of Things (IoT) Data Collection Batch compute Data Sources Channel Data Consumer Analytic Tools Alerting Tools Sensor Machine (Analytical) Real-Time Data Processing Stream/Event Processing Result Store Messaging Result Store
  • 24. Use Case 5) Real-Time Analytics on Sensor Events Data Collection Batch compute Data Sources Channel Data Consumer Analytic Tools Sensor Machine (Analytical) Real-Time Data Processing Stream/Event Processing Result Store Messaging Result Store
  • 25. Unified Log (Event) Processing Stream processing allows for computing feeds off of other feeds Derived feeds are no different than original feeds they are computed off Single deployment of “Unified Log” but logically different feeds Meter Readings Collector Enrich / Transform Aggregate by Minute Raw Meter Readings Meter with Customer Meter by Customer by Minute Customer Aggregate by Minute Meter by Minute Persist Meter by Minute Persist Raw Meter Readings
  • 26. Streaming Analytics Technology Mapping Data Collection Batch compute Data Sources Channel Data Consumer Reports Service Analytic Tools Alerting Tools Social Logfiles Sensor RDBMS ERP Mobile Machine (Analytical) Real-Time Data Processing Stream/Event Processing Result Store Messaging Result Store = Data in Motion = Data at Rest
  • 27. Streaming Analytics Architecture for Big Data The solution for low latency use cases Process each event separately => low latency Process events in micro-batches => increases latency but offers better reliability Previously known as “Complex Event Processing” Keep the data moving / Data in Motion instead of Data at Rest => raw events are (often) not stored
  • 29. “Lambda Architecture” for Big Data Data Collection (Analytical) Batch Data Processing Batch compute Result StoreData Sources Channel Data Consumer Reports Service Analytic Tools Alerting Tools Social RDBMS Sensor ERP Logfiles Mobile Machine (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest
  • 30. Use Case 6) Social Media and Social Network Analysis Data Collection (Analytical) Batch Data Processing Batch compute Result StoreData Sources Channel Data Consumer Reports Service Analytic Tools Alerting Tools Social (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest
  • 31. Lambda Architecture for Big Data Combines (Big) Data at Rest with (Fast) Data in Motion Closes the gap from high-latency batch processing Keeps the raw information forever Makes it possible to rerun analytics operations on whole data set if necessary => because the old run had an error or => because we have found a better algorithm we want to apply Have to implement functionality twice • Once for batch • Once for real-time streaming
  • 33. “Kappa Architecture” for Big Data Data Collection “Raw Data Reservoir” Batch compute Data Sources Messaging Data Consumer Reports Service Analytic Tools Alerting Tools Social Logfiles Sensor RDBMS ERP Mobile Machine (Analytical) Real-Time Data Processing Stream/Event Processing Result Store Messaging Result Store Raw Data (Reservoir) = Data in Motion = Data at Rest
  • 34. Kappa Architecture for Big Data Today the stream processing infrastructure are as scalable as Big Data processing architectures • Some using the same base infrastructure, i.e. Hadoop YARN Only implement processing / analytics logic once Can Replay historical events out of an historical (raw) event store • Provided by either the Messaging or Raw Data (Reservoir) component Updates of processing logic / Event replay are handled by deploying new version of logic in parallel to old one • New logic will reprocess events until it caught up with the current events and then the old version can be de-commissioned.
  • 36. “Unified Architecture” for Big Data Data Collection (Analytical) Batch Data Processing (Calculate Models of incoming data) Batch compute Result StoreData Sources Channel Data Consumer Reports Service Analytic Tools Alerting Tools Social RDBMS Sensor ERP Logfiles Mobile Machine (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest Prediction Models
  • 37. Use Case 7) Fraud Detection Data Collection (Analytical) Batch Data Processing (Calculate Models of incoming data) Batch compute Result StoreData Sources Channel Data Consumer Reports Service Analytic Tools Alerting Tools RDBMS Sensor ERP Logfiles Mobile Machine (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) Prediction Models
  • 39. Summary Know your use cases and then choose your architecture and the relevant components/products/frameworks You don’t have to use all the components of the Hadoop Ecosystem to be successful Big Data is still quite a young field and therefore there are no standard architectures available which have been used for years Lambda, Kappa Architecture are best practices architectures which you have to adapt to your environment