SlideShare a Scribd company logo
Architectures of AI systems
Engineering for Big Data & AI
HCMC, Sep 6th 2019 herve@quod.aiHerve Roussel
What is
Data Engineering ?
Is this data engineering?
UploadData.java
upload_data.py
cat console.log
| grep “ERROR”
> errors.log
Is this data engineering?
Data engineering?
Transformed dataEvent data
Program
Backend vs Data?
cat console.log
| grep “ERROR”
> errors.log
Is this data engineering?
Event data
Transform
Transformed data
What is
Big Data Engineering ?
Where is Big Data?
How to query news feed?
SELECT
*
FROM posts
INNER JOIN friends
WHERE ...
ORDER BY
posts.timestamp DESC
Notify? Web,
mobile?
Who can
see this?
Racist? Vulgar?
Is this a face? Who’s
this? Friend? Celebrity?
Courtney likes. Is that
good or bad?
Paddy commented. Is
that good or bad?
Chris posted. Is that
good or bad?
Anybody tagged?
What rank
in feed?
Copyright violation?
Is Big Data just for big companies?
300K QPS [R]
6K QPS [W]
As of JULY 8, 2013
1B+ QPM [P]
250M+ QPM [R]
400M LOC [P]
1.8 TB per year [P]
Data Engineering
Augmented dataEvent data
Program
Event data
Transform
Augmented data
Big Data Engineering + AI
Pipeline (Transform)
Source (Event data)
Sink (Augmented data)
What is a
source ?
Synchronous_
( 10-100 ms )_
Where is data coming from?
Main data
Event source
Why split?
Asynchronous_
( 3-5 s )_
What’s in an event data?
Post
{
id: 12345,
content: “hello world”,
created_at: …
updated_at: …
author_id: 67890,
…
}
PostCreatedEvent
{
story_id: 12345,
type: “story_posted”
…
}
Job 1
Job 2
Scheduler
What’s batch processing?
Which DB for event source?
● Volume?
● Velocity? QPS reads? QPS writes?
● Latency?
● Cost? Storage & R/W
● How to write?
○ Integrity?
○ Consistency?
○ Durability?
○ Version?
● How to read?
○ Random access or sequential?
○ Full text search?
○ Geo distance?
How to store events?
MySQL MongoDB JSON on S3 (or
GCS)
30 GB OK Good Very good
10K WPS OK Good Very good
1K RPS OK Good Very good
Range read OK Good Very good
Cost $$ $$$ $
MySQL MongoDB
30 GB OK Good
10K WPS OK Good
1K RPS OK Good
Sequential read OK Good
Cost $$ $$$
How to store events?
Who wants to become architect?
Job 1
Job 2
Scheduler
What’s the problem with batch?
LATENCY
How to process real-time?
Stream processing
How can 2 processes talk?
QUEUE
Why not use database?
Importance MySQL Kafka Redis
10K WPS 1.0 5 10 10
1K RPS 1.0 5 10 10
Sequential
read
1.0 10
(with B-TREE)
10 10
(using Lists)
Order
guarantee
0.2 10 0 10
Durability 0.1 10 5 (but perf. hit) 0
Deployability 0.5 10 5 7.5
Score 5.6 / 10 6.6 / 10 7.15 / 10
Why not database?
What is a
transform ?
Transforms
Source
Sink
Functional vs OOP
Librarian
.startShift()
Catalog.open() Library.close()
Books.create()
Operations on things
Add more things
find(book)
assign(book)
Things with operations
Add more operations
remove(book)
load_cover(book)
Functional vs OOP
find_similar(vid_uploaded)
transcribe_captions(vid_uploaded
)
Things with operations
Add more operations
alert_subscribers(vid_uploaded)
generate_thumbnails(vid_uploaded
)
What’s supporting data?
Transform
Supporting data
event
{
id: 12345,
type: “story_posted”
user_id: 67890
coordinates: [ 10.76, 106.66
]
}
Friends or city DB
Who uses ext. supporting data?
API vs Pipeline: availability?
Requests in thread Long running
API vs Pipeline: performance?
100ms
⇓
10ms
100ms * 300,000/60/60 = 9H
⇓
10ms * 300,000/60/60 = 55 min
Where is the data coming from?
Is this a face? Who’s
this? Friend? Celebrity?
Data pipelines & AI
TransformAI model
How can 2 processes talk?
Transform
AI model
What is a
sink ?
Which DB to sink to?
What to do with the sink?
Write Read
Data scientist
Sales
What are the read use cases?
Give me summary
report of last
month’s activity
Give me posts that
contain the words
Donald Trump,
Trump or President
Give me all posts by
female, age 18-35
Aggregation Full text search Bulk data, filtered
ACID
Denormalization: good or bad?
What is BCNF?
What’s distributed data systems?
Why re-run the pipeline?
TransformAI model Transform v2
Idempotency & backfill
f(f(x)) = f(x)
POST “/BankAccount/AddFunds”
{ value: 1000, token: TX123 }
Another reason for backfill?
What if the AI model improves?
TransformAI model v2
AI systems ≠ traditional systems?
93.2%
ProbabilisticDeterministic
Store output of model v1 or v2?
AI Model v1
( accuracy: 83.1% )
AI Model v2
( accuracy: ?? )
What have we
learned ?
Source: Uber Engineering
[DE] Collect data
[DE] Process data
[DS] Build DL model
[BE/FE] Use DL model in app
[DA] Validate DL model
Which NFR for Big Data?
• Scalability
• Availability
• Interoperability
• Portability
• Modifiability
• Maintainability
• Testability
• Usability
• Buildability
• Deployability
• Ease of Development
• Performance
• Security
• Localization
• Legal
• Reusability
• Supportability
• Monitorability
• Deployability
• Ease of Development
• Performance
• Security
• Localization
• Legal
• Reusability
• Supportability
• Monitorability
Which NFR for Big Data?
• Scalability
• Availability
• Interoperability
• Portability
• Modifiability
• Maintainability
• Testability
• Usability
• Buildability
Main data
+
Materialized view
Event data
⇓
Pipeline
⇓
Augmented data
What have we learned?
Want to learn more about
AI & Big Data?
We’re hiring:
● Big Data Engineer, in training (Java)
● Big Data Engineer (Java)
● Data Scientist (Python)
http://bit.ly/quod-ai-join
herve@quod.aiHerve Roussel

More Related Content

PPTX
Grokking TechTalk #16: Html js and three way binding
PDF
TechTalk #15 Grokking: The data processing journey at AhaMove
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
PDF
Continuous delivery for machine learning
PDF
Observability for Data Pipelines With OpenLineage
PDF
Google Dataflow Intro
PPTX
Speed layer : Real time views in LAMBDA architecture
PPTX
Building a system for machine and event-oriented data with Rocana
Grokking TechTalk #16: Html js and three way binding
TechTalk #15 Grokking: The data processing journey at AhaMove
Introduction to Data Engineer and Data Pipeline at Credit OK
Continuous delivery for machine learning
Observability for Data Pipelines With OpenLineage
Google Dataflow Intro
Speed layer : Real time views in LAMBDA architecture
Building a system for machine and event-oriented data with Rocana

What's hot (20)

PDF
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
PDF
Microservice-based software architecture
PPTX
Traveloka's journey to no ops streaming analytics
PDF
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
PPTX
Grokking Techtalk #37: Data intensive problem
PDF
FlinkML - Big data application meetup
PDF
Real time analytics at uber @ strata data 2019
PDF
Zipline - A Declarative Feature Engineering Framework
PDF
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
PDF
MongoDB at Baidu
PPTX
Tableau & MongoDB: Visual Analytics at the Speed of Thought
PDF
You Can Do It in SQL
PPTX
Challenges in Building a Data Pipeline
PDF
On Improving Broadcast Joins in Apache Spark SQL
PDF
Overhauling a database engine in 2 months
PDF
Superworkflow of Graph Neural Networks with K8S and Fugue
PDF
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
PDF
Funnel Analysis with Apache Spark and Druid
PDF
Using Hazelcast in the Kappa architecture
PDF
Accelerating Data Ingestion with Databricks Autoloader
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Microservice-based software architecture
Traveloka's journey to no ops streaming analytics
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Grokking Techtalk #37: Data intensive problem
FlinkML - Big data application meetup
Real time analytics at uber @ strata data 2019
Zipline - A Declarative Feature Engineering Framework
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
MongoDB at Baidu
Tableau & MongoDB: Visual Analytics at the Speed of Thought
You Can Do It in SQL
Challenges in Building a Data Pipeline
On Improving Broadcast Joins in Apache Spark SQL
Overhauling a database engine in 2 months
Superworkflow of Graph Neural Networks with K8S and Fugue
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Funnel Analysis with Apache Spark and Druid
Using Hazelcast in the Kappa architecture
Accelerating Data Ingestion with Databricks Autoloader
Ad

Similar to Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big Data & AI (20)

PDF
Internet of Things in Tbilisi
PDF
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
PDF
High-performance database technology for rock-solid IoT solutions
PDF
Streaming is a Detail
PPTX
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
KEY
Big data and APIs for PHP developers - SXSW 2011
PPTX
Introduction to Azure DocumentDB
PDF
ITCamp 2018 - Magnus Mårtensson - Azure Global Application Perspectives
PDF
Containers & AI - Beauty and the Beast!?!
PDF
Containers & AI - Beauty and the Beast !?! @MLCon - 27.6.2024
PDF
Kubernetes and AI - Beauty and the Beast - Tobias Schneck - DOAG 24 NUE - 20....
PPT
BioIT Europe 2010 - BioCatalogue
PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
PDF
How to Quantify the Value of Kafka in Your Organization
PDF
Azure HDInsight
PDF
IoT NY - Google Cloud Services for IoT
ODP
Database Shootout: What's best for BI?
PPTX
A Big Data Concept
PPTX
Refactoring your EDW with Mobile Analytics Products
PDF
Social media analytics using Azure Technologies
Internet of Things in Tbilisi
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
High-performance database technology for rock-solid IoT solutions
Streaming is a Detail
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
Big data and APIs for PHP developers - SXSW 2011
Introduction to Azure DocumentDB
ITCamp 2018 - Magnus Mårtensson - Azure Global Application Perspectives
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast !?! @MLCon - 27.6.2024
Kubernetes and AI - Beauty and the Beast - Tobias Schneck - DOAG 24 NUE - 20....
BioIT Europe 2010 - BioCatalogue
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
How to Quantify the Value of Kafka in Your Organization
Azure HDInsight
IoT NY - Google Cloud Services for IoT
Database Shootout: What's best for BI?
A Big Data Concept
Refactoring your EDW with Mobile Analytics Products
Social media analytics using Azure Technologies
Ad

More from Grokking VN (20)

PDF
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
PDF
Grokking Techtalk #45: First Principles Thinking
PDF
Grokking Techtalk #42: Engineering challenges on building data platform for M...
PDF
Grokking Techtalk #43: Payment gateway demystified
PPTX
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
PPTX
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
PDF
Grokking Techtalk #39: Gossip protocol and applications
PDF
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
PDF
Grokking Techtalk #38: Escape Analysis in Go compiler
PPTX
Grokking Techtalk #37: Software design and refactoring
PDF
Grokking TechTalk #35: Efficient spellchecking
PDF
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
PDF
Grokking TechTalk #33: High Concurrency Architecture at TIKI
PDF
SOLID & Design Patterns
PDF
Grokking TechTalk #31: Asynchronous Communications
PDF
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
PDF
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
PDF
Grokking TechTalk #27: Optimal Binary Search Tree
PDF
Grokking TechTalk #26: Kotlin, Understand the Magic
PDF
Grokking TechTalk #26: Compare ios and android platform
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #37: Software design and refactoring
Grokking TechTalk #35: Efficient spellchecking
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking TechTalk #33: High Concurrency Architecture at TIKI
SOLID & Design Patterns
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Compare ios and android platform

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Cloud computing and distributed systems.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Modernizing your data center with Dell and AMD
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
KodekX | Application Modernization Development
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Review of recent advances in non-invasive hemoglobin estimation
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
Dropbox Q2 2025 Financial Results & Investor Presentation
Modernizing your data center with Dell and AMD
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25 Week I
Spectral efficient network and resource selection model in 5G networks
KodekX | Application Modernization Development
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Chapter 3 Spatial Domain Image Processing.pdf
Understanding_Digital_Forensics_Presentation.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation_ Review paper, used for researhc scholars
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big Data & AI