SlideShare a Scribd company logo
A Tale of Lambdas, Kappas & Pancakes
@osamakhn
Who am I?
Osama Khan
Big Data Engineer @ACLServices
Grad Student @GTComputing
AWS Big Data Specialist+
! Vancouver, BC
" : Java " : C# (met thru J#) # : Python
$ : Golang, NodeJS % : Scala
Previously: Robot Soccer, BI, Credit Rating, AML,
O&G Portfolio, NLP/Governance, Doctor Triage,
Energy Monitoring, Consulting, Private Equity
Recently: Data/ML Pipeline, Tools & Platforms
What am I going to talk
about today?
The goal of this talk is to provide a high level overview
of the big data landscape to help software engineers
distinguish signal from noise
1) How big is BIG: Lets get our scales recalibrated to
understand what we mean by BIG Data
2) Lineage: Evolution of the Big Data ecosystem; from
EDW to Data Lakes
3) Lambda & Kappa Architectures: The foundation
of data pipelines and machine learning systems
4) Technology Choices: SMACK that PANCAKE
BUT I ❤ Serverless
5) Demos: Athena, EMR, Redshift, Quicksight,
Sagemaker, ModelDB
How big is BIG?
Big is BIG when Bieber breaks the Google Cloud (wat!?!)
Lets get our scales recalibrated to understand what we
mean by BIG Data
This is BIG …
§ 390 Hyperscale Datacenters ( < 300, 2016)
§ Hyperscale == (5k servers, 10k sq.ft space)
§ > 400, 100M+ total servers
§ 56% web content in English
§ 8,000 languages spoken globally
§ Hello, friend.
§ 100M+ active users, 40M+ subscribers
§ 30M+ songs, 20K new per day, 2B+ playlists, 1B+ plays per day
§ 2,500 node Hadoop cluster, 100 PB+ Disk, 100TB+ RAM
§ 60TB+ per day log ingestion, 20k+ jobs per day
§ Listening History Query
§ user x track x [day/week/month/all time]
§ 300B elements
§ 800 workers, 32 core, 208 GB ram
§ 240TB in, 90TB out
§ Top Tracks in Vancouver (June 2017)
§ 30 date partitioned tables, 60TB data
§ 1 metadata table, 418GB
§ 94.2s, 4.82TB processed
§ Despacito – Remix (Luis Fonsi)
§ (2017)
§ 2.8 B+ US Tweets
§ Donald Trump (901.8 M)
§ Hillary Clinton (123.2 M)
§ Mike Pence (31.4 M)
§ 30x more than VP, 7x more than opponent
(2013)
§ 170M individual metrics (timeseries) per minute
§ 200M queries served/day, 47 charts/user
Lineage
Evolution of the Big Data ecosystem; from EDW to
Data Lakes
A journey from ETL to Distributed Transactions via the ELT alley…
2007 20172003 20142009 2011201020042000
FaunaDB,
Aurora
Lambda,
Kappa
Architectures
UC
Berkeley
Spark
Google
File
System
LinkedIn
Kafka
FB
Cassandra
AWS
DynamoDB
Google
Dremel
2012: Google
Spanner
IBM,
Oracle,
MSFT,
Terradata,
SAP
CAP
Theorem
Google
Map
Reduce
AWS, GCP,
Azure,
Hana
2006:
Yahoo! Hadoop
Occupy the Cloud:
Distributed
Computing for the
99%
Big Data Landscape
Distributed systems rule the !
Yet Another Big Data Framework (YABDF)
Doesn’t fit on a slide or two … and you
thought you had library fatigue in the JS
world !
http://mattturck.com/wp-content/uploads/2017/05/Matt-Turck-FirstMark-2017-Big-Data-Landscape.png
Lambda & Kappa Architectures
The foundation of data pipelines for enterprise insights
Lambda Architecture: First Principles & Desired Properties
Data special information from which everything else is derived Information processed data
Data System query = function(ALL_DATA)
1. Robustness & Fault Tolerance
2. Low Latency Read & Update
3. Scalability
4. Generalization
5. Ad-hoc
6. Minimal Maintenance
7. Debuggability
The Lambda Architecture
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
∑ ALL_DATA
Δ NEW_DATA
The Lambda Architecture (in the enterprise)
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
The Lambda Architecture (in the enterprise)
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
The Lambda Architecture (in the enterprise)
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
The Lambda Architecture (deep dive)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
> home of ‘master data’
> precomputed_batch_view = fn(ALL_DATA)
> user_query = fn(precomputed_batch_view)
The Lambda Architecture (deep dive)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
> (distributed) db to store batch view data
> produce fast results for known queries
> allow (random) reads by users/systems
The Lambda Architecture (deep dive)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
> compensate for high latency of batch updates
> run fast, incremental algorithms (probabilistic data structures, for the win)
> realtime_view = fn(realtime_view, new_data)
> user_query = fn(realtime_view) > user_query =
fn(precomputed_batch_view)
The Lambda Architecture (deep dive)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
> user_query = fn(
precomputed_batch_view,
realtime_view)
The Lambda Architecture (ingest? speedlayer?)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
The Lambda Architecture (ingest? speedlayer?)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
The Kappa Architecture
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
The Lambda Architecture (SMACK)
Data
Catalog
Impact
Analysis
Data
Lineage
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
1
2 3
4
5
Big Picture of Metadata Management for Data Governance
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
lineage
impact
analysis
semantic
lineage
Enterprise
Vocabulary
Semantic
Mapping
(metadata harvesting)
(metadata stitching)
Big Picture of Metadata Management for Data Governance
Access
Control
Secure
Data Rest
GDPR
query
query
new
data
master dataset
batch view
batch view
real-time view real-time view
batch layer serving layer
speed layer
lineage
impact
analysis
semantic
lineage
Enterprise
Vocabulary
Semantic
Mapping
Machine Learning Pipelines
BAML PIT
Machine Learning Pipelines
BAML PITBAML PIT == $$100MM
Blockchain based Adversarial Machine Learning
Platform for IoT Testing
classic model
classic model
Pancake Stack
[Presto Arrow Nifi Cassandra Airflow Kafka ElasticSearch Spark Tensorflow AlgeBird CoreNLP Kibana]
data science silo
Data Source Data & Feature
Engineering
Adaptation of slide by Ben Lorica
Model
Building
Deploy
Monitor
maturity spectrum
what’s changing(-ed)?
1. Cloud (faas, serverless data pipelines, ml-as-a-service)
2. Consumer demand for ML features/products/applications
3. Targeted Models (we need to manage 20MM models for 10MM users maybe)
4. Localization (ASEAN facial recognition)
5. Security (Adv. ML, Side-channel attacks)
6. Transparency (Bias is a BUG)
7. Many toy sophisticated solutions but conventional, simpler techniques (regression)
still deliver more business value!
8. Monitoring to ensure deployed models are making high quality predictions
9. Need practices to maintain (update or rebuild) models over time
10. and ….
feature engineering, wat?
By @MLpuppy
rise of machine learning engineers
rise of machine learning engineers
Online Machine Learning Pipeline
Model Inventory
Model Output Monitoring
Take Action
ML Serving Layer
Hyper-parameter Tuning
www.productionml.org

More Related Content

PDF
Production Machine Learning
PDF
Use Apache Gradle to Build and Automate KSQL and Kafka Streams (Stewart Bryso...
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
PPTX
Microservices Live
PDF
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
PDF
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
PDF
Real Time Processing Using Twitter Heron by Karthik Ramasamy
PPTX
Brandon obrien streaming_data
Production Machine Learning
Use Apache Gradle to Build and Automate KSQL and Kafka Streams (Stewart Bryso...
Simplify and Scale Data Engineering Pipelines with Delta Lake
Microservices Live
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Real Time Processing Using Twitter Heron by Karthik Ramasamy
Brandon obrien streaming_data

What's hot (19)

PPTX
Netflix incloudsmarch8 2011forwiki
PDF
Scalable crawling with Kafka, scrapy and spark - November 2021
PDF
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
PDF
Technological insights behind Clusterpoint database
PDF
Clickstream Analysis With Apache Spark
PDF
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
PDF
Javascript & SQL within database management system
PPTX
Big Data Pipeline and Analytics Platform
PDF
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
PPTX
Running Presto and Spark on the Netflix Big Data Platform
PDF
Customer Experience at Disney+ Through Data Perspective
PDF
PDF
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
PDF
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
PDF
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
PDF
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
PPTX
Cloud-based Data Lake for Analytics and AI
PDF
Building event-driven Microservices with Kafka Ecosystem
PPTX
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Netflix incloudsmarch8 2011forwiki
Scalable crawling with Kafka, scrapy and spark - November 2021
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Technological insights behind Clusterpoint database
Clickstream Analysis With Apache Spark
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Javascript & SQL within database management system
Big Data Pipeline and Analytics Platform
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
Running Presto and Spark on the Netflix Big Data Platform
Customer Experience at Disney+ Through Data Perspective
[DataCon.TW 2019] Graph Query on Big-data, REST API, and Live Analysis Systems
Unified Data Analytics: Helping Data Teams Solve the World’s Toughest Problems
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and m...
Cloud-based Data Lake for Analytics and AI
Building event-driven Microservices with Kafka Ecosystem
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Ad

Similar to Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes (20)

PPT
Computing Outside The Box June 2009
PDF
How to govern and secure a Data Mesh?
PDF
Off-Label Data Mesh: A Prescription for Healthier Data
PDF
Introduction Big Data
PPTX
Hadoop & Hive Change the Data Warehousing Game Forever
PDF
Keynote – When Open Source Meets the Enterprise
PPT
Computing Outside The Box September 2009
PPTX
BigData
PDF
Big Data Analytics for Real Time Systems
PDF
Secrets of Enterprise Data Mining 201305
PDF
High-performance database technology for rock-solid IoT solutions
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PPTX
Bigdata
PPTX
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
PDF
Big data serving: Processing and inference at scale in real time
PDF
What to Expect for Big Data and Apache Spark in 2017
PPT
Big data & hadoop framework
PDF
Real time analytics at uber @ strata data 2019
PDF
The hidden engineering behind machine learning products at Helixa
PPTX
Architecting Cloudy Applications
Computing Outside The Box June 2009
How to govern and secure a Data Mesh?
Off-Label Data Mesh: A Prescription for Healthier Data
Introduction Big Data
Hadoop & Hive Change the Data Warehousing Game Forever
Keynote – When Open Source Meets the Enterprise
Computing Outside The Box September 2009
BigData
Big Data Analytics for Real Time Systems
Secrets of Enterprise Data Mining 201305
High-performance database technology for rock-solid IoT solutions
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Bigdata
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Big data serving: Processing and inference at scale in real time
What to Expect for Big Data and Apache Spark in 2017
Big data & hadoop framework
Real time analytics at uber @ strata data 2019
The hidden engineering behind machine learning products at Helixa
Architecting Cloudy Applications
Ad

Recently uploaded (20)

PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Computer network topology notes for revision
PPTX
1_Introduction to advance data techniques.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Global journeys: estimating international migration
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
.pdf is not working space design for the following data for the following dat...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Reliability_Chapter_ presentation 1221.5784
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Supervised vs unsupervised machine learning algorithms
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Miokarditis (Inflamasi pada Otot Jantung)
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Computer network topology notes for revision
1_Introduction to advance data techniques.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Global journeys: estimating international migration

Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes

  • 1. A Tale of Lambdas, Kappas & Pancakes @osamakhn
  • 2. Who am I? Osama Khan Big Data Engineer @ACLServices Grad Student @GTComputing AWS Big Data Specialist+ ! Vancouver, BC " : Java " : C# (met thru J#) # : Python $ : Golang, NodeJS % : Scala Previously: Robot Soccer, BI, Credit Rating, AML, O&G Portfolio, NLP/Governance, Doctor Triage, Energy Monitoring, Consulting, Private Equity Recently: Data/ML Pipeline, Tools & Platforms
  • 3. What am I going to talk about today? The goal of this talk is to provide a high level overview of the big data landscape to help software engineers distinguish signal from noise 1) How big is BIG: Lets get our scales recalibrated to understand what we mean by BIG Data 2) Lineage: Evolution of the Big Data ecosystem; from EDW to Data Lakes 3) Lambda & Kappa Architectures: The foundation of data pipelines and machine learning systems 4) Technology Choices: SMACK that PANCAKE BUT I ❤ Serverless 5) Demos: Athena, EMR, Redshift, Quicksight, Sagemaker, ModelDB
  • 4. How big is BIG? Big is BIG when Bieber breaks the Google Cloud (wat!?!) Lets get our scales recalibrated to understand what we mean by BIG Data
  • 5. This is BIG … § 390 Hyperscale Datacenters ( < 300, 2016) § Hyperscale == (5k servers, 10k sq.ft space) § > 400, 100M+ total servers § 56% web content in English § 8,000 languages spoken globally § Hello, friend. § 100M+ active users, 40M+ subscribers § 30M+ songs, 20K new per day, 2B+ playlists, 1B+ plays per day § 2,500 node Hadoop cluster, 100 PB+ Disk, 100TB+ RAM § 60TB+ per day log ingestion, 20k+ jobs per day § Listening History Query § user x track x [day/week/month/all time] § 300B elements § 800 workers, 32 core, 208 GB ram § 240TB in, 90TB out § Top Tracks in Vancouver (June 2017) § 30 date partitioned tables, 60TB data § 1 metadata table, 418GB § 94.2s, 4.82TB processed § Despacito – Remix (Luis Fonsi) § (2017) § 2.8 B+ US Tweets § Donald Trump (901.8 M) § Hillary Clinton (123.2 M) § Mike Pence (31.4 M) § 30x more than VP, 7x more than opponent (2013) § 170M individual metrics (timeseries) per minute § 200M queries served/day, 47 charts/user
  • 6. Lineage Evolution of the Big Data ecosystem; from EDW to Data Lakes
  • 7. A journey from ETL to Distributed Transactions via the ELT alley… 2007 20172003 20142009 2011201020042000 FaunaDB, Aurora Lambda, Kappa Architectures UC Berkeley Spark Google File System LinkedIn Kafka FB Cassandra AWS DynamoDB Google Dremel 2012: Google Spanner IBM, Oracle, MSFT, Terradata, SAP CAP Theorem Google Map Reduce AWS, GCP, Azure, Hana 2006: Yahoo! Hadoop Occupy the Cloud: Distributed Computing for the 99%
  • 8. Big Data Landscape Distributed systems rule the !
  • 9. Yet Another Big Data Framework (YABDF) Doesn’t fit on a slide or two … and you thought you had library fatigue in the JS world ! http://mattturck.com/wp-content/uploads/2017/05/Matt-Turck-FirstMark-2017-Big-Data-Landscape.png
  • 10. Lambda & Kappa Architectures The foundation of data pipelines for enterprise insights
  • 11. Lambda Architecture: First Principles & Desired Properties Data special information from which everything else is derived Information processed data Data System query = function(ALL_DATA) 1. Robustness & Fault Tolerance 2. Low Latency Read & Update 3. Scalability 4. Generalization 5. Ad-hoc 6. Minimal Maintenance 7. Debuggability
  • 12. The Lambda Architecture query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer ∑ ALL_DATA Δ NEW_DATA
  • 13. The Lambda Architecture (in the enterprise) query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR
  • 14. The Lambda Architecture (in the enterprise) query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR
  • 15. The Lambda Architecture (in the enterprise) query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR
  • 16. The Lambda Architecture (deep dive) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5 > home of ‘master data’ > precomputed_batch_view = fn(ALL_DATA) > user_query = fn(precomputed_batch_view)
  • 17. The Lambda Architecture (deep dive) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5 > (distributed) db to store batch view data > produce fast results for known queries > allow (random) reads by users/systems
  • 18. The Lambda Architecture (deep dive) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5 > compensate for high latency of batch updates > run fast, incremental algorithms (probabilistic data structures, for the win) > realtime_view = fn(realtime_view, new_data) > user_query = fn(realtime_view) > user_query = fn(precomputed_batch_view)
  • 19. The Lambda Architecture (deep dive) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5 > user_query = fn( precomputed_batch_view, realtime_view)
  • 20. The Lambda Architecture (ingest? speedlayer?) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5
  • 21. The Lambda Architecture (ingest? speedlayer?) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5
  • 22. The Kappa Architecture Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5
  • 23. The Lambda Architecture (SMACK) Data Catalog Impact Analysis Data Lineage Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer 1 2 3 4 5
  • 24. Big Picture of Metadata Management for Data Governance Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer lineage impact analysis semantic lineage Enterprise Vocabulary Semantic Mapping (metadata harvesting) (metadata stitching)
  • 25. Big Picture of Metadata Management for Data Governance Access Control Secure Data Rest GDPR query query new data master dataset batch view batch view real-time view real-time view batch layer serving layer speed layer lineage impact analysis semantic lineage Enterprise Vocabulary Semantic Mapping
  • 27. Machine Learning Pipelines BAML PITBAML PIT == $$100MM Blockchain based Adversarial Machine Learning Platform for IoT Testing
  • 30. Pancake Stack [Presto Arrow Nifi Cassandra Airflow Kafka ElasticSearch Spark Tensorflow AlgeBird CoreNLP Kibana]
  • 31. data science silo Data Source Data & Feature Engineering Adaptation of slide by Ben Lorica Model Building Deploy Monitor
  • 33. what’s changing(-ed)? 1. Cloud (faas, serverless data pipelines, ml-as-a-service) 2. Consumer demand for ML features/products/applications 3. Targeted Models (we need to manage 20MM models for 10MM users maybe) 4. Localization (ASEAN facial recognition) 5. Security (Adv. ML, Side-channel attacks) 6. Transparency (Bias is a BUG) 7. Many toy sophisticated solutions but conventional, simpler techniques (regression) still deliver more business value! 8. Monitoring to ensure deployed models are making high quality predictions 9. Need practices to maintain (update or rebuild) models over time 10. and ….
  • 35. rise of machine learning engineers
  • 36. rise of machine learning engineers
  • 37. Online Machine Learning Pipeline Model Inventory Model Output Monitoring Take Action ML Serving Layer Hyper-parameter Tuning