SlideShare a Scribd company logo
SPARK MBUTO
Design & Engineering Machine Learning Pipelines
Gianvito Siciliano
Use Case: Image Classification and Retrieval
OUTLINE
1. Spark ‘Mbuto intro
2. ML problems overview
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
OUTLINE
1. Spark ‘Mbuto intro
• Abstractions
• Basic Examples
2. ML problems overview
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
SPARK MBUTO
• Spark poc to (easy) create, run and test pipelines and
workflow
• Pipelines are made by sequential steps in a SparkJobApp
• Each steps is a SparkJob
• Each job share the same Spark/SQL context
• Jobs are consecutively run by JobRunner
SPARKJOB
JOBRUNNER
SPARKJOBAPP
PIPELINE
App .main
JobRunner .run
Job
Job
.execute
.execute
next job
JOB READYTO USE
READABLE APP
App .main
JobRunner .run
Job
Job
.execute
.execute
next job
PERFORMANCE LOOKUP
A
JobR
J
J
OUTLINE
1. Spark ‘Mbuto intro
2. ML problems overview
• Classification
• Retrieval
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
IMAGE CLASSIFICATION
• Multiclass image classification:
1. Choose model (NN, SVM,TREE…)
2. Train/test model (with labeled images)
3. Predict the label of new images
4. Tune the model
IMAGE RETRIEVAL
• Multiclass image classification:
1. Choose metric (Euclidean, cosine…)
2. Build dictionary
3. Train/test the model
4. Query and search
5. Tune the model
WHAT CHANGES?
• Pipelines architecture
• Classification logic
• How to update the model?
CLASSIFICATION PIPELINE
DATA
TRAIN
CLASSIFIER
MODEL
NEW
DATA
PREDICTION
RETRIEVAL PIPELINE
DATA
TRAIN
CLASSIFIER
MODEL QUERY
PREDICTION
OUTLINE
1. Spark ‘Mbuto intro
2. ML problems overview
3. Classification & retrieval logic
4. Classification Models
5. Image Pipeline
CLASSIFICATION & RETRIEVAL
• Keypoints extraction from each images
• Clustering on the keypoints universe
• Represent each image with weighted cluster
vector
• Train &Test the model
• Query the model (finding the most similar
images)
Features
Engineering
Build the
Dictionary
Build the
classifier
Query
the model
C. & R. JOBS
• Load whole dataset
• Extract keypoints
• Reduce the keypoints universe
• Transform the features space
• Create the dictionary (aka Codebook)
• Train, test & evaluate the classifier
• Query and get prediction
DATA
TRAIN
CLASSIFIER
MODEL
PREDICTION
KMeans
CLASSIFIER
Image
LOADER
.transform
Sift
EXTRACTOR
KMeans
QUANTISER
.fit
CLUSTERS
CfIif
TRANSFORMER
ClusterVector
PIVOTER
CODEBOOK
Features
Engineering
Build the
Dictionary
DICTIONARY
TRANSFORMER
ESTIMATOR
Vector
ASSEMBLER
.transform
Label
INDEXER
KNN
CLASSIFIER
.fit
.transform
.fit
KMeans
CLASSIFIER
TRAIN TEST
.split
EVALUATOR
Train
classifier
Evaluate
classifier
INSAMPLE
PREDICTION
OUTSAMPLE
PREDICTION
CLASSIFIER
TRANSFORMER
ESTIMATOR
KNN IMPLEMENTATION
• Is a comparison model: the similarity metric is crucial!
• Nearest Neighbour search (in the codebook) is the panic point:
• KDTree: not parallel (anche se…)
• LSH: hyperparams difficult to tune
• MetricTree: disjoint features points area
• Spill tree: too many shared points
=> HybridTree
HYBRIDTREE
• TopTree is a Metric tree
• SubLeaf Tree are Spill tree, trained in parallel
• Nodes can be:
• OVERLAP => defeatist search
• NON OVERLAP => backtracking
NEURAL NETWORK
• Convolutional works well with images
• Hyperparameters tuning is the panic point, but can
be automatised (guarda il nuovo algo)
• Training is not trivial, update the model is easy to
complain
WHAT MORE?
• Features engineering
• Hyperparameters tuning
• Parallel optimizations
• Persist/update steps
• Ensemble models
DATA
Combiner
PREDICTION
Normalizer
pipelineModel
Cross
Validator
https://github.com/gianvi
Thanks!

More Related Content

PPTX
Spark UDFs are EviL, Catalyst to the rEsCue!
PDF
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
PPTX
Incremental model compiler for executable UML
PDF
Introduction to Scala for Java Developers
PPTX
Massif - the love child of Matlab Simulink and Eclipse
PPTX
Modern java script features
PPTX
Incremental Queries and Transformations for Engineering Critical Systems
Spark UDFs are EviL, Catalyst to the rEsCue!
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Berl...
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Incremental model compiler for executable UML
Introduction to Scala for Java Developers
Massif - the love child of Matlab Simulink and Eclipse
Modern java script features
Incremental Queries and Transformations for Engineering Critical Systems

Viewers also liked (10)

PDF
Interactive Scientific Image Analysis using Spark
PPTX
Eigenfaces In Scala
PDF
Neural Networks, Spark MLlib, Deep Learning
PPTX
Image Processing in agro-based industries
PDF
Image Analysis for Food Scientists
PPTX
基于Python构建可扩展的自动化运维平台
PPTX
Top 5 Deep Learning and AI Stories 1/27
PPTX
An introduction to Machine Learning (and a little bit of Deep Learning)
PDF
Pair RDD - Spark
PDF
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Interactive Scientific Image Analysis using Spark
Eigenfaces In Scala
Neural Networks, Spark MLlib, Deep Learning
Image Processing in agro-based industries
Image Analysis for Food Scientists
基于Python构建可扩展的自动化运维平台
Top 5 Deep Learning and AI Stories 1/27
An introduction to Machine Learning (and a little bit of Deep Learning)
Pair RDD - Spark
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Ad

Similar to Image Classification and Retrieval on Spark (20)

PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PPTX
Machine learning at scale - Webinar By zekeLabs
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PDF
Machine Learning Pipelines
PDF
from ai.backend import python @ pycontw2018
PPTX
AI-ML-Virtual-Internship on new technology
PPTX
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
PPTX
Introduction to Spark ML
PDF
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
PDF
A survey on Machine Learning In Production (July 2018)
PDF
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
PDF
Guiding through a typical Machine Learning Pipeline
PPTX
More Data, More Problems: Evolving big data machine learning pipelines with S...
PDF
Introduction to Spark ML Pipelines Workshop
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
PPTX
Presentation
PDF
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
PDF
Introduction to and Extending Spark ML
PDF
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Machine learning at scale - Webinar By zekeLabs
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Machine Learning Pipelines
from ai.backend import python @ pycontw2018
AI-ML-Virtual-Internship on new technology
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
Introduction to Spark ML
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
A survey on Machine Learning In Production (July 2018)
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Guiding through a typical Machine Learning Pipeline
More Data, More Problems: Evolving big data machine learning pipelines with S...
Introduction to Spark ML Pipelines Workshop
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Presentation
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Introduction to and Extending Spark ML
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Ad

More from Gianvito Siciliano (9)

PDF
Image Classification and Retrieval logic
PDF
Intro Angular Ionic
PDF
Firefly exact MCMC for Big Data
PDF
MAD skills for analysis and big data Machine Learning
PDF
Social Study (project architecture review)
PDF
Consensus Concurrent problem
PDF
Yana - disabled assistance by google watch
PDF
Social study - Network
PDF
New interaction Technologies
Image Classification and Retrieval logic
Intro Angular Ionic
Firefly exact MCMC for Big Data
MAD skills for analysis and big data Machine Learning
Social Study (project architecture review)
Consensus Concurrent problem
Yana - disabled assistance by google watch
Social study - Network
New interaction Technologies

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
1_Introduction to advance data techniques.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Computer network topology notes for revision
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Fluorescence-microscope_Botany_detailed content
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to machine learning and Linear Models
Business Acumen Training GuidePresentation.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Mega Projects Data Mega Projects Data
IBA_Chapter_11_Slides_Final_Accessible.pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Foundation of Data Science unit number two notes
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
1_Introduction to advance data techniques.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Computer network topology notes for revision
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Galatica Smart Energy Infrastructure Startup Pitch Deck

Image Classification and Retrieval on Spark