Image Classification and Retrieval on Spark

SPARK MBUTO
Design & Engineering Machine Learning Pipelines
Gianvito Siciliano
Use Case: Image Classiﬁcation and Retrieval

OUTLINE
1. Spark ‘Mbuto intro
2. ML problems overview
3. Classiﬁcation & retrieval logic
4. Classiﬁcation Models
5. Image Pipeline

OUTLINE
• Abstractions
• Basic Examples
5. Image Pipeline

SPARK MBUTO
• Spark poc to (easy) create, run and test pipelines and
workﬂow
• Pipelines are made by sequential steps in a SparkJobApp
• Each steps is a SparkJob
• Each job share the same Spark/SQL context
• Jobs are consecutively run by JobRunner

PIPELINE
App .main
JobRunner .run
Job
Job
.execute
.execute
next job

READABLE APP
App .main
JobRunner .run
Job
Job
.execute
.execute
next job

OUTLINE
• Classiﬁcation
• Retrieval
5. Image Pipeline

IMAGE CLASSIFICATION
• Multiclass image classiﬁcation:
1. Choose model (NN, SVM,TREE…)
2. Train/test model (with labeled images)
3. Predict the label of new images
4. Tune the model

IMAGE RETRIEVAL
• Multiclass image classiﬁcation:
1. Choose metric (Euclidean, cosine…)
2. Build dictionary
3. Train/test the model
4. Query and search
5. Tune the model

WHAT CHANGES?
• Pipelines architecture
• Classiﬁcation logic
• How to update the model?

CLASSIFICATION PIPELINE
DATA
TRAIN
CLASSIFIER
MODEL
NEW
DATA
PREDICTION

RETRIEVAL PIPELINE
DATA
TRAIN
CLASSIFIER
MODEL QUERY
PREDICTION

CLASSIFICATION & RETRIEVAL
• Keypoints extraction from each images
• Clustering on the keypoints universe
• Represent each image with weighted cluster
vector
• Train &Test the model
• Query the model (ﬁnding the most similar
images)
Features
Engineering
Build the
Dictionary
Build the
classiﬁer
Query
the model

C. & R. JOBS
• Load whole dataset
• Extract keypoints
• Reduce the keypoints universe
• Transform the features space
• Create the dictionary (aka Codebook)
• Train, test & evaluate the classiﬁer
• Query and get prediction
DATA
TRAIN
CLASSIFIER
MODEL
PREDICTION

KMeans
CLASSIFIER
Image
LOADER
.transform
Sift
EXTRACTOR
KMeans
QUANTISER
.ﬁt
CLUSTERS
CfIif
TRANSFORMER
ClusterVector
PIVOTER
CODEBOOK
Features
Engineering
Build the
Dictionary
DICTIONARY
TRANSFORMER
ESTIMATOR

Vector
ASSEMBLER
.transform
Label
INDEXER
KNN
CLASSIFIER
.fit
.transform
.fit
KMeans
CLASSIFIER
TRAIN TEST
.split
EVALUATOR
Train
classifier
Evaluate
classifier
INSAMPLE
PREDICTION
OUTSAMPLE
PREDICTION
CLASSIFIER
TRANSFORMER
ESTIMATOR

KNN IMPLEMENTATION
• Is a comparison model: the similarity metric is crucial!
• Nearest Neighbour search (in the codebook) is the panic point:
• KDTree: not parallel (anche se…)
• LSH: hyperparams difﬁcult to tune
• MetricTree: disjoint features points area
• Spill tree: too many shared points
=> HybridTree

HYBRIDTREE
• TopTree is a Metric tree
• SubLeaf Tree are Spill tree, trained in parallel
• Nodes can be:
• OVERLAP => defeatist search
• NON OVERLAP => backtracking

NEURAL NETWORK
• Convolutional works well with images
• Hyperparameters tuning is the panic point, but can
be automatised (guarda il nuovo algo)
• Training is not trivial, update the model is easy to
complain

WHAT MORE?
• Features engineering
• Hyperparameters tuning
• Parallel optimizations
• Persist/update steps
• Ensemble models
DATA
Combiner
PREDICTION
Normalizer
pipelineModel
Cross
Validator

https://github.com/gianvi
Thanks!

Image Classification and Retrieval on Spark

More Related Content

Viewers also liked (10)

Similar to Image Classification and Retrieval on Spark (20)

More from Gianvito Siciliano (9)

Recently uploaded (20)

Image Classification and Retrieval on Spark