Marko Velic PhD
Data Science Department
Styria Medijski Servisi d.o.o.
marko.velic@styria.hr
UNSUPERVISED LEARNING
(WITH SPARK)
CONTENTS
 Distances
• Eucledian
• Manhattan
• Mahalanobis
• Cosine Similarity
 Clustering
• K-Means
• Example (Spark)
 Examples from Styria practice (not Spark – for now)
10.03.2016 2
MACHINE LEARNING
10.03.2016 3
UNSUPERVISED LEARNING
 Opservations are not assigned to classes
 Computer program is not ‘supervised’
throughout the learning process
 Usually the task is to find ‘meaningful’
groups within data
 Decision is made based on distances i.e.
similarities among data points
10.03.2016 4
DISTANCES
10.03.2016 5
• To decide upon the groups we have to introduce
similarity measure or contrary – a distance measure
• Pitagora’s theorem – Euclidean distance
• dist((2, -1), (-2, 2))= √((2 - (-2))² + ((-1) - 2)²) = √((2 + 2)² + (-1 -
2)²) = √((4)² + (-3)²) = √(16 + 9) = √25 = 5
DISTANCES & APPROACHES
10.03.2016 6
Source:
http://en.wikipedia.org/wiki/Man
hattan_distance
 Manhattan/Cityblock/Taxicab
• dist((x, y), (a, b)) = |x - a| + |y - b|
 Normalization!
 Mahalanobis – considers variance
• “multidimensional z-score”
 Cosine similarity
 Autoencoders – ‘unsupervised’ neural nets
 Non-unsupervised but based on distances
• ReliefF measure, KNN classifier ... etc...
K-MEANS
7
Simplified:
1. Randomly place
centroids
2. Find the closest
3. Put centroid in the
middle
4. GOTO 2
Image source:
http://www.javabeat.net/2011/05/k-means-
clustering-algorithms-in-mahout/
DEMO (SPARK!)
 K-means clustering of photos (ie.
their vector representations)
 Convolutional neural network as
a supervised model and its
outputs as features for
unsupervised models
 Vector representations after the
pooling layers, after every
convolutional layer (Caffe)
 Clustering in Spark
8
T-SNE CLUSTER VISUALIZATION
9
SEMI-MANUAL CLUSTERING OF PHOTOS
10Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
SEMI-MANUAL CLUSTERING OF PHOTOS
11Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
NATURAL LANGUAGE PROCESSING
10.03.2016 12
T-sne concept visualization; vecernji.hr, Styria Data Science Team
AUTOMATIC (LEARNED) HIERARCHIES
13
Hierarchical clustering, Florijan Stamenković, Styria Data Science Team
VISUAL SEARCH EXAMPLE
14
CONCLUSION
 Distances
• Eucledian
• Manhattan
• Mahalanobis
• Cosine Similarity
 Clustering
• K-Means
 We can nicely combine supervised and unsupervised
features
 SparkNet: Training Deep Networks in Spark
http://arxiv.org/pdf/1511.06051v4.pdf
 https://news.developer.nvidia.com/caffe-on-spark-for-
deep-learning-from-yahoo/
10.03.2016 15
THANK YOU!
CONCLUSION
10.03.2016 17

More Related Content

PDF
Neural Networks with Anticipation: Problems and Prospects
PPT
15857 cse422 unsupervised-learning
PPTX
Unsupervised learning
PPTX
Supervised and unsupervised learning
PDF
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
PDF
Machine Learning with Big Data using Apache Spark
PDF
Kefed introduction 12-05-10-2224
PDF
Kefed introduction 12-06-10-0043
Neural Networks with Anticipation: Problems and Prospects
15857 cse422 unsupervised-learning
Unsupervised learning
Supervised and unsupervised learning
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
Machine Learning with Big Data using Apache Spark
Kefed introduction 12-05-10-2224
Kefed introduction 12-06-10-0043

Similar to Unsupervised learning with Spark (20)

PDF
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
PPTX
Unsupervised learning clustering
DOCX
PDF
Astronomical Data Processing on the LSST Scale with Apache Spark
PDF
A comprehensive survey of contemporary
PDF
Master's Thesis - Data Science - Presentation
PPTX
image_segmentation_ppt.pptx
PDF
Unsupervised learning and clustering.pdf
PDF
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
PDF
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
PDF
DMTM Lecture 11 Clustering
PPTX
Poggi analytics - clustering - 1
PDF
Deep Learning AtoC with Image Perspective
PPT
upd Unit-v -Cluster Analysis (1) (1).ppt
PPTX
Fa18_P2.pptx
PDF
Machine Learning Foundations for Professional Managers
PPT
Digital image classification22oct
PPTX
Mathematics online: some common algorithms
PPTX
Artificial intelligence NEURAL NETWORKS
PPTX
Density based clustering
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Unsupervised learning clustering
Astronomical Data Processing on the LSST Scale with Apache Spark
A comprehensive survey of contemporary
Master's Thesis - Data Science - Presentation
image_segmentation_ppt.pptx
Unsupervised learning and clustering.pdf
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
DMTM Lecture 11 Clustering
Poggi analytics - clustering - 1
Deep Learning AtoC with Image Perspective
upd Unit-v -Cluster Analysis (1) (1).ppt
Fa18_P2.pptx
Machine Learning Foundations for Professional Managers
Digital image classification22oct
Mathematics online: some common algorithms
Artificial intelligence NEURAL NETWORKS
Density based clustering
Ad

Recently uploaded (20)

PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
The various Industrial Revolutions .pptx
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PPT
Geologic Time for studying geology for geologist
PPT
What is a Computer? Input Devices /output devices
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Hybrid model detection and classification of lung cancer
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Getting Started with Data Integration: FME Form 101
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
A comparative study of natural language inference in Swahili using monolingua...
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
CloudStack 4.21: First Look Webinar slides
The various Industrial Revolutions .pptx
Web Crawler for Trend Tracking Gen Z Insights.pptx
WOOl fibre morphology and structure.pdf for textiles
Geologic Time for studying geology for geologist
What is a Computer? Input Devices /output devices
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Hindi spoken digit analysis for native and non-native speakers
NewMind AI Weekly Chronicles – August ’25 Week III
Hybrid model detection and classification of lung cancer
Taming the Chaos: How to Turn Unstructured Data into Decisions
Getting Started with Data Integration: FME Form 101
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Benefits of Physical activity for teenagers.pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
O2C Customer Invoices to Receipt V15A.pptx
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
A comparative study of natural language inference in Swahili using monolingua...
Ad

Unsupervised learning with Spark

  • 1. Marko Velic PhD Data Science Department Styria Medijski Servisi d.o.o. marko.velic@styria.hr UNSUPERVISED LEARNING (WITH SPARK)
  • 2. CONTENTS  Distances • Eucledian • Manhattan • Mahalanobis • Cosine Similarity  Clustering • K-Means • Example (Spark)  Examples from Styria practice (not Spark – for now) 10.03.2016 2
  • 4. UNSUPERVISED LEARNING  Opservations are not assigned to classes  Computer program is not ‘supervised’ throughout the learning process  Usually the task is to find ‘meaningful’ groups within data  Decision is made based on distances i.e. similarities among data points 10.03.2016 4
  • 5. DISTANCES 10.03.2016 5 • To decide upon the groups we have to introduce similarity measure or contrary – a distance measure • Pitagora’s theorem – Euclidean distance • dist((2, -1), (-2, 2))= √((2 - (-2))² + ((-1) - 2)²) = √((2 + 2)² + (-1 - 2)²) = √((4)² + (-3)²) = √(16 + 9) = √25 = 5
  • 6. DISTANCES & APPROACHES 10.03.2016 6 Source: http://en.wikipedia.org/wiki/Man hattan_distance  Manhattan/Cityblock/Taxicab • dist((x, y), (a, b)) = |x - a| + |y - b|  Normalization!  Mahalanobis – considers variance • “multidimensional z-score”  Cosine similarity  Autoencoders – ‘unsupervised’ neural nets  Non-unsupervised but based on distances • ReliefF measure, KNN classifier ... etc...
  • 7. K-MEANS 7 Simplified: 1. Randomly place centroids 2. Find the closest 3. Put centroid in the middle 4. GOTO 2 Image source: http://www.javabeat.net/2011/05/k-means- clustering-algorithms-in-mahout/
  • 8. DEMO (SPARK!)  K-means clustering of photos (ie. their vector representations)  Convolutional neural network as a supervised model and its outputs as features for unsupervised models  Vector representations after the pooling layers, after every convolutional layer (Caffe)  Clustering in Spark 8
  • 10. SEMI-MANUAL CLUSTERING OF PHOTOS 10Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
  • 11. SEMI-MANUAL CLUSTERING OF PHOTOS 11Gruping photos based in visual features, Enes Deumić, Styria Data Science Team
  • 12. NATURAL LANGUAGE PROCESSING 10.03.2016 12 T-sne concept visualization; vecernji.hr, Styria Data Science Team
  • 13. AUTOMATIC (LEARNED) HIERARCHIES 13 Hierarchical clustering, Florijan Stamenković, Styria Data Science Team
  • 15. CONCLUSION  Distances • Eucledian • Manhattan • Mahalanobis • Cosine Similarity  Clustering • K-Means  We can nicely combine supervised and unsupervised features  SparkNet: Training Deep Networks in Spark http://arxiv.org/pdf/1511.06051v4.pdf  https://news.developer.nvidia.com/caffe-on-spark-for- deep-learning-from-yahoo/ 10.03.2016 15