SlideShare a Scribd company logo
Machine Learning
con Apache Mahout
  Domingo Suarez Torres
Machine Learning (ML)
        Introduction
Definition

     • Machine learning, a branch of artificial
        intelligence, is a scientific discipline
        concerned with the design and
        development of algorithms that allow
        computers to evolve behaviors based on
        empirical data (1)


1http://en.wikipedia.org/wiki/Machine_learning
• “Machine Learning is programming
  computers to optimize a performance
  criterion using example data or past
  experience”
 • Intro. To Machine Learning by E. Alpaydin
Applications
•   Recommend friends/dates/        •   Detect anomalies in machine
    products                            output

•   Classify content into           •   Ranking search results
    predefined groups
                                    •   Fraud detection
•   Find similar content based
    on object properties            •   Spam detection

•   Find associations/patterns in   •   Medical diagnostics
    actions/behaviors
                                    •   Translators
•   Identify key topics in large
    collections of text             •   Much more¡
Math

• Stadistics
• Discrete Math
• Linear algebra
• Probability
Machine Learning & Apache Mahout
Starting with ML
•   Get your data
•   Decide on your features per your algorithm
•   Prep the data
    •   Different approaches for different algorithms
•   Run your algorithm(s)
    •   Lather, rinse, repeat
•   Validate your results
    •   Smell test, A/B testing
Apache Mahout

• Machine Learning library. Platform?
• Extensible, we can use our own algorithm.
• Hadoop support
• 2005. Taste Framework
• 2008. Included in Lucene
Scalability
•   Huge amount of data, growing every second¡
•   Be as fast and efficient as possible given the intrinsic design of
    the algorithm
    •   Some algorithms won’t scale to massive machine clusters
    •   Others fit logically on a Map Reduce framework like
        Apache Hadoop
    •   Still others will need alternative distributed programming
        models
    •   Be pragmatic
•   Most Mahout implementations are Map Reduce enabled
Who uses Mahout?
Components

• Recommender Engines (collaborative
  filtering, content-based)
• Clustering
• Classification
When to use?
• Recommendation
 • Rank large datasets
• Clustering
 • Group your data
• Classification
 • Train me to think like you
Recommenders
•   Given a data set. Make a recomendation.
    •   Item recomendation (Book, Movie, etc)
•   Ranking based
•   Recomendations
    •   User based
    •   Item based
•   knowledge of user’s relationships to items (user
    preferences)
Machine Learning & Apache Mahout
Colaborative filtering
• User based
• Item based
• Both techniques require no knowledge of
  the properties of the items themselves.
• Item Type is irrelevant. Apache Mahout is
  happy
17
Content based
• Domain-specific approaches
• Hard to meaningfully codify into a
  framework
• We are responsables of choosing which
  item's attributes to use.
• Apache Mahout can’t handle this out-of-
  the-box, but can built on top.
Making recommendations

 • What we need?
  • Input data
  • Neighborhood
  • Similarity
Input Data
•   In Mahout terms: Preferences
•   A preference contains:
    •   User ID
    •   Item ID
    •   Preference value
    •   Example:
        •   1,101,5.0
        •   USER ID: 1, ITEM ID: 101, PrefValue: 5.0
21
Machine Learning & Apache Mahout
Neighborhood
Nearest N Users    Threshold
Similarity
Clustering

• Surface naturally occurring groups of data
• A notion of similarity (and dissimilarity)
• Algorithms do not require training
• Stopping condition - iterate until close
  enough
Clustering
•   Document level
    •   Group documents based on a notion of similarity
    •   K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift
    •   Distance Measures
    •   Manhattan, Euclidean, other
•   Topic Modeling
    •   Cluster words across documents to identify topics
    •   Latent Dirichlet Allocation
Classification

• Require training (supervised)
• Make a single decision with a very limited
  set of outcomes
• Typical answers naturally fit into categories
Classification samples

• Credit card fraud prediction
• Customer attrition
• Diabetes detector
• Search Engine
Mahout/Hadoop
• For large data sets
• Online
• Offline (Hadoop prefered)
• You can build your solution with Mahout
• Take a look into Weka
 • http://www.cs.waikato.ac.nz/ml/weka/
Resources
Resources
Resources
Machine Learning & Apache Mahout
Join us¡
• GIAMA.
 • Agustin Ramos iniciative

More Related Content

PPTX
Raspberry pi
PPT
Chapter 3
PPTX
Optical disc drive
PPT
Linux Audio Drivers. ALSA
PPTX
Graphics card
PDF
Multimedia Development Lifecycle
ODP
MPEG-1 Part 2 Video Encoding
PPT
chapter-1-introduction-to-linux.ppt
Raspberry pi
Chapter 3
Optical disc drive
Linux Audio Drivers. ALSA
Graphics card
Multimedia Development Lifecycle
MPEG-1 Part 2 Video Encoding
chapter-1-introduction-to-linux.ppt

What's hot (20)

PPT
Chapter 3 - Fundamental Concepts in Video and Digital Audio.ppt
PPT
PPT
Server configuration
PPTX
multimedia technologies Introduction
PPTX
U5 Case Tools.pptx
PPTX
Simple Presentation On Raspberry pi
PDF
Audio Drivers
PDF
Ubuntu an absolute beginners guide
PPT
Chapter 4 : SOUND
PDF
Course 102: Lecture 10: Learning About the Shell
PPT
Gnome on wayland at a glance
PDF
CPU vs. GPU presentation
PPTX
World wide web with multimedia
PDF
Why and How to Run Your Own Gitlab Runners as Your Company Grows
PPTX
Processors and its Types
PPT
Amd vs intel
PDF
USB Drivers
PDF
Programming guide for linux usb device drivers
PPTX
Fundamentals of computers
Chapter 3 - Fundamental Concepts in Video and Digital Audio.ppt
Server configuration
multimedia technologies Introduction
U5 Case Tools.pptx
Simple Presentation On Raspberry pi
Audio Drivers
Ubuntu an absolute beginners guide
Chapter 4 : SOUND
Course 102: Lecture 10: Learning About the Shell
Gnome on wayland at a glance
CPU vs. GPU presentation
World wide web with multimedia
Why and How to Run Your Own Gitlab Runners as Your Company Grows
Processors and its Types
Amd vs intel
USB Drivers
Programming guide for linux usb device drivers
Fundamentals of computers
Ad

Viewers also liked (6)

PDF
SGCE 2015 REST APIs
PDF
Serling dev team, development process
KEY
SGCE 2012 Lightning Talk-Single Page Interface
PDF
SGNext Elasticsearch
PDF
JVM Reactive Programming
PDF
SGCE 2014 micro services
SGCE 2015 REST APIs
Serling dev team, development process
SGCE 2012 Lightning Talk-Single Page Interface
SGNext Elasticsearch
JVM Reactive Programming
SGCE 2014 micro services
Ad

Similar to Machine Learning & Apache Mahout (20)

PPTX
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
PDF
Introduction to Mahout and Machine Learning
PPTX
Building NLP solutions for Davidson ML Group
PDF
Data Scientist Toolbox
PDF
Tutorial Mahout - Recommendation
DOC
Download Materials
PPTX
Building NLP solutions using Python
PPTX
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
PPTX
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
PDF
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
PDF
Mahout Tutorial and Hands-on (version 2015)
PDF
SDEC2011 Essentials of Mahout
PDF
Towards a common data file format for hyperspectral images
PDF
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
PDF
Apache Mahout
PDF
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
PDF
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
PPTX
machine learning
PPTX
The Art of Intelligence – Introduction Machine Learning for Java professional...
PDF
Workshop Exercise: Text Analysis Methods for Digital Humanities
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Introduction to Mahout and Machine Learning
Building NLP solutions for Davidson ML Group
Data Scientist Toolbox
Tutorial Mahout - Recommendation
Download Materials
Building NLP solutions using Python
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Mahout Tutorial and Hands-on (version 2015)
SDEC2011 Essentials of Mahout
Towards a common data file format for hyperspectral images
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Apache Mahout
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
machine learning
The Art of Intelligence – Introduction Machine Learning for Java professional...
Workshop Exercise: Text Analysis Methods for Digital Humanities

More from Domingo Suarez Torres (20)

PDF
Projecto Loom - Structured Concurrency - JavaMexico - Julio 2024
PDF
Cloud Native MX Meetup - Asegurando tu Cluster de Kubernetes
PDF
Java Dev Day 2019 No kuberneteen por convivir
PDF
Contenedores 101 Digital Ocean CDMX
PPTX
Retos en la arquitectura de Microservicios
PDF
Java Cloud Native Hack Nights GDL
PDF
meetup digital ocean kubernetes
PDF
Peru JUG Micronaut & GraalVM
PDF
DevFest Lima Corriendo cargas e trabajo seguras en GKE con Istio
PDF
Cloud Native Development in the JVM
PDF
Cloud Native Mexico - Introducción a Kubernetes
PDF
Meetup DigitalOcean Cloud Native architecture
PDF
Cloud Native Mexico Meetup de Marzo 2018 Service Mesh con Istio y Envoy
PDF
Cloud Native Mexico Meetup enero 2018 Observability
PDF
Cloud Native Mexico Presentacion
PDF
gRPC: Beyond REST
PDF
Devops Landscape
PDF
Orquestación de contenedores con Kubernetes SGNext
PDF
Webinar Arquitectura de Microservicios
PDF
Elasticsearch JVM-MX Meetup April 2016
Projecto Loom - Structured Concurrency - JavaMexico - Julio 2024
Cloud Native MX Meetup - Asegurando tu Cluster de Kubernetes
Java Dev Day 2019 No kuberneteen por convivir
Contenedores 101 Digital Ocean CDMX
Retos en la arquitectura de Microservicios
Java Cloud Native Hack Nights GDL
meetup digital ocean kubernetes
Peru JUG Micronaut & GraalVM
DevFest Lima Corriendo cargas e trabajo seguras en GKE con Istio
Cloud Native Development in the JVM
Cloud Native Mexico - Introducción a Kubernetes
Meetup DigitalOcean Cloud Native architecture
Cloud Native Mexico Meetup de Marzo 2018 Service Mesh con Istio y Envoy
Cloud Native Mexico Meetup enero 2018 Observability
Cloud Native Mexico Presentacion
gRPC: Beyond REST
Devops Landscape
Orquestación de contenedores con Kubernetes SGNext
Webinar Arquitectura de Microservicios
Elasticsearch JVM-MX Meetup April 2016

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Advanced Soft Computing BINUS July 2025.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Chapter 3 Spatial Domain Image Processing.pdf
Machine learning based COVID-19 study performance prediction
Understanding_Digital_Forensics_Presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Mobile App Security Testing_ A Comprehensive Guide.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
The AUB Centre for AI in Media Proposal.docx
Cloud computing and distributed systems.
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Machine Learning & Apache Mahout

  • 1. Machine Learning con Apache Mahout Domingo Suarez Torres
  • 2. Machine Learning (ML) Introduction
  • 3. Definition • Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data (1) 1http://en.wikipedia.org/wiki/Machine_learning
  • 4. • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Intro. To Machine Learning by E. Alpaydin
  • 5. Applications • Recommend friends/dates/ • Detect anomalies in machine products output • Classify content into • Ranking search results predefined groups • Fraud detection • Find similar content based on object properties • Spam detection • Find associations/patterns in • Medical diagnostics actions/behaviors • Translators • Identify key topics in large collections of text • Much more¡
  • 6. Math • Stadistics • Discrete Math • Linear algebra • Probability
  • 8. Starting with ML • Get your data • Decide on your features per your algorithm • Prep the data • Different approaches for different algorithms • Run your algorithm(s) • Lather, rinse, repeat • Validate your results • Smell test, A/B testing
  • 9. Apache Mahout • Machine Learning library. Platform? • Extensible, we can use our own algorithm. • Hadoop support • 2005. Taste Framework • 2008. Included in Lucene
  • 10. Scalability • Huge amount of data, growing every second¡ • Be as fast and efficient as possible given the intrinsic design of the algorithm • Some algorithms won’t scale to massive machine clusters • Others fit logically on a Map Reduce framework like Apache Hadoop • Still others will need alternative distributed programming models • Be pragmatic • Most Mahout implementations are Map Reduce enabled
  • 12. Components • Recommender Engines (collaborative filtering, content-based) • Clustering • Classification
  • 13. When to use? • Recommendation • Rank large datasets • Clustering • Group your data • Classification • Train me to think like you
  • 14. Recommenders • Given a data set. Make a recomendation. • Item recomendation (Book, Movie, etc) • Ranking based • Recomendations • User based • Item based • knowledge of user’s relationships to items (user preferences)
  • 16. Colaborative filtering • User based • Item based • Both techniques require no knowledge of the properties of the items themselves. • Item Type is irrelevant. Apache Mahout is happy
  • 17. 17
  • 18. Content based • Domain-specific approaches • Hard to meaningfully codify into a framework • We are responsables of choosing which item's attributes to use. • Apache Mahout can’t handle this out-of- the-box, but can built on top.
  • 19. Making recommendations • What we need? • Input data • Neighborhood • Similarity
  • 20. Input Data • In Mahout terms: Preferences • A preference contains: • User ID • Item ID • Preference value • Example: • 1,101,5.0 • USER ID: 1, ITEM ID: 101, PrefValue: 5.0
  • 21. 21
  • 25. Clustering • Surface naturally occurring groups of data • A notion of similarity (and dissimilarity) • Algorithms do not require training • Stopping condition - iterate until close enough
  • 26. Clustering • Document level • Group documents based on a notion of similarity • K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift • Distance Measures • Manhattan, Euclidean, other • Topic Modeling • Cluster words across documents to identify topics • Latent Dirichlet Allocation
  • 27. Classification • Require training (supervised) • Make a single decision with a very limited set of outcomes • Typical answers naturally fit into categories
  • 28. Classification samples • Credit card fraud prediction • Customer attrition • Diabetes detector • Search Engine
  • 29. Mahout/Hadoop • For large data sets • Online • Offline (Hadoop prefered) • You can build your solution with Mahout • Take a look into Weka • http://www.cs.waikato.ac.nz/ml/weka/
  • 34. Join us¡ • GIAMA. • Agustin Ramos iniciative

Editor's Notes