SlideShare a Scribd company logo
Apache

   The Elephant Driver
          Presenters:
      Antonio Loureiro Severien
     Emmanouil Dimogerontakis
     Muhammad Anis uddin Nasir
What is Apache Mahout?
● Machine learning and data mining framework for
  classification, clustering and recommendation

● The Apache Mahout free machine learning library's goal
  is to build scalable machine learning tools for use on
  analysing big data on a distributed manner
Machine Learning
"Machine Learning is programming computers to optimize a
performance criterion using example data or past
experience" - Alpaydin, 2004

Machine learning is concerned with the design and
development of algorithms that allow machines to make
decisions or even evolve behaviors based on collection of
empirical data.
Data Mining
Data mining, also called knowledge discovery in
databases(KDD) is the process of discovering interesting
and useful patterns and relationships in large volumes of
data.
Combines tools from:
    ● statistics
    ● artificial intelligence (such as neural networks and
       machine learning)
with database management to analyze large data sets.
-Britannica Online Encyclopedia
Why Machine Learning and Data
Mining?

● Data, Data, DATA!!!


● Tasks too Hard to Program


● Customizing software
Available Machine Learning Tools


●   WEKA
●   R
●   KEEL
●   Others...


Not enough?
Apache Mahout vs others?
Many open source Machine Learning
libraries either:
● Lack Community
● Lack Documentation and Examples
● Lack the Apache License
    (business opportunity)
● Are research-oriented
    (not fit for production yet)
● Lack Scalability
Mahout = Elephant Driver?
Why we need scalability?
● Big Data
Applications
● Recommendation features
● Clustering of information
● Classification

Examples: Movie recommendations, stock
analysis, fraud detection, ad-sense
recommendation, etc...

            How do we do this?
Supported Algorithms
●   Classification
●   Clustering
●   Recommender / Collaborative Filtering
●   Evolutionary Algorithms
●   Pattern Mining
●   Regression
●   Dimension reduction
●   Similarity Vectors
Classification
(learn to assign categories to documents)

Fully functional
 ● Logistic Regression (SGD)
 ● Bayesian

Integrated to Mahout Development
 ● Random Forests (integrated)
 ● Online Passive Aggressive (integrated)
 ● Boosting (awaiting patch commit)

Open to be worked on...
 ● Hidden Markov Models (HMM) - Training is done in Map-Reduce
 ● Support Vector Machines (SVM) (open)
 ● Perceptron and Winnow (open)
 ● Neural Network (open)
Clustering
(group items that are topically related)

Fully functional
 ● Expectation Maximization (EM)
 ● Hierarchical Clustering

Integrated to Mahout Development
 ● Canopy Clustering
 ● K-Means Clustering
 ● Fuzzy K-Means
 ● Mean Shift Clustering
 ● Dirichlet Process Clustering
 ● Latent Dirichlet Allocation
 ● Spectral Clustering
 ● Minhash Clustering
 ● Top Down Clustering
Recommenders /
Collaborative Filtering
(find items a user might like /
find items that appear together)

Integrated to Mahout Development
●   Non-distributed recommenders ("Taste") (integrated)
●   Distributed Item-Based Collaborative Filtering (integrated)
●   Collaborative Filtering using a parallel matrix factorization (integrated)
Who is using it?
Opportunities
●   Developers
●   Researchers
●   Small Business
●   Large Business
●   Consultancy...
    ○ on Mahout
    ○ on specific data analysis
● Open data
● etc...
Apache Mahout
Business?

Ideas?

Suggestions?

Questions?
Where to start?
● Wikipedia Bayes Example
   ○   https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html


● What does it do?
   ○ Classify wikipedia data dump by countries.
   ○ Objective: Predict what country an unseen article
     should be categorized into.
References
General
http://www.slideshare.net/sdec2011/sdec2011-mahout-the-what-the-how-and-
the-why
http://www.slideshare.net/gsingers/intro-to-mahout-dc-hadoop
http://www.slideshare.net/aneeshabakharia/lca2011-mahout
Hands-on
http://www.slideshare.net/OReillyOSCON/hands-on-mahout
Who is using it?
https://cwiki.apache.org/MAHOUT/powered-by-mahout.html
Apache Mahout
http://mahout.apache.org/
Quickstart
https://cwiki.apache.org/MAHOUT/quickstart.html

More Related Content

PPTX
Apache Mahout 於電子商務的應用
PDF
Mahout Tutorial and Hands-on (version 2015)
PDF
Tutorial Mahout - Recommendation
PPTX
Machine Learning and Apache Mahout : An Introduction
PDF
Mahout
PDF
Introduction to Collaborative Filtering with Apache Mahout
PPTX
Apache Mahout
PPTX
Apache mahout
Apache Mahout 於電子商務的應用
Mahout Tutorial and Hands-on (version 2015)
Tutorial Mahout - Recommendation
Machine Learning and Apache Mahout : An Introduction
Mahout
Introduction to Collaborative Filtering with Apache Mahout
Apache Mahout
Apache mahout

What's hot (20)

PPTX
Intro to Mahout -- DC Hadoop
PPTX
Introduction to Apache Mahout
PPTX
Whats Right and Wrong with Apache Mahout
PPTX
Intro to Apache Mahout
PDF
SDEC2011 Mahout - the what, the how and the why
PPTX
mahout introduction
KEY
Machine Learning with Apache Mahout
PPT
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
PDF
Apache Mahout Tutorial - Recommendation - 2013/2014
PDF
Next directions in Mahout's recommenders
PPTX
Mahout Introduction BarCampDC
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PDF
Mahout classification presentation
PPT
Mahout part2
PDF
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
PPTX
Intro to Mahout
PPT
Hands on Mahout!
PPTX
Apache Mahout: Driving the Yellow Elephant
PPTX
Primer to Machine Learning
PDF
Apache Mahout Architecture Overview
Intro to Mahout -- DC Hadoop
Introduction to Apache Mahout
Whats Right and Wrong with Apache Mahout
Intro to Apache Mahout
SDEC2011 Mahout - the what, the how and the why
mahout introduction
Machine Learning with Apache Mahout
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
Apache Mahout Tutorial - Recommendation - 2013/2014
Next directions in Mahout's recommenders
Mahout Introduction BarCampDC
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Mahout classification presentation
Mahout part2
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Intro to Mahout
Hands on Mahout!
Apache Mahout: Driving the Yellow Elephant
Primer to Machine Learning
Apache Mahout Architecture Overview
Ad

Viewers also liked (12)

PDF
Introduction to Mahout and Machine Learning
PDF
MAHOUT classifier tour
PPTX
Biometric Databases and Hadoop__HadoopSummit2010
PDF
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
DOC
Diane Richey Resume4
PDF
How to make mobile convert - usertesting webinar with michael mace
PDF
Wild Times
ODP
Few words about happiness (Polish talk) / O szczęściu słów kilka
DOC
China bank industry market forecast and investment strategy report, 2013 2017
DOCX
PDF
China construction quality testing industry market forecast and competition s...
Introduction to Mahout and Machine Learning
MAHOUT classifier tour
Biometric Databases and Hadoop__HadoopSummit2010
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Mail antispam - Bay area Hadoop user group
Diane Richey Resume4
How to make mobile convert - usertesting webinar with michael mace
Wild Times
Few words about happiness (Polish talk) / O szczęściu słów kilka
China bank industry market forecast and investment strategy report, 2013 2017
China construction quality testing industry market forecast and competition s...
Ad

Similar to Apache Mahout (20)

KEY
Machine Learning & Apache Mahout
PDF
Mahout and Distributed Machine Learning 101
DOC
Download Materials
PPTX
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
PPTX
Setting up a mini big data architecture, just for you! - Bas Geerdink
PPTX
AEM integration with Apache Mahout
PPTX
AEM integration with Apache Mahout
PPTX
Apache mahout and R-mining complex dataobject
PDF
Mahout tutorial
PPT
Orchestrating the Intelligent Web with Apache Mahout
PDF
OSCON: Apache Mahout - Mammoth Scale Machine Learning
PDF
Artificial Intelligence Layer: Mahout, MLLib, and other projects
PPTX
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
PPTX
data_sinces_presntion_tools_2025_hi.pptx
PDF
Apache mahout - introduction
PDF
Practical Machine Learning
PPTX
mapReduce for machine learning
PDF
MLlib: Spark's Machine Learning Library
PDF
Big Data Analytics using Mahout
Machine Learning & Apache Mahout
Mahout and Distributed Machine Learning 101
Download Materials
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Setting up a mini big data architecture, just for you! - Bas Geerdink
AEM integration with Apache Mahout
AEM integration with Apache Mahout
Apache mahout and R-mining complex dataobject
Mahout tutorial
Orchestrating the Intelligent Web with Apache Mahout
OSCON: Apache Mahout - Mammoth Scale Machine Learning
Artificial Intelligence Layer: Mahout, MLLib, and other projects
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
data_sinces_presntion_tools_2025_hi.pptx
Apache mahout - introduction
Practical Machine Learning
mapReduce for machine learning
MLlib: Spark's Machine Learning Library
Big Data Analytics using Mahout

More from Save Manos (14)

PDF
Software Defined Networking for Community Network Testbeds
PDF
Lock Service with Paxos in Erlang
PDF
PDF
FOSS Licenses: A first attempt
PDF
Ciel universal distributed execution engine
PDF
A boring presentation about social mobile communication patterns and opportun...
PDF
Man In The Browser
PDF
P2P-Tuple: Towards a Robust Volunteer Computing Platform
PDF
A survey on modifications for unstructured P2P in WMNs .
PDF
Intelligent Placement of Datacenter for Internet Services
PDF
Network as a Service
PDF
Openflow
PDF
RESTful Web Services
PDF
Distributed systems
Software Defined Networking for Community Network Testbeds
Lock Service with Paxos in Erlang
FOSS Licenses: A first attempt
Ciel universal distributed execution engine
A boring presentation about social mobile communication patterns and opportun...
Man In The Browser
P2P-Tuple: Towards a Robust Volunteer Computing Platform
A survey on modifications for unstructured P2P in WMNs .
Intelligent Placement of Datacenter for Internet Services
Network as a Service
Openflow
RESTful Web Services
Distributed systems

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
KodekX | Application Modernization Development
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Modernizing your data center with Dell and AMD
PDF
Encapsulation theory and applications.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KodekX | Application Modernization Development
Spectral efficient network and resource selection model in 5G networks
Modernizing your data center with Dell and AMD
Encapsulation theory and applications.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MYSQL Presentation for SQL database connectivity
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Apache Mahout

  • 1. Apache The Elephant Driver Presenters: Antonio Loureiro Severien Emmanouil Dimogerontakis Muhammad Anis uddin Nasir
  • 2. What is Apache Mahout? ● Machine learning and data mining framework for classification, clustering and recommendation ● The Apache Mahout free machine learning library's goal is to build scalable machine learning tools for use on analysing big data on a distributed manner
  • 3. Machine Learning "Machine Learning is programming computers to optimize a performance criterion using example data or past experience" - Alpaydin, 2004 Machine learning is concerned with the design and development of algorithms that allow machines to make decisions or even evolve behaviors based on collection of empirical data.
  • 4. Data Mining Data mining, also called knowledge discovery in databases(KDD) is the process of discovering interesting and useful patterns and relationships in large volumes of data. Combines tools from: ● statistics ● artificial intelligence (such as neural networks and machine learning) with database management to analyze large data sets. -Britannica Online Encyclopedia
  • 5. Why Machine Learning and Data Mining? ● Data, Data, DATA!!! ● Tasks too Hard to Program ● Customizing software
  • 6. Available Machine Learning Tools ● WEKA ● R ● KEEL ● Others... Not enough?
  • 7. Apache Mahout vs others? Many open source Machine Learning libraries either: ● Lack Community ● Lack Documentation and Examples ● Lack the Apache License (business opportunity) ● Are research-oriented (not fit for production yet) ● Lack Scalability
  • 9. Why we need scalability? ● Big Data
  • 10. Applications ● Recommendation features ● Clustering of information ● Classification Examples: Movie recommendations, stock analysis, fraud detection, ad-sense recommendation, etc... How do we do this?
  • 11. Supported Algorithms ● Classification ● Clustering ● Recommender / Collaborative Filtering ● Evolutionary Algorithms ● Pattern Mining ● Regression ● Dimension reduction ● Similarity Vectors
  • 12. Classification (learn to assign categories to documents) Fully functional ● Logistic Regression (SGD) ● Bayesian Integrated to Mahout Development ● Random Forests (integrated) ● Online Passive Aggressive (integrated) ● Boosting (awaiting patch commit) Open to be worked on... ● Hidden Markov Models (HMM) - Training is done in Map-Reduce ● Support Vector Machines (SVM) (open) ● Perceptron and Winnow (open) ● Neural Network (open)
  • 13. Clustering (group items that are topically related) Fully functional ● Expectation Maximization (EM) ● Hierarchical Clustering Integrated to Mahout Development ● Canopy Clustering ● K-Means Clustering ● Fuzzy K-Means ● Mean Shift Clustering ● Dirichlet Process Clustering ● Latent Dirichlet Allocation ● Spectral Clustering ● Minhash Clustering ● Top Down Clustering
  • 14. Recommenders / Collaborative Filtering (find items a user might like / find items that appear together) Integrated to Mahout Development ● Non-distributed recommenders ("Taste") (integrated) ● Distributed Item-Based Collaborative Filtering (integrated) ● Collaborative Filtering using a parallel matrix factorization (integrated)
  • 16. Opportunities ● Developers ● Researchers ● Small Business ● Large Business ● Consultancy... ○ on Mahout ○ on specific data analysis ● Open data ● etc...
  • 18. Where to start? ● Wikipedia Bayes Example ○ https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html ● What does it do? ○ Classify wikipedia data dump by countries. ○ Objective: Predict what country an unseen article should be categorized into.