SlideShare a Scribd company logo
Website classification
using Apache Spark
Amith Nambiar
Demo of the WebCat app
Business problem
Automatically classify new websites into one or more
predefined categories.
Why?
Web logs collected from data providers have new websites
popping up everyday. And these need to be categorized
before they are presented to customers in reports - daily.
Website classification using
Apache Spark's MLlib.
Training Data
Starting point was already categorised data in the
form:
URL, category_id
www.linux.com, 10 -> (Computers and Internet)
www.coles.com.au, 20 -> (Shopping and Classifieds)
Training Data
Developed a crawler to crawl each of the categorised websites
2,550 websites picked for initial training and test data.
URL, Category_Id -> URL, Category_Id, Features
www.coles.com.au, 10 ->
www.coles.com.au, 10, groceri deliv kitchen bench custom receiv deliveri first
spend onlin liquorland cole card cole insur apparel cole credit card locat hour
look hervey hervey today normal store hour monday friday 8am special store hour
saturday decemb sunday decemb store store search suburb postcod search
suburb postcod select locat suburb locat found pleas store store state recip
inspir recip tast cole partner tast weekli plan easier visit tast cook month cole
magazin everyday ingredi sensat meal famili friend latest cole cole handi video
recip creativ kitchen visit cole youtub rang rang product product bakeri dairi fresh
fruit cole mobil card heston liquor special diet gluten kosher foodtruck term condit
corpor respons corpor respons supplier commit work …
Crawled, Stemmed and
removed stop words from
the data for the website
Bayes's theorem
Website classification using
Naive Bayes
Naive Bayes Classifier are a family of simple probabilistic
classifiers based on applying Bayes' theorem with strong
(naive) independence assumptions between the features.
tf-idf for weighting
In information retrieval, tf–idf, short for term frequency–inverse document frequency,
is a numerical statistic that is intended to reflect how important a word is to a document
in a collection or corpus
https://en.wikipedia.org/wiki/Tf-idf
tf-idf
The tf-idf value increases proportionally to the number of times a word appears in the
document, but is offset by the frequency of the word in the corpus, which helps to adjust for
the fact that some words appear more frequently in general.
Training Data from Database/HDFS
TermDoc RDD’s
tf-idf’s
Array of LabeledPoint(classId, vector)
Calculate tf-idf’s on the features.
Create a LabelPoint for each of the
training data row
model = NaiveBayes.train(labelPoints)
Train the NaiveBayes Model
model.predict(feature_vector)
Predict class
New Data e.g “Automotive”
Each row of Training data (website) is
turned into this form:
(ClassId, Sparse Vector) in the form:
5.0, [100,(1,44,..),(0.3,0.12,…)]
API first for Data science
http://engineering.pivotal.io/post/api-first-for-data-science/
High Level Architecture of WebCat
High level architecture of WebCat
Webcat
App
Queues/Topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
Categorize
www.coles.com.au
Category is
“Shopping and
Classifieds”
Category is
“Shopping and
Classifieds”
Scale the Crawler
service independent
of the rest of the
services
WebCat dashboard on PWS - Pivotal Web Services
Note that the crawler service is scaled up to 6 instances for better performance.
Ideas for improving WebCat?
User feedback loop to update the model on
incorrect predictions
Webcat
App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
Categorise
www.bmw.com.
We think it is
“Electronics” - Did we
get it right?
Upload your own data - (website, category) pairs
Webcat
App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
I know kogan.com.au
belongs to category
“Shopping and
Classifieds” - add it to
the training data please.
More data = Better predictions?
User defined categories
e.g realestate.com.au -> “Real Estate”
Webcat
App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
Create New
Category
“Real Estate”
Provide a publicly available API for
categorised websites
Webcat
App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
GET /websites/{id}/category
GET /websites/{id}/features
…
WebCat on Apache Madlib
http://madlib.incubator.apache.org/

More Related Content

PPTX
The How and Why of Feature Engineering
PPTX
Website Classification using Apache Spark
KEY
WPF: Working with Data
PPTX
Solving Complex Data Load Challenges
PPTX
Classification of webpages as Ephemeral or Evergreen
PPTX
Predicting the relevance of search results for e-commerce systems
PPTX
Feature engineering for diverse data types
PDF
IRJET- Noisy Content Detection on Web Data using Machine Learning
The How and Why of Feature Engineering
Website Classification using Apache Spark
WPF: Working with Data
Solving Complex Data Load Challenges
Classification of webpages as Ephemeral or Evergreen
Predicting the relevance of search results for e-commerce systems
Feature engineering for diverse data types
IRJET- Noisy Content Detection on Web Data using Machine Learning

Similar to Slides from Apache spark Meetup in Sydney - November,2016 (20)

PPTX
Spark application on ec2 cluster
PDF
On The Automated Classification of Web Pages Using Artificial Neural Network
PPTX
Hai huang presentation
PDF
Text Classification Powered by Apache Mahout and Lucene
PDF
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
PDF
The Magical Art of Extracting Meaning From Data
PPTX
Data Mining Email SPam Detection PPT WITH Algorithms
PPT
ppt
PDF
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
PPTX
CSC410-Presentation
PPTX
Machine Learning With Spark
PPT
Unit iii
PPTX
Going a Step Beyond the Black and White Lists for URL Accesses in the Enterpr...
PDF
USING BLACK-LIST AND WHITE-LIST TECHNIQUE TO DETECT MALICIOUS URLS
PPTX
BigData Computing For WebSite Classifier
PPTX
Case study of Rujhaan.com (A social news app )
PPTX
Overview of Machine Learning and Feature Engineering
PDF
Text categorization with Lucene and Solr
PDF
Binary search query classifier
Spark application on ec2 cluster
On The Automated Classification of Web Pages Using Artificial Neural Network
Hai huang presentation
Text Classification Powered by Apache Mahout and Lucene
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
The Magical Art of Extracting Meaning From Data
Data Mining Email SPam Detection PPT WITH Algorithms
ppt
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
CSC410-Presentation
Machine Learning With Spark
Unit iii
Going a Step Beyond the Black and White Lists for URL Accesses in the Enterpr...
USING BLACK-LIST AND WHITE-LIST TECHNIQUE TO DETECT MALICIOUS URLS
BigData Computing For WebSite Classifier
Case study of Rujhaan.com (A social news app )
Overview of Machine Learning and Feature Engineering
Text categorization with Lucene and Solr
Binary search query classifier
Ad

Recently uploaded (20)

PPTX
Leprosy and NLEP programme community medicine
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Transcultural that can help you someday.
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Introduction to Data Science and Data Analysis
PPTX
SAP 2 completion done . PRESENTATION.pptx
Leprosy and NLEP programme community medicine
Reliability_Chapter_ presentation 1221.5784
STERILIZATION AND DISINFECTION-1.ppthhhbx
oil_refinery_comprehensive_20250804084928 (1).pptx
Qualitative Qantitative and Mixed Methods.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Analytics and business intelligence.pdf
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Transcultural that can help you someday.
Data_Analytics_and_PowerBI_Presentation.pptx
ISS -ESG Data flows What is ESG and HowHow
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Knowledge Engineering Part 1
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
.pdf is not working space design for the following data for the following dat...
Introduction to Data Science and Data Analysis
SAP 2 completion done . PRESENTATION.pptx
Ad

Slides from Apache spark Meetup in Sydney - November,2016

  • 2. Demo of the WebCat app
  • 3. Business problem Automatically classify new websites into one or more predefined categories.
  • 4. Why? Web logs collected from data providers have new websites popping up everyday. And these need to be categorized before they are presented to customers in reports - daily.
  • 6. Training Data Starting point was already categorised data in the form: URL, category_id www.linux.com, 10 -> (Computers and Internet) www.coles.com.au, 20 -> (Shopping and Classifieds)
  • 7. Training Data Developed a crawler to crawl each of the categorised websites 2,550 websites picked for initial training and test data. URL, Category_Id -> URL, Category_Id, Features www.coles.com.au, 10 -> www.coles.com.au, 10, groceri deliv kitchen bench custom receiv deliveri first spend onlin liquorland cole card cole insur apparel cole credit card locat hour look hervey hervey today normal store hour monday friday 8am special store hour saturday decemb sunday decemb store store search suburb postcod search suburb postcod select locat suburb locat found pleas store store state recip inspir recip tast cole partner tast weekli plan easier visit tast cook month cole magazin everyday ingredi sensat meal famili friend latest cole cole handi video recip creativ kitchen visit cole youtub rang rang product product bakeri dairi fresh fruit cole mobil card heston liquor special diet gluten kosher foodtruck term condit corpor respons corpor respons supplier commit work … Crawled, Stemmed and removed stop words from the data for the website
  • 9. Website classification using Naive Bayes Naive Bayes Classifier are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.
  • 10. tf-idf for weighting In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus https://en.wikipedia.org/wiki/Tf-idf
  • 11. tf-idf The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
  • 12. Training Data from Database/HDFS TermDoc RDD’s tf-idf’s Array of LabeledPoint(classId, vector) Calculate tf-idf’s on the features. Create a LabelPoint for each of the training data row model = NaiveBayes.train(labelPoints) Train the NaiveBayes Model model.predict(feature_vector) Predict class New Data e.g “Automotive” Each row of Training data (website) is turned into this form: (ClassId, Sparse Vector) in the form: 5.0, [100,(1,44,..),(0.3,0.12,…)]
  • 13. API first for Data science http://engineering.pivotal.io/post/api-first-for-data-science/
  • 15. High level architecture of WebCat Webcat App Queues/Topics Link Collector Service Link Crawler Service Classification Service Training Data Database Apache Spark Categorize www.coles.com.au Category is “Shopping and Classifieds” Category is “Shopping and Classifieds” Scale the Crawler service independent of the rest of the services
  • 16. WebCat dashboard on PWS - Pivotal Web Services Note that the crawler service is scaled up to 6 instances for better performance.
  • 18. User feedback loop to update the model on incorrect predictions Webcat App Queue with topics Link Collector Service Link Crawler Service Classification Service Training Data Database Apache Spark Categorise www.bmw.com. We think it is “Electronics” - Did we get it right?
  • 19. Upload your own data - (website, category) pairs Webcat App Queue with topics Link Collector Service Link Crawler Service Classification Service Training Data Database Apache Spark I know kogan.com.au belongs to category “Shopping and Classifieds” - add it to the training data please. More data = Better predictions?
  • 20. User defined categories e.g realestate.com.au -> “Real Estate” Webcat App Queue with topics Link Collector Service Link Crawler Service Classification Service Training Data Database Apache Spark Create New Category “Real Estate”
  • 21. Provide a publicly available API for categorised websites Webcat App Queue with topics Link Collector Service Link Crawler Service Classification Service Training Data Database Apache Spark GET /websites/{id}/category GET /websites/{id}/features …
  • 22. WebCat on Apache Madlib http://madlib.incubator.apache.org/