Slides from Apache spark Meetup in Sydney - November,2016

Website classiﬁcation
using Apache Spark
Amith Nambiar

Business problem
Automatically classify new websites into one or more
predeﬁned categories.

Why?
Web logs collected from data providers have new websites
popping up everyday. And these need to be categorized
before they are presented to customers in reports - daily.

Website classiﬁcation using
Apache Spark's MLlib.

Training Data
Starting point was already categorised data in the
form:
URL, category_id
www.linux.com, 10 -> (Computers and Internet)
www.coles.com.au, 20 -> (Shopping and Classiﬁeds)

Training Data
Developed a crawler to crawl each of the categorised websites
2,550 websites picked for initial training and test data.
URL, Category_Id -> URL, Category_Id, Features
www.coles.com.au, 10 ->
www.coles.com.au, 10, groceri deliv kitchen bench custom receiv deliveri ﬁrst
spend onlin liquorland cole card cole insur apparel cole credit card locat hour
look hervey hervey today normal store hour monday friday 8am special store hour
saturday decemb sunday decemb store store search suburb postcod search
suburb postcod select locat suburb locat found pleas store store state recip
inspir recip tast cole partner tast weekli plan easier visit tast cook month cole
magazin everyday ingredi sensat meal famili friend latest cole cole handi video
recip creativ kitchen visit cole youtub rang rang product product bakeri dairi fresh
fruit cole mobil card heston liquor special diet gluten kosher foodtruck term condit
corpor respons corpor respons supplier commit work …
Crawled, Stemmed and
removed stop words from
the data for the website

Website classification using
Naive Bayes
Naive Bayes Classifier are a family of simple probabilistic
classifiers based on applying Bayes' theorem with strong
(naive) independence assumptions between the features.

tf-idf for weighting
In information retrieval, tf–idf, short for term frequency–inverse document frequency,
is a numerical statistic that is intended to reﬂect how important a word is to a document
in a collection or corpus
https://en.wikipedia.org/wiki/Tf-idf

tf-idf
The tf-idf value increases proportionally to the number of times a word appears in the
document, but is offset by the frequency of the word in the corpus, which helps to adjust for
the fact that some words appear more frequently in general.

Training Data from Database/HDFS
TermDoc RDD’s
tf-idf’s
Array of LabeledPoint(classId, vector)
Calculate tf-idf’s on the features.
Create a LabelPoint for each of the
training data row
model = NaiveBayes.train(labelPoints)
Train the NaiveBayes Model
model.predict(feature_vector)
Predict class
New Data e.g “Automotive”
Each row of Training data (website) is
turned into this form:
(ClassId, Sparse Vector) in the form:
5.0, [100,(1,44,..),(0.3,0.12,…)]

API ﬁrst for Data science
http://engineering.pivotal.io/post/api-ﬁrst-for-data-science/

High Level Architecture of WebCat

High level architecture of WebCat
Webcat
App
Queues/Topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
Categorize
www.coles.com.au
Category is
“Shopping and
Classifieds”
Category is
“Shopping and
Classifieds”
Scale the Crawler
service independent
of the rest of the
services

WebCat dashboard on PWS - Pivotal Web Services
Note that the crawler service is scaled up to 6 instances for better performance.

User feedback loop to update the model on
incorrect predictions
Webcat
App
Queue with topics
Training Data
Database
Apache Spark
Categorise
www.bmw.com.
We think it is
“Electronics” - Did we
get it right?

Upload your own data - (website, category) pairs
Webcat
App
Queue with topics
Training Data
Database
Apache Spark
I know kogan.com.au
belongs to category
“Shopping and
Classiﬁeds” - add it to
the training data please.
More data = Better predictions?

User deﬁned categories
e.g realestate.com.au -> “Real Estate”
Webcat
App
Queue with topics
Training Data
Database
Apache Spark
Create New
Category
“Real Estate”

Provide a publicly available API for
categorised websites
Webcat
App
Queue with topics
Training Data
Database
Apache Spark
GET /websites/{id}/category
GET /websites/{id}/features
…

WebCat on Apache Madlib
http://madlib.incubator.apache.org/

Slides from Apache spark Meetup in Sydney - November,2016

More Related Content

Similar to Slides from Apache spark Meetup in Sydney - November,2016 (20)

Recently uploaded (20)

Slides from Apache spark Meetup in Sydney - November,2016