Classification of webpages as Ephemeral or Evergreen

Classifying Ephemeral vs. Long
Lasting Content on the Web
Akif Khan Yusufzai 11CSS07
Moonis Javed 11CSS40

Introduction
● Web classification is a very important machine learning problem with
wide applicability in tasks such as news classification, content
prioritization, focused crawling and sentiment analysis of web content.
● On the Web, classification of page content is essential to focused
crawling, to the assisted development of web directories, to topic
specific Web link analysis and to contextual advertising.
● Our GOAL:
Study a specific instance of this broad and vital web classification
problem and developing a successful prediction system.

Goal Of Project:
● Building a crawler to scrape data from different websites across
the internet such as bloomberg.com, Blogs from blogger.com,
Times of India news website.
● Building a classifier to categorize web pages as evergreen or
ephemeral.
● Binary classes of classification:
o Ephemeral (short lived)
o Evergreen (Long lived)

Ephemeral Content
● Short Lived i.e it loses its relevance after
a certain period of time.
● Based on current happenings or
interests.
● Fades out after some time and its
viewership or hit count is negligible.
● Examples
o A news topic
o A viral video

Long Lasting Content
● Content doesn't loses its relevance even
after a very long time.
● Usually based on everlasting topics
● Example
o A cooking recipe
o Information about monument such as
Taj Mahal, or about history.

Technical Details● Dataset from Kaggle’s ‘Stumbleupon Challenge’
● Contains approximately 10,000 HTML documents for
training and testing purpose.
● To improve our model we will scrape recent data from other
websites.
● Fields
o URL
o Boilerplate text : contains title and body of html page.
o Number of characters
o Number of links
o Number of words in url
o User determined label (only in training) , etc.

Preprocessing
● Bag of Words Model:
o Create a dictionary of all the words with their frequency of
occurrence.
o remove the high frequency words (filler words like the, is,
etc.)
and lowest frequency words as their presence doesn’t affect
the prediction model.
o remove the least frequency words which do not occur
enough so as to be helpful for prediction.

Preprocessing
Another approach…
we use the Term frequency- Inverse document frequency (tf-idf) of
each word.
● The TF-IDF is the product of the Term frequency ( indicating the
number of times a word appears in a given document).
● Inverse document frequency: which measures how commonly
the word appears across all documents.

Formula for calculating IDF
D is the set of training examples (documents)
|D| is the number of training examples
|{d ∈ D : t ∈ d}| is the number of documents where the
word t appears [6].

Classifier Models used
1. Naive Bayes Model
2. Logistic regression Analysis
3. Support Vector Machine (SVM)
4. Decision tree model - Random Forest

Naïve Bayes Model
● family of simple probabilistic classifiers
● based on applying Bayes' theorem
● strong (naive) independence assumptions between
the features
● popular method for text categorization
● used with word frequencies as features

Logistic Regression
● used for predicting the outcome of a categorical dependent
variable (i.e., a class label) based on one or more predictor
variables (features)
● makes use of one or more predictor variables that may be
either continuous or categorical data
● Binomial logistic regression used as final predictions are
binary
(0 - ephemeral
1 - long lasting)

Support Vector Machine (SVM)
● builds a model that assigns new examples into one
category or the other, making it a non-probabilistic
binary linear classifier
● used to analyze data and recognize patterns
● A set of features that describes one case (i.e., a row of predictor
values) is called a vector.
● creates an optimal N-dimensional hyperplane which separates the
data into two categories

Decision Tree Model - Random Forest
● ensemble learning method for classification (and regression).
● operate by constructing a multitude of decision trees at training
time.
● outputting the class that is the mode of the classes output by
individual trees.
● applies the general technique of bootstrap aggregating, or
bagging.

Results of Different Models
SVM
● Linear SVM on Tf-Idf vectorized body
20 fold CV score : 86.8915%
● Linear SVM on Tf-Idf vectorized body after outlier
removal
20 fold CV score : 87.2765%

Gaussian NB:
● 20 Fold CV Score
69.825%
● 20 Fold CV Score after
Outlier Removal
70.379%

Random Forest:
● 20 Fold CV Score
79.95%
● 20 Fold CV Score after
Outlier Removal
80.1174%

Word Cloud of highest frequency words

Work To Be Done
● Verify Outliers using clustering methods like
K-means Clustering, DBSCAN
● Build Ensemble Method by combining
multiple models
● Apply AdaBoost to improve Ensemble
Accuracy

Area of Application
● useful for recommenders attempting to classify different
news stories based on type.
● used for archival projects to determine what web content
merits inclusion.
● for content sites interested in capacity planning for
hosting different pages based on expected longevity.
● used for putting up of advertisement, and the companies
can bid more on ads displayed on this page as it will be
visible for a longer time.

Future Scope
● Apply distributed or cloud computing to further improve accuracy
and provide real time classification.
● Calculate lifetime prediction ( time for which it receives a fair
amount of hits )
● Sentiment analysis of long lasting web pages

References
[1] Kaggle, “StumbleUpon Evergreen Classification Challenge”,
http://www.kaggle.com/c/stumbleupon
[2] T. Fawcett, “An Introduction to ROC Analysis”, Pattern Recognition Letters,
Issue 27, 2006, pp 861-874
[3] J. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries”,
ICML, 2003

Classification of webpages as Ephemeral or Evergreen

More Related Content

Similar to Classification of webpages as Ephemeral or Evergreen (20)

Recently uploaded (20)

Classification of webpages as Ephemeral or Evergreen