SlideShare a Scribd company logo
Classifying Ephemeral vs. Long
Lasting Content on the Web
Akif Khan Yusufzai 11CSS07
Moonis Javed 11CSS40
Introduction
● Web classification is a very important machine learning problem with
wide applicability in tasks such as news classification, content
prioritization, focused crawling and sentiment analysis of web content.
● On the Web, classification of page content is essential to focused
crawling, to the assisted development of web directories, to topic
specific Web link analysis and to contextual advertising.
● Our GOAL:
Study a specific instance of this broad and vital web classification
problem and developing a successful prediction system.
Goal Of Project:
● Building a crawler to scrape data from different websites across
the internet such as bloomberg.com, Blogs from blogger.com,
Times of India news website.
● Building a classifier to categorize web pages as evergreen or
ephemeral.
● Binary classes of classification:
o Ephemeral (short lived)
o Evergreen (Long lived)
Ephemeral Content
● Short Lived i.e it loses its relevance after
a certain period of time.
● Based on current happenings or
interests.
● Fades out after some time and its
viewership or hit count is negligible.
● Examples
o A news topic
o A viral video
Long Lasting Content
● Content doesn't loses its relevance even
after a very long time.
● Usually based on everlasting topics
● Example
o A cooking recipe
o Information about monument such as
Taj Mahal, or about history.
Technical Details● Dataset from Kaggle’s ‘Stumbleupon Challenge’
● Contains approximately 10,000 HTML documents for
training and testing purpose.
● To improve our model we will scrape recent data from other
websites.
● Fields
o URL
o Boilerplate text : contains title and body of html page.
o Number of characters
o Number of links
o Number of words in url
o User determined label (only in training) , etc.
Preprocessing
● Bag of Words Model:
o Create a dictionary of all the words with their frequency of
occurrence.
o remove the high frequency words (filler words like the, is,
etc.)
and lowest frequency words as their presence doesn’t affect
the prediction model.
o remove the least frequency words which do not occur
enough so as to be helpful for prediction.
Preprocessing
Another approach…
we use the Term frequency- Inverse document frequency (tf-idf) of
each word.
● The TF-IDF is the product of the Term frequency ( indicating the
number of times a word appears in a given document).
● Inverse document frequency: which measures how commonly
the word appears across all documents.
Formula for calculating IDF
D is the set of training examples (documents)
|D| is the number of training examples
|{d ∈ D : t ∈ d}| is the number of documents where the
word t appears [6].
Classifier Models used
1. Naive Bayes Model
2. Logistic regression Analysis
3. Support Vector Machine (SVM)
4. Decision tree model - Random Forest
Naïve Bayes Model
● family of simple probabilistic classifiers
● based on applying Bayes' theorem
● strong (naive) independence assumptions between
the features
● popular method for text categorization
● used with word frequencies as features
Logistic Regression
● used for predicting the outcome of a categorical dependent
variable (i.e., a class label) based on one or more predictor
variables (features)
● makes use of one or more predictor variables that may be
either continuous or categorical data
● Binomial logistic regression used as final predictions are
binary
(0 - ephemeral
1 - long lasting)
Support Vector Machine (SVM)
● builds a model that assigns new examples into one
category or the other, making it a non-probabilistic
binary linear classifier
● used to analyze data and recognize patterns
● A set of features that describes one case (i.e., a row of predictor
values) is called a vector.
● creates an optimal N-dimensional hyperplane which separates the
data into two categories
Decision Tree Model - Random Forest
● ensemble learning method for classification (and regression).
● operate by constructing a multitude of decision trees at training
time.
● outputting the class that is the mode of the classes output by
individual trees.
● applies the general technique of bootstrap aggregating, or
bagging.
Outliers
Results of Different Models
SVM
● Linear SVM on Tf-Idf vectorized body
20 fold CV score : 86.8915%
● Linear SVM on Tf-Idf vectorized body after outlier
removal
20 fold CV score : 87.2765%
Results of Different Models
L
Results of Different Models
Gaussian NB:
● 20 Fold CV Score
69.825%
● 20 Fold CV Score after
Outlier Removal
70.379%
Results of Different Models
Random Forest:
● 20 Fold CV Score
79.95%
● 20 Fold CV Score after
Outlier Removal
80.1174%
Results of Different Models
Word Cloud of highest frequency words
Work To Be Done
● Verify Outliers using clustering methods like
K-means Clustering, DBSCAN
● Build Ensemble Method by combining
multiple models
● Apply AdaBoost to improve Ensemble
Accuracy
Area of Application
● useful for recommenders attempting to classify different
news stories based on type.
● used for archival projects to determine what web content
merits inclusion.
● for content sites interested in capacity planning for
hosting different pages based on expected longevity.
● used for putting up of advertisement, and the companies
can bid more on ads displayed on this page as it will be
visible for a longer time.
Future Scope
● Apply distributed or cloud computing to further improve accuracy
and provide real time classification.
● Calculate lifetime prediction ( time for which it receives a fair
amount of hits )
● Sentiment analysis of long lasting web pages
References
[1] Kaggle, “StumbleUpon Evergreen Classification Challenge”,
http://www.kaggle.com/c/stumbleupon
[2] T. Fawcett, “An Introduction to ROC Analysis”, Pattern Recognition Letters,
Issue 27, 2006, pp 861-874
[3] J. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries”,
ICML, 2003
Thank You
Any Questions ?

More Related Content

PPT
Graceful Degradation of User Interfaces as a Design Method for Multiplatform ...
PPTX
Model-Based Approaches to Reengineering Web Pages
PPTX
Elevate ANTONYMS
PPTX
Security Training: #3 Threat Modelling - Practices and Tools
PPTX
The Power of Inbound Marketing
PDF
2015 Upload Campaigns Calendar - SlideShare
PPTX
What to Upload to SlideShare
PDF
Getting Started With SlideShare
Graceful Degradation of User Interfaces as a Design Method for Multiplatform ...
Model-Based Approaches to Reengineering Web Pages
Elevate ANTONYMS
Security Training: #3 Threat Modelling - Practices and Tools
The Power of Inbound Marketing
2015 Upload Campaigns Calendar - SlideShare
What to Upload to SlideShare
Getting Started With SlideShare

Similar to Classification of webpages as Ephemeral or Evergreen (20)

PDF
MongoDB World 2019: Fast Machine Learning Development with MongoDB
PDF
Learning Single page Application chapter 1
PDF
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
PDF
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
DOCX
What are the basic key points to focus on while learning Full-stack web devel...
PDF
Intro to ember.js
PDF
IRE Semantic Annotation of Documents
PPTX
WDS trainer presentation - MLOps.pptx
PDF
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
PPTX
Business Analytics Final Capstone Project Presenation PPT.pptx
PPTX
Angular js 1.3 presentation for fed nov 2014
PDF
Modern UI Architecture_ Trends and Technologies in Web Development
PDF
Web clustering engines
PDF
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
PDF
Building an ML Tool to predict Article Quality Scores using Delta & MLFlow
PDF
Architecturing the software stack at a small business
PPTX
Foster - Getting started with Angular
PDF
Continuous delivery for machine learning
PDF
Test Automation Design Patterns_ A Comprehensive Guide.pdf
PDF
Django Patterns - Pycon India 2014
MongoDB World 2019: Fast Machine Learning Development with MongoDB
Learning Single page Application chapter 1
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
What are the basic key points to focus on while learning Full-stack web devel...
Intro to ember.js
IRE Semantic Annotation of Documents
WDS trainer presentation - MLOps.pptx
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Business Analytics Final Capstone Project Presenation PPT.pptx
Angular js 1.3 presentation for fed nov 2014
Modern UI Architecture_ Trends and Technologies in Web Development
Web clustering engines
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Building an ML Tool to predict Article Quality Scores using Delta & MLFlow
Architecturing the software stack at a small business
Foster - Getting started with Angular
Continuous delivery for machine learning
Test Automation Design Patterns_ A Comprehensive Guide.pdf
Django Patterns - Pycon India 2014
Ad

Recently uploaded (20)

PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
Pharmacology of Autonomic nervous system
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
Overview of calcium in human muscles.pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPT
6.1 High Risk New Born. Padetric health ppt
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
BIOMOLECULES PPT........................
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Placing the Near-Earth Object Impact Probability in Context
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Pharmacology of Autonomic nervous system
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
lecture 2026 of Sjogren's syndrome l .pdf
Overview of calcium in human muscles.pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
6.1 High Risk New Born. Padetric health ppt
ECG_Course_Presentation د.محمد صقران ppt
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Biophysics 2.pdffffffffffffffffffffffffff
BIOMOLECULES PPT........................
Phytochemical Investigation of Miliusa longipes.pdf
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Ad

Classification of webpages as Ephemeral or Evergreen

  • 1. Classifying Ephemeral vs. Long Lasting Content on the Web Akif Khan Yusufzai 11CSS07 Moonis Javed 11CSS40
  • 2. Introduction ● Web classification is a very important machine learning problem with wide applicability in tasks such as news classification, content prioritization, focused crawling and sentiment analysis of web content. ● On the Web, classification of page content is essential to focused crawling, to the assisted development of web directories, to topic specific Web link analysis and to contextual advertising. ● Our GOAL: Study a specific instance of this broad and vital web classification problem and developing a successful prediction system.
  • 3. Goal Of Project: ● Building a crawler to scrape data from different websites across the internet such as bloomberg.com, Blogs from blogger.com, Times of India news website. ● Building a classifier to categorize web pages as evergreen or ephemeral. ● Binary classes of classification: o Ephemeral (short lived) o Evergreen (Long lived)
  • 4. Ephemeral Content ● Short Lived i.e it loses its relevance after a certain period of time. ● Based on current happenings or interests. ● Fades out after some time and its viewership or hit count is negligible. ● Examples o A news topic o A viral video
  • 5. Long Lasting Content ● Content doesn't loses its relevance even after a very long time. ● Usually based on everlasting topics ● Example o A cooking recipe o Information about monument such as Taj Mahal, or about history.
  • 6. Technical Details● Dataset from Kaggle’s ‘Stumbleupon Challenge’ ● Contains approximately 10,000 HTML documents for training and testing purpose. ● To improve our model we will scrape recent data from other websites. ● Fields o URL o Boilerplate text : contains title and body of html page. o Number of characters o Number of links o Number of words in url o User determined label (only in training) , etc.
  • 7. Preprocessing ● Bag of Words Model: o Create a dictionary of all the words with their frequency of occurrence. o remove the high frequency words (filler words like the, is, etc.) and lowest frequency words as their presence doesn’t affect the prediction model. o remove the least frequency words which do not occur enough so as to be helpful for prediction.
  • 8. Preprocessing Another approach… we use the Term frequency- Inverse document frequency (tf-idf) of each word. ● The TF-IDF is the product of the Term frequency ( indicating the number of times a word appears in a given document). ● Inverse document frequency: which measures how commonly the word appears across all documents.
  • 9. Formula for calculating IDF D is the set of training examples (documents) |D| is the number of training examples |{d ∈ D : t ∈ d}| is the number of documents where the word t appears [6].
  • 10. Classifier Models used 1. Naive Bayes Model 2. Logistic regression Analysis 3. Support Vector Machine (SVM) 4. Decision tree model - Random Forest
  • 11. Naïve Bayes Model ● family of simple probabilistic classifiers ● based on applying Bayes' theorem ● strong (naive) independence assumptions between the features ● popular method for text categorization ● used with word frequencies as features
  • 12. Logistic Regression ● used for predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more predictor variables (features) ● makes use of one or more predictor variables that may be either continuous or categorical data ● Binomial logistic regression used as final predictions are binary (0 - ephemeral 1 - long lasting)
  • 13. Support Vector Machine (SVM) ● builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier ● used to analyze data and recognize patterns ● A set of features that describes one case (i.e., a row of predictor values) is called a vector. ● creates an optimal N-dimensional hyperplane which separates the data into two categories
  • 14. Decision Tree Model - Random Forest ● ensemble learning method for classification (and regression). ● operate by constructing a multitude of decision trees at training time. ● outputting the class that is the mode of the classes output by individual trees. ● applies the general technique of bootstrap aggregating, or bagging.
  • 16. Results of Different Models SVM ● Linear SVM on Tf-Idf vectorized body 20 fold CV score : 86.8915% ● Linear SVM on Tf-Idf vectorized body after outlier removal 20 fold CV score : 87.2765%
  • 18. Results of Different Models Gaussian NB: ● 20 Fold CV Score 69.825% ● 20 Fold CV Score after Outlier Removal 70.379%
  • 19. Results of Different Models Random Forest: ● 20 Fold CV Score 79.95% ● 20 Fold CV Score after Outlier Removal 80.1174%
  • 21. Word Cloud of highest frequency words
  • 22. Work To Be Done ● Verify Outliers using clustering methods like K-means Clustering, DBSCAN ● Build Ensemble Method by combining multiple models ● Apply AdaBoost to improve Ensemble Accuracy
  • 23. Area of Application ● useful for recommenders attempting to classify different news stories based on type. ● used for archival projects to determine what web content merits inclusion. ● for content sites interested in capacity planning for hosting different pages based on expected longevity. ● used for putting up of advertisement, and the companies can bid more on ads displayed on this page as it will be visible for a longer time.
  • 24. Future Scope ● Apply distributed or cloud computing to further improve accuracy and provide real time classification. ● Calculate lifetime prediction ( time for which it receives a fair amount of hits ) ● Sentiment analysis of long lasting web pages
  • 25. References [1] Kaggle, “StumbleUpon Evergreen Classification Challenge”, http://www.kaggle.com/c/stumbleupon [2] T. Fawcett, “An Introduction to ROC Analysis”, Pattern Recognition Letters, Issue 27, 2006, pp 861-874 [3] J. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries”, ICML, 2003