SlideShare a Scribd company logo
Adaptive Web-page Content IdentificationAuthor: John Gibson, Ben Wellner, Susan LubarPublication: WIDM’2007Presenter: Jhih-Ming Chen1
OutlineIntroductionContent Identification as Sequence LabelingModelsConditional Random FieldsMaximum Entropy ClassifierMaximum Entropy Markov ModelsDataHarvesting and AnnotationDividing into BlocksExperimental SetupResults and AnalysisConclusion and Feature Work2
IntroductionWeb pages containing news stories also include many other pieces of extraneous information such as navigation bars, JavaScript, images and advertisements.This paper’s goal:Detect identical and near duplicate articles within a set of web pages from a conglomerate of websites.Why do this work?Provide input for an application such as a Natural Language tool or for an index for a search engine.Re-display it on a small screen such as a cell phone or PDA.3
IntroductionTypically, content extraction is done via a hand-crafted tool targeted to handle a single web page format.Shortcoming:When the page format changes, the extractor is likely to break.It is labor intensive.Web page formats change fairly quickly and custom extractors often become obsolete a short time after they are written.Some websites use multiple formats concurrently and identifying each one and handling them properly makes this a complex task.In general, the site-specific extractors are unworkable as a long-term solution.The approach described in this paper is meant to overcome these issues.4
IntroductionThe data set for this work consisted of web pages from 27 different news sites.Identifying portions of relevant content in web-pages can be construed as a sequence labeling problem.i.e. each document is broken into a sequence of blocks and the task is to label each block as Content or NotContent.The best system, based on Conditional Random Fields, can correctly identify individual Content blocks with recall above 99.5% and precision above 97.9%.5
Content Identification as Sequence LabelingProblem descriptionIdentify the portions of news-source web-pages that contain relevant content – i.e. the news article itself.Two general ways to approach this problem:Boundary Detection MethodIdentify positions where content begins and ends.Sequence Labeling MethodDivide the original Web document into a sequence of some appropriately sized units or blocks.Categorize each block as Content or NotContent.6
Content Identification as Sequence LabelingIn this work, the authors focus largely on the sequence labeling method.Shortcomings of boundary detection methods:A number of web pages contain ‘noncontiguous’ content.The paragraphs of the article body have other page content, such as advertisements, interspersed.Boundary detection methods aren’t able to nicely model transitions from Content to NotContentback to Content.If a boundary is not identified at all, an entire section of content can be missed.When developing a statistical classifier to identify boundaries, there are many, many more negative examples of boundaries than positive ones.Itmay be possible to sub-sample negative examples or identifysome reasonable set of candidate boundaries and train a classifieron those, but this complicates matters greatly.7
ModelsThis section describes the three statistical sequence labeling models employed in the experiments:Conditional Random Fields (CRF)Maximum Entropy Classifiers (MaxEnt)Maximum Entropy Markov Models (MEMM)8
Conditional Random FieldsLet x= x1, x2, …, xnbe a sequence of observations,such as a sequence of words, paragraphs, or, as in our setting, a sequence of HTML “segments”.Given a set of possible output values (i.e., labels),sequence CRFs define the conditional probability of a label sequence y = y1, y2, …, ynas:Zxis a normalization term overall possible label sequences.Current position is i.Current and previous labels are yiand yi-1Often, the range of feature functions fkis {0,1}.Associated with each feature function, fk, is a learned parameter λkthat captures the strength and polarity of the correlation between the label transition, yi-1to yi.9
Conditional Random FieldsTrainingD ={(y(1), x(1)), (y(2), x(2)), …,(y(m), x(m))}.A set of training data consisting of a set of pairs ofsequences.The model parameters are learned by maximizing the conditional log-likelihood of the training data.which is simply the sum of the log-probabilities assigned by the model to each label-observation sequence pair:The second term in the above equation is a Gaussian prior over the parameters, which helps the model to overcome over-fitting.10Regularization term
Conditional Random FieldsDecoding (Testing)Given a trained model, the problem of decoding in sequence CRFs involves finding the most likely label sequence for a given observation sequence.There are NMpossible label sequences.N is the number of labels.M is the length of the sequence.Dynamic programming, specifically a variation on the Viterbi algorithm, can find the optimal sequence in time linear in the length of the sequence and quadratic in the number of possible labels.11
Maximum Entropy ClassifierMaxEnt classifiers are conditional models that given a set of parameters produce a conditional multinomial distribution according to:In contrast with CRFs, MaxEnt models (and classifiers generally) are “state-less” and do not model any dependence between different positions in the sequence.12
Maximum Entropy Markov ModelsMEMMs model a state sequence just as CRFs do, but use a “local” training method rather than performing global inference over the sequence at each training iteration.Viterbi decoding is employed as with CRFs to find the best label sequence.MEMMs are prone to various biases that can reduce their accuracy.Do not normalize over the entire sequence like CRFs.13
DataHarvesting and Annotation1620 labeled documents from 27 of the sites.388 distinctarticles within the 1620 documents.This large number ofduplicate and near duplicate documents introduced bias into ourtraining set.14
DataDividing into BlocksSanitize the raw HTML and transform it intoXHTML.Exclude all words insidestyle and script tags.Tokenize the document.Divide up theentire sequence of <lex> into smaller sequences called blocks.except the following:<a>,<span>, <strong>, ...etc.Wrap each block as tightly as possible with a <span>.Create features based on the material from the <span> tags in each block.There is a large skew of NotContent vs. Content blocks.234,436 NotContentblocks.24,388 Content blocks.15
Experimental Setup16
Experimental SetupData Set CreationThe authors ran four separate cross validation experiments to measure the bias introduced by duplicate articles and mixed sources.Duplicates, Mixed Sources.Split documents with 75% in the training set and 25% in the testing set.No Duplicates, Mixed Sources.This split prevented duplicate documents from spanning the training and testing boundary.Duplicates Allowed, Separate Sources.Create four separate bundles each containing 6 or 7 sources.Fill the bundles in round-robin fashion by selecting a source at random.Three bundles were assigned to training and one to testing.No Duplicates, Separate Sources.Ensure that no duplicates crossed the training/test set boundary.17
Results and AnalysisEach of the threedifferent models, CRF, MEMM and MaxEnt are evaluated oneach of the four different data setups.All results are a weightedaverage of four-fold cross validation.18
Results and AnalysisFeature Type AnalysisShows results for the CRF when individually removing each class of feature using the “No Duplicates; Separate Sources” data set.19
Results and AnalysisContiguous vs.Non-Contiguous performanceDocument level results.20
Results and AnalysisAmount of Training DataShow the results for the CRF with varying quantities of training data using the “No Duplicates; Separate Sources” experimentalsetup.21
Results and AnalysisError AnalysisNotContent blocks found interspersed within portions of Content being falsely labeled as Content.Sometimes section headers would be incorrectly singled out as NotContent.The beginning and end of article boundaries where sometimes the first or last block would incorrectly be labeled NotContent.22
Conclusion and Feature WorkSequence labeling emerged as the clear winner with CRF edging out MEMM.MaxEnt was only competitive on the easiest of the four data sets and is not a viable alternative to site-specific wrappers.Future work includes applying the techniques to additional data, including additional sources, different languages, and other types of data such as weblogs.Another interesting avenue: semi-Markov CRFs.23

More Related Content

PPTX
Session ii g2 overview metabolic network modeling mcc
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS A probabilistic approach to string transf...
PPTX
Edbt2014 talk
PDF
Multi label text classification
PDF
Phenoflow 2021
PDF
MICRE: Microservices In MediCal Research Environments
PPTX
Pepfold 3 peptide structure prediction
PDF
Boilerplate Removal and Content Extraction from Dynamic Web Pages
Session ii g2 overview metabolic network modeling mcc
IEEE 2014 JAVA DATA MINING PROJECTS A probabilistic approach to string transf...
Edbt2014 talk
Multi label text classification
Phenoflow 2021
MICRE: Microservices In MediCal Research Environments
Pepfold 3 peptide structure prediction
Boilerplate Removal and Content Extraction from Dynamic Web Pages

What's hot (7)

PDF
Boilerplate removal and content
PDF
Lecture 13 – comparative modeling
PDF
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
PDF
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...
PPTX
Session ii g2 overview protein modeling mmc
PPT
Homology modeling
PDF
Iaetsd an enhanced feature selection for
Boilerplate removal and content
Lecture 13 – comparative modeling
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...
Session ii g2 overview protein modeling mmc
Homology modeling
Iaetsd an enhanced feature selection for
Ad

Similar to Adaptive web page content identification (20)

PPTX
Conditional Random Fields
PDF
PPTX
asdrfasdfasdf
PDF
Conditional Random Fields
PPT
PowerPoint Presentation - Conditional Random Fields - A ...
PDF
From logistic regression to linear chain CRF
PDF
Predicting the success of altruistic requests
PPTX
Conditional Random Fields - Vidya Venkiteswaran
PDF
Sociopath presentation
PDF
Lecture20 xing
PDF
data_mining_Projectreport
PPTX
Seminar dm
PDF
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
PDF
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
PDF
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
PDF
Lecture13 xing fei-fei
PPTX
Content extraction via tag ratios
PDF
Word2vec and Friends
PDF
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DOCX
Group4 doc
Conditional Random Fields
asdrfasdfasdf
Conditional Random Fields
PowerPoint Presentation - Conditional Random Fields - A ...
From logistic regression to linear chain CRF
Predicting the success of altruistic requests
Conditional Random Fields - Vidya Venkiteswaran
Sociopath presentation
Lecture20 xing
data_mining_Projectreport
Seminar dm
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
Lecture13 xing fei-fei
Content extraction via tag ratios
Word2vec and Friends
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
Group4 doc
Ad

More from Jhih-Ming Chen (6)

PPTX
Extracting article text from the web with maximum subsequence segmentation
PPTX
Comments oriented blog summarization by sentence extraction
PPTX
Financial Comic Information Retrieval System
PPT
PPTX
Progress Report 20091002
PPT
Progress Report 090820 2
Extracting article text from the web with maximum subsequence segmentation
Comments oriented blog summarization by sentence extraction
Financial Comic Information Retrieval System
Progress Report 20091002
Progress Report 090820 2

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PPTX
Tartificialntelligence_presentation.pptx
PPT
Teaching material agriculture food technology
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Encapsulation theory and applications.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
August Patch Tuesday
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
1. Introduction to Computer Programming.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Assigned Numbers - 2025 - Bluetooth® Document
Spectroscopy.pptx food analysis technology
Tartificialntelligence_presentation.pptx
Teaching material agriculture food technology
TLE Review Electricity (Electricity).pptx
Encapsulation theory and applications.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Programs and apps: productivity, graphics, security and other tools
Group 1 Presentation -Planning and Decision Making .pptx
A comparative study of natural language inference in Swahili using monolingua...
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
MIND Revenue Release Quarter 2 2025 Press Release
A comparative analysis of optical character recognition models for extracting...
August Patch Tuesday
OMC Textile Division Presentation 2021.pptx
Approach and Philosophy of On baking technology
1. Introduction to Computer Programming.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Assigned Numbers - 2025 - Bluetooth® Document

Adaptive web page content identification

  • 1. Adaptive Web-page Content IdentificationAuthor: John Gibson, Ben Wellner, Susan LubarPublication: WIDM’2007Presenter: Jhih-Ming Chen1
  • 2. OutlineIntroductionContent Identification as Sequence LabelingModelsConditional Random FieldsMaximum Entropy ClassifierMaximum Entropy Markov ModelsDataHarvesting and AnnotationDividing into BlocksExperimental SetupResults and AnalysisConclusion and Feature Work2
  • 3. IntroductionWeb pages containing news stories also include many other pieces of extraneous information such as navigation bars, JavaScript, images and advertisements.This paper’s goal:Detect identical and near duplicate articles within a set of web pages from a conglomerate of websites.Why do this work?Provide input for an application such as a Natural Language tool or for an index for a search engine.Re-display it on a small screen such as a cell phone or PDA.3
  • 4. IntroductionTypically, content extraction is done via a hand-crafted tool targeted to handle a single web page format.Shortcoming:When the page format changes, the extractor is likely to break.It is labor intensive.Web page formats change fairly quickly and custom extractors often become obsolete a short time after they are written.Some websites use multiple formats concurrently and identifying each one and handling them properly makes this a complex task.In general, the site-specific extractors are unworkable as a long-term solution.The approach described in this paper is meant to overcome these issues.4
  • 5. IntroductionThe data set for this work consisted of web pages from 27 different news sites.Identifying portions of relevant content in web-pages can be construed as a sequence labeling problem.i.e. each document is broken into a sequence of blocks and the task is to label each block as Content or NotContent.The best system, based on Conditional Random Fields, can correctly identify individual Content blocks with recall above 99.5% and precision above 97.9%.5
  • 6. Content Identification as Sequence LabelingProblem descriptionIdentify the portions of news-source web-pages that contain relevant content – i.e. the news article itself.Two general ways to approach this problem:Boundary Detection MethodIdentify positions where content begins and ends.Sequence Labeling MethodDivide the original Web document into a sequence of some appropriately sized units or blocks.Categorize each block as Content or NotContent.6
  • 7. Content Identification as Sequence LabelingIn this work, the authors focus largely on the sequence labeling method.Shortcomings of boundary detection methods:A number of web pages contain ‘noncontiguous’ content.The paragraphs of the article body have other page content, such as advertisements, interspersed.Boundary detection methods aren’t able to nicely model transitions from Content to NotContentback to Content.If a boundary is not identified at all, an entire section of content can be missed.When developing a statistical classifier to identify boundaries, there are many, many more negative examples of boundaries than positive ones.Itmay be possible to sub-sample negative examples or identifysome reasonable set of candidate boundaries and train a classifieron those, but this complicates matters greatly.7
  • 8. ModelsThis section describes the three statistical sequence labeling models employed in the experiments:Conditional Random Fields (CRF)Maximum Entropy Classifiers (MaxEnt)Maximum Entropy Markov Models (MEMM)8
  • 9. Conditional Random FieldsLet x= x1, x2, …, xnbe a sequence of observations,such as a sequence of words, paragraphs, or, as in our setting, a sequence of HTML “segments”.Given a set of possible output values (i.e., labels),sequence CRFs define the conditional probability of a label sequence y = y1, y2, …, ynas:Zxis a normalization term overall possible label sequences.Current position is i.Current and previous labels are yiand yi-1Often, the range of feature functions fkis {0,1}.Associated with each feature function, fk, is a learned parameter λkthat captures the strength and polarity of the correlation between the label transition, yi-1to yi.9
  • 10. Conditional Random FieldsTrainingD ={(y(1), x(1)), (y(2), x(2)), …,(y(m), x(m))}.A set of training data consisting of a set of pairs ofsequences.The model parameters are learned by maximizing the conditional log-likelihood of the training data.which is simply the sum of the log-probabilities assigned by the model to each label-observation sequence pair:The second term in the above equation is a Gaussian prior over the parameters, which helps the model to overcome over-fitting.10Regularization term
  • 11. Conditional Random FieldsDecoding (Testing)Given a trained model, the problem of decoding in sequence CRFs involves finding the most likely label sequence for a given observation sequence.There are NMpossible label sequences.N is the number of labels.M is the length of the sequence.Dynamic programming, specifically a variation on the Viterbi algorithm, can find the optimal sequence in time linear in the length of the sequence and quadratic in the number of possible labels.11
  • 12. Maximum Entropy ClassifierMaxEnt classifiers are conditional models that given a set of parameters produce a conditional multinomial distribution according to:In contrast with CRFs, MaxEnt models (and classifiers generally) are “state-less” and do not model any dependence between different positions in the sequence.12
  • 13. Maximum Entropy Markov ModelsMEMMs model a state sequence just as CRFs do, but use a “local” training method rather than performing global inference over the sequence at each training iteration.Viterbi decoding is employed as with CRFs to find the best label sequence.MEMMs are prone to various biases that can reduce their accuracy.Do not normalize over the entire sequence like CRFs.13
  • 14. DataHarvesting and Annotation1620 labeled documents from 27 of the sites.388 distinctarticles within the 1620 documents.This large number ofduplicate and near duplicate documents introduced bias into ourtraining set.14
  • 15. DataDividing into BlocksSanitize the raw HTML and transform it intoXHTML.Exclude all words insidestyle and script tags.Tokenize the document.Divide up theentire sequence of <lex> into smaller sequences called blocks.except the following:<a>,<span>, <strong>, ...etc.Wrap each block as tightly as possible with a <span>.Create features based on the material from the <span> tags in each block.There is a large skew of NotContent vs. Content blocks.234,436 NotContentblocks.24,388 Content blocks.15
  • 17. Experimental SetupData Set CreationThe authors ran four separate cross validation experiments to measure the bias introduced by duplicate articles and mixed sources.Duplicates, Mixed Sources.Split documents with 75% in the training set and 25% in the testing set.No Duplicates, Mixed Sources.This split prevented duplicate documents from spanning the training and testing boundary.Duplicates Allowed, Separate Sources.Create four separate bundles each containing 6 or 7 sources.Fill the bundles in round-robin fashion by selecting a source at random.Three bundles were assigned to training and one to testing.No Duplicates, Separate Sources.Ensure that no duplicates crossed the training/test set boundary.17
  • 18. Results and AnalysisEach of the threedifferent models, CRF, MEMM and MaxEnt are evaluated oneach of the four different data setups.All results are a weightedaverage of four-fold cross validation.18
  • 19. Results and AnalysisFeature Type AnalysisShows results for the CRF when individually removing each class of feature using the “No Duplicates; Separate Sources” data set.19
  • 20. Results and AnalysisContiguous vs.Non-Contiguous performanceDocument level results.20
  • 21. Results and AnalysisAmount of Training DataShow the results for the CRF with varying quantities of training data using the “No Duplicates; Separate Sources” experimentalsetup.21
  • 22. Results and AnalysisError AnalysisNotContent blocks found interspersed within portions of Content being falsely labeled as Content.Sometimes section headers would be incorrectly singled out as NotContent.The beginning and end of article boundaries where sometimes the first or last block would incorrectly be labeled NotContent.22
  • 23. Conclusion and Feature WorkSequence labeling emerged as the clear winner with CRF edging out MEMM.MaxEnt was only competitive on the easiest of the four data sets and is not a viable alternative to site-specific wrappers.Future work includes applying the techniques to additional data, including additional sources, different languages, and other types of data such as weblogs.Another interesting avenue: semi-Markov CRFs.23