SlideShare a Scribd company logo
Proceedings of 2014 RAECS UIET Panjab University Chandigarh, 06 – 08 March, 2014
978-1-4799-2291-8/14/$31.00 ©2014 IEEE
A Novel Approach for Content Extraction from Web
Pages
Aanshi Bhardwaj
UIET, Panjab University
Chandigarh-160014, India
bhardwajaanshi@gmail.com
Veenu Mangat
UIET, Panjab University
Chandigarh-160014, India
veenumangat@yahoo.com
Abstract—The rapid development of the internet and web
publishing techniques create numerous information sources
published as HTML pages on World Wide Web. However, there
is lot of redundant and irrelevant information also on web pages.
Navigation panels, Table of content (TOC), advertisements,
copyright statements, service catalogs, privacy policies etc. on
web pages are considered as relevant and irrelevant content.
Such information makes various web mining tasks such as web
page crawling, web page classification, link based ranking, topic
distillation complex This paper discusses various approaches for
extracting informative content from web pages and a new
approach for content extraction from web pages using word to
leaf ratio and density of links.
Keywords—Content extraction; Entropy; Document object
Model; hub and authority; ontology generation; template; Content
Structure Tree; web page segmentation; Vision Based Page
Segmentation; clustering; anchor text.
I. INTRODUCTION
WWW is now a famous medium by which people all
around the world can spread and gather the information of all
kinds. But web pages of various sites that are generated
dynamically contain undesired information also. This
information is called noisy or irrelevant content. Mainly,
advertisements, copyright statements, privacy statements,
logos, table of contents, navigational panel, footers and headers
come under noisy content. Table of content and navigational
panel are provided to make it easier for users to navigate
through web pages. Table of content and navigational panel
blocks are also called redundant blocks because they are
present on almost every web page. It has been measured that
almost 40-50% of the content in a webpage can be considered
irrelevant [1].
A user is basically interested in the main content of a web
page. So, the process of identifying main content blocks from
a web page is called content extraction. The term content
extraction was found by Rahman [2].
Content extraction has many applications-it becomes easier
for users to access the information in timely efficient manner.
Irrelevant and Redundant information is removed.
Performance of search engines is also increased because they
will not waste their time and memory in indexing and storing
irrelevant and redundant content. Therefore, it acts as a
preprocessor of web pages for search engines. It also helps
users who uses internet through small screen devices because
they can easily point out the relevant content. Else it would be
difficult for users to get actual information as the information
is displayed on small screen. It makes several web mining
tasks such as web page crawling, web page classification, link
based ranking, topic distillation simple. It can help in
generating automatically rich site summary (RSS) from blogs
or articles. Content Extraction is also being used in application
like ontology generation.
Many methods have been developed to extract content
blocks from web pages. Lin and Ho [3] proposed a method
named infodiscoverer in which they used <table> tag to divide
the web page into blocks. Then they extracted features from
blocks and calculated entropy value of these features. Then
this entropy value is used to determine whether the block is
informative or not. But the problem in this method is that they
are not able to divide the web pages that contain other tags
like DIV than the table tag. They only performed experiments
on news websites having Chinese pages. Kao and Lin [4]
proposed a method in which they used HITS (Hyperlink
Induced Topic Search) algorithm to get a concise structure of
web site by removing irrelevant structures. On the filtered
structure they performed infodiscoverer method. This method
is better than infodiscoverer because instead of using the
whole web page they experimented on the filtered structure.
HITS algorithm works by finding hub and authority page but
it becomes difficult for it to find out those hub web pages that
have few authority pages linked with it. Kao [5] proposed
WISDOM (Web Intrapage Informative Structure Mining
Based on Document Object Model) method. This method
evaluates the amount of information contained in node of
DOM (Document Object Model) tree with the help of
information theory. It first divides the original DOM tree into
subtrees and chooses the candidate subtrees with the help of
assigned threshold. Then a top–down and greedy algorithm is
applied to select the informative blocks and a skeleton set
which consist of set of candidate informative structures.
Merging and expanding methods are applied on skeleton set to
get the required informative blocks. It removes pseudo
informative nodes while merging.
Debnath [6] gave four algorithms content extractor, feature
extractor, k-feature extractor and L-extractor for separating
content blocks from irrelevant content. ContentExtractor
algorithm finds redundant blocks based on the occurrence of
the same block across multiple Web pages. FeatureExtractor
algorithm identifies the content block with help of particular
feature. K- FeatureExtractor, algorithm uses a K-means
clustering which gets multiple blocks as compared to
FeatureExtractor that selects a single block. L-Extractor
algorithm combines block-partitioning algorithm (VIPS –
Vision based Page Segmentation algorithm) [18] with support
vector machine to identify content blocks in a web page.
Content extraction and feature extraction was performed on
different websites and showed that both algorithms performed
better than INFODISCVERER method in nearly all cases.
Wang [7] proposed a method which is based on fundamental
information of web pages. In first step, method extracts
information from each web page and thereby combining that
information to get site information. The information extracted
is text node i.e. the data present in the tags, word length, menu
subtree which is a subtree having text node length less than 5,
menu item information, menu instance information. In second
step, entropy estimation method is applied to discover actual
information which is required by the users. The information
extracted from this method helps in classifying web pages and
in domain ontology generation. Tseng and Kao [8] proposed a
method based on three novel features which are similarity,
density and diverseness for identifying information from web
pages. The block which has maximum value of these three
features is considered as informative block. Similarity feature
identifies the similar set of objects in the group, density feature
indicates the degree of similar objects in particular area and
diverseness measures distribution of features among various
objects. Huang [9] proposed a method employing block
preclustering technology. This method consists of two
methods- matching phase and modeling phase. In matching
phase, it first partitions the web page into blocks based on
VIPS (Vision based Page Segmentation algorithm) [18].
Nearest neighbor clustering algorithm is used to cluster these
partitioned blocks based on similar structures. Importance
degree is associated with each cluster and clusters with
importance degree are stored in clustered pattern database. In
modeling phase, when a new web page comes it is first
partitioned into blocks and then these blocks are matched with
clustered pattern database to get the importance degree of these
new partitioned blocks. Entropy evaluation is done on these
blocks to know whether they are informative or not.
Kang and Choi [10] proposed algorithm RIPB (Recognizing
Informative Page Blocks) using visual block segmentation.
This method also partitions web page into blocks based on
VIPS. Similar structure blocks are grouped into clusters. A
linear weighted function is applied to determine whether the
block is informative or not. The function is based on the
tokens i.e. bits of text and area of cluster. Li [11] proposed a
novel algorithm for extracting informative blocks based on
new tree CST (Content Structure Tree). CST is a good source
for examining structure and content of web page. After
creating CST, content weight in each block of a web page is
calculated. Extract Informative Content Blocks algorithm
proposed by authors is used then to extract blocks from CST
trees.
Fei [12] for content extraction in e-commerce domain found
another type of tree called SDOM (semantic Dom) was used.
This tree gives a new idea that combines structure information
with its semantic meaning for efficient extraction. Different
wrappers convert tags into corresponding information.
Thomas [13] proposed content code vector (CCV) approach in
which a web page containing content and tags are represented
as 1 and 0 respectively called content code vector. Content
code ratio (CCR) is calculated which determine total amount
of content and code present around the CCV. If CCR value is
high then it means there are more text and little tags.
Tim [14] developed another approach that makes use of tag
ratios. Tag ratio is computed as the ratio of number of non
HTML tag characters to number of HTML tags. The problem
with this approach is that code of a web page can be indented
or unintended which leads to different values for tag ratios
depending upon the distribution of code. Then a threshold
value selects tag ratio values as content and non-content
blocks. Nguyen [15] creates a template which consists of paths
to content blocks and also stores non content blocks which has
same path as content block. Any new web page is compared
with the stored template to know whether it contains content
blocks or not. But this approach works for particular type of
web pages only. Yang [16] used three parameters node link
text density, non-anchor text density and punctuation mark
density together in extraction process. The main idea behind
the use of these three densities is that the non-informative
blocks consist of fewer punctuation marks and more of anchor
text and less text. Uzun [17] proposed hybrid approach which
combines automatic and manual techniques together for
extraction process. Machine learning methods are used which
draw rules for extraction process. They found decision tree
learning as the best learning method for the creation of rules.
Then informative content is extracted by using these rules and
simple string manipulation functions. If these rules are not
able to get informative blocks then machine learning method
again draw new rules.
II. PROPOSED APPROACH
Our approach combines word to leaf ratio (WLR) [19] with
link attributes [20] of nodes for content extraction. In previous
techniques characters were used instead of words but it
purposelessly gives importance to long words. So instead of
characters words are used. Leaves are examined in this ratio as
these are the only nodes that consist of textual information. In
approach [19], they have not considered the idea that a block
containing more number of links is less informative than the
block containing lesser links. So, adding text link and anchor
text ratios to word to leaf ratio gives a new approach which is
more efficient.
Our method performs following steps:
1. Construct DOM (Document Object Model) tree of
web page.
2. Remove noisy nodes which contain title, script, head,
meta, style, no script, link, select, #comment and the
nodes which have no words and are not visible.
3. Compute word to leaf ratio (WLR) as in Eq. 1.
WLR (n) = tw (n) / l (n) (1)
Where tw (n) = number of words in the node n
l (n) = number of leaves in the subtree of node n
4. Obtain initial node set which contains higher density
of text. Initial node set I is defined as those nodes
which satisfy the condition in Eq. 2.
WLR (n) ≥ √max WLR × WLR (root) (2)
5. Compute text link ratio (TLR) i.e. the ratio of the
length of the text and the number of links of node n
in I.
6. Compute link text ratio (TLTR) i.e. the ratio of the
length of the text and the length of the link text of
node n in I.
7. Calculate weight of each node W (n) as in Eq. 3
W (n) = (TLR (n) +TLTR (n)) / a (3)
Where a is normalizing factor.
8. Relative position of node n is defined in Eq. 4, Eq. 5,
Eq. 6 and Eq.7.
R (n) = WLR (n) × max w (n), ∑ R (n1)
n1∈ Children (n) (4)
Where w (n) = rpos (n) ×rWLR (n) if n ∈ I (5)
0 if n ∈ I
Where rpos (n) = 1 – (id (n) - minid / maxid – minid) (6)
rWLR (n) = (WLR (n) - minWLR) / (maxWLR - minWLR) (7)
Where minid, maxid are the minimum and maximum values of
identifiers in I, minWLR, maxWLR are the minimum and
maximum WLR values from step 3.
Relevance includes those nodes which have higher density of
text. If two nodes have same R (n) then the node with lower
identifier is selected.
9. Select the node which have highest values of W (n)
and R (n) in proportion as, 0.7 × R (n) + 0.3 × W(n).
10. Selected node has the required textual information.
III. CONCLUSION
Informative Content Extraction from web pages is very
important because web pages are unstructured and its number
is growing at a very fast rate. Content Extraction is useful for
the human users as they will get the required information in a
time efficient manner and also used as a preprocessing stage
for systems like robots, indexers, crawlers, etc. that need to
extract the main content of a web page to prevent the
treatment and processing of noisy, irrelevant and useless
information. We have presented traditional approaches for
extracting main content from web pages and also a new
approach for content extraction from web pages using the
concept of word to leaf ratio and link density.
IV. FUTURE SCOPE
Automatic Content Extraction is an emerging field in
research area as the amount and type of information added is
increasing and changing day by day. We will perform our
method on different websites like news, shopping, business
and e-commerce and education and compare precision and
recall with present methods. We will try to incorporate
hypertext information also to the above method and work on
event information generation.
ACKNOWLEDGMENT
I sincerely thank all those who helped me in completing this
task.
REFERENCES
[1] D.Gibson, K.Punera and A.Tomkins, “The volume and evolution of web
page templates”, proceedings of WWW '05 Special interest tracks and
posters of the 14th International Conference on World Wide Web, pp.
830-839, 2005.
[2] A.F.R.Rahman, H.Alam and R.Hartono, “Content extraction from
HTML documents”, International workshop on Web Document
Analysis, pp. 7-10, 2001.
[3] S.Lin and J.Ho, “Discovering informative content blocks from web
documents”, Proceedings of ACM SIGKDD International Conference
on Information Retrieval, pp. 450-453, 2002.
[4] H.Kao and S.Lin, “Mining web informative structures
contents based on entropy analysis”, IEEE Trans. Knowledge and Data
Eng., vol. 16, no. 1, pp. 41-55, 2004.
[5] H.Kao and J.Ho, “WISDOM: web intrapage informative structure
mining based on document object model”, IEEE Trans. Knowledge and
Data Eng., vol. 17, no. 5, pp. 614-627, 2005.
[6] S.Debnath, P.Mitra, N.Pal and C.Giles, “Automatic identification of
informative sections of web pages”, IEEE Trans. Knowledge and Data
Eng., vol. 17, no. 9, pp. 1233-1246, 2005.
[7] C.Wang, J.Lu and G.Zhang, “Mining key information of web pages: a
method and its application”, Expert Systems with Applications, vol. 33,
no. 2, pp. 425-433, 2006.
[8] Y.Tseng and H.Kao, “The mining and extraction of primary informative
blocks and data objects from systematic web pages”, Proceedings of the
IEEE/WIC/ACM International Conference on Web Intelligence, pp.
370-373, 2006.
[9] C.Huang, P.Yen, Y.Hung, T.Chuang and H.Lee, “Enhancing entropy-
based informative block identification using block preclustering
technology”, IEEE International Conference on Systems, Man, and
Cybernetics, pp. 2640-2645, 2006.
[10] J.Kang and J.Choi, “Detecting informative web page blocks for efficient
information extraction using visual block segmentation”, International
Symposium on Information Technology Convergence, pp. 306-310,
2007.
[11] Y.Li and J.Yang, “A novel method to extract informative blocks from
web pages”, International joint Conference on Artificial Intelligence, pp.
536-539, 2009.
[12] Y.Fei, Z.Luo, Y.Xu and W.Zhang, “A semantic dom approach for
webpage information extraction”, International Conference on
Management and Service Science, pp. 1-5, 2009.
[13] T.Gottron, “Content code blurring: a new approach to content
extraction”, 19th International Conference on Database and Expert
Systems Application, pp. 29-33, 2008.
[14] T.Weninger, W.-H.Hsu and J.Han, “ CETR-Content extraction via tag
ratios”, WWW’10 proceedings of the 19th
International Conference on
World Wide Web, pp. 971-980, 2010.
[15] D.Nguyen, D.Nguyen, S.Pham and T.Bui, “A fast template-based
approach to automatically identify primary text content of a web page”,
International Conference on Knowledge and Systems Engineering, pp.
232-236, 2009.
[16] D.Yang and J.Song, “Web content information extraction approach
based on removing noise and content-features”, 2010 International
Conference on Web Information Systems and Mining, pp. 246-249,
2010.
[17] E.Uzan, H.Agun and T.yerlikaya, “A hybrid approach for extracting
informative content from web pages”, Information Processing and
Management, vol. 49, no. 4, pp. 1248-1257, 2013.
[18] D.Cai, S.Yu, J.-R.Wen and W.-Y.Ma, “Extracting content structure for
web pages based on visual representation”, In Proceedings of Fifth Asia
Pacific Web Conference, pp. 406-417, 2003.
[19] D.Insa, J.Silva and S.Tamarit, “Using the words/leafs ratio in the dom
tree for content extraction”, The Journal of Logic and Algebraic
Programming”, vol. 82, no. 8, pp. 311-325, 2013.
[20] S.Shen and H.Zhang, “Block-level links based content extraction”,
Fourth International Symposium on Parallel Architectures, Algorithms
and Programming, pp. 330-333, 2011.

More Related Content

PDF
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
PDF
Comparable Analysis of Web Mining Categories
PDF
International Journal of Engineering Research and Development
PDF
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
PPTX
Discovering knowledge using web structure mining
ODP
Web Content Mining
PDF
A1303060109
PDF
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
Comparable Analysis of Web Mining Categories
International Journal of Engineering Research and Development
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Discovering knowledge using web structure mining
Web Content Mining
A1303060109
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs

What's hot (14)

PDF
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
PDF
Literature Survey on Web Mining
PDF
Study on Web Content Extraction Techniques
PDF
PPT
Webmining Overview
PPT
Web Usage Pattern
PDF
A Study on Web Structure Mining
PDF
IRJET - Re-Ranking of Google Search Results
PPT
A survey on web usage mining techniques
ODP
Web mining
PDF
WEB MINING – A CATALYST FOR E-BUSINESS
PDF
ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASS...
PPTX
Webmining ppt
PDF
RESEARCH ISSUES IN WEB MINING
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
Literature Survey on Web Mining
Study on Web Content Extraction Techniques
Webmining Overview
Web Usage Pattern
A Study on Web Structure Mining
IRJET - Re-Ranking of Google Search Results
A survey on web usage mining techniques
Web mining
WEB MINING – A CATALYST FOR E-BUSINESS
ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASS...
Webmining ppt
RESEARCH ISSUES IN WEB MINING
Ad

Viewers also liked (16)

PPTX
Gambar anyaman
PPTX
PDF
Startup ABCs
PDF
Knowing Your Metrics
PPT
Water2010
PDF
What is Growthink?
DOCX
Caratula de el pedro nolasco
DOCX
Ciri benda hidup
PDF
Answering Tough Evangelism Questions
PDF
PPTX
DOCX
Kertas kerja.olahraga 2014
PDF
2012 ific food and health survey report of findings (for website)
PPTX
Gambar anyaman
PDF
Cymphonix active-passive high availability v9
PPT
Therapeutic nurse patient-relationship
Gambar anyaman
Startup ABCs
Knowing Your Metrics
Water2010
What is Growthink?
Caratula de el pedro nolasco
Ciri benda hidup
Answering Tough Evangelism Questions
Kertas kerja.olahraga 2014
2012 ific food and health survey report of findings (for website)
Gambar anyaman
Cymphonix active-passive high availability v9
Therapeutic nurse patient-relationship
Ad

Similar to content extraction (20)

PDF
Pf3426712675
PDF
ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASS...
PDF
ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASS...
PDF
A semantic based approach for knowledge discovery and acquistion from multipl...
PDF
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
PDF
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
PDF
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
PDF
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
PDF
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
PDF
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
PDF
A Multimodal Approach to Incremental User Profile Building
PPT
4.5 mining the worldwideweb
PDF
01635156
PDF
Integrating content search with structure analysis for hypermedia retrieval a...
PDF
Nature-inspired methods for the Semantic Web
PDF
ACOMP_2014_submission_70
PDF
The International Journal of Engineering and Science (The IJES)
PDF
Boilerplate Removal and Content Extraction from Dynamic Web Pages
Pf3426712675
ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASS...
ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASS...
A semantic based approach for knowledge discovery and acquistion from multipl...
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
IRJET- Behaviour of Hybrid Fibre Reinforced Sintered Fly Ash Aggregate Concre...
IRJET- SVM-based Web Content Mining with Leaf Classification Unit From DOM-Tree
International Journal of Engineering Research and Development (IJERD)
HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STREN...
A SEMANTIC BASED APPROACH FOR KNOWLEDGE DISCOVERY AND ACQUISITION FROM MULTIP...
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
A Multimodal Approach to Incremental User Profile Building
4.5 mining the worldwideweb
01635156
Integrating content search with structure analysis for hypermedia retrieval a...
Nature-inspired methods for the Semantic Web
ACOMP_2014_submission_70
The International Journal of Engineering and Science (The IJES)
Boilerplate Removal and Content Extraction from Dynamic Web Pages

Recently uploaded (20)

PDF
PPT on Performance Review to get promotions
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
PPTX
Current and future trends in Computer Vision.pptx
PPT
Total quality management ppt for engineering students
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Soil Improvement Techniques Note - Rabbi
PDF
86236642-Electric-Loco-Shed.pdf jfkduklg
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
737-MAX_SRG.pdf student reference guides
PPTX
introduction to high performance computing
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPT on Performance Review to get promotions
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
Current and future trends in Computer Vision.pptx
Total quality management ppt for engineering students
Safety Seminar civil to be ensured for safe working.
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Automation-in-Manufacturing-Chapter-Introduction.pdf
Soil Improvement Techniques Note - Rabbi
86236642-Electric-Loco-Shed.pdf jfkduklg
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
Information Storage and Retrieval Techniques Unit III
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
737-MAX_SRG.pdf student reference guides
introduction to high performance computing
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf

content extraction

  • 1. Proceedings of 2014 RAECS UIET Panjab University Chandigarh, 06 – 08 March, 2014 978-1-4799-2291-8/14/$31.00 ©2014 IEEE A Novel Approach for Content Extraction from Web Pages Aanshi Bhardwaj UIET, Panjab University Chandigarh-160014, India bhardwajaanshi@gmail.com Veenu Mangat UIET, Panjab University Chandigarh-160014, India veenumangat@yahoo.com Abstract—The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web. However, there is lot of redundant and irrelevant information also on web pages. Navigation panels, Table of content (TOC), advertisements, copyright statements, service catalogs, privacy policies etc. on web pages are considered as relevant and irrelevant content. Such information makes various web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation complex This paper discusses various approaches for extracting informative content from web pages and a new approach for content extraction from web pages using word to leaf ratio and density of links. Keywords—Content extraction; Entropy; Document object Model; hub and authority; ontology generation; template; Content Structure Tree; web page segmentation; Vision Based Page Segmentation; clustering; anchor text. I. INTRODUCTION WWW is now a famous medium by which people all around the world can spread and gather the information of all kinds. But web pages of various sites that are generated dynamically contain undesired information also. This information is called noisy or irrelevant content. Mainly, advertisements, copyright statements, privacy statements, logos, table of contents, navigational panel, footers and headers come under noisy content. Table of content and navigational panel are provided to make it easier for users to navigate through web pages. Table of content and navigational panel blocks are also called redundant blocks because they are present on almost every web page. It has been measured that almost 40-50% of the content in a webpage can be considered irrelevant [1]. A user is basically interested in the main content of a web page. So, the process of identifying main content blocks from a web page is called content extraction. The term content extraction was found by Rahman [2]. Content extraction has many applications-it becomes easier for users to access the information in timely efficient manner. Irrelevant and Redundant information is removed. Performance of search engines is also increased because they will not waste their time and memory in indexing and storing irrelevant and redundant content. Therefore, it acts as a preprocessor of web pages for search engines. It also helps users who uses internet through small screen devices because they can easily point out the relevant content. Else it would be difficult for users to get actual information as the information is displayed on small screen. It makes several web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation simple. It can help in generating automatically rich site summary (RSS) from blogs or articles. Content Extraction is also being used in application like ontology generation. Many methods have been developed to extract content blocks from web pages. Lin and Ho [3] proposed a method named infodiscoverer in which they used <table> tag to divide the web page into blocks. Then they extracted features from blocks and calculated entropy value of these features. Then this entropy value is used to determine whether the block is informative or not. But the problem in this method is that they are not able to divide the web pages that contain other tags like DIV than the table tag. They only performed experiments on news websites having Chinese pages. Kao and Lin [4] proposed a method in which they used HITS (Hyperlink Induced Topic Search) algorithm to get a concise structure of web site by removing irrelevant structures. On the filtered structure they performed infodiscoverer method. This method is better than infodiscoverer because instead of using the whole web page they experimented on the filtered structure. HITS algorithm works by finding hub and authority page but it becomes difficult for it to find out those hub web pages that have few authority pages linked with it. Kao [5] proposed WISDOM (Web Intrapage Informative Structure Mining Based on Document Object Model) method. This method evaluates the amount of information contained in node of DOM (Document Object Model) tree with the help of information theory. It first divides the original DOM tree into subtrees and chooses the candidate subtrees with the help of assigned threshold. Then a top–down and greedy algorithm is applied to select the informative blocks and a skeleton set which consist of set of candidate informative structures. Merging and expanding methods are applied on skeleton set to get the required informative blocks. It removes pseudo informative nodes while merging. Debnath [6] gave four algorithms content extractor, feature extractor, k-feature extractor and L-extractor for separating content blocks from irrelevant content. ContentExtractor algorithm finds redundant blocks based on the occurrence of the same block across multiple Web pages. FeatureExtractor algorithm identifies the content block with help of particular
  • 2. feature. K- FeatureExtractor, algorithm uses a K-means clustering which gets multiple blocks as compared to FeatureExtractor that selects a single block. L-Extractor algorithm combines block-partitioning algorithm (VIPS – Vision based Page Segmentation algorithm) [18] with support vector machine to identify content blocks in a web page. Content extraction and feature extraction was performed on different websites and showed that both algorithms performed better than INFODISCVERER method in nearly all cases. Wang [7] proposed a method which is based on fundamental information of web pages. In first step, method extracts information from each web page and thereby combining that information to get site information. The information extracted is text node i.e. the data present in the tags, word length, menu subtree which is a subtree having text node length less than 5, menu item information, menu instance information. In second step, entropy estimation method is applied to discover actual information which is required by the users. The information extracted from this method helps in classifying web pages and in domain ontology generation. Tseng and Kao [8] proposed a method based on three novel features which are similarity, density and diverseness for identifying information from web pages. The block which has maximum value of these three features is considered as informative block. Similarity feature identifies the similar set of objects in the group, density feature indicates the degree of similar objects in particular area and diverseness measures distribution of features among various objects. Huang [9] proposed a method employing block preclustering technology. This method consists of two methods- matching phase and modeling phase. In matching phase, it first partitions the web page into blocks based on VIPS (Vision based Page Segmentation algorithm) [18]. Nearest neighbor clustering algorithm is used to cluster these partitioned blocks based on similar structures. Importance degree is associated with each cluster and clusters with importance degree are stored in clustered pattern database. In modeling phase, when a new web page comes it is first partitioned into blocks and then these blocks are matched with clustered pattern database to get the importance degree of these new partitioned blocks. Entropy evaluation is done on these blocks to know whether they are informative or not. Kang and Choi [10] proposed algorithm RIPB (Recognizing Informative Page Blocks) using visual block segmentation. This method also partitions web page into blocks based on VIPS. Similar structure blocks are grouped into clusters. A linear weighted function is applied to determine whether the block is informative or not. The function is based on the tokens i.e. bits of text and area of cluster. Li [11] proposed a novel algorithm for extracting informative blocks based on new tree CST (Content Structure Tree). CST is a good source for examining structure and content of web page. After creating CST, content weight in each block of a web page is calculated. Extract Informative Content Blocks algorithm proposed by authors is used then to extract blocks from CST trees. Fei [12] for content extraction in e-commerce domain found another type of tree called SDOM (semantic Dom) was used. This tree gives a new idea that combines structure information with its semantic meaning for efficient extraction. Different wrappers convert tags into corresponding information. Thomas [13] proposed content code vector (CCV) approach in which a web page containing content and tags are represented as 1 and 0 respectively called content code vector. Content code ratio (CCR) is calculated which determine total amount of content and code present around the CCV. If CCR value is high then it means there are more text and little tags. Tim [14] developed another approach that makes use of tag ratios. Tag ratio is computed as the ratio of number of non HTML tag characters to number of HTML tags. The problem with this approach is that code of a web page can be indented or unintended which leads to different values for tag ratios depending upon the distribution of code. Then a threshold value selects tag ratio values as content and non-content blocks. Nguyen [15] creates a template which consists of paths to content blocks and also stores non content blocks which has same path as content block. Any new web page is compared with the stored template to know whether it contains content blocks or not. But this approach works for particular type of web pages only. Yang [16] used three parameters node link text density, non-anchor text density and punctuation mark density together in extraction process. The main idea behind the use of these three densities is that the non-informative blocks consist of fewer punctuation marks and more of anchor text and less text. Uzun [17] proposed hybrid approach which combines automatic and manual techniques together for extraction process. Machine learning methods are used which draw rules for extraction process. They found decision tree learning as the best learning method for the creation of rules. Then informative content is extracted by using these rules and simple string manipulation functions. If these rules are not able to get informative blocks then machine learning method again draw new rules. II. PROPOSED APPROACH Our approach combines word to leaf ratio (WLR) [19] with link attributes [20] of nodes for content extraction. In previous techniques characters were used instead of words but it purposelessly gives importance to long words. So instead of characters words are used. Leaves are examined in this ratio as these are the only nodes that consist of textual information. In approach [19], they have not considered the idea that a block containing more number of links is less informative than the block containing lesser links. So, adding text link and anchor text ratios to word to leaf ratio gives a new approach which is more efficient. Our method performs following steps: 1. Construct DOM (Document Object Model) tree of web page. 2. Remove noisy nodes which contain title, script, head, meta, style, no script, link, select, #comment and the nodes which have no words and are not visible. 3. Compute word to leaf ratio (WLR) as in Eq. 1. WLR (n) = tw (n) / l (n) (1)
  • 3. Where tw (n) = number of words in the node n l (n) = number of leaves in the subtree of node n 4. Obtain initial node set which contains higher density of text. Initial node set I is defined as those nodes which satisfy the condition in Eq. 2. WLR (n) ≥ √max WLR × WLR (root) (2) 5. Compute text link ratio (TLR) i.e. the ratio of the length of the text and the number of links of node n in I. 6. Compute link text ratio (TLTR) i.e. the ratio of the length of the text and the length of the link text of node n in I. 7. Calculate weight of each node W (n) as in Eq. 3 W (n) = (TLR (n) +TLTR (n)) / a (3) Where a is normalizing factor. 8. Relative position of node n is defined in Eq. 4, Eq. 5, Eq. 6 and Eq.7. R (n) = WLR (n) × max w (n), ∑ R (n1) n1∈ Children (n) (4) Where w (n) = rpos (n) ×rWLR (n) if n ∈ I (5) 0 if n ∈ I Where rpos (n) = 1 – (id (n) - minid / maxid – minid) (6) rWLR (n) = (WLR (n) - minWLR) / (maxWLR - minWLR) (7) Where minid, maxid are the minimum and maximum values of identifiers in I, minWLR, maxWLR are the minimum and maximum WLR values from step 3. Relevance includes those nodes which have higher density of text. If two nodes have same R (n) then the node with lower identifier is selected. 9. Select the node which have highest values of W (n) and R (n) in proportion as, 0.7 × R (n) + 0.3 × W(n). 10. Selected node has the required textual information. III. CONCLUSION Informative Content Extraction from web pages is very important because web pages are unstructured and its number is growing at a very fast rate. Content Extraction is useful for the human users as they will get the required information in a time efficient manner and also used as a preprocessing stage for systems like robots, indexers, crawlers, etc. that need to extract the main content of a web page to prevent the treatment and processing of noisy, irrelevant and useless information. We have presented traditional approaches for extracting main content from web pages and also a new approach for content extraction from web pages using the concept of word to leaf ratio and link density. IV. FUTURE SCOPE Automatic Content Extraction is an emerging field in research area as the amount and type of information added is increasing and changing day by day. We will perform our method on different websites like news, shopping, business and e-commerce and education and compare precision and recall with present methods. We will try to incorporate hypertext information also to the above method and work on event information generation. ACKNOWLEDGMENT I sincerely thank all those who helped me in completing this task. REFERENCES [1] D.Gibson, K.Punera and A.Tomkins, “The volume and evolution of web page templates”, proceedings of WWW '05 Special interest tracks and posters of the 14th International Conference on World Wide Web, pp. 830-839, 2005. [2] A.F.R.Rahman, H.Alam and R.Hartono, “Content extraction from HTML documents”, International workshop on Web Document Analysis, pp. 7-10, 2001. [3] S.Lin and J.Ho, “Discovering informative content blocks from web documents”, Proceedings of ACM SIGKDD International Conference on Information Retrieval, pp. 450-453, 2002. [4] H.Kao and S.Lin, “Mining web informative structures contents based on entropy analysis”, IEEE Trans. Knowledge and Data Eng., vol. 16, no. 1, pp. 41-55, 2004. [5] H.Kao and J.Ho, “WISDOM: web intrapage informative structure mining based on document object model”, IEEE Trans. Knowledge and Data Eng., vol. 17, no. 5, pp. 614-627, 2005. [6] S.Debnath, P.Mitra, N.Pal and C.Giles, “Automatic identification of informative sections of web pages”, IEEE Trans. Knowledge and Data Eng., vol. 17, no. 9, pp. 1233-1246, 2005. [7] C.Wang, J.Lu and G.Zhang, “Mining key information of web pages: a method and its application”, Expert Systems with Applications, vol. 33, no. 2, pp. 425-433, 2006. [8] Y.Tseng and H.Kao, “The mining and extraction of primary informative blocks and data objects from systematic web pages”, Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pp. 370-373, 2006. [9] C.Huang, P.Yen, Y.Hung, T.Chuang and H.Lee, “Enhancing entropy- based informative block identification using block preclustering technology”, IEEE International Conference on Systems, Man, and Cybernetics, pp. 2640-2645, 2006. [10] J.Kang and J.Choi, “Detecting informative web page blocks for efficient information extraction using visual block segmentation”, International Symposium on Information Technology Convergence, pp. 306-310, 2007. [11] Y.Li and J.Yang, “A novel method to extract informative blocks from web pages”, International joint Conference on Artificial Intelligence, pp. 536-539, 2009. [12] Y.Fei, Z.Luo, Y.Xu and W.Zhang, “A semantic dom approach for webpage information extraction”, International Conference on Management and Service Science, pp. 1-5, 2009. [13] T.Gottron, “Content code blurring: a new approach to content extraction”, 19th International Conference on Database and Expert Systems Application, pp. 29-33, 2008.
  • 4. [14] T.Weninger, W.-H.Hsu and J.Han, “ CETR-Content extraction via tag ratios”, WWW’10 proceedings of the 19th International Conference on World Wide Web, pp. 971-980, 2010. [15] D.Nguyen, D.Nguyen, S.Pham and T.Bui, “A fast template-based approach to automatically identify primary text content of a web page”, International Conference on Knowledge and Systems Engineering, pp. 232-236, 2009. [16] D.Yang and J.Song, “Web content information extraction approach based on removing noise and content-features”, 2010 International Conference on Web Information Systems and Mining, pp. 246-249, 2010. [17] E.Uzan, H.Agun and T.yerlikaya, “A hybrid approach for extracting informative content from web pages”, Information Processing and Management, vol. 49, no. 4, pp. 1248-1257, 2013. [18] D.Cai, S.Yu, J.-R.Wen and W.-Y.Ma, “Extracting content structure for web pages based on visual representation”, In Proceedings of Fifth Asia Pacific Web Conference, pp. 406-417, 2003. [19] D.Insa, J.Silva and S.Tamarit, “Using the words/leafs ratio in the dom tree for content extraction”, The Journal of Logic and Algebraic Programming”, vol. 82, no. 8, pp. 311-325, 2013. [20] S.Shen and H.Zhang, “Block-level links based content extraction”, Fourth International Symposium on Parallel Architectures, Algorithms and Programming, pp. 330-333, 2011.