content extraction

Proceedings of 2014 RAECS UIET Panjab University Chandigarh, 06 – 08 March, 2014
978-1-4799-2291-8/14/$31.00 ©2014 IEEE
A Novel Approach for Content Extraction from Web
Pages
Aanshi Bhardwaj
UIET, Panjab University
Chandigarh-160014, India
bhardwajaanshi@gmail.com
Veenu Mangat
UIET, Panjab University
Chandigarh-160014, India
veenumangat@yahoo.com
Abstract—The rapid development of the internet and web
publishing techniques create numerous information sources
published as HTML pages on World Wide Web. However, there
is lot of redundant and irrelevant information also on web pages.
Navigation panels, Table of content (TOC), advertisements,
copyright statements, service catalogs, privacy policies etc. on
web pages are considered as relevant and irrelevant content.
Such information makes various web mining tasks such as web
page crawling, web page classification, link based ranking, topic
distillation complex This paper discusses various approaches for
extracting informative content from web pages and a new
approach for content extraction from web pages using word to
leaf ratio and density of links.
Keywords—Content extraction; Entropy; Document object
Model; hub and authority; ontology generation; template; Content
Structure Tree; web page segmentation; Vision Based Page
Segmentation; clustering; anchor text.
I. INTRODUCTION
WWW is now a famous medium by which people all
around the world can spread and gather the information of all
kinds. But web pages of various sites that are generated
dynamically contain undesired information also. This
information is called noisy or irrelevant content. Mainly,
advertisements, copyright statements, privacy statements,
logos, table of contents, navigational panel, footers and headers
come under noisy content. Table of content and navigational
panel are provided to make it easier for users to navigate
through web pages. Table of content and navigational panel
blocks are also called redundant blocks because they are
present on almost every web page. It has been measured that
almost 40-50% of the content in a webpage can be considered
irrelevant [1].
A user is basically interested in the main content of a web
page. So, the process of identifying main content blocks from
a web page is called content extraction. The term content
extraction was found by Rahman [2].
Content extraction has many applications-it becomes easier
for users to access the information in timely efficient manner.
Irrelevant and Redundant information is removed.
Performance of search engines is also increased because they
will not waste their time and memory in indexing and storing
irrelevant and redundant content. Therefore, it acts as a
preprocessor of web pages for search engines. It also helps
users who uses internet through small screen devices because
they can easily point out the relevant content. Else it would be
difficult for users to get actual information as the information
is displayed on small screen. It makes several web mining
tasks such as web page crawling, web page classification, link
based ranking, topic distillation simple. It can help in
generating automatically rich site summary (RSS) from blogs
or articles. Content Extraction is also being used in application
like ontology generation.
Many methods have been developed to extract content
blocks from web pages. Lin and Ho [3] proposed a method
named infodiscoverer in which they used <table> tag to divide
the web page into blocks. Then they extracted features from
blocks and calculated entropy value of these features. Then
this entropy value is used to determine whether the block is
informative or not. But the problem in this method is that they
are not able to divide the web pages that contain other tags
like DIV than the table tag. They only performed experiments
on news websites having Chinese pages. Kao and Lin [4]
proposed a method in which they used HITS (Hyperlink
Induced Topic Search) algorithm to get a concise structure of
web site by removing irrelevant structures. On the filtered
structure they performed infodiscoverer method. This method
is better than infodiscoverer because instead of using the
whole web page they experimented on the filtered structure.
HITS algorithm works by finding hub and authority page but
it becomes difficult for it to find out those hub web pages that
have few authority pages linked with it. Kao [5] proposed
WISDOM (Web Intrapage Informative Structure Mining
Based on Document Object Model) method. This method
evaluates the amount of information contained in node of
DOM (Document Object Model) tree with the help of
information theory. It first divides the original DOM tree into
subtrees and chooses the candidate subtrees with the help of
assigned threshold. Then a top–down and greedy algorithm is
applied to select the informative blocks and a skeleton set
which consist of set of candidate informative structures.
Merging and expanding methods are applied on skeleton set to
get the required informative blocks. It removes pseudo
informative nodes while merging.
Debnath [6] gave four algorithms content extractor, feature
extractor, k-feature extractor and L-extractor for separating
content blocks from irrelevant content. ContentExtractor
algorithm finds redundant blocks based on the occurrence of
the same block across multiple Web pages. FeatureExtractor
algorithm identifies the content block with help of particular

feature. K- FeatureExtractor, algorithm uses a K-means
clustering which gets multiple blocks as compared to
FeatureExtractor that selects a single block. L-Extractor
algorithm combines block-partitioning algorithm (VIPS –
Vision based Page Segmentation algorithm) [18] with support
vector machine to identify content blocks in a web page.
Content extraction and feature extraction was performed on
different websites and showed that both algorithms performed
better than INFODISCVERER method in nearly all cases.
Wang [7] proposed a method which is based on fundamental
information of web pages. In first step, method extracts
information from each web page and thereby combining that
information to get site information. The information extracted
is text node i.e. the data present in the tags, word length, menu
subtree which is a subtree having text node length less than 5,
menu item information, menu instance information. In second
step, entropy estimation method is applied to discover actual
information which is required by the users. The information
extracted from this method helps in classifying web pages and
in domain ontology generation. Tseng and Kao [8] proposed a
method based on three novel features which are similarity,
density and diverseness for identifying information from web
pages. The block which has maximum value of these three
features is considered as informative block. Similarity feature
identifies the similar set of objects in the group, density feature
indicates the degree of similar objects in particular area and
diverseness measures distribution of features among various
objects. Huang [9] proposed a method employing block
preclustering technology. This method consists of two
methods- matching phase and modeling phase. In matching
phase, it first partitions the web page into blocks based on
VIPS (Vision based Page Segmentation algorithm) [18].
Nearest neighbor clustering algorithm is used to cluster these
partitioned blocks based on similar structures. Importance
degree is associated with each cluster and clusters with
importance degree are stored in clustered pattern database. In
modeling phase, when a new web page comes it is first
partitioned into blocks and then these blocks are matched with
clustered pattern database to get the importance degree of these
new partitioned blocks. Entropy evaluation is done on these
blocks to know whether they are informative or not.
Kang and Choi [10] proposed algorithm RIPB (Recognizing
Informative Page Blocks) using visual block segmentation.
This method also partitions web page into blocks based on
VIPS. Similar structure blocks are grouped into clusters. A
linear weighted function is applied to determine whether the
block is informative or not. The function is based on the
tokens i.e. bits of text and area of cluster. Li [11] proposed a
novel algorithm for extracting informative blocks based on
new tree CST (Content Structure Tree). CST is a good source
for examining structure and content of web page. After
creating CST, content weight in each block of a web page is
calculated. Extract Informative Content Blocks algorithm
proposed by authors is used then to extract blocks from CST
trees.
Fei [12] for content extraction in e-commerce domain found
another type of tree called SDOM (semantic Dom) was used.
This tree gives a new idea that combines structure information
with its semantic meaning for efficient extraction. Different
wrappers convert tags into corresponding information.
Thomas [13] proposed content code vector (CCV) approach in
which a web page containing content and tags are represented
as 1 and 0 respectively called content code vector. Content
code ratio (CCR) is calculated which determine total amount
of content and code present around the CCV. If CCR value is
high then it means there are more text and little tags.
Tim [14] developed another approach that makes use of tag
ratios. Tag ratio is computed as the ratio of number of non
HTML tag characters to number of HTML tags. The problem
with this approach is that code of a web page can be indented
or unintended which leads to different values for tag ratios
depending upon the distribution of code. Then a threshold
value selects tag ratio values as content and non-content
blocks. Nguyen [15] creates a template which consists of paths
to content blocks and also stores non content blocks which has
same path as content block. Any new web page is compared
with the stored template to know whether it contains content
blocks or not. But this approach works for particular type of
web pages only. Yang [16] used three parameters node link
text density, non-anchor text density and punctuation mark
density together in extraction process. The main idea behind
the use of these three densities is that the non-informative
blocks consist of fewer punctuation marks and more of anchor
text and less text. Uzun [17] proposed hybrid approach which
combines automatic and manual techniques together for
extraction process. Machine learning methods are used which
draw rules for extraction process. They found decision tree
learning as the best learning method for the creation of rules.
Then informative content is extracted by using these rules and
simple string manipulation functions. If these rules are not
able to get informative blocks then machine learning method
again draw new rules.
II. PROPOSED APPROACH
Our approach combines word to leaf ratio (WLR) [19] with
link attributes [20] of nodes for content extraction. In previous
techniques characters were used instead of words but it
purposelessly gives importance to long words. So instead of
characters words are used. Leaves are examined in this ratio as
these are the only nodes that consist of textual information. In
approach [19], they have not considered the idea that a block
containing more number of links is less informative than the
block containing lesser links. So, adding text link and anchor
text ratios to word to leaf ratio gives a new approach which is
more efficient.
Our method performs following steps:
1. Construct DOM (Document Object Model) tree of
web page.
2. Remove noisy nodes which contain title, script, head,
meta, style, no script, link, select, #comment and the
nodes which have no words and are not visible.
3. Compute word to leaf ratio (WLR) as in Eq. 1.
WLR (n) = tw (n) / l (n) (1)

Where tw (n) = number of words in the node n
l (n) = number of leaves in the subtree of node n
4. Obtain initial node set which contains higher density
of text. Initial node set I is defined as those nodes
which satisfy the condition in Eq. 2.
WLR (n) ≥ √max WLR × WLR (root) (2)
5. Compute text link ratio (TLR) i.e. the ratio of the
length of the text and the number of links of node n
in I.
6. Compute link text ratio (TLTR) i.e. the ratio of the
length of the text and the length of the link text of
node n in I.
7. Calculate weight of each node W (n) as in Eq. 3
W (n) = (TLR (n) +TLTR (n)) / a (3)
Where a is normalizing factor.
8. Relative position of node n is defined in Eq. 4, Eq. 5,
Eq. 6 and Eq.7.
R (n) = WLR (n) × max w (n), ∑ R (n1)
n1∈ Children (n) (4)
Where w (n) = rpos (n) ×rWLR (n) if n ∈ I (5)
0 if n ∈ I
Where rpos (n) = 1 – (id (n) - minid / maxid – minid) (6)
rWLR (n) = (WLR (n) - minWLR) / (maxWLR - minWLR) (7)
Where minid, maxid are the minimum and maximum values of
identifiers in I, minWLR, maxWLR are the minimum and
maximum WLR values from step 3.
Relevance includes those nodes which have higher density of
text. If two nodes have same R (n) then the node with lower
identifier is selected.
9. Select the node which have highest values of W (n)
and R (n) in proportion as, 0.7 × R (n) + 0.3 × W(n).
10. Selected node has the required textual information.
III. CONCLUSION
Informative Content Extraction from web pages is very
important because web pages are unstructured and its number
is growing at a very fast rate. Content Extraction is useful for
the human users as they will get the required information in a
time efficient manner and also used as a preprocessing stage
for systems like robots, indexers, crawlers, etc. that need to
extract the main content of a web page to prevent the
treatment and processing of noisy, irrelevant and useless
information. We have presented traditional approaches for
extracting main content from web pages and also a new
approach for content extraction from web pages using the
concept of word to leaf ratio and link density.
IV. FUTURE SCOPE
Automatic Content Extraction is an emerging field in
research area as the amount and type of information added is
increasing and changing day by day. We will perform our
method on different websites like news, shopping, business
and e-commerce and education and compare precision and
recall with present methods. We will try to incorporate
hypertext information also to the above method and work on
event information generation.
ACKNOWLEDGMENT
I sincerely thank all those who helped me in completing this
task.
REFERENCES
[1] D.Gibson, K.Punera and A.Tomkins, “The volume and evolution of web
page templates”, proceedings of WWW '05 Special interest tracks and
posters of the 14th International Conference on World Wide Web, pp.
830-839, 2005.
[2] A.F.R.Rahman, H.Alam and R.Hartono, “Content extraction from
HTML documents”, International workshop on Web Document
Analysis, pp. 7-10, 2001.
[3] S.Lin and J.Ho, “Discovering informative content blocks from web
documents”, Proceedings of ACM SIGKDD International Conference
on Information Retrieval, pp. 450-453, 2002.
[4] H.Kao and S.Lin, “Mining web informative structures
contents based on entropy analysis”, IEEE Trans. Knowledge and Data
Eng., vol. 16, no. 1, pp. 41-55, 2004.
[5] H.Kao and J.Ho, “WISDOM: web intrapage informative structure
mining based on document object model”, IEEE Trans. Knowledge and
Data Eng., vol. 17, no. 5, pp. 614-627, 2005.
[6] S.Debnath, P.Mitra, N.Pal and C.Giles, “Automatic identification of
informative sections of web pages”, IEEE Trans. Knowledge and Data
Eng., vol. 17, no. 9, pp. 1233-1246, 2005.
[7] C.Wang, J.Lu and G.Zhang, “Mining key information of web pages: a
method and its application”, Expert Systems with Applications, vol. 33,
no. 2, pp. 425-433, 2006.
[8] Y.Tseng and H.Kao, “The mining and extraction of primary informative
blocks and data objects from systematic web pages”, Proceedings of the
IEEE/WIC/ACM International Conference on Web Intelligence, pp.
370-373, 2006.
[9] C.Huang, P.Yen, Y.Hung, T.Chuang and H.Lee, “Enhancing entropy-
based informative block identification using block preclustering
technology”, IEEE International Conference on Systems, Man, and
Cybernetics, pp. 2640-2645, 2006.
[10] J.Kang and J.Choi, “Detecting informative web page blocks for efficient
information extraction using visual block segmentation”, International
Symposium on Information Technology Convergence, pp. 306-310,
2007.
[11] Y.Li and J.Yang, “A novel method to extract informative blocks from
web pages”, International joint Conference on Artificial Intelligence, pp.
536-539, 2009.
[12] Y.Fei, Z.Luo, Y.Xu and W.Zhang, “A semantic dom approach for
webpage information extraction”, International Conference on
Management and Service Science, pp. 1-5, 2009.
[13] T.Gottron, “Content code blurring: a new approach to content
extraction”, 19th International Conference on Database and Expert
Systems Application, pp. 29-33, 2008.

[14] T.Weninger, W.-H.Hsu and J.Han, “ CETR-Content extraction via tag
ratios”, WWW’10 proceedings of the 19th
International Conference on
World Wide Web, pp. 971-980, 2010.
[15] D.Nguyen, D.Nguyen, S.Pham and T.Bui, “A fast template-based
approach to automatically identify primary text content of a web page”,
International Conference on Knowledge and Systems Engineering, pp.
232-236, 2009.
[16] D.Yang and J.Song, “Web content information extraction approach
based on removing noise and content-features”, 2010 International
Conference on Web Information Systems and Mining, pp. 246-249,
2010.
[17] E.Uzan, H.Agun and T.yerlikaya, “A hybrid approach for extracting
informative content from web pages”, Information Processing and
Management, vol. 49, no. 4, pp. 1248-1257, 2013.
[18] D.Cai, S.Yu, J.-R.Wen and W.-Y.Ma, “Extracting content structure for
web pages based on visual representation”, In Proceedings of Fifth Asia
Pacific Web Conference, pp. 406-417, 2003.
[19] D.Insa, J.Silva and S.Tamarit, “Using the words/leafs ratio in the dom
tree for content extraction”, The Journal of Logic and Algebraic
Programming”, vol. 82, no. 8, pp. 311-325, 2013.
[20] S.Shen and H.Zhang, “Block-level links based content extraction”,
Fourth International Symposium on Parallel Architectures, Algorithms
and Programming, pp. 330-333, 2011.

content extraction

More Related Content

What's hot (14)

Viewers also liked (16)

Similar to content extraction (20)

Recently uploaded (20)

content extraction