An automatic text summarization using lexical cohesion and correlation of sentences

IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 06 | Jun-2014, Available @ http://www.ijret.org 285
AN AUTOMATIC TEXT SUMMARIZATION USING LEXICAL
COHESION AND CORRELATION OF SENTENCES
A.R.Kulkarni1
, S.S.Apte2
1
Computer Science & Engineering Department, Walchand Institute of Technology, Solapur – 413006, India
2
Head, Computer Science & Engineering Department, Walchand Institute of Technology, Solapur – 413006, India
Abstract
Due to substantial increase in the amount of information on the Internet, it has become extremely difficult to search for relevant
documents needed by the users. To solve this problem, Text summarization is used which produces the summary of documents
such that the summary contains important content of the document. This paper proposes a better approach for text summarization
using lexical chaining and correlation of sentences. Lexical chains are created using Wordnet . The score of each Lexical chain is
calculated based on keyword strength, Tf-idf & other features. The concept of using lexical chains helps to analyze the document
semantically and the concept of correlation of sentences helps to consider the relation of sentence with preceding or succeeding
sentence. This improves the quality of summary generated.
In this paper we discuss a summarization method, which combines lexical chaining with correlation of sentences in which relation
of a sentence with the preceding sentence is considered. Our experiments show that the inclusion of both these features improves
the quality of summary generated.
Keywords— Text summarization, Wordnet, Correlation of sentences, Lexical chains
--------------------------------------------------------------------***-----------------------------------------------------------------
1. INTRODUCTION
1.1 Motivation
These days, the number of Web pages on the Internet almost
doubles every year as the information is now available from
a variety of sources. It takes considerable amount of time to
find the relevant information. Automatic Text
Summarization will help the users to find the relevant
information rapidly. It generates the summary of the
document and one can read the read the summary and
decide the relevance of the document to the information
needed by the user.
1.2 Background Research:
Text summarization is the process of producing a condensed
version of original document. This condensed version
should have important content of the original document.
Research is being done since many years to generate
coherent and indicative summaries using different
techniques. According to (Jones, 1993) the text
summarization is described as two step process
i) Building a source representation from the original
document.
ii) Generating summary from the source representation
Text summarization can be broadly classified into two types:
Single document summarization and multi-document
summarization. This paper focuses on single document
summarization that generates summary of single document.
The text summarization can be categorized into extractive
and abstractive based on the nature of text representation in
the summary.
Many methods have been proposed till now on generating a
coherent summary. The earlier methods used only statistical
methods that focused on term frequency [1] for choosing
important sentences. These methods were not found to be
efficient as it did not consider all the contexts of the word
or identify semantically related terms known as cohesion.
Then came methods which used semantic representation of
the original document supported by a domain-specific
knowledge base. Now a days text summarization is
considered as a natural language processing task . Lexical
chains a simplest form of lexical cohesion was introduced
by Morns & Hirst[2].But it was found that all possible
senses of the word were not taken into account. .
Berzilay & Elhada [2] presented a better algorithm that
constructs all possible interpretations of the source text
using lexical chains. It is an efficient method for text
summarization as lexical chains identify and capture
important concepts of the document without going into
deep semantic analyses. Lexical chains are constructed
using some knowledge base that contains nouns and its
various associations.
Our Algorithm is based on the method used above. We
have used Wordnet to generate domain-specific extractive
summary using Lexical chains for the nouns in the
document. The algorithm segments the given content into
sentences & then into tokens. These tokens are tagged using
POS tagger. The Nouns are selected & for each noun in the
segment, we consider its sense using Wordnet. Then we
attempt to merge these senses into all of the existing chains
in all possible ways, hence building every possible

_______________________________________________________________________________________
interpretation of the segment. Next merge chains between
segments that contain a word in the same sense in common.
The algorithm then calculates score of lexical chains,
determines the strongest chain and uses this to generate a
summary. We have also used the concept of correlation of
sentences to generate a good quality summary.. The terms
that occur in the strongest lexical chains are considered as
key terms and the score of sentences is calculated based
on the presence of key terms in it. All the sentences are
ranked based on their score and top n sentences are selected
for inclusion in the summary. Then the correlation of
sentences is checked and if any sentence has correlation
with the previous sentence, then the previous sentence
should also be included in the summary based on condition
as shown in the algorithm below
2. ARCHITECTURE OF TEXT
SUMMARIZATION
Preprocessing includes
 Segmentation
 Tokenization
 POS(part of speech tagging) at lexical level.
 Stemming.
3. LEXICAL CHAIN COMPUTING
ALGORITHM
1. Input Original document for generating summary
(.txt file).
2. Divide the document into sentences using
segmentation.
3. Each sentence is divided into tokens using
tokenizer.
4. These tokens are tagged using POS Tagger.
5. For each noun build the synsets.
6. For each sentence generate a map using 4
relations: Synonym, Hyperrnym, Hyponym,
Merynym.
7. Calculate distance of each word from other related
words.
8. Build Lexical chains using generated map.
9. Calculate each chain weight using values of
distances of each word
10. Select longest chain i.e. best chain having highest
chain weight
11. From the original document select sentences that
have words in the best chain retaining their order
of occurrence in the original document.
12. Pick top n sentences as summary based on the
percentage of original document to be used for
generating summary.
13. If the selected sentence starts with words :
although, however, moreover ,also, this, those and
that ,then they are related with the preceding
sentence.
14. If the rank of the preceding sentence is equal to or
greater than 70% of the rank of the selected
sentence, then it is included in the summary. In this
way correlation between sentences is maintained.
4. EVALUATION
Evaluation is the most important part of any research work.
It helps to compare various techniques based on evaluation
metrics.
This paper uses precision & recall [4,5,6]technique for
evaluation which is based on statistical measures. Precision
evaluates the proportion of correctness for the sentences in
the summary whereas recall is utilized to evaluate the
proportion of relevant sentences included in the summary.
4.1 Precision
Precision = {Retrieved sentences} - {Relevant sentences}
-----------------------------------------------------------
{Retrieved Sentences}
The higher the precision value, the better is the efficiency
of the system in reducing irrelevant Sentences
4.2 Recall
Recall= {Retrieved sentences}- {Relevant sentences}
______________________________
{relevant sentences}
Higher the recall value, better the efficiency of the
approach in selecting only relevant sentences.
4.3 F-Measure
The weighted harmonic mean of precision and recall is
called as F-measure
F-measure= 2 x Precision * recall
-----------------------
Precision + recall
5. EXPERIMENTAL RESULTS
Three documents are taken in news domain. The original
document, manually generated summaries and summaries
generated by the above approach are shown below. The

_______________________________________________________________________________________
precision recall and F-measure are calculated for these three
documents and they are compared with other two
summarizers.
Original Document 1
Ideal Summary of Document 1

_______________________________________________________________________________________
Summary of Document 1 generated by our Summarizer
Original Document 2

_______________________________________________________________________________________
Ideal Summary of Document 2
Summary of Document 2 generated by our summarizer
Original document 3

_______________________________________________________________________________________
Ideal summary of document 3
Generated Summary of Document 3
6. COMPARISION
This paper considers online summarizer from freesummarizer.com[7], Copernicus summarizer and our summarizer using lexical
chains of sentences for comparison. The above three documents are used as input to all the three summarizers. The precision,
recall and F-measure are used as performance measures for summary generated.

_______________________________________________________________________________________
Document1
Document2:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Copernic Summarizer Online Summarizer Lexical Chain
Summarizer
Precision
Recall
F-measure
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Summarizer
Precision
Recall
F-measure

_______________________________________________________________________________________
Document3:
7. CONCLUSIONS
It is seen that for document 1 and document 3 our summarizer
performs better than Copernicus summarizer. and online
summarizer. For document 2, It performs equally as online
summarizer but less efficient than Copernicus summarizer.
Our summarizer is better as it also considers the semantic
analysis of the document & correlation of sentences for
generating the summary.
REFERENCES
[1] Canasai Kruengkari and Chuleer at Jaruskulchai,
"Generic Text Summarization Using Local and Global
Properties of Sentences", Proceedings of the IEEE/WIC
international Conference on Web Intelligence (WI’03),
2003.
[2] Morris, J. and G. Hirst. Lexical cohesion computed by
thesaural relations as an indicator of the structure of the
text. In Computational Linguistics, 18(1):pp21-45.
1991.
[3] Barzilay, Regina and Michael Elhadad. Using Lexical
Chains for Text Summarization. in Proceedings of the
Intelligent Scalable Text Summarization
Workshop.(ISTS’97), ACL Madrid, 1997.
[4] Rene Arnulfo Garcia-Hera ndez and Yulia Ledeneva,
“Word Sequence Models for Single Text
Summarization”, IEEE,44-48, 2009.
[5] Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, Jaime
Carbonell, Summarizing text documents: Sentence
Selection and Evaluation Metrics, Language
Technologies Institute, Carnegie Mellon University.
[6] Khosrow Kaikhah, "Automatic Text summarization
with Neural Networks", in Proceedings of second
international Conference on intelligent systems, IEEE,
40-44, Texas, USA, June 2004.
[7] www.freesummarizer.com/summarize/
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Summarizer
Precision
Recall
F-measure

An automatic text summarization using lexical cohesion and correlation of sentences

More Related Content

What's hot (17)

Viewers also liked (16)

Similar to An automatic text summarization using lexical cohesion and correlation of sentences (20)

More from eSAT Publishing House (20)

Recently uploaded (20)

An automatic text summarization using lexical cohesion and correlation of sentences