SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 53
ENHANCING THE PERFORMANCE OF CLUSTER BASED TEXT
SUMMARIZATION USING SUPPORT VECTOR MACHINE
M. S Patil1
, M. S. Bewoor2
, S. H. Patil3
1
Research Scholar, Department of Computer Engineering, BVUCOEP, Maharashtra, India
2
Associate Professor, Department of Computer Engineering, BVUCOEP, Maharashtra, India
3
Professor, Department of Computer Engineering, BVUCOEP, Maharashtra, India
Abstract
Technology is evolving day by day and this increase in technology is nothing but is the efforts to reduce human work and to have
systems as automatic as possible. Same thing is true in terms of existence of digital information. Due to enormous increase in the
use of internet, there is striking increase in the digital information. This digital information is characterized by different form of
information, same information in different form, unrelated information and also there is lot of redundant information. Another
next important thing to note is that most of the time we require textual information. To search or retrieve small information one
has to go through thousands of documents, read all the retrieved documents irrespective whether they contain useful information
or no. It becomes very difficult to read all the retrieved documents and prepare exact summary out of it within time. Besides this,
many times retrieved information is repeated in almost many documents. This leads to research in the area of text mining. Text
summarization is one of the challenging tasks in the field of text mining.
Keywords: Entropy, FCM, Purity, SVM
--------------------------------------------------------------------***----------------------------------------------------------------------
1. INTRODUCTION
Text summarization is the process of presenting the
information in the document in very precise manner without
losing any information or content in the document. The
approach that truly gives information contained in the
document and in correct form without changing its meaning
is considered to be the best approach. Thereof generated
summary must retain the data as well as the central idea of
document. Based on following characteristics, different text
summarization techniques can be classified:
1. Based on number of documents (Single document
and multi-document summarization.)
2. Based on summary generated (Extractive and
Abstractive.)
a. Extractive: Sentences in summary are same as
those in the document.
b. Abstractive: Sentences in summary are
constructed from the information in the
document. This approach is difficult as
compared to extractive.
3. Based on technique used. (Supervised and
unsupervised.)
4. Based on usage of summary (Query based and query
independent.)
a. Query based: Summary of the document is
constructed with respect to the query given by
the user.
b. Query independent: This type of summary
remains same throughout the process where
sentences are selected from document
irrespective of the query.
Irrespective of the type of summarization technique used,
text summarization is carried out in following three stages:
1. Preprocessing
2. Processing
3. Summary Generation.
In the stage of preprocessing, NLP phases like tokenization,
parsing, stop word removal, stemming, case folding etc are
carried out. This stage eliminates unnecessary words and
retains only important words.
In processing stage summarization algorithms are applied in
order to extract sentences required for generating.
In last phase, final summary is generated from given
document or documents.
This paper presents an extractive text summarization
approach to generate summary. For this two algorithms are
used viz. Fuzzy C Means (FCM), a clustering algorithm and
Support Vector Machine (SVM). Finally summary
generated is compared with the summary generated by the
pure clustering algorithm.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 54
Fig-1: Summarization Process
2. RELATED WORK
The paper [2] proposes a system for generating summary
using clustering algorithm cascaded with Support vector
machine (SVM). It also proposes set of metrics for
evaluating the performance of the proposed system with
respect to performance of pure clustering algorithm. In
Paper [3], author has compared various techniques of text
summarization. In paper [7], Fuzzy C Means (FCM)
clustering technique is described in detail. This paper clearly
states the algorithm of FCM. In paper [6], performance of
Fuzzy C Means (FCM) is compared with other techniques
like Support Vector Machine (SVM), Artificial Neural
Network (ANN) and BC. Also its results show, out of these
three techniques SVM performs better. Use of SVM is also
explained in this paper.
3. SYSTEM DESCRIPTION
Proposed system generates summary of text file using Fuzzy
C Means (FCM), a clustering algorithm, cascaded with
Support Vector Machine (SVM), a machine learning
algorithm. First given input text file then undergoes
preprocessing step which carries out NLP phases like
tokenization, stop word removal, etc. Next is the processing
step. In this FCM is applied and cluster centers are
calculated. Then word count and word frequencies are
calculated using FCM. Then in next step next algorithm is
applied using SVM and word frequencies for SVM are
calculated. In last step summaries are generated using two
different sentence scores of the two algorithms viz. FCM
and FCM cascaded with SVM. These summaries thus
created are compared with respect to the set of metrics.
Following figure 2 shows architecture of the system.
Fig-2: System Architecture
4. METHODOLOGY:
Fig-3: Workflow of the System
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 55
This paper proposes an algorithm for text summarization
using FCM, clustering algorithm and SVM. Traditional
clustering techniques like k-means, nearest neighbor
clustering etc generate clusters in which each item belongs
to exactly one cluster. These are termed as hard clustering
techniques. Unlike these, FCM is a soft clustering technique.
It allows one item to belong to all the generated clusters.
Each item is related to all clusters with a relationship
function. Higher the value of the function, higher is the
relation of item with that cluster.
Work flow of the system is as follows:
1) Preprocessing will perform the NLP phases like
tokenization, stop word removal, etc.
2) Next is to calculate word frequency as follows:
wf=((wordCount/TotalWords) *100)
where wordCount is the total number of times the
word occur in the document.
Total Words is the total number of words in the
document.
This frequency is normalized.
3) Then pure FCM algorithm is applied.
Minimizing function used for FCM is as follows:
Uij=1/ 𝐶
𝑘−𝑖 (𝑑𝑖𝑠𝑡(𝑐𝑒𝑛𝑡𝑒𝑟𝑖, 𝑥)/
(𝑐𝑒𝑛𝑡𝑒𝑟2, 𝑥)) 2/(𝑚−1)
Next to this SVM is applied. In this phase SVM kernel
function is applied to calculate the sentence scores.
4) Sentence scores are calculated as follows:
Score= (X * Y) + C
Where
C is constant; here it is word frequency,
For FCM, X and Y are the cluster centers
For SVM, X and Y are calculated using the cluster
centers and its related cluster values.
5) According to the limit of number of sentences in the
summary, top high score sentences are selected and
summary is generated.
6) Generated summaries are compared using following
metrics:
a. Purity: It is an external clustering evaluation
metric. Higher the value of purity, more accurate
the summary is obtained.
Purity=
1
𝑁
(𝑋 + 𝑌)
b. Clustering Entropy: It is also an external
clustering evaluation metric.
𝐻 𝑋 = − p Xi ∗ log2 p Xi
n−1
i=0
Here the value which is close to 0 is more
accurate.
c. Semantic Gap: This is shown by using sentence
summary that shows semantic gap between two
algorithms.
d. Classification Cost: It is calculated using
following formula:
Cost= Frequency + Overheads (Transaction or
iterations) + number of clusters
5. EXPERIMENTAL AND PERFORMANCE
ANALYSIS
Experiments have been performed by giving various text
files as input to the system and calculating the values of
above metrics using both the algorithms. Graphs have been
generated for each metric showing the comparison between
the values calculated using both the algorithms.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 56
Fig-4: Comparison Graph of Purity in FCM and proposed algorithm
Fig 5: Comparison Graph of Clustering Entropy in FCM and proposed algorithm
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 57
Fig 6: Comparison Graph of Semantic Gap in FCM and proposed algorithm.
Fig 7: Comparison Graph of Purity in FCM and proposed algorithm.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 58
6. CONCLUSION
In this paper concentration is given on improving the quality
of summary generated by clustering technique. In this
performance of the proposed system is compared with the
performance of clustering technique. This is done by
comparing the summaries generated on the basis of above
given performance evaluating factors. Above results show
that proposed algorithm performs better than pure clustering
algorithm. Further this approach can be applied for multi-
document summarization.
REFERENCES
[1]. A.K. Jain, M.N. Murty, P.J. Flynn, “Data Clustering: A
Review”, 2000 ACM
[2]. M. S. Patil, M. S Bewoor, Dr. S. H. Patil, “A Hybrid
Approach for Extractive Document Summarization Using
Machine Learning and Clustering Technique” IJCSIT 2014.
[3]. Roma V J, M S Bewoor, Dr. S. H. Patil, “the Quantity
of NLP Based Text Summarization and Clustering
Techniques By Quantitative and Qualitative Metrics”
International Journal of Scientific & Engineering Research
2013
[4]. Ronen Feldman, James Sanger, “The Text Mining
Handbook” www.cambridge.org
[5]. Ross, T. J. (2010); “Fuzzy Logic with Engineering
Applications”, Third Edition, John Wiley & Sons, Ltd,
Chichester, UK
[6]. Srinivasa K G, Venugopal K R and L M Patnaik,
“Feature Extraction using Fuzzy C - Means Clustering for
Data Mining Systems” IJCSNS International Journal of
Computer Science and Network Security, VOL.6 No.3A,
March 2006
[7]. Sumit Goswami, Mayank Singh Shishodia, “A Fuzzy
Based Approach To Text Mining And Document
Clustering”
[8]. Tsutomu Hirao, Hideki Isozaki, Eisaku Maeda
“Extracting Important Sentences with Support Vector
Machines” ACM 2002.
[9]. Vishal Gupta, Gurpreet S. Lehal; “A Survey of Text
Mining Techniques and Applications”; Journal of Emerging
Technologies in Web Intelligence, Vol.1, No.1, August
2009.
BIOGRAPHIES
Patil M S is currently completing M.Tech(Computer) from
Bharati Vidyapeeth College Of engineering Pune.
E-mail: madhsp.patil@gmail.com
M S Bewoor (M. E Computer) is currently working as an
Associate Professor in the Department of Computer
Engineering at Bharati Vidyapeeth College Of engineering
Pune.
E-mail: msbewoor@bvucoep.edu.in
Dr. S H Patil is currently working as a Professor in the
Department of Computer Engineering at Bharati Vidyapeeth
College Of engineering Pune. His subject of specialization
are Operating System and Distributed System.
E-mail: shpatil@bvucoep.edu.in

More Related Content

PDF
Automation tool for evaluation of the quality of nlp based
PDF
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
PDF
Data reduction techniques for high dimensional biological data
PDF
IRJET- Semantics based Document Clustering
PDF
Optimal approach for text summarization
PDF
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
PDF
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
PDF
84cc04ff77007e457df6aa2b814d2346bf1b
Automation tool for evaluation of the quality of nlp based
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Data reduction techniques for high dimensional biological data
IRJET- Semantics based Document Clustering
Optimal approach for text summarization
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
84cc04ff77007e457df6aa2b814d2346bf1b

What's hot (18)

PDF
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
PDF
A Novel Clustering Method for Similarity Measuring in Text Documents
PDF
Dynamic thresholding on speech segmentation
PDF
Volume 2-issue-6-2143-2147
PDF
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MINING
PDF
50120140501018
PDF
Fault diagnosis using genetic algorithms and principal curves
PDF
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
PDF
An investigative scheme for keyword search using inverted key tactic
PDF
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
PDF
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
PDF
Classification of text data using feature clustering algorithm
PDF
International Journal of Engineering Research and Development
PDF
Proposed technique-for-edge-matching-of-torn-paper
PDF
Fault diagnosis using genetic algorithms and
PDF
Textual Data Partitioning with Relationship and Discriminative Analysis
PDF
50120130406007
PDF
Improved wolf algorithm on document images detection using optimum mean techn...
Hybrid Method HVS-MRMR for Variable Selection in Multilayer Artificial Neural...
A Novel Clustering Method for Similarity Measuring in Text Documents
Dynamic thresholding on speech segmentation
Volume 2-issue-6-2143-2147
A SURVEY OF CLUSTERING ALGORITHMS IN ASSOCIATION RULES MINING
50120140501018
Fault diagnosis using genetic algorithms and principal curves
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
An investigative scheme for keyword search using inverted key tactic
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
Classification of text data using feature clustering algorithm
International Journal of Engineering Research and Development
Proposed technique-for-edge-matching-of-torn-paper
Fault diagnosis using genetic algorithms and
Textual Data Partitioning with Relationship and Discriminative Analysis
50120130406007
Improved wolf algorithm on document images detection using optimum mean techn...
Ad

Similar to Enhancing the performance of cluster based text summarization using support vector machine (20)

PDF
Document retrieval using clustering
PDF
Variance rover system
PDF
Variance rover system web analytics tool using data
PDF
Optimization of workload prediction based on map reduce frame work in a cloud...
PDF
Optimization of workload prediction based on map reduce frame work in a cloud...
PDF
Evaluating the efficiency of rule techniques for file classification
PDF
Evaluating the efficiency of rule techniques for file
PDF
Test case prioritization using hyperlink ranking
PDF
Test case prioritization using hyperlink rankinga
PDF
A study and survey on various progressive duplicate detection mechanisms
PDF
Survey on semi supervised classification methods and feature selection
PDF
Survey on semi supervised classification methods and
PDF
Data mining techniques
PDF
Image retrieval based on feature selection method
PDF
H04564550
PDF
An efficient information retrieval ontology system based indexing for context
PDF
Clustering of medline documents using semi supervised spectral clustering
PDF
Clustering of medline documents using semi supervised spectral clustering
PDF
Fusion method used to tolerate the faults occurred in disrtibuted system
PDF
A novel approach for text extraction using effective pattern matching technique
Document retrieval using clustering
Variance rover system
Variance rover system web analytics tool using data
Optimization of workload prediction based on map reduce frame work in a cloud...
Optimization of workload prediction based on map reduce frame work in a cloud...
Evaluating the efficiency of rule techniques for file classification
Evaluating the efficiency of rule techniques for file
Test case prioritization using hyperlink ranking
Test case prioritization using hyperlink rankinga
A study and survey on various progressive duplicate detection mechanisms
Survey on semi supervised classification methods and feature selection
Survey on semi supervised classification methods and
Data mining techniques
Image retrieval based on feature selection method
H04564550
An efficient information retrieval ontology system based indexing for context
Clustering of medline documents using semi supervised spectral clustering
Clustering of medline documents using semi supervised spectral clustering
Fusion method used to tolerate the faults occurred in disrtibuted system
A novel approach for text extraction using effective pattern matching technique
Ad

More from eSAT Journals (20)

PDF
Mechanical properties of hybrid fiber reinforced concrete for pavements
PDF
Material management in construction – a case study
PDF
Managing drought short term strategies in semi arid regions a case study
PDF
Life cycle cost analysis of overlay for an urban road in bangalore
PDF
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
PDF
Laboratory investigation of expansive soil stabilized with natural inorganic ...
PDF
Influence of reinforcement on the behavior of hollow concrete block masonry p...
PDF
Influence of compaction energy on soil stabilized with chemical stabilizer
PDF
Geographical information system (gis) for water resources management
PDF
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
PDF
Factors influencing compressive strength of geopolymer concrete
PDF
Experimental investigation on circular hollow steel columns in filled with li...
PDF
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
PDF
Evaluation of punching shear in flat slabs
PDF
Evaluation of performance of intake tower dam for recent earthquake in india
PDF
Evaluation of operational efficiency of urban road network using travel time ...
PDF
Estimation of surface runoff in nallur amanikere watershed using scs cn method
PDF
Estimation of morphometric parameters and runoff using rs & gis techniques
PDF
Effect of variation of plastic hinge length on the results of non linear anal...
PDF
Effect of use of recycled materials on indirect tensile strength of asphalt c...
Mechanical properties of hybrid fiber reinforced concrete for pavements
Material management in construction – a case study
Managing drought short term strategies in semi arid regions a case study
Life cycle cost analysis of overlay for an urban road in bangalore
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory investigation of expansive soil stabilized with natural inorganic ...
Influence of reinforcement on the behavior of hollow concrete block masonry p...
Influence of compaction energy on soil stabilized with chemical stabilizer
Geographical information system (gis) for water resources management
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
Factors influencing compressive strength of geopolymer concrete
Experimental investigation on circular hollow steel columns in filled with li...
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
Evaluation of punching shear in flat slabs
Evaluation of performance of intake tower dam for recent earthquake in india
Evaluation of operational efficiency of urban road network using travel time ...
Estimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of morphometric parameters and runoff using rs & gis techniques
Effect of variation of plastic hinge length on the results of non linear anal...
Effect of use of recycled materials on indirect tensile strength of asphalt c...

Recently uploaded (20)

PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Sustainable Sites - Green Building Construction
PPTX
web development for engineering and engineering
PPT
Mechanical Engineering MATERIALS Selection
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT
Project quality management in manufacturing
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
OOP with Java - Java Introduction (Basics)
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Sustainable Sites - Green Building Construction
web development for engineering and engineering
Mechanical Engineering MATERIALS Selection
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Project quality management in manufacturing
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
R24 SURVEYING LAB MANUAL for civil enggi
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
OOP with Java - Java Introduction (Basics)
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
UNIT 4 Total Quality Management .pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Foundation to blockchain - A guide to Blockchain Tech

Enhancing the performance of cluster based text summarization using support vector machine

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 53 ENHANCING THE PERFORMANCE OF CLUSTER BASED TEXT SUMMARIZATION USING SUPPORT VECTOR MACHINE M. S Patil1 , M. S. Bewoor2 , S. H. Patil3 1 Research Scholar, Department of Computer Engineering, BVUCOEP, Maharashtra, India 2 Associate Professor, Department of Computer Engineering, BVUCOEP, Maharashtra, India 3 Professor, Department of Computer Engineering, BVUCOEP, Maharashtra, India Abstract Technology is evolving day by day and this increase in technology is nothing but is the efforts to reduce human work and to have systems as automatic as possible. Same thing is true in terms of existence of digital information. Due to enormous increase in the use of internet, there is striking increase in the digital information. This digital information is characterized by different form of information, same information in different form, unrelated information and also there is lot of redundant information. Another next important thing to note is that most of the time we require textual information. To search or retrieve small information one has to go through thousands of documents, read all the retrieved documents irrespective whether they contain useful information or no. It becomes very difficult to read all the retrieved documents and prepare exact summary out of it within time. Besides this, many times retrieved information is repeated in almost many documents. This leads to research in the area of text mining. Text summarization is one of the challenging tasks in the field of text mining. Keywords: Entropy, FCM, Purity, SVM --------------------------------------------------------------------***---------------------------------------------------------------------- 1. INTRODUCTION Text summarization is the process of presenting the information in the document in very precise manner without losing any information or content in the document. The approach that truly gives information contained in the document and in correct form without changing its meaning is considered to be the best approach. Thereof generated summary must retain the data as well as the central idea of document. Based on following characteristics, different text summarization techniques can be classified: 1. Based on number of documents (Single document and multi-document summarization.) 2. Based on summary generated (Extractive and Abstractive.) a. Extractive: Sentences in summary are same as those in the document. b. Abstractive: Sentences in summary are constructed from the information in the document. This approach is difficult as compared to extractive. 3. Based on technique used. (Supervised and unsupervised.) 4. Based on usage of summary (Query based and query independent.) a. Query based: Summary of the document is constructed with respect to the query given by the user. b. Query independent: This type of summary remains same throughout the process where sentences are selected from document irrespective of the query. Irrespective of the type of summarization technique used, text summarization is carried out in following three stages: 1. Preprocessing 2. Processing 3. Summary Generation. In the stage of preprocessing, NLP phases like tokenization, parsing, stop word removal, stemming, case folding etc are carried out. This stage eliminates unnecessary words and retains only important words. In processing stage summarization algorithms are applied in order to extract sentences required for generating. In last phase, final summary is generated from given document or documents. This paper presents an extractive text summarization approach to generate summary. For this two algorithms are used viz. Fuzzy C Means (FCM), a clustering algorithm and Support Vector Machine (SVM). Finally summary generated is compared with the summary generated by the pure clustering algorithm.
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 54 Fig-1: Summarization Process 2. RELATED WORK The paper [2] proposes a system for generating summary using clustering algorithm cascaded with Support vector machine (SVM). It also proposes set of metrics for evaluating the performance of the proposed system with respect to performance of pure clustering algorithm. In Paper [3], author has compared various techniques of text summarization. In paper [7], Fuzzy C Means (FCM) clustering technique is described in detail. This paper clearly states the algorithm of FCM. In paper [6], performance of Fuzzy C Means (FCM) is compared with other techniques like Support Vector Machine (SVM), Artificial Neural Network (ANN) and BC. Also its results show, out of these three techniques SVM performs better. Use of SVM is also explained in this paper. 3. SYSTEM DESCRIPTION Proposed system generates summary of text file using Fuzzy C Means (FCM), a clustering algorithm, cascaded with Support Vector Machine (SVM), a machine learning algorithm. First given input text file then undergoes preprocessing step which carries out NLP phases like tokenization, stop word removal, etc. Next is the processing step. In this FCM is applied and cluster centers are calculated. Then word count and word frequencies are calculated using FCM. Then in next step next algorithm is applied using SVM and word frequencies for SVM are calculated. In last step summaries are generated using two different sentence scores of the two algorithms viz. FCM and FCM cascaded with SVM. These summaries thus created are compared with respect to the set of metrics. Following figure 2 shows architecture of the system. Fig-2: System Architecture 4. METHODOLOGY: Fig-3: Workflow of the System
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 55 This paper proposes an algorithm for text summarization using FCM, clustering algorithm and SVM. Traditional clustering techniques like k-means, nearest neighbor clustering etc generate clusters in which each item belongs to exactly one cluster. These are termed as hard clustering techniques. Unlike these, FCM is a soft clustering technique. It allows one item to belong to all the generated clusters. Each item is related to all clusters with a relationship function. Higher the value of the function, higher is the relation of item with that cluster. Work flow of the system is as follows: 1) Preprocessing will perform the NLP phases like tokenization, stop word removal, etc. 2) Next is to calculate word frequency as follows: wf=((wordCount/TotalWords) *100) where wordCount is the total number of times the word occur in the document. Total Words is the total number of words in the document. This frequency is normalized. 3) Then pure FCM algorithm is applied. Minimizing function used for FCM is as follows: Uij=1/ 𝐶 𝑘−𝑖 (𝑑𝑖𝑠𝑡(𝑐𝑒𝑛𝑡𝑒𝑟𝑖, 𝑥)/ (𝑐𝑒𝑛𝑡𝑒𝑟2, 𝑥)) 2/(𝑚−1) Next to this SVM is applied. In this phase SVM kernel function is applied to calculate the sentence scores. 4) Sentence scores are calculated as follows: Score= (X * Y) + C Where C is constant; here it is word frequency, For FCM, X and Y are the cluster centers For SVM, X and Y are calculated using the cluster centers and its related cluster values. 5) According to the limit of number of sentences in the summary, top high score sentences are selected and summary is generated. 6) Generated summaries are compared using following metrics: a. Purity: It is an external clustering evaluation metric. Higher the value of purity, more accurate the summary is obtained. Purity= 1 𝑁 (𝑋 + 𝑌) b. Clustering Entropy: It is also an external clustering evaluation metric. 𝐻 𝑋 = − p Xi ∗ log2 p Xi n−1 i=0 Here the value which is close to 0 is more accurate. c. Semantic Gap: This is shown by using sentence summary that shows semantic gap between two algorithms. d. Classification Cost: It is calculated using following formula: Cost= Frequency + Overheads (Transaction or iterations) + number of clusters 5. EXPERIMENTAL AND PERFORMANCE ANALYSIS Experiments have been performed by giving various text files as input to the system and calculating the values of above metrics using both the algorithms. Graphs have been generated for each metric showing the comparison between the values calculated using both the algorithms.
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 56 Fig-4: Comparison Graph of Purity in FCM and proposed algorithm Fig 5: Comparison Graph of Clustering Entropy in FCM and proposed algorithm
  • 5. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 57 Fig 6: Comparison Graph of Semantic Gap in FCM and proposed algorithm. Fig 7: Comparison Graph of Purity in FCM and proposed algorithm.
  • 6. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 03 Issue: 12 | Dec-2014, Available @ http://www.ijret.org 58 6. CONCLUSION In this paper concentration is given on improving the quality of summary generated by clustering technique. In this performance of the proposed system is compared with the performance of clustering technique. This is done by comparing the summaries generated on the basis of above given performance evaluating factors. Above results show that proposed algorithm performs better than pure clustering algorithm. Further this approach can be applied for multi- document summarization. REFERENCES [1]. A.K. Jain, M.N. Murty, P.J. Flynn, “Data Clustering: A Review”, 2000 ACM [2]. M. S. Patil, M. S Bewoor, Dr. S. H. Patil, “A Hybrid Approach for Extractive Document Summarization Using Machine Learning and Clustering Technique” IJCSIT 2014. [3]. Roma V J, M S Bewoor, Dr. S. H. Patil, “the Quantity of NLP Based Text Summarization and Clustering Techniques By Quantitative and Qualitative Metrics” International Journal of Scientific & Engineering Research 2013 [4]. Ronen Feldman, James Sanger, “The Text Mining Handbook” www.cambridge.org [5]. Ross, T. J. (2010); “Fuzzy Logic with Engineering Applications”, Third Edition, John Wiley & Sons, Ltd, Chichester, UK [6]. Srinivasa K G, Venugopal K R and L M Patnaik, “Feature Extraction using Fuzzy C - Means Clustering for Data Mining Systems” IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.3A, March 2006 [7]. Sumit Goswami, Mayank Singh Shishodia, “A Fuzzy Based Approach To Text Mining And Document Clustering” [8]. Tsutomu Hirao, Hideki Isozaki, Eisaku Maeda “Extracting Important Sentences with Support Vector Machines” ACM 2002. [9]. Vishal Gupta, Gurpreet S. Lehal; “A Survey of Text Mining Techniques and Applications”; Journal of Emerging Technologies in Web Intelligence, Vol.1, No.1, August 2009. BIOGRAPHIES Patil M S is currently completing M.Tech(Computer) from Bharati Vidyapeeth College Of engineering Pune. E-mail: madhsp.patil@gmail.com M S Bewoor (M. E Computer) is currently working as an Associate Professor in the Department of Computer Engineering at Bharati Vidyapeeth College Of engineering Pune. E-mail: msbewoor@bvucoep.edu.in Dr. S H Patil is currently working as a Professor in the Department of Computer Engineering at Bharati Vidyapeeth College Of engineering Pune. His subject of specialization are Operating System and Distributed System. E-mail: shpatil@bvucoep.edu.in