SlideShare a Scribd company logo
TELKOMNIKA, Vol.17, No.6, December 2019, pp.3050~3056
ISSN: 1693-6930, accredited First Grade by Kemenristekdikti, Decree No: 21/E/KPT/2018
DOI: 10.12928/TELKOMNIKA.v17i6.12494 ◼ 3050
Received February 6, 2019; Revised June 18, 2019; Accepted July 2, 2019
Regression model focused on query
for multi documents summarization
based on significance of the sentence position
Aris Fanani*1
, Yuniar Farida2
, Putra Prima Arhandi3
, M. Mahaputra Hidayat4
,
Abdul Muhid5
, Billy Montolalu6
1,2,5
UIN Sunan Ampel Surabaya, 117, A Yani St., Surabaya, Indonesia
3
State Polytechnic of Malang, 9, Soekarno Hatta St., Malang, Indonesia
4
Bhayangkara University, 114, A Yani St., Surabaya, Indonesia
6
IT Telkom, Surabaya, Indonesia
*Corresponding author, e-mail: arisfa@uinsby.ac.id1
, yuniar_farida@uinsby.ac.id2
,
putraprima@polinema.ac.id3
, mahaputra@ubhara.ac.id4
, abdulmuhid@uinsby.ac.id5
,
billy@ittelkom-sby.ac.id6
Abstract
Document summarization is needed to get the information effectively and efficiently. One method
used to obtain the document summarization by applying machine learning techniques. This paper
proposes the application of regression models to query-focused multi-document summarization based on
the significance of the sentence position. The method used is the Support Vector Regression (SVR) which
estimates the weight of the sentence on a set of documents to be made as a summary based on sentence
feature which has been defined previously. A series of evaluations performed on a data set of DUC 2005.
From the test results obtained summary which has an average precision and recall values of 0.0580 and
0.0590 for measurements using ROUGE-2, ROUGE 0.0997 and 0.1019 for measurements using
the proposed regression-SU4. Model can perform measurements of the significance of the position of
the sentence in the document well.
Keywords: multi-document summarization, sentence position, support vector regresion
Copyright © 2019 Universitas Ahmad Dahlan. All rights reserved.
1. Introduction
As internet usage increases, all information becomes easier to obtain and in abundant
amounts. For just one topic, so many information documents are displayed with various different
narratives even though the core information is the same. Document summarization is needed to
get the information effectively and efficiently. In the process of searching documents on web
pages, Keyword searches for collections of documents are generally carried out on the entire
contents of the document. So the process of information retrieval takes a long time. whereas
users expect the right results with a short time in the process of information retrieval. Therefore,
it is recommended that the keyword matching process for document collections be carried out at
the core of documents that have shorter content. Summarization is needed to get the contents
of the article in summary. Summary is a strict expression of the main content of an article, which
aims to tell the reader the core of a main thought [1-4]. The simple concept of summary is taking
an important part of the entire contents of the article which then presents it again in a more
concise form for its users [5]. A good summary should retain the most important contents of
the original document or a cluster of related documents, while being coherent, non-redundant
and grammatically readable [6].
Basically a summary can be done on one document or several documents. There are
different characteristics in making multi-document summarization compared to summarizing
single documents, in which multi-document summarization involves many sources of
information that overlap and complement each other on several occasions. So, the main task is
not only to identify and overcome redundancy in all documents, but also to ensure that the final
summary is coherent and complete [3, 5, 7]. This is the background to the need for an automatic
summarization system in a document. An Automatic Text Summarization is a computer-based
device to produce text that is shorter than the original text but still holds the main points of
the summarized text [8-11].
TELKOMNIKA ISSN: 1693-6930 ◼
Regression model focused on query for multi documents summarization... (Aris Fanani)
3051
Automatic summarization techniques are divided into two groups: extractive
summarization and abstractive summarization [4]. Extractive summary is produced by arranging
a few sentences. These sentences are selected exactly as it appears in the original document.
On the other hand, abstractive summarization is a more difficult task because it is carried out by
paraphrasing source documents. In the research conducted by V. Tohalino and
D. R. Amancio [12], using dynamic measurement methods based on complex networks for
extractive multi-document summarization methods, which extracts the most central sentences
from several textual sources. Meanwhile, research conducted by G. D. Fabbrizio, A. J. Stent
and R. Gaizauskas [13] presents the STARLET-H hybrid method as an abstract/extractive
summarizer to produce a summary of opinion reviews by combining natural language document
with prominent sentence selection techniques.
In another study it was stated that document summarization methods can also be
differentiated into generic summarization and query-based summarization [9, 14]. In this study
also explained that generic summarization is divided into two parts, namely supervised and
unsupervised methods. In the supervised method, training data from a group of people is
needed to produce a summary of a document, so that when there are different documents,
different training data is needed. This supervised method can only be applied to certain data
models. Whereas, in the unsupervised method, summarization does not require training data as
like carried out in the supervised method. The research conducted by T. Nayeem, T. A. Fuad
and Y. Chali [15] developed an unsupervised abstractive summarization system in
multi-document settings. They designed a paraphrastic sentence fusion model which jointly
performs sentence fusion and paraphrasing using skip-gram word embedding model at
the sentence level. The results showed that this method provides a significant increase in
multi-document abstractive summarization.
Several other research related to multi document summarization was conducted by Lin
Zhao, et al. [16] who presented about multi-document summarization using extractive
summarization methods on query. They propose a query expansion algorithm in a graph-based
ranking approach. In addition, Ercan Canhasi et al. [17] also studied the summarization of
multi-document that focuses on query using graphical representation based on weighted
archetypal analysis. Research conducted by Amini [18] investigate how to use a ranking
learning model for single document summarization that focuses on queries and compares
the ranking algorithms proposed with the logistic classifier. The ranking algorithm outperforms
the logistic classifier.
Another research conducted by You Ouyang [19] successfully developed a regression
model to make a summary of many documents that consider queries from users. This study
concludes that in making a summary of many documents, the regression model has a better
performance than the classification or ranking model. The sentence position feature in this study
is assessed based on its global position in a document, so that the sentence at the beginning of
the document always has a greater weight than the next sentence. This is considered
inappropriate because not all documents have important sentences at the beginning of the
document. To overcome this, it is assumed that the sentence in the document that has a high
level of significance is the sentence located at the beginning and at the end of the document.
Another study that apply regression in summarize multi document were conducted
by [20-22]. Researchers [20] present a fast query-based multi-document summarizer called
FastSum based solely on word-frequency features of clusters, documents and topics.
Researchers [21] use Integer Linear Programming to jointly maximize the importance of
sentences included in the summary and diversity, without exceeding the maximum summary
length allowed. To get an important score for each sentence, they use the Support Vector
Regression (SVM) model which is trained on summaries written by humans. Researchers [22]
use SVM as a supervised learning algorithm for ranking sentences based on score similarities
between candidate sentences and benchmark summaries. From several methods used by
several researchers above, the authors are interested in applying a regression model in
summarizing multi documents because of their simplicity but having a reliable ability to
summarize multiple documents. So this paper proposes a regression model to rank sentences
in a multi-document summarization that focuses on queries based on the significance of
sentence positions.
◼ ISSN: 1693-6930
TELKOMNIKA Vol. 17, No. 6, December 2019: 3050-3056
3052
2. Research Method
The summarization approach proposed is based on feature-based extractive
framework, in which ranking and sentence extraction are based on a set of pre-defined
sentence features and a combination of assessment functions.
2.1. Feature Design
The sentence in the document is assessed based on the value of its features, so that
features have an important role in the assessment and ranking of sentences. The features used
in this paper are as follows:
a. Word matching feature
Compare similarities between queries with sentences in documents.
𝑓𝑤𝑜𝑟𝑑(𝑠) = ∑ ∑ 𝑠𝑎𝑚𝑒(𝑤𝑖, 𝑤𝑗)
𝑤 𝑖∈𝑞𝑤 𝑗∈𝑠
(1)
where 𝑓 is the feature value, 𝑞 is the query. If the word in the query is the same as the sentence
it will be given a value of 1, while if not the same is given a value of 0.
b. Semantic matching feature
Compare similar words between queries and sentences in a document:
𝑓𝑤𝑜𝑟𝑑𝑛𝑒𝑡(𝑠) = ∑ ∑ 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑤𝑖, 𝑤𝑗)
𝑤 𝑖∈𝑞𝑤 𝑗∈𝑠
(2)
where 𝑓 is the value of the similarity between the query and the sentence, 𝑞 is a query.
If the word in the query is the same as the sentence it will be given a value of 1, while if not
the same is given a value of 0.
c. Named entity matching feature (query-dependent)
The sliced result of named entity is queried with named entity in the sentence in
the document:
𝑓𝑒𝑛𝑡𝑖𝑡𝑦(𝑠) = |𝑒𝑛𝑒𝑛𝑡𝑖𝑡𝑦(𝑠) ∩ 𝑒𝑛𝑡𝑖𝑡𝑦(𝑞)| (3)
d. Named entity feature
𝑓𝑒𝑛𝑡𝑖𝑡𝑦𝑛𝑜 = |𝑒𝑛𝑡𝑖𝑡𝑦(𝑠)| (4)
where 𝑓𝑒𝑛𝑡𝑖𝑡𝑦𝑛𝑜 is number of entity names in sentences.
e. Stop-word penalty feature
Assuming that sentences with many stop-words as less informative sentences:
𝑓𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑 = |𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑(𝑠)| (5)
where |𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑| is number of stop-word in sentences.
f. Sentence position feature
Assuming that the sentence at the beginning and end of the document has more
important information, the sentence at the beginning and end of the document has a higher
weight than the other sentences.
𝑓𝑝𝑜𝑠 = {
1 − (
𝑖 − 1
𝑛
) , 1 < 𝑖𝑝 <
𝑖𝑝 + 𝑗
2
(
𝑖 − 1
𝑛
),
𝑖𝑝 + 𝑗
2
≤ 𝑖𝑝 ≤ 𝑗
(6)
where i is sentence position on the document, n is number of sentences in the document, and ip
is sentence index.
TELKOMNIKA ISSN: 1693-6930 ◼
Regression model focused on query for multi documents summarization... (Aris Fanani)
3053
2.2. Support Vector Regression
SVR is the application of Support Vector Machine (SVM) for regression cases.
In the case of regression, the output is a real or continuous number. SVR is a method that can
overcome overfitting, so it will produce good performance [23-25]. For example we have λ set of
training data (xj.,yj) where j = 1,2,...λ with input 𝑥 = {𝑥1, 𝑥2, 𝑥3} … ⊆ ℜ 𝑁
and output
𝑦 = {𝑦𝑖, … , 𝑦𝜆} ⊆ ℜ. With SVR, we want to find the function of f(x) that have the biggest deviation
ε of actual target yi for all of training data. When ε is equal to zero (0) then we get perfect
regression [23]. For example we have the following function as a regression line:
𝑓(𝑥) = 𝑊 𝑇
𝜑(𝑥) + 𝑏 (7)
where φ(x) shows a point in feature space F mapping results x in input space. Coefficient of
w and b estimated by minimizing the risk function defined in the (8):
𝑚𝑖𝑛
1
2
||𝑤|2
+ 𝐶
1
𝜆
∑ 𝐿∈
𝜆
𝑖=1
(𝑦𝑖, 𝑓(𝑥𝑖))
(8)
2.3. Sentence Ranking Method with Regression
The defined feature is used as a combined function to calculate the importance score
of a sentence. In this paper, Support Vector Regression (SVR) was adopted to study
the assessment function using previously defined features. Regression models are trained from
a set of topic D which gives importance score for each sentence. Topics derived from the DUC
dataset, each containing query and a set of relevant documents. A sentence in document D is
given a score that shows the importance score (s) and a vector of the corresponding
F (s) feature. Training data is built by connecting the scores of sentences and features together,
that is {(𝑠𝑐𝑜𝑟𝑒(𝑠), 𝐹(𝑠)) | 𝑆 ∈ 𝐷}. The target is to predict the score of a new sentence s' in topic
D' which is unknown through its vector feature F(s'). This task can be considered as a typical
linear regression problem, such as the use of training data {(𝑠𝑐𝑜𝑟𝑒(𝑠), 𝐹(𝑠)) | 𝑆 ∈ 𝐷} to learn
the optimal regression function 𝑓: 𝐹(𝑠) → 𝑅 from a set of candidate functions
{𝑓 (𝑥) = 𝑤 . 𝑥 + 𝑏 | 𝑤 ∈ 𝑅𝑛, 𝑏 ∈ 𝑅}. For regression problems, linear SVR selects the optimum
function 𝑓0(𝑥) = 𝑤0. 𝑥 + 𝑏0 by minimizing the risk function structure.
𝛷(𝑤, 𝑏) =
1
2
||𝑤||2
+ 𝐶(
1
|𝐷|
∑ 𝐿(𝑠𝑐𝑜𝑟𝑒(𝑠𝑖) − (𝑤. 𝐹(𝑠𝑖) + 𝑏))
𝑠 𝑖∈𝐷
(9)
where L(x) is a loss function, C indicates weights to balance factors and |D| indicates
the number of sentences in D. After the regression function f0 is learned, the results are used to
provide an estimate of the importance of the new sentence s
𝑠𝑐𝑜𝑟𝑒(𝑠′) = 𝑓0(𝐹(𝑠′)) = 𝑤0. 𝐹(𝑠′) + 𝑏0 (10)
2.4. Establishment of Training Data
To establish training data, a DUC (Document Understanding Conference) 2005 dataset
is used where in this dataset there are 50 documents with 25 topics, each topic has a query that
is specific to the topic and has 4 summaries of human experts depending on the query given.
The initial hypothesis we proposed is: it is increasingly similar between sentences in the human
expert summary with the sentence in the document, the better the weight given by the N-gram
in the training data formation process. For the D document set and set of human expert
summary H={H1,…,Hm}, each time in D will be given an importance score (s|H). The score is
calculated by probabilistic unigram of s to be recognized as a summary sentence given
a human summary. By using a bag-of-word model, the probabilistic of unigram in the i human
summary of Hi can be calculated by:
𝑝(𝑡|𝐻𝑖) = 𝑓𝑟𝑒𝑞(𝑡)/|𝐻𝑖| (11)
where freq(t) is frequence of t in Hi and |Hi| is number of words on Hi. To get the probability
of t in all human summaries is using the maximum strategy of:
◼ ISSN: 1693-6930
TELKOMNIKA Vol. 17, No. 6, December 2019: 3050-3056
3054
𝑝max(𝑡|𝐻) = max
Hi∈H
(
𝑝(𝑡)
|𝐻𝑖|
).
(12)
The overall score of sentence s is calculated by summing the probability of unigram:
𝑠𝑐𝑜𝑟𝑒(𝑠|𝐻) = ∑ 𝑝(𝑡𝑗|𝐻)
𝑡 𝑗∈𝑠
(13)
or by analogy, the scoring method is based on unigram as follows:
𝑠𝑐𝑜𝑟𝑒 𝑚𝑎𝑥(𝑠|𝐻) = ∑ max
Hi∈H
(
𝑡𝑗
|𝐻𝑖|
)
𝑡 𝑗∈𝑠
(14)
to calculate the score of a sentence, a combined function is used. It uses the features as
mentioned above. In this study used Support Vector Regression (SVR) as a learning tool.
The general process of this system can be shown in Figure 1.
Figure 1. General system diagram
3. Results and Analysis
A series of trials were conducted to obtain a multi-document summarization
document that focuses on queries based on the significance of sentence position. The dataset
used is the DUC (Document Understanding Conference) 2005. This dataset is used because
it consists of 10 topics, with each topic consisting of 30-50 news documents and 4 kinds
of human summary results. This dataset can be downloaded at the link
http://www-nlpir.nist.gov/projects/duc/duc2005/.
In all trials, queries and documents are preprocessed by eliminating stopword and
stemming. The system created will be limited to produce a summary with a word length of
250 words. After ranking the sentence, the sentence with the highest score will be chosen from
the original document to be used as a summary until the limit of the summary sentence is
reached which are 250 words.
In this paper, two DUC automatic evaluation criteria, ROUGE-2 and ROUGE-SU4, are
used to compare the summary results obtained from a system built with a summary made by
humans. ROUGE-2 and ROUGE-SU4 are used because these two criteria are the official
evaluation values of ROUGE. ROUGE (Recall Oriented Understudy for Gisting Evaluationa) [26]
is an automatic summarization evaluation method that utilizes the N-gram ratio. For example,
ROUGE-2 evaluates the summary results of the system by matching Bi-gram with a human
summary, i.e.:
𝑅 𝑛(𝑠) =
∑ ∑ 𝐶𝑜𝑢𝑛𝑡(𝑡 𝑖∈𝑠
ℎ
𝑗=1 𝑡𝑖|𝑆, 𝐻𝑗)
∑ ∑ 𝐶𝑜𝑢𝑛𝑡(𝑡 𝑖∈𝑠
ℎ
𝑗=1 𝑡𝑖|𝐻𝑖)
(14)
where S is the summary that will be evaluated, Hj (j = 1, 2, ..., h) is a human summary which is
considered as a standard summary, ti shows Bi-gram in summary S, Count (ti|Hj) is number of
Start Preprocessing
Training
Data Training
Feature
Extraction
TestingResultEnd
TELKOMNIKA ISSN: 1693-6930 ◼
Regression model focused on query for multi documents summarization... (Aris Fanani)
3055
occurrences Bi-gram ti that happens in the human summary of the j in Hj and Count (ti |S, Hj) is
the number of occurrences of ti that occur in S and Hj. ROUGE-SU4 is the same as ROUGE-2.
ROUGE-SU4 matches Uni-grams and ignores the Bi-gram summary of human summaries.
In this study two experiments were conducted to measure the reliability of regression
models in multi document summarization based on the significance of sentence position.
The first experiment was carried out by using all the features that were defined in section 2.1,
while the second experiment was carried out without entering the sentence position feature.
The two experiments above were carried out to find out how effective the summarization system
was by paying attention to the significance of the position of the important sentences in
the document. Table 1 shows the results of average ROUGE-2 and ROUGE-SU4 with the 95%
Confidential Interval (CI) suitability level:
Table 1. The Results of the Evaluation of the Application of Different Features
in the Dataset DUC 2005 (CI = 95%)
Evaluation Fiture Precision (CI) Recall (CI)
Rouge-2
All
Without fpos
0.0580
(0.0347-0.1005)
0.0576
(0.0328-0.1005)
0.0590
(0.0344-0.1034)
0.0585
(0.0344-0.1034)
Rouge-SU4
All
Without fpos
0.0997
(0.0636-0.1414)
0.0994
(0.0683-0.1414)
0.1019
(0.0684-0.1384)
0.1015
(0.0689-0.1384)
4. Conclusion
In this paper, we design the application of regression models to query-focused
multi-document summarization based on the significance of the sentence position. This method
using Support Vector Regression (SVR) which estimates the weight of the sentence on a set of
documents to be made as a summary based on sentence feature which has been defined
previously. A series of evaluations performed on a data set of DUC 2005. From the test results
obtained summary which has an average precision and recall values of 0.0580 and 0.0590 for
measurements using ROUGE-2, ROUGE 0.0997 and 0.1019 for measurements using
the proposed regression-SU4. Model can perform measurements of the significance of
the position of the sentence in the document well. This also shows the proposed summarization
system has better precision and recall values.
References
[1] Sartuni, Rasjid, et al. Indonesian for Higher Education (in Indonesia: Bahasa Indonesia untuk
Perguruan Tinggi). Jakarta: Nina Dinamika. 1984.
[2] Wang L, Raghavan H, Castelli V, Florian R, Cardie C. A Sentence Comparession Based Framework
to Query-Focused Multi-Document Summarization. 2016.
[3] Kumar YJ, Salim N. Automatic Multi Document Summarization Approaches. Journal of Computer
Sciences. 2012; 8(1): 133-140.
[4] Haghighi A, Vanderwende L. Exploring Content Models for Multi-Document Summarization. Human
Language Technologies: The 2019 Annual Conference of the North American Chapter of the ACL.
2009: 362-370.
[5] Mani I, Maybury MT. Advance in Automatic Text Summarization. Cambridge: The MIT Press.
[6] Nayeem MT, Fuad TA, Chali Y. Abstractive Unsupervised Multi-Document Summarization using
Paraphrastic Sentence Fusion. Proceedings of the 27th
International Conference on Computational
Linguistics. Santa Fe. 2018.
[7] Cao Z, Li W, Li S, Wei F. Improving Multi-Document Summarization via Text Classification.
Proceedings of the Thirty-First AAAI Conference on Artifical Intelegence. 2017: AAAI-17.
[8] Dallianis H. GSLT: Natural Language Generation Spring. 2005.
[9] Lukmana I, Swanjaya D, Kurniawardhani A, Arifin AZ, Purwitasari D. Multi-Document Summarization
Based on Sentence Clustering Improved Using Topic Words. JUTI: Jurnal Ilmiah Teknologi Informasi.
2014; 12(2) :1-8.
◼ ISSN: 1693-6930
TELKOMNIKA Vol. 17, No. 6, December 2019: 3050-3056
3056
[10] Yih WT, Goodman J, Vanderwende L, Suzuki H. Multi-Document Summarization by Maximizing
Informative Content-Words. Proceedings of The 20th
International Joint Conference on Artificial
Intelligents. 2007: 1776-1782.
[11] Bysani P, Reddy VB, Varma V. Modeling Novelty and Feature Combination using Support Vector
Regression for Update Summarization. Proceedings of ICON-2009: 7th
International Conference on
Natural Language Processing. 2009: 41.
[12] Tohalino JV, Amancio DR. Extractive Multi Document Summarization using Dynamical
Measurements of Complex Networks. 2017 Brazilian Conference on Intelligent Systems (BRACIS).
2017: 366-371.
[13] Di Fabbrizio G, Stent A, Gaizauskas R. A Hybrid Approach to Multi-document Summarization of
Opinions in Reviews. Proceedings of the 8th
International Natural Language Generation Conference.
2014: 54-63.
[14] Lee JH, Park S, Ahn CM, Kim D. Automatic Generic Document Summarization based on
Non-Negative Matrix Factorization. Information Processing and Management. 2009; 45(1): 20-34.
[15] Nayeem MT, Fuad TA, Chali Y. Abstractive Unsupervised Multi-Document Summarization using
Paraphrastic Sentence Fusion. Proceeding of the 27th
International Conference on Computational
Linguistics. 2018: 1191-1204.
[16] Lin CY. ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches
Out-Proceedings of the ACL Workshop. 2004: 74-81.
[17] Canhasi E, Kononenko I. Weighted archetypal analysis of the multi-element graph for query-focused
multi-document summarization. Expert Systems with Applications. 2014; 41(2): 535-43.
[18] Amini MR, Usunier N, Gallinari P. Automatic Text Summarization based on Word-Clusters and
Ranking Algorithms. ECIR. In D. E. Losada & J.M. 2005; 3408: 142-156.
[19] Ouyang Y, et al. Applying Regression Models to Query-Focused Multi-Document Summarization.
Information Processing and Management. 2011; 47(2): 227-37.
[20] Schilder F, Kondadadi R. Fast and accurate query-based multi-document summarization.
Proceedings of ACL-08: HLT, Short Papers (Companion Volume). 2008: 205–208.
[21] Galanis D, Lampouras G, Androutsopoulos I. Extractive Multi-Document Summarization with Integer
Linear Programming and Support Vector Regression. Proceedings of COLING 2012: Technical
Papers. 2012: 911–926.
[22] Dlikman A, Last M. Last. Using Machine Learning Methods and Linguistic Features in
Single-Document Extractive Summarization. Proceedings of DMNLP, Workshop at ECML/PKDD.
Riva del Garda. 2016: 1-8.
[23] Santosa B. Applied Data Mining using Matlab (in Indonesia: Data Mining Terapan dengan Matlab).
Yogyakarta: Graha Ilmu. 2011.
[24] Alkaff M, Khatimi H, Puspita W, Sari Y. Modelling and predicting wetland rice production using
support vector regression. TELKOMNIKA Telecommunication Computing Electronics and Control.
2019; 17(6): 819-825.
[25] Harabagiu S, Lacatusu F. Using Topic Themes for Multi-Document Summarization. ACM
Transactions on Information Systems. 2010; 28(3): 13.
[26] Lin CY, Hovy E. Manual and Automatic Evaluation of Summaries. Proceedings of the ACL-02
Workshop on Automatic Summarization. 2002; 4: 45-51.

More Related Content

PDF
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...
PDF
Improvement of Text Summarization using Fuzzy Logic Based Method
PDF
K0936266
PDF
76 s201906
PDF
Context Sensitive Search String Composition Algorithm using User Intention to...
PDF
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
PDF
N15-1013
PDF
Keyword Extraction Based Summarization of Categorized Kannada Text Documents
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...
Improvement of Text Summarization using Fuzzy Logic Based Method
K0936266
76 s201906
Context Sensitive Search String Composition Algorithm using User Intention to...
Rhetorical Sentence Classification for Automatic Title Generation in Scientif...
N15-1013
Keyword Extraction Based Summarization of Categorized Kannada Text Documents

What's hot (20)

PDF
An automatic text summarization using lexical cohesion and correlation of sen...
PDF
Optimal approach for text summarization
PDF
Query Answering Approach Based on Document Summarization
PDF
IRJET- A Survey Paper on Text Summarization Methods
PDF
A template based algorithm for automatic summarization and dialogue managemen...
PDF
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...
PDF
A statistical model for gist generation a case study on hindi news article
PDF
Single document keywords extraction in Bahasa Indonesia using phrase chunking
PDF
A domain specific automatic text summarization using fuzzy logic
PDF
Semantic Based Model for Text Document Clustering with Idioms
PDF
Performance Evaluation of Query Processing Techniques in Information Retrieval
PDF
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
PDF
Exploiting rhetorical relations to
PDF
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
PDF
Sentence similarity-based-text-summarization-using-clusters
PDF
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...
PDF
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
PDF
Paper vs On-line
PDF
An Evaluation of Preprocessing Techniques for Text Classification
PDF
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
An automatic text summarization using lexical cohesion and correlation of sen...
Optimal approach for text summarization
Query Answering Approach Based on Document Summarization
IRJET- A Survey Paper on Text Summarization Methods
A template based algorithm for automatic summarization and dialogue managemen...
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...
A statistical model for gist generation a case study on hindi news article
Single document keywords extraction in Bahasa Indonesia using phrase chunking
A domain specific automatic text summarization using fuzzy logic
Semantic Based Model for Text Document Clustering with Idioms
Performance Evaluation of Query Processing Techniques in Information Retrieval
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
Exploiting rhetorical relations to
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
Sentence similarity-based-text-summarization-using-clusters
An Efficient Approach for Keyword Selection ; Improving Accessibility of Web ...
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...
Paper vs On-line
An Evaluation of Preprocessing Techniques for Text Classification
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
Ad

Similar to Regression model focused on query for multi documents summarization based on significance of the sentence position (20)

PDF
IRJET- Multi-Document Summarization using Fuzzy and Hierarchical Approach
PDF
MULTI-DOCUMENT SUMMARIZATION SYSTEM: USING FUZZY LOGIC AND GENETIC ALGORITHM
PPT
Query based summarization
PPT
Query based summarization
PPT
Query Based Summarization
PDF
I6 mala3 sowmya
PDF
Text summarization
PDF
8 efficient multi-document summary generation using neural network
PDF
H04564550
PDF
The International Journal of Engineering and Science (IJES)
PDF
Multi-Topic Multi-Document Summarizer
PDF
A Survey of Various Methods for Text Summarization
PDF
Advantages of Query Biased Summaries in Information Retrieval
PDF
Article Summarizer
PDF
Conceptual framework for abstractive text summarization
PDF
IRJET- Automatic Recapitulation of Text Document
PDF
A Survey on Automatic Text Summarization
PDF
Extractive Document Summarization - An Unsupervised Approach
PPTX
summarization-oct12.pptx
PDF
EASESUM: an online abstractive and extractive text summarizer using deep lear...
IRJET- Multi-Document Summarization using Fuzzy and Hierarchical Approach
MULTI-DOCUMENT SUMMARIZATION SYSTEM: USING FUZZY LOGIC AND GENETIC ALGORITHM
Query based summarization
Query based summarization
Query Based Summarization
I6 mala3 sowmya
Text summarization
8 efficient multi-document summary generation using neural network
H04564550
The International Journal of Engineering and Science (IJES)
Multi-Topic Multi-Document Summarizer
A Survey of Various Methods for Text Summarization
Advantages of Query Biased Summaries in Information Retrieval
Article Summarizer
Conceptual framework for abstractive text summarization
IRJET- Automatic Recapitulation of Text Document
A Survey on Automatic Text Summarization
Extractive Document Summarization - An Unsupervised Approach
summarization-oct12.pptx
EASESUM: an online abstractive and extractive text summarizer using deep lear...
Ad

More from TELKOMNIKA JOURNAL (20)

PDF
Earthquake magnitude prediction based on radon cloud data near Grindulu fault...
PDF
Implementation of ICMP flood detection and mitigation system based on softwar...
PDF
Indonesian continuous speech recognition optimization with convolution bidir...
PDF
Recognition and understanding of construction safety signs by final year engi...
PDF
The use of dolomite to overcome grounding resistance in acidic swamp land
PDF
Clustering of swamp land types against soil resistivity and grounding resistance
PDF
Hybrid methodology for parameter algebraic identification in spatial/time dom...
PDF
Integration of image processing with 6-degrees-of-freedom robotic arm for adv...
PDF
Deep learning approaches for accurate wood species recognition
PDF
Neuromarketing case study: recognition of sweet and sour taste in beverage pr...
PDF
Reversible data hiding with selective bits difference expansion and modulus f...
PDF
Website-based: smart goat farm monitoring cages
PDF
Novel internet of things-spectroscopy methods for targeted water pollutants i...
PDF
XGBoost optimization using hybrid Bayesian optimization and nested cross vali...
PDF
Convolutional neural network-based real-time drowsy driver detection for acci...
PDF
Addressing overfitting in comparative study for deep learningbased classifica...
PDF
Integrating artificial intelligence into accounting systems: a qualitative st...
PDF
Leveraging technology to improve tuberculosis patient adherence: a comprehens...
PDF
Adulterated beef detection with redundant gas sensor using optimized convolut...
PDF
A 6G THz MIMO antenna with high gain and wide bandwidth for high-speed wirele...
Earthquake magnitude prediction based on radon cloud data near Grindulu fault...
Implementation of ICMP flood detection and mitigation system based on softwar...
Indonesian continuous speech recognition optimization with convolution bidir...
Recognition and understanding of construction safety signs by final year engi...
The use of dolomite to overcome grounding resistance in acidic swamp land
Clustering of swamp land types against soil resistivity and grounding resistance
Hybrid methodology for parameter algebraic identification in spatial/time dom...
Integration of image processing with 6-degrees-of-freedom robotic arm for adv...
Deep learning approaches for accurate wood species recognition
Neuromarketing case study: recognition of sweet and sour taste in beverage pr...
Reversible data hiding with selective bits difference expansion and modulus f...
Website-based: smart goat farm monitoring cages
Novel internet of things-spectroscopy methods for targeted water pollutants i...
XGBoost optimization using hybrid Bayesian optimization and nested cross vali...
Convolutional neural network-based real-time drowsy driver detection for acci...
Addressing overfitting in comparative study for deep learningbased classifica...
Integrating artificial intelligence into accounting systems: a qualitative st...
Leveraging technology to improve tuberculosis patient adherence: a comprehens...
Adulterated beef detection with redundant gas sensor using optimized convolut...
A 6G THz MIMO antenna with high gain and wide bandwidth for high-speed wirele...

Recently uploaded (20)

PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Construction Project Organization Group 2.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT
Mechanical Engineering MATERIALS Selection
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Geodesy 1.pptx...............................................
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT
Project quality management in manufacturing
PDF
composite construction of structures.pdf
PPTX
web development for engineering and engineering
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Sustainable Sites - Green Building Construction
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Construction Project Organization Group 2.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Mechanical Engineering MATERIALS Selection
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Geodesy 1.pptx...............................................
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Embodied AI: Ushering in the Next Era of Intelligent Systems
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Project quality management in manufacturing
composite construction of structures.pdf
web development for engineering and engineering
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
OOP with Java - Java Introduction (Basics)
Sustainable Sites - Green Building Construction

Regression model focused on query for multi documents summarization based on significance of the sentence position

  • 1. TELKOMNIKA, Vol.17, No.6, December 2019, pp.3050~3056 ISSN: 1693-6930, accredited First Grade by Kemenristekdikti, Decree No: 21/E/KPT/2018 DOI: 10.12928/TELKOMNIKA.v17i6.12494 ◼ 3050 Received February 6, 2019; Revised June 18, 2019; Accepted July 2, 2019 Regression model focused on query for multi documents summarization based on significance of the sentence position Aris Fanani*1 , Yuniar Farida2 , Putra Prima Arhandi3 , M. Mahaputra Hidayat4 , Abdul Muhid5 , Billy Montolalu6 1,2,5 UIN Sunan Ampel Surabaya, 117, A Yani St., Surabaya, Indonesia 3 State Polytechnic of Malang, 9, Soekarno Hatta St., Malang, Indonesia 4 Bhayangkara University, 114, A Yani St., Surabaya, Indonesia 6 IT Telkom, Surabaya, Indonesia *Corresponding author, e-mail: arisfa@uinsby.ac.id1 , yuniar_farida@uinsby.ac.id2 , putraprima@polinema.ac.id3 , mahaputra@ubhara.ac.id4 , abdulmuhid@uinsby.ac.id5 , billy@ittelkom-sby.ac.id6 Abstract Document summarization is needed to get the information effectively and efficiently. One method used to obtain the document summarization by applying machine learning techniques. This paper proposes the application of regression models to query-focused multi-document summarization based on the significance of the sentence position. The method used is the Support Vector Regression (SVR) which estimates the weight of the sentence on a set of documents to be made as a summary based on sentence feature which has been defined previously. A series of evaluations performed on a data set of DUC 2005. From the test results obtained summary which has an average precision and recall values of 0.0580 and 0.0590 for measurements using ROUGE-2, ROUGE 0.0997 and 0.1019 for measurements using the proposed regression-SU4. Model can perform measurements of the significance of the position of the sentence in the document well. Keywords: multi-document summarization, sentence position, support vector regresion Copyright © 2019 Universitas Ahmad Dahlan. All rights reserved. 1. Introduction As internet usage increases, all information becomes easier to obtain and in abundant amounts. For just one topic, so many information documents are displayed with various different narratives even though the core information is the same. Document summarization is needed to get the information effectively and efficiently. In the process of searching documents on web pages, Keyword searches for collections of documents are generally carried out on the entire contents of the document. So the process of information retrieval takes a long time. whereas users expect the right results with a short time in the process of information retrieval. Therefore, it is recommended that the keyword matching process for document collections be carried out at the core of documents that have shorter content. Summarization is needed to get the contents of the article in summary. Summary is a strict expression of the main content of an article, which aims to tell the reader the core of a main thought [1-4]. The simple concept of summary is taking an important part of the entire contents of the article which then presents it again in a more concise form for its users [5]. A good summary should retain the most important contents of the original document or a cluster of related documents, while being coherent, non-redundant and grammatically readable [6]. Basically a summary can be done on one document or several documents. There are different characteristics in making multi-document summarization compared to summarizing single documents, in which multi-document summarization involves many sources of information that overlap and complement each other on several occasions. So, the main task is not only to identify and overcome redundancy in all documents, but also to ensure that the final summary is coherent and complete [3, 5, 7]. This is the background to the need for an automatic summarization system in a document. An Automatic Text Summarization is a computer-based device to produce text that is shorter than the original text but still holds the main points of the summarized text [8-11].
  • 2. TELKOMNIKA ISSN: 1693-6930 ◼ Regression model focused on query for multi documents summarization... (Aris Fanani) 3051 Automatic summarization techniques are divided into two groups: extractive summarization and abstractive summarization [4]. Extractive summary is produced by arranging a few sentences. These sentences are selected exactly as it appears in the original document. On the other hand, abstractive summarization is a more difficult task because it is carried out by paraphrasing source documents. In the research conducted by V. Tohalino and D. R. Amancio [12], using dynamic measurement methods based on complex networks for extractive multi-document summarization methods, which extracts the most central sentences from several textual sources. Meanwhile, research conducted by G. D. Fabbrizio, A. J. Stent and R. Gaizauskas [13] presents the STARLET-H hybrid method as an abstract/extractive summarizer to produce a summary of opinion reviews by combining natural language document with prominent sentence selection techniques. In another study it was stated that document summarization methods can also be differentiated into generic summarization and query-based summarization [9, 14]. In this study also explained that generic summarization is divided into two parts, namely supervised and unsupervised methods. In the supervised method, training data from a group of people is needed to produce a summary of a document, so that when there are different documents, different training data is needed. This supervised method can only be applied to certain data models. Whereas, in the unsupervised method, summarization does not require training data as like carried out in the supervised method. The research conducted by T. Nayeem, T. A. Fuad and Y. Chali [15] developed an unsupervised abstractive summarization system in multi-document settings. They designed a paraphrastic sentence fusion model which jointly performs sentence fusion and paraphrasing using skip-gram word embedding model at the sentence level. The results showed that this method provides a significant increase in multi-document abstractive summarization. Several other research related to multi document summarization was conducted by Lin Zhao, et al. [16] who presented about multi-document summarization using extractive summarization methods on query. They propose a query expansion algorithm in a graph-based ranking approach. In addition, Ercan Canhasi et al. [17] also studied the summarization of multi-document that focuses on query using graphical representation based on weighted archetypal analysis. Research conducted by Amini [18] investigate how to use a ranking learning model for single document summarization that focuses on queries and compares the ranking algorithms proposed with the logistic classifier. The ranking algorithm outperforms the logistic classifier. Another research conducted by You Ouyang [19] successfully developed a regression model to make a summary of many documents that consider queries from users. This study concludes that in making a summary of many documents, the regression model has a better performance than the classification or ranking model. The sentence position feature in this study is assessed based on its global position in a document, so that the sentence at the beginning of the document always has a greater weight than the next sentence. This is considered inappropriate because not all documents have important sentences at the beginning of the document. To overcome this, it is assumed that the sentence in the document that has a high level of significance is the sentence located at the beginning and at the end of the document. Another study that apply regression in summarize multi document were conducted by [20-22]. Researchers [20] present a fast query-based multi-document summarizer called FastSum based solely on word-frequency features of clusters, documents and topics. Researchers [21] use Integer Linear Programming to jointly maximize the importance of sentences included in the summary and diversity, without exceeding the maximum summary length allowed. To get an important score for each sentence, they use the Support Vector Regression (SVM) model which is trained on summaries written by humans. Researchers [22] use SVM as a supervised learning algorithm for ranking sentences based on score similarities between candidate sentences and benchmark summaries. From several methods used by several researchers above, the authors are interested in applying a regression model in summarizing multi documents because of their simplicity but having a reliable ability to summarize multiple documents. So this paper proposes a regression model to rank sentences in a multi-document summarization that focuses on queries based on the significance of sentence positions.
  • 3. ◼ ISSN: 1693-6930 TELKOMNIKA Vol. 17, No. 6, December 2019: 3050-3056 3052 2. Research Method The summarization approach proposed is based on feature-based extractive framework, in which ranking and sentence extraction are based on a set of pre-defined sentence features and a combination of assessment functions. 2.1. Feature Design The sentence in the document is assessed based on the value of its features, so that features have an important role in the assessment and ranking of sentences. The features used in this paper are as follows: a. Word matching feature Compare similarities between queries with sentences in documents. 𝑓𝑤𝑜𝑟𝑑(𝑠) = ∑ ∑ 𝑠𝑎𝑚𝑒(𝑤𝑖, 𝑤𝑗) 𝑤 𝑖∈𝑞𝑤 𝑗∈𝑠 (1) where 𝑓 is the feature value, 𝑞 is the query. If the word in the query is the same as the sentence it will be given a value of 1, while if not the same is given a value of 0. b. Semantic matching feature Compare similar words between queries and sentences in a document: 𝑓𝑤𝑜𝑟𝑑𝑛𝑒𝑡(𝑠) = ∑ ∑ 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑤𝑖, 𝑤𝑗) 𝑤 𝑖∈𝑞𝑤 𝑗∈𝑠 (2) where 𝑓 is the value of the similarity between the query and the sentence, 𝑞 is a query. If the word in the query is the same as the sentence it will be given a value of 1, while if not the same is given a value of 0. c. Named entity matching feature (query-dependent) The sliced result of named entity is queried with named entity in the sentence in the document: 𝑓𝑒𝑛𝑡𝑖𝑡𝑦(𝑠) = |𝑒𝑛𝑒𝑛𝑡𝑖𝑡𝑦(𝑠) ∩ 𝑒𝑛𝑡𝑖𝑡𝑦(𝑞)| (3) d. Named entity feature 𝑓𝑒𝑛𝑡𝑖𝑡𝑦𝑛𝑜 = |𝑒𝑛𝑡𝑖𝑡𝑦(𝑠)| (4) where 𝑓𝑒𝑛𝑡𝑖𝑡𝑦𝑛𝑜 is number of entity names in sentences. e. Stop-word penalty feature Assuming that sentences with many stop-words as less informative sentences: 𝑓𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑 = |𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑(𝑠)| (5) where |𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑| is number of stop-word in sentences. f. Sentence position feature Assuming that the sentence at the beginning and end of the document has more important information, the sentence at the beginning and end of the document has a higher weight than the other sentences. 𝑓𝑝𝑜𝑠 = { 1 − ( 𝑖 − 1 𝑛 ) , 1 < 𝑖𝑝 < 𝑖𝑝 + 𝑗 2 ( 𝑖 − 1 𝑛 ), 𝑖𝑝 + 𝑗 2 ≤ 𝑖𝑝 ≤ 𝑗 (6) where i is sentence position on the document, n is number of sentences in the document, and ip is sentence index.
  • 4. TELKOMNIKA ISSN: 1693-6930 ◼ Regression model focused on query for multi documents summarization... (Aris Fanani) 3053 2.2. Support Vector Regression SVR is the application of Support Vector Machine (SVM) for regression cases. In the case of regression, the output is a real or continuous number. SVR is a method that can overcome overfitting, so it will produce good performance [23-25]. For example we have λ set of training data (xj.,yj) where j = 1,2,...λ with input 𝑥 = {𝑥1, 𝑥2, 𝑥3} … ⊆ ℜ 𝑁 and output 𝑦 = {𝑦𝑖, … , 𝑦𝜆} ⊆ ℜ. With SVR, we want to find the function of f(x) that have the biggest deviation ε of actual target yi for all of training data. When ε is equal to zero (0) then we get perfect regression [23]. For example we have the following function as a regression line: 𝑓(𝑥) = 𝑊 𝑇 𝜑(𝑥) + 𝑏 (7) where φ(x) shows a point in feature space F mapping results x in input space. Coefficient of w and b estimated by minimizing the risk function defined in the (8): 𝑚𝑖𝑛 1 2 ||𝑤|2 + 𝐶 1 𝜆 ∑ 𝐿∈ 𝜆 𝑖=1 (𝑦𝑖, 𝑓(𝑥𝑖)) (8) 2.3. Sentence Ranking Method with Regression The defined feature is used as a combined function to calculate the importance score of a sentence. In this paper, Support Vector Regression (SVR) was adopted to study the assessment function using previously defined features. Regression models are trained from a set of topic D which gives importance score for each sentence. Topics derived from the DUC dataset, each containing query and a set of relevant documents. A sentence in document D is given a score that shows the importance score (s) and a vector of the corresponding F (s) feature. Training data is built by connecting the scores of sentences and features together, that is {(𝑠𝑐𝑜𝑟𝑒(𝑠), 𝐹(𝑠)) | 𝑆 ∈ 𝐷}. The target is to predict the score of a new sentence s' in topic D' which is unknown through its vector feature F(s'). This task can be considered as a typical linear regression problem, such as the use of training data {(𝑠𝑐𝑜𝑟𝑒(𝑠), 𝐹(𝑠)) | 𝑆 ∈ 𝐷} to learn the optimal regression function 𝑓: 𝐹(𝑠) → 𝑅 from a set of candidate functions {𝑓 (𝑥) = 𝑤 . 𝑥 + 𝑏 | 𝑤 ∈ 𝑅𝑛, 𝑏 ∈ 𝑅}. For regression problems, linear SVR selects the optimum function 𝑓0(𝑥) = 𝑤0. 𝑥 + 𝑏0 by minimizing the risk function structure. 𝛷(𝑤, 𝑏) = 1 2 ||𝑤||2 + 𝐶( 1 |𝐷| ∑ 𝐿(𝑠𝑐𝑜𝑟𝑒(𝑠𝑖) − (𝑤. 𝐹(𝑠𝑖) + 𝑏)) 𝑠 𝑖∈𝐷 (9) where L(x) is a loss function, C indicates weights to balance factors and |D| indicates the number of sentences in D. After the regression function f0 is learned, the results are used to provide an estimate of the importance of the new sentence s 𝑠𝑐𝑜𝑟𝑒(𝑠′) = 𝑓0(𝐹(𝑠′)) = 𝑤0. 𝐹(𝑠′) + 𝑏0 (10) 2.4. Establishment of Training Data To establish training data, a DUC (Document Understanding Conference) 2005 dataset is used where in this dataset there are 50 documents with 25 topics, each topic has a query that is specific to the topic and has 4 summaries of human experts depending on the query given. The initial hypothesis we proposed is: it is increasingly similar between sentences in the human expert summary with the sentence in the document, the better the weight given by the N-gram in the training data formation process. For the D document set and set of human expert summary H={H1,…,Hm}, each time in D will be given an importance score (s|H). The score is calculated by probabilistic unigram of s to be recognized as a summary sentence given a human summary. By using a bag-of-word model, the probabilistic of unigram in the i human summary of Hi can be calculated by: 𝑝(𝑡|𝐻𝑖) = 𝑓𝑟𝑒𝑞(𝑡)/|𝐻𝑖| (11) where freq(t) is frequence of t in Hi and |Hi| is number of words on Hi. To get the probability of t in all human summaries is using the maximum strategy of:
  • 5. ◼ ISSN: 1693-6930 TELKOMNIKA Vol. 17, No. 6, December 2019: 3050-3056 3054 𝑝max(𝑡|𝐻) = max Hi∈H ( 𝑝(𝑡) |𝐻𝑖| ). (12) The overall score of sentence s is calculated by summing the probability of unigram: 𝑠𝑐𝑜𝑟𝑒(𝑠|𝐻) = ∑ 𝑝(𝑡𝑗|𝐻) 𝑡 𝑗∈𝑠 (13) or by analogy, the scoring method is based on unigram as follows: 𝑠𝑐𝑜𝑟𝑒 𝑚𝑎𝑥(𝑠|𝐻) = ∑ max Hi∈H ( 𝑡𝑗 |𝐻𝑖| ) 𝑡 𝑗∈𝑠 (14) to calculate the score of a sentence, a combined function is used. It uses the features as mentioned above. In this study used Support Vector Regression (SVR) as a learning tool. The general process of this system can be shown in Figure 1. Figure 1. General system diagram 3. Results and Analysis A series of trials were conducted to obtain a multi-document summarization document that focuses on queries based on the significance of sentence position. The dataset used is the DUC (Document Understanding Conference) 2005. This dataset is used because it consists of 10 topics, with each topic consisting of 30-50 news documents and 4 kinds of human summary results. This dataset can be downloaded at the link http://www-nlpir.nist.gov/projects/duc/duc2005/. In all trials, queries and documents are preprocessed by eliminating stopword and stemming. The system created will be limited to produce a summary with a word length of 250 words. After ranking the sentence, the sentence with the highest score will be chosen from the original document to be used as a summary until the limit of the summary sentence is reached which are 250 words. In this paper, two DUC automatic evaluation criteria, ROUGE-2 and ROUGE-SU4, are used to compare the summary results obtained from a system built with a summary made by humans. ROUGE-2 and ROUGE-SU4 are used because these two criteria are the official evaluation values of ROUGE. ROUGE (Recall Oriented Understudy for Gisting Evaluationa) [26] is an automatic summarization evaluation method that utilizes the N-gram ratio. For example, ROUGE-2 evaluates the summary results of the system by matching Bi-gram with a human summary, i.e.: 𝑅 𝑛(𝑠) = ∑ ∑ 𝐶𝑜𝑢𝑛𝑡(𝑡 𝑖∈𝑠 ℎ 𝑗=1 𝑡𝑖|𝑆, 𝐻𝑗) ∑ ∑ 𝐶𝑜𝑢𝑛𝑡(𝑡 𝑖∈𝑠 ℎ 𝑗=1 𝑡𝑖|𝐻𝑖) (14) where S is the summary that will be evaluated, Hj (j = 1, 2, ..., h) is a human summary which is considered as a standard summary, ti shows Bi-gram in summary S, Count (ti|Hj) is number of Start Preprocessing Training Data Training Feature Extraction TestingResultEnd
  • 6. TELKOMNIKA ISSN: 1693-6930 ◼ Regression model focused on query for multi documents summarization... (Aris Fanani) 3055 occurrences Bi-gram ti that happens in the human summary of the j in Hj and Count (ti |S, Hj) is the number of occurrences of ti that occur in S and Hj. ROUGE-SU4 is the same as ROUGE-2. ROUGE-SU4 matches Uni-grams and ignores the Bi-gram summary of human summaries. In this study two experiments were conducted to measure the reliability of regression models in multi document summarization based on the significance of sentence position. The first experiment was carried out by using all the features that were defined in section 2.1, while the second experiment was carried out without entering the sentence position feature. The two experiments above were carried out to find out how effective the summarization system was by paying attention to the significance of the position of the important sentences in the document. Table 1 shows the results of average ROUGE-2 and ROUGE-SU4 with the 95% Confidential Interval (CI) suitability level: Table 1. The Results of the Evaluation of the Application of Different Features in the Dataset DUC 2005 (CI = 95%) Evaluation Fiture Precision (CI) Recall (CI) Rouge-2 All Without fpos 0.0580 (0.0347-0.1005) 0.0576 (0.0328-0.1005) 0.0590 (0.0344-0.1034) 0.0585 (0.0344-0.1034) Rouge-SU4 All Without fpos 0.0997 (0.0636-0.1414) 0.0994 (0.0683-0.1414) 0.1019 (0.0684-0.1384) 0.1015 (0.0689-0.1384) 4. Conclusion In this paper, we design the application of regression models to query-focused multi-document summarization based on the significance of the sentence position. This method using Support Vector Regression (SVR) which estimates the weight of the sentence on a set of documents to be made as a summary based on sentence feature which has been defined previously. A series of evaluations performed on a data set of DUC 2005. From the test results obtained summary which has an average precision and recall values of 0.0580 and 0.0590 for measurements using ROUGE-2, ROUGE 0.0997 and 0.1019 for measurements using the proposed regression-SU4. Model can perform measurements of the significance of the position of the sentence in the document well. This also shows the proposed summarization system has better precision and recall values. References [1] Sartuni, Rasjid, et al. Indonesian for Higher Education (in Indonesia: Bahasa Indonesia untuk Perguruan Tinggi). Jakarta: Nina Dinamika. 1984. [2] Wang L, Raghavan H, Castelli V, Florian R, Cardie C. A Sentence Comparession Based Framework to Query-Focused Multi-Document Summarization. 2016. [3] Kumar YJ, Salim N. Automatic Multi Document Summarization Approaches. Journal of Computer Sciences. 2012; 8(1): 133-140. [4] Haghighi A, Vanderwende L. Exploring Content Models for Multi-Document Summarization. Human Language Technologies: The 2019 Annual Conference of the North American Chapter of the ACL. 2009: 362-370. [5] Mani I, Maybury MT. Advance in Automatic Text Summarization. Cambridge: The MIT Press. [6] Nayeem MT, Fuad TA, Chali Y. Abstractive Unsupervised Multi-Document Summarization using Paraphrastic Sentence Fusion. Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe. 2018. [7] Cao Z, Li W, Li S, Wei F. Improving Multi-Document Summarization via Text Classification. Proceedings of the Thirty-First AAAI Conference on Artifical Intelegence. 2017: AAAI-17. [8] Dallianis H. GSLT: Natural Language Generation Spring. 2005. [9] Lukmana I, Swanjaya D, Kurniawardhani A, Arifin AZ, Purwitasari D. Multi-Document Summarization Based on Sentence Clustering Improved Using Topic Words. JUTI: Jurnal Ilmiah Teknologi Informasi. 2014; 12(2) :1-8.
  • 7. ◼ ISSN: 1693-6930 TELKOMNIKA Vol. 17, No. 6, December 2019: 3050-3056 3056 [10] Yih WT, Goodman J, Vanderwende L, Suzuki H. Multi-Document Summarization by Maximizing Informative Content-Words. Proceedings of The 20th International Joint Conference on Artificial Intelligents. 2007: 1776-1782. [11] Bysani P, Reddy VB, Varma V. Modeling Novelty and Feature Combination using Support Vector Regression for Update Summarization. Proceedings of ICON-2009: 7th International Conference on Natural Language Processing. 2009: 41. [12] Tohalino JV, Amancio DR. Extractive Multi Document Summarization using Dynamical Measurements of Complex Networks. 2017 Brazilian Conference on Intelligent Systems (BRACIS). 2017: 366-371. [13] Di Fabbrizio G, Stent A, Gaizauskas R. A Hybrid Approach to Multi-document Summarization of Opinions in Reviews. Proceedings of the 8th International Natural Language Generation Conference. 2014: 54-63. [14] Lee JH, Park S, Ahn CM, Kim D. Automatic Generic Document Summarization based on Non-Negative Matrix Factorization. Information Processing and Management. 2009; 45(1): 20-34. [15] Nayeem MT, Fuad TA, Chali Y. Abstractive Unsupervised Multi-Document Summarization using Paraphrastic Sentence Fusion. Proceeding of the 27th International Conference on Computational Linguistics. 2018: 1191-1204. [16] Lin CY. ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out-Proceedings of the ACL Workshop. 2004: 74-81. [17] Canhasi E, Kononenko I. Weighted archetypal analysis of the multi-element graph for query-focused multi-document summarization. Expert Systems with Applications. 2014; 41(2): 535-43. [18] Amini MR, Usunier N, Gallinari P. Automatic Text Summarization based on Word-Clusters and Ranking Algorithms. ECIR. In D. E. Losada & J.M. 2005; 3408: 142-156. [19] Ouyang Y, et al. Applying Regression Models to Query-Focused Multi-Document Summarization. Information Processing and Management. 2011; 47(2): 227-37. [20] Schilder F, Kondadadi R. Fast and accurate query-based multi-document summarization. Proceedings of ACL-08: HLT, Short Papers (Companion Volume). 2008: 205–208. [21] Galanis D, Lampouras G, Androutsopoulos I. Extractive Multi-Document Summarization with Integer Linear Programming and Support Vector Regression. Proceedings of COLING 2012: Technical Papers. 2012: 911–926. [22] Dlikman A, Last M. Last. Using Machine Learning Methods and Linguistic Features in Single-Document Extractive Summarization. Proceedings of DMNLP, Workshop at ECML/PKDD. Riva del Garda. 2016: 1-8. [23] Santosa B. Applied Data Mining using Matlab (in Indonesia: Data Mining Terapan dengan Matlab). Yogyakarta: Graha Ilmu. 2011. [24] Alkaff M, Khatimi H, Puspita W, Sari Y. Modelling and predicting wetland rice production using support vector regression. TELKOMNIKA Telecommunication Computing Electronics and Control. 2019; 17(6): 819-825. [25] Harabagiu S, Lacatusu F. Using Topic Themes for Multi-Document Summarization. ACM Transactions on Information Systems. 2010; 28(3): 13. [26] Lin CY, Hovy E. Manual and Automatic Evaluation of Summaries. Proceedings of the ACL-02 Workshop on Automatic Summarization. 2002; 4: 45-51.