SlideShare a Scribd company logo
Centroid-based Text Summarization through
Compositionality of Word Embeddings
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro
gaetano.rossiello@uniba.it
Department of Computer Science
University of Bari - Aldo Moro, Italy
MultiLing 2017 Workshop in EACL 2017
Summarization and summary evaluation across source types and genres
3 April 2017 - Valencia, Spain
Introduction
Extractive Text Summarization
The generated summary is a selection of relevant sentences from the
source text in a copy-paste fashion.
A good extractive summarization method must satisfy:
Coverage: the selected sentences should cover a sufficient
amount of topics from the original source text
Diversity: avoid the redundancy of information in the summary
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Extractive Text Summarization
An extractive text summarization method should define:
Representation model: a paradigm to represent the sentences
Scoring method: a technique for assigning a score to each sentence
Ranking module: a method to properly select the relevant sentences

Bag-of-Words
Several summarization methods proposed in the literature use the
bag-of-words as representation model for the sentence scoring and
selection modules.
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Limit of the Bag-of-Words
The textual similarity is a crucial aspect for many extractive text
summarization methods.
For example, taking into account two semantically related phrases:
S1 Syd leaves Pink Floyd
S2 Barrett abandons the band
abandons band Barrett Floyd leaves Pink Syd the
S1 0 0 0 1 1 1 1 0
S2 1 1 1 0 0 0 0 1
S1 ⊥ S2 =⇒ cosine(S1, S2) = 0
In the BOW model their vector (sparse) representation result
orthogonal since they have no words in common.
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
The Idea
Humans use a wide background knowledge when writing a summary.
What Representation Learning
How Distributional Semantics
Why Transfer Learning
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Neural Language Model - Word2Vec
Word embedding stands for a continuous vector representation
able to capture syntactic and semantic information of a word.
vec(“Gilmour”) = [0.1, 0.3, 0.2, ...]
vec(“Barrett”) = [0.3, 0.1, 0.6, ...]
...
Figure: Continuous bag-of-words and Skip-gram [Mikolov et al., 2013]
vec(Barrett) − vec(singer) + vec(guitarist) ≈ vec(Gilmour)
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Centroid-based Text Summarization: Overview
Centroid-based Extractive Text Summarization [Radev et al., 2004]
The centroid represents a pseudo-document which condenses
the meaningful information of a document (tfidf (w) > t)
The main idea is to project in the vector space the vector
representations of both the centroid and each sentence of a
document
The sentences closer to the centroid are selected
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Sentence Scoring using Word Embeddings
Word Embeddings Lookup Table
Given a corpus of documents [D1, D2, . . . ]a and its vocabulary V
with size N = |V |, we define a matrix E ∈ RN,k, so-called lookup
table, where the i-th row is a word embedding of size k, k << N,
of the i-th word in V .
a
The model can be trained on the collection of documents to be summarized
or on a larger corpus.
Given a document D to be summarized:
Preprocessing: split into sentences, remove stopwords, no stemming
Centroid Embedding: C = w∈D,tfidf (w)>t E[idx(w)]
Sentence Embedding: Sj = w∈Sj
E[idx(w)]
Sentence Score: sim(C, Sj ) =
CT •Sj
||C||·||Sj ||
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Sentence Selection
Input: S, Scores, st, limit
Output: Summary
1: S ← sortDesc(S,Scores)
2: k ← 1
3: for i ← 1 to m do
4: length ← length(Summary)
5: if length > limit then return Summary
6: SV ← sumVectors(S[i])
7: include ← True
8: for j ← 1 to k do
9: SV 2 ← sumVectors(Summary[j])
10: sim ← similarity(SV ,SV 2)
11: if sim > st then
12: include ← False
13: if include then
14: Summary[k] ← S[i]
15: k ← k + 1
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Text Summarization using Word Embeddings: Example
Figure: Embeddings visualization using t-SNE [van der Maaten et al., 2008]
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Text Summarization using Word Embeddings: Example
arcade donkey kong game nintendo coleco centroid embedding
arcades goat hong gameplay mario intellivision nes
pac-man pig macao multiplayer wii atari gamecube
console monkey fung videogame console nes konami
famicom horse taiwan rpg nes msx wii
sega cow wong gamespot gamecube 3do famicom
Table: Centroid words of the Donkey Kong (video game) article having the
tf-idf values greater than a topic threshold
Sent ID Sentence Score
136 The original arcade version of the game appears in the Nintendo 64 game Donkey
Kong 64.
0.9533
131 The game was ported to Nintendo’s Family Computer (Famicom) console in 1983 as
one of the system’s three launch titles; the same version was a launch title for the
Famicom’s North American version, the Nintendo Entertainment System (NES).
0.9375
186 In 2004, Nintendo released Mario vs. Donkey Kong, a sequel to the Game Boy title. 0.9366
192 In 2007, Donkey Kong Barrel Blast was released for the Nintendo Wii. 0.9362
135 The NES version was re-released as an unlockable game in Animal Crossing for the
GameCube and as an item for purchase on the Wii’s Virtual Console.
0.9308
Table: The most relevant sentences
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Experiments
Research Question
Can word embeddings improve the effectiveness of the centroid-based
text summarization method?
Python implementation on GitHub:
github.com/gaetangate/text-summarizer/
DUC-2004 Multi-document Summarization task 2
Tuning Grid search on DUC-2003 dataset
Word2Vec CBOW, Skip-gram on DUC-03/04 -
Pre-trained on Google News
MSS 2015 Multilingual Single Document task 2015
Tuning Grid search on MSS 2015 training set
Word2Vec Skip-gram on Wikipedia (en, it, de, es, fr)
MSS 2017 Multilingual Single Document task 2017
Tuning Grid search on MSS 2017 training set
Word2Vec Skip-gram on Wikipedia (en, it, de, es, fr)
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
DUC-2004 task 2: Multi-document Summarization
ROUGE-1 ROUGE-2 tt st size
LEAD 32.42 6.42
SumBasic 37.27 8.58
Peer65 38.22 9.18
NMF 31.60 6.31
LexRank 37.58 8.78
RNN 38.78 9.86
C BOW 37.76 8.08 0.1 0.6
C GNEWS 37.91 8.45 0.2 0.9 300
C CBOW 38.68 8.93 0.3 0.93 200
C SKIP 38.81 9.97 0.3 0.94 400
Table: ROUGE scores (%) on DUC-2004 dataset. tt and st are the topic
and similarity thresholds respectively. size is the dimension of embeddings
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
DUC-2004 task 2: Multi-document Summarization
Although the different methods achieve similar ROUGE scores, they
not necessarily generate similar summaries.
GNEWS CBOW SKIP BOW
GNEWS 1 0.109 0.171 0.075
CBOW 1 0.460 0.072
SKIP 1 0.105
BOW 1
Table: Sentence overlap using the Jaccard coefficient
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
MSS 2015: Multilingual Single Document Summarization
English Italian German Spanish French
R1 R2 R1 R2 R1 R2 R1 R2 R1 R2
LEAD 44.33 11.68 30.46 4.38 29.13 3.21 43.02 9.17 42.73 8.07
WORST 37.17 9.93 39.68 10.01 33.02 4.88 45.20 13.04 46.68 12.96
BEST 50.38 15.10 43.87 12.50 40.58 8.80 53.23 17.86 51.39 15.38
C BOW 49.06 13.43 33.44 4.82 35.28 4.93 48.38 12.88 46.13 10.45
C W2V 50.43‡
13.34†
35.12 6.81 35.38†
5.39†
49.25†
12.99 47.82†
12.15
ORACLE 61.91 22.42 53.31 17.51 54.34 13.32 62.55 22.36 58.68 17.18
Table: ROUGE-1, -2 scores (%) on MultiLing MSS 2015 dataset for five
different languages
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Improvements for MSS 2017
SWAP system in MSS 2017 task:
Always retain the first sentence in the summary
Subtract to each word embedding the centroid vector of the
whole embedding space
Combination of four scores:
sc1 word2vec centroid similarity
sc2 bag-of-words centroid similarity
sc3 normalized sentence length
sc4 normalized sentence position
score(Sj ) = λ1 ∗ sc1 + λ2 ∗ sc2 + λ3 ∗ sc3 + λ4 ∗ sc4
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
MSS 2017: Multilingual Single Document Summarization
English Italian German Spanish French
R1 MM R1 MM R1 MM R1 MM R1 MM
CIST 45.06 16.83 30.07 20.22 32.32 16.52 45.31 16.94 41.67 17.66
TeamMD 43.08 16.35 30.22 20.79 32.91 15.76 44.95 16.54 42.81 17.18
SWAP 45.62 17.05 32.66 18.45 35.15 18.27 46.67 20.58 43.68 20.06
ORACLE 55.52 20.98 41.25 26.89 41.58 23.75 52.20 23.16 51.41 25.44
Table: ROUGE-1, MeMog scores (%) on MultiLing MSS 2017 dataset for
five different languages
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Future Works
Try word embeddings with other summarization methods
Sentence embeddings using Deep Learning
Doc2Vec
Autoencoders
Convolutional Neural Networks
Recurrent Neural Networks (LSTM, GRU)
Attention and Memory Networks
Joint Learning of Distributional and Relational Semantics
WordNet
DBpedia
Wikidata
...
Infuse prior knowledge in sentence embeddings
Compositionality of Concepts
Transfer Learning
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Moral of the Story
Math > Magic
Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
Centroid-based Text Summarization through Compositionality of Word Embeddings

More Related Content

PDF
Internet of Things(IOT)_Seminar_Dr.G.Rajeshkumar
PPTX
Python for IoT
PDF
Agent architectures
PPTX
Factory Method Pattern
PDF
Neo4j Training Cypher
PPTX
The Singleton Pattern Presentation
PPTX
Privacy and security in IoT
PPTX
Internet of Things - IoT
Internet of Things(IOT)_Seminar_Dr.G.Rajeshkumar
Python for IoT
Agent architectures
Factory Method Pattern
Neo4j Training Cypher
The Singleton Pattern Presentation
Privacy and security in IoT
Internet of Things - IoT

What's hot (20)

PPTX
IoT Development - Opportunities and Challenges
PPTX
Grasp principles
PPT
Lecture 01 Evolution of Decision Support Systems
PPTX
Information retrieval 13 alternative set theoretic models
PPTX
Smart Business using IoT
PDF
Big data Analytics
PPTX
PPT
Education and the Internet of Things
PPTX
Object oriented modeling and design
PPTX
Transactional workflow
PPTX
Linq to sql
PDF
FUTURE OF IOT
PPTX
PDF
Internet of things
PPTX
IOT-internet of thing
PPTX
AI: Logic in AI
PPTX
IoT in Healthcare
PPTX
Internet of Things (IoT) - IK
PDF
Internet of Things (IoT) - Slide Marvels, Top PowerPoint presentation design ...
PPTX
Java vs python
IoT Development - Opportunities and Challenges
Grasp principles
Lecture 01 Evolution of Decision Support Systems
Information retrieval 13 alternative set theoretic models
Smart Business using IoT
Big data Analytics
Education and the Internet of Things
Object oriented modeling and design
Transactional workflow
Linq to sql
FUTURE OF IOT
Internet of things
IOT-internet of thing
AI: Logic in AI
IoT in Healthcare
Internet of Things (IoT) - IK
Internet of Things (IoT) - Slide Marvels, Top PowerPoint presentation design ...
Java vs python
Ad

Similar to Centroid-based Text Summarization through Compositionality of Word Embeddings (19)

PDF
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
PPTX
Embedding for fun fumarola Meetup Milano DLI luglio
PPTX
IA3_presentation.pptx
PDF
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embedd...
PPTX
Efficient estimation of word representations in vector space (2013)
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
PDF
Yoav Goldberg: Word Embeddings What, How and Whither
PDF
Myanmar news summarization using different word representations
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PDF
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
PPTX
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
PDF
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
PPTX
Word_Embeddings.pptx
PPTX
Text Mining for Lexicography
PDF
Enhanced TextRank using weighted word embedding for text summarization
PDF
Interactive Analysis of Word Vector Embeddings
PPTX
What is word2vec?
PDF
StarSpace: Embed All The Things!
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
Embedding for fun fumarola Meetup Milano DLI luglio
IA3_presentation.pptx
Bijaya Zenchenko - An Embedding is Worth 1000 Words - Start Using Word Embedd...
Efficient estimation of word representations in vector space (2013)
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
Yoav Goldberg: Word Embeddings What, How and Whither
Myanmar news summarization using different word representations
Neural Text Embeddings for Information Retrieval (WSDM 2017)
IRJET- Short-Text Semantic Similarity using Glove Word Embedding
NLP WORDEMBEDDDING TECHINUES CBOW BOW.pptx
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Word_Embeddings.pptx
Text Mining for Lexicography
Enhanced TextRank using weighted word embedding for text summarization
Interactive Analysis of Word Vector Embeddings
What is word2vec?
StarSpace: Embed All The Things!
Ad

Recently uploaded (20)

PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
2. Earth - The Living Planet earth and life
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
2Systematics of Living Organisms t-.pptx
ECG_Course_Presentation د.محمد صقران ppt
lecture 2026 of Sjogren's syndrome l .pdf
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
2. Earth - The Living Planet earth and life
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
neck nodes and dissection types and lymph nodes levels
HPLC-PPT.docx high performance liquid chromatography
INTRODUCTION TO EVS | Concept of sustainability
Comparative Structure of Integument in Vertebrates.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
POSITIONING IN OPERATION THEATRE ROOM.ppt
The scientific heritage No 166 (166) (2025)
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Biophysics 2.pdffffffffffffffffffffffffff
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.

Centroid-based Text Summarization through Compositionality of Word Embeddings

  • 1. Centroid-based Text Summarization through Compositionality of Word Embeddings Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro gaetano.rossiello@uniba.it Department of Computer Science University of Bari - Aldo Moro, Italy MultiLing 2017 Workshop in EACL 2017 Summarization and summary evaluation across source types and genres 3 April 2017 - Valencia, Spain
  • 2. Introduction Extractive Text Summarization The generated summary is a selection of relevant sentences from the source text in a copy-paste fashion. A good extractive summarization method must satisfy: Coverage: the selected sentences should cover a sufficient amount of topics from the original source text Diversity: avoid the redundancy of information in the summary Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 3. Extractive Text Summarization An extractive text summarization method should define: Representation model: a paradigm to represent the sentences Scoring method: a technique for assigning a score to each sentence Ranking module: a method to properly select the relevant sentences  Bag-of-Words Several summarization methods proposed in the literature use the bag-of-words as representation model for the sentence scoring and selection modules. Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 4. Limit of the Bag-of-Words The textual similarity is a crucial aspect for many extractive text summarization methods. For example, taking into account two semantically related phrases: S1 Syd leaves Pink Floyd S2 Barrett abandons the band abandons band Barrett Floyd leaves Pink Syd the S1 0 0 0 1 1 1 1 0 S2 1 1 1 0 0 0 0 1 S1 ⊥ S2 =⇒ cosine(S1, S2) = 0 In the BOW model their vector (sparse) representation result orthogonal since they have no words in common. Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 5. The Idea Humans use a wide background knowledge when writing a summary. What Representation Learning How Distributional Semantics Why Transfer Learning Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 6. Neural Language Model - Word2Vec Word embedding stands for a continuous vector representation able to capture syntactic and semantic information of a word. vec(“Gilmour”) = [0.1, 0.3, 0.2, ...] vec(“Barrett”) = [0.3, 0.1, 0.6, ...] ... Figure: Continuous bag-of-words and Skip-gram [Mikolov et al., 2013] vec(Barrett) − vec(singer) + vec(guitarist) ≈ vec(Gilmour) Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 7. Centroid-based Text Summarization: Overview Centroid-based Extractive Text Summarization [Radev et al., 2004] The centroid represents a pseudo-document which condenses the meaningful information of a document (tfidf (w) > t) The main idea is to project in the vector space the vector representations of both the centroid and each sentence of a document The sentences closer to the centroid are selected Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 8. Sentence Scoring using Word Embeddings Word Embeddings Lookup Table Given a corpus of documents [D1, D2, . . . ]a and its vocabulary V with size N = |V |, we define a matrix E ∈ RN,k, so-called lookup table, where the i-th row is a word embedding of size k, k << N, of the i-th word in V . a The model can be trained on the collection of documents to be summarized or on a larger corpus. Given a document D to be summarized: Preprocessing: split into sentences, remove stopwords, no stemming Centroid Embedding: C = w∈D,tfidf (w)>t E[idx(w)] Sentence Embedding: Sj = w∈Sj E[idx(w)] Sentence Score: sim(C, Sj ) = CT •Sj ||C||·||Sj || Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 9. Sentence Selection Input: S, Scores, st, limit Output: Summary 1: S ← sortDesc(S,Scores) 2: k ← 1 3: for i ← 1 to m do 4: length ← length(Summary) 5: if length > limit then return Summary 6: SV ← sumVectors(S[i]) 7: include ← True 8: for j ← 1 to k do 9: SV 2 ← sumVectors(Summary[j]) 10: sim ← similarity(SV ,SV 2) 11: if sim > st then 12: include ← False 13: if include then 14: Summary[k] ← S[i] 15: k ← k + 1 Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 10. Text Summarization using Word Embeddings: Example Figure: Embeddings visualization using t-SNE [van der Maaten et al., 2008] Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 11. Text Summarization using Word Embeddings: Example arcade donkey kong game nintendo coleco centroid embedding arcades goat hong gameplay mario intellivision nes pac-man pig macao multiplayer wii atari gamecube console monkey fung videogame console nes konami famicom horse taiwan rpg nes msx wii sega cow wong gamespot gamecube 3do famicom Table: Centroid words of the Donkey Kong (video game) article having the tf-idf values greater than a topic threshold Sent ID Sentence Score 136 The original arcade version of the game appears in the Nintendo 64 game Donkey Kong 64. 0.9533 131 The game was ported to Nintendo’s Family Computer (Famicom) console in 1983 as one of the system’s three launch titles; the same version was a launch title for the Famicom’s North American version, the Nintendo Entertainment System (NES). 0.9375 186 In 2004, Nintendo released Mario vs. Donkey Kong, a sequel to the Game Boy title. 0.9366 192 In 2007, Donkey Kong Barrel Blast was released for the Nintendo Wii. 0.9362 135 The NES version was re-released as an unlockable game in Animal Crossing for the GameCube and as an item for purchase on the Wii’s Virtual Console. 0.9308 Table: The most relevant sentences Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 12. Experiments Research Question Can word embeddings improve the effectiveness of the centroid-based text summarization method? Python implementation on GitHub: github.com/gaetangate/text-summarizer/ DUC-2004 Multi-document Summarization task 2 Tuning Grid search on DUC-2003 dataset Word2Vec CBOW, Skip-gram on DUC-03/04 - Pre-trained on Google News MSS 2015 Multilingual Single Document task 2015 Tuning Grid search on MSS 2015 training set Word2Vec Skip-gram on Wikipedia (en, it, de, es, fr) MSS 2017 Multilingual Single Document task 2017 Tuning Grid search on MSS 2017 training set Word2Vec Skip-gram on Wikipedia (en, it, de, es, fr) Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 13. DUC-2004 task 2: Multi-document Summarization ROUGE-1 ROUGE-2 tt st size LEAD 32.42 6.42 SumBasic 37.27 8.58 Peer65 38.22 9.18 NMF 31.60 6.31 LexRank 37.58 8.78 RNN 38.78 9.86 C BOW 37.76 8.08 0.1 0.6 C GNEWS 37.91 8.45 0.2 0.9 300 C CBOW 38.68 8.93 0.3 0.93 200 C SKIP 38.81 9.97 0.3 0.94 400 Table: ROUGE scores (%) on DUC-2004 dataset. tt and st are the topic and similarity thresholds respectively. size is the dimension of embeddings Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 14. DUC-2004 task 2: Multi-document Summarization Although the different methods achieve similar ROUGE scores, they not necessarily generate similar summaries. GNEWS CBOW SKIP BOW GNEWS 1 0.109 0.171 0.075 CBOW 1 0.460 0.072 SKIP 1 0.105 BOW 1 Table: Sentence overlap using the Jaccard coefficient Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 15. MSS 2015: Multilingual Single Document Summarization English Italian German Spanish French R1 R2 R1 R2 R1 R2 R1 R2 R1 R2 LEAD 44.33 11.68 30.46 4.38 29.13 3.21 43.02 9.17 42.73 8.07 WORST 37.17 9.93 39.68 10.01 33.02 4.88 45.20 13.04 46.68 12.96 BEST 50.38 15.10 43.87 12.50 40.58 8.80 53.23 17.86 51.39 15.38 C BOW 49.06 13.43 33.44 4.82 35.28 4.93 48.38 12.88 46.13 10.45 C W2V 50.43‡ 13.34† 35.12 6.81 35.38† 5.39† 49.25† 12.99 47.82† 12.15 ORACLE 61.91 22.42 53.31 17.51 54.34 13.32 62.55 22.36 58.68 17.18 Table: ROUGE-1, -2 scores (%) on MultiLing MSS 2015 dataset for five different languages Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 16. Improvements for MSS 2017 SWAP system in MSS 2017 task: Always retain the first sentence in the summary Subtract to each word embedding the centroid vector of the whole embedding space Combination of four scores: sc1 word2vec centroid similarity sc2 bag-of-words centroid similarity sc3 normalized sentence length sc4 normalized sentence position score(Sj ) = λ1 ∗ sc1 + λ2 ∗ sc2 + λ3 ∗ sc3 + λ4 ∗ sc4 Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 17. MSS 2017: Multilingual Single Document Summarization English Italian German Spanish French R1 MM R1 MM R1 MM R1 MM R1 MM CIST 45.06 16.83 30.07 20.22 32.32 16.52 45.31 16.94 41.67 17.66 TeamMD 43.08 16.35 30.22 20.79 32.91 15.76 44.95 16.54 42.81 17.18 SWAP 45.62 17.05 32.66 18.45 35.15 18.27 46.67 20.58 43.68 20.06 ORACLE 55.52 20.98 41.25 26.89 41.58 23.75 52.20 23.16 51.41 25.44 Table: ROUGE-1, MeMog scores (%) on MultiLing MSS 2017 dataset for five different languages Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 18. Future Works Try word embeddings with other summarization methods Sentence embeddings using Deep Learning Doc2Vec Autoencoders Convolutional Neural Networks Recurrent Neural Networks (LSTM, GRU) Attention and Memory Networks Joint Learning of Distributional and Relational Semantics WordNet DBpedia Wikidata ... Infuse prior knowledge in sentence embeddings Compositionality of Concepts Transfer Learning Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings
  • 19. Moral of the Story Math > Magic Gaetano Rossiello, Pierpaolo Basile, Giovanni Semeraro Centroid-based Text Summarization using Word Embeddings