SlideShare a Scribd company logo
EXTRACTIVE TEXT
SUMMARIZATION AND
TOPIC MODELING
OVER REDDIT POSTS
University of Milano-Bicocca
Master's Degree in Data Science
Text Mining and Search
Academic Year 2022-2023
Authors:
Giorgio CARBONE matricola n. 811974
Marco SCATASSI matricola n. 883823
Gianluca SCURI matricola n. 886725
reddit : the front page of the internet
❑ Social news aggregation & discussion website
❑ 100k+ communities & 430M+ posts in 2022
❑ Redditors post, comment and rate content
❑ Subreddit: domain-specific community
❑ TL;DR = "Too Long; Didn't Read“ → summary for
lengthy posts
❑ The fraction of posts containing TL;DR is decreasing
Project objectives
1. To perform Extractive Text Summarization on
reddit posts
• to obtain very short summaries
resembling TL;DRs
2. To perform Topic Modelling on reddit posts
• to extract the main topics in a collection
of posts
POST:
We are students and we go to college together, we have three
lessons a week together. At school he normally sits at the
front and I sit at the back, but recently the person I sit next to
has been struggling with mental health and hasn't been in, so
I moved and sit next to her most classes. For a while now we
've texted each other a few times, but outside of that, we
don't really hang out at all. I see a lot of theatre, and about a
week ago she said she wanted to come see a show with me.
When we find our seats, mine has a pole in the way so I can't
see a section of the stage unless I lean away from her. About
half an hour in, this girl leans on my shoulder and starts
hugging my arm, while still leaning on my shoulder. She was
kind of cuddling all day, we went to an arcade earlier as well.
She doesn't seem like the cuddling type of friend, and I'm
very worried she has a crush on me. I don't want to ruin our
friendship, I don't like her back. Should I just ignore it until
she asks me? What if she thinks that was a date?
TL;DR:
I took my friend to see a show, she leant on my shoulder the
whole time. I 'm not into her but I think she has a crush on
me?
TOPICS:
• Topic 7: school, class, college, student, ...
• Topic 1: relationship, friend, girl, dating, ...
• ...
DATASET &
DATA EXPLORATION
Dataset: TLDRHQ
❑ Released in 2021
❑ Posts published in 2005-2021
❑ 1,671,099 reddit posts and their TL;DRs
• Training set → 1,590,132 instances
• Validation set → 40,486 instances
• Test set → 40,481 instances
❑ Attributes
• id → ID of the post
• document → Text of the user’s post
• summary → Text of the user-written TL;DR
• ext_labels → Extractive labels of the post's sentences
• rg_labels → Rouge scores of the post’s sentences
id train-TLDR_RS_2012-02-4890.json
document i 'm looking for a new pair of headphones that i
will carry around with me when i travel
.</s><s> i do n't want to spend more than $ 50
.</s><s> i do n't like earbuds because they do
n't stay in very well .</s><s> i wear glasses so
the headphones ca n't be too tight . </s><s> i
'm not an audiophile but i do appreciate quality
.</s><s> i prefer over-ear style . </s><s> i 've
tried [ skullcandy ] (
https://www.amazon.com/stores/page/E0223B)
, sony , and some other weird brand a while a
back and so far the sony 's have the least
amount of pressure but also the least amount of
volume . </s><s> i ca n't turn them up because
they do n't cover the ear and i 'm not that guy
who walks around and forces people to listen to
distorted music from headphones .
summary want new headphones - prefer over-ear . i wear
glasses so ca n't be too tight . around $ 50 .
thanks !
ext_labels [0, 0, 0, 1, 0, 1, 0, 0]
rg_labels [0.10165, 0.11729, 0.07898, 0.36880, 0.03765,
0.15032, 0.04066, 0.10461]
Data Exploration
❑ Most of the posts published after 2013
❑ 53.8.% submissions / 46.2.% comments
❑ No missing values
❑ document and summary → 38K and 67K duplicates
• announcements, bots messages, spam
❑ 𝒄𝒐𝒎𝒑𝒓𝒆𝒔𝒔𝒊𝒐𝒏 𝒓𝒂𝒕𝒆 =
𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑤𝑜𝑟𝑑𝑠 𝑐𝑜𝑢𝑛𝑡 (𝑎𝑣𝑔)
𝑠𝑢𝑚𝑚𝑎𝑟𝑦 𝑤𝑜𝑟𝑑𝑠 𝑐𝑜𝑢𝑛𝑡 (𝑎𝑣𝑔)
= 𝟏𝟐. 𝟏
• TLDRs heavily shorten the post’s text
document summary
words count (tot) ~468M ~38M
words count (avg) 291 24
sentences count
(tot)
~24M ~3.8M
sentences count
(avg)
15 2
words count /
sentence (avg)
20 11
unique words ~738K ~254K
compression rate 12.1
Data & Text Pre-processing
1. Data Cleaning → duplicates removal
2. Sentences Splitting
3. Text Cleaning
4. Text Normalization → Words and punctuation
5. Tokenization → unigrams
6. Stop-words and 1-character words removal
7. Lemmatization
8. POS Tagging
Text Cleaning &
Tokenization
+Lemmatization &
Stop-Words
Removal
words count (tot) ~468M ~213M
words count (avg) 291 133
words count /
sentence (avg)
20 9
unique words ~738K ~715K
TEXT
SUMMARIZATION
Text Summarization
❑ NLP task aimed at identifying and extract the most
important information within a text
❑ Many ways to undertake this task
❑ Characteristics of our approach:
❑ single-document
❑ generic
❑ extractive
❑ Moreover, extreme summarization
Features Matrix
❑ 8 features:
1. Sentence relative position
2. Word in sentence (relative)
3. NOUN tag ratio
4. VERB tag ratio
5. ADJ tag ratio
6. ADV tag ratio
7. TS-ISF
8. Sentence similarity score
Model Selection
❑ 3-Fold Cross Validation
❑ Two models:
❑ Random Forest
❑ Hist Gradient Boost
❑ Final Model
❑ Hist Gradient Boost:
❑ 500 max iterations
❑ 255 bins
❑ 0.05 as learning rate
❑ {0 : 1, 1 : 3} as class weights
Selected Model
F1 Recall Precision
Train Test Train Test Train Test
6 0.59 0.24 0.73 0.72 0.50 0.16
F1 Recall Precision
Train Val Train Val Train Val
1 0.70 0.59 0.91 0.76 0.57 0.48
2 0.83 0.60 0.88 0.61 0.78 0.60
3 0.61 0.59 0.74 0.72 0.51 0.50
4 0.57 0.56 0.86 0.84 0.43 0.42
5 0.73 0.60 0.81 0.66 0.66 0.56
6 0.59 0.59 0.73 0.73 0.50 0.49
7 0.61 0.59 0.74 0.72 0.51 0.50
8 0.56 0.56 0.85 0.85 0.41 0.41
9 0.73 0.60 0.81 0.66 0.66 0.55
10 0.78 0.60 0.94 0.69 0.67 0.52
Summary Selection
❑ Three length summaries
❑ One - sentence
❑ extreme summarization
❑ Two - sentences
❑ Sqrt – sentences
Summary Evaluation
❑ Comparable performance
❑ Despite lower complexity
❑ More sentences
❑ Slightly increase in RG scores
Model RG-1(%) RG-2(%) RG-L(%)
BertSumExt 28.40 11.35 21.38
One-sentence
Hist Gradient
Boost
21.99 7.06 17.15
Two-sentences
Hist Gradient
Boost
24.34 7.83 17.44
Sqrt-sentences
Hist Gradient
Boost
24.83 7.99 17.83
Example 1
❑ Example in which the model couldn’t capture the
crucial information in all summaries
Example 2
❑ The one sentence summary is not representative of
the TL;DR.
❑ The two sentence one captures the most relevant
information inside the post
Example 3
❑ A good one sentence summary.
❑ Two and Sqrt summaries add non relevant
information
❑ The extractive reference summary is wrong.
❑ Examples like this one, negatively affected the
performance of our model
Example 4
❑ The one sentence summary captures quite well all
the information inside the TL:DR
❑ it is equal to the reference extractive summary
❑ Two and Sqrt summaries add non relevant
information
TOPIC MODELING
Topic Modeling
❑ Groups words into different clusters named topics
❑ Results:
❑ list of topics
❑ probability of words for each topic
❑ distribution of topics for each document
❑ Latent Semantic Analysis (LSA) and Latent Dirichlet
Allocation (LDA)
Exploration and pre-processing
❑ Documents must contain only words with high
semantic value
❑ Actions performed:
❑ Verb and noun filtering based on POS tagging
(90,444 → 75,879 terms)
❑ Common (>50% of docs) and rare (<5 docs)
words removal
(75,879 → 13,554 terms)
❑ custom stop-words ['time', 'something', 'going',
'year', 'week', 'month', 'day', 'get', 'got', …]
(13,554 → 13,536 terms)
LDA: introduction
❑ Two key assumptions:
❑ documents are a mixture of topics
❑ topics are a mixture of words
❑ Both follow the Dirichlet distribution
❑ Hyper-parameters: α, β and k
𝑝 = 1 𝑝 > 1 𝑝 < 1
LDA: hyper-parameters tuning
❑ Steps:
❑ BOW representation
❑ LdaMulticore() of Gensim
❑ CoherenceModel()
❑ Grid search maximizing coherence
α β k
UMass >1 >1 20
C_V 0.01 0.2 {20, 50}
UMass
C_V
LDA: results
❑ Topics obtained are not well distinguished and many
of them have overlapped terms
❑ Attempts at improvement:
❑ Expanded ranges grid search
❑ Different coherence measures
❑ Additional pre-processing operations
❑ Exclusion of short documents
Doc Topic_0 Topic_1 Topic_2 Topic_3 Topic_4 Topic_5 Topic_6 Topic_7 Topic_8
0 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008
1 0.4687 0.0000 0.0993 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2 0.0005 0.0005 0.0005 0.0005 0.7440 0.0005 0.0005 0.2300 0.0005
3 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
4 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004
5 0.0001 0.0001 0.0001 0.0001 0.0001 0.1694 0.0001 0.0001 0.0001
6 0.0001 0.0001 0.0001 0.0001 0.0001 0.9930 0.0001 0.0001 0.0001
7 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
8 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003
9 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004
Top10Words Topic_0 Topic_1 Topic_2 Topic_3 Topic_4 Topic_5 Topic_6 Topic_7 Topic_8
0 people make think work think feel way make make
1 make think feel way lot way think work see
2 think feel people need way think make think think
3 way lot make people make let people people way
4 work friend anything think people make see point lot
5 need point need take feel anything friend friend friend
6 take work point point see lot everything way made
7 see way see make friend friend getting life people
8 friend life way getting anything see lot anything take
9 feel see take help need people anything help let
Topic Top10Word Score
0 people 0.0065
0 make 0.0065
0 think 0.0059
0 way 0.0051
0 work 0.0049
0 need 0.0045
0 take 0.0044
0 see 0.0043
0 Friend 0.0042
0 feel 0.0041
LSA: introduction
❑ Similar documents contains approximately the same
distribution of word frequencies for certain words
❑ Truncated SVD of the Doc-Term matrix resulting in
Doc-Topic and Topic-Term matrices (𝐴 = 𝑈Σ𝑉𝑇
)
❑ Hyper-parameter: 𝐾 = 20
❑ Steps:
❑ TF-IDF representation
❑ randomized_svd() of Sklearn
LSA: results
❑ These results are much more meaningful than the
LDA analysis
❑ Topics are now well defined and distinguished
Doc Topic_0 Topic_1 Topic_2 Topic_3 Topic_4 Topic_5 Topic_6 Topic_7 Topic_8
0 0.0003 -0.0008 0.0001 0.0001 -0.0011 0.0002 0.0006 0.0004 0.0000
1 0.0104 0.0074 0.0009 0.0006 -0.0025 0.0045 -0.0075 -0.0049 -0.0048
2 0.0031 -0.0006 -0.0025 -0.0048 0.0009 0.0047 0.0034 0.0121 -0.0060
3 0.0062 -0.0034 -0.0095 -0.0028 -0.0081 0.0021 0.0006 0.0022 0.0059
4 0.0031 -0.0003 0.0012 -0.0044 -0.0003 0.0030 -0.0055 -0.0012 0.0003
5 0.0064 -0.0032 0.0026 -0.0011 0.0006 0.0026 -0.0027 0.0021 0.0018
6 0.0041 -0.0073 -0.0001 0.0087 0.0014 0.0036 0.0028 -0.0044 0.0009
7 0.0042 -0.0039 0.0073 -0.0054 0.0091 -0.0140 0.0054 -0.0137 -0.0015
8 0.0034 0.0002 0.0019 -0.0041 0.0038 0.0085 -0.0018 0.0001 0.0051
9 0.0017 -0.0010 0.0016 -0.0003 -0.0052 -0.0012 -0.0003 -0.0006 -0.0041
Top10Word Topic_0 Topic_1 Topic_2 Topic_3 Topic_4 Topic_5 Topic_6 Topic_7 Topic_8
0 feel relationship went game game game school school guy
1 friend friend home play hundred family people class girl
2 make girl house playing money parent class college people
3 think dating car player play life friend student hundred
4 people told room played work play college experience money
5 see boyfriend came friend buy playing game started car
6 way feeling minute guy pay mom student looking make
7 lot met took team hour kid everyone girl pay
8 relationship talk door girl playing love girl grade buy
9 work asked hour fun friend dad kid help date
Topic Top10Word Score
0 feel 0.12
0 friend 0.12
0 make 0.12
0 think 0.12
0 people 0.11
0 see 0.11
0 way 0.11
0 lot 0.1
0 relationship 0.1
0 work 0.1
LSA: example
❑ Topic 7: school
❑ Topic 3: games
❑ Topic 13: advices
Top10Words Topic_3 Topic_7 Topic_13
0 game school post
1 play class advice
2 playing college question
3 player student help
4 played experience anyone
5 friend started twenty
6 guy looking hundred
7 team girl looking
8 girl grade team
9 fun help thirty
Document Top1 Topic
i can not for the life of me find a school for wing chun , and i am very eager
to learn . i know that bad habits can come from learning online but i am
getting restless . so if any one would like to help a ( hopefully ) soon to be
chunner out , find a school near pleasanton , ca . lineage is n't a big concern
of mine right now
Topic_7
i have noticed lately that while i may not be interested in a game for its
gameplay i want to dig into the story and lore as much as possible . as an
example i cant stand the puzzle platforming or multiplayer aspects of
splatoon but the lore and characters are super interesting to me ... i want to
enjoy those parts of it but have no desire to actually play the game .
Topic_3
since im using a ipad pro with a apple pencil for my work and my studies i
had this cool idea of sketching little things and put them later on as
wallpaper on my desktop but i ve tested some free apps like adobe sketch
but sadly in all these apps , the screen resolution / format did nt fit my
dektops one . so my sketch gets either cut off or i have like two big black
bars on my desktop . so maybe can anyone of you advice me a sketching /
drawing app where i can change the resolution or where the format fits my
desktop . im using a 1080p monitor btw . i appreciate every suggestion :d
Topic_13
/ Bibliography
1. Sajad Sotudeh et al. “TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts”. In: (Nov.
2021), pp. 142–151. DOI: 10.18653/v1/2021.newsum-1.15. URL: https://aclanthology.org/2021. newsum-1.15
2. Reddit - Dive into anything. URL: https://www.reddit.com.
3. Sahil Patel. Reddit Claims 52 Million Daily Users, Revealing a Key Figure for Social-Media Platforms. Ed. by The Wall
Street Journal. URL: https://www.wsj.com/articles/reddit-claims-52-million-daily-users-revealing-a-key-figure-for-
social-media-platforms-11606822200. (posted: Dec. 1, 2020).
4. Reddit Staff Announcements. Revealing This Year’s (2022) Reddit Recap. Ed. by upvoted: The Official Reddit blog. URL:
https://www.redditinc.com/blog/reddit-recap-2022-global. (posted: Dec. 8, 2022)
5. ir@Georgetown - Home, ed. The Georgetown University Information Retrieval Lab. URL: https : / / ir . cs .
georgetown.edu/. (accessed: January 16, 2023)
6. Sajastu. sajastu/reddit collector: Reddit Collector and Text Processor. URL: https://github.com/sajastu/
reddit_collector.
7. Vinicius Camargo da Silva, Jo ao Paulo Papa, and Kelton Augusto Pontara da Costa. Extractive Text Summarization
Using Generalized Additive Models with Interactions for Sentence Selection. 2022. arXiv: 2212.10707 [cs.CL].
/ Bibliography
8. Kam-Fai Wong, Mingli Wu, and Wenjie Li. “Extractive Summarization Using Supervised and Semi-Supervised
Learning”.
In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Manchester, UK:
Coling 2008 Organizing Committee, Aug. 2008, pp. 985–992. URL: https://aclanthology.org/C08-1124.
9. Alexander Dlikman and Mark Last. “Using Machine Learning Methods and Linguistic Features in Single-Document
Extractive Summarization”. In: DMNLP@PKDD/ECML. 2016.
10. Guillaume Lemaˆıtre, Fernando Nogueira, and Christos K. Aridas. “Imbalanced-learn: A Python Toolbox to Tackle the
Curse of Imbalanced Datasets in Machine Learning”. In: Journal of Machine Learning Research 18.17 (2017), pp. 1–5.
URL: http://jmlr.org/papers/v18/16-365.
11. David M Blei, Andrew Y Ng, and Michael I Jordan. “Latent dirichlet allocation”. In: Journal of machine Learning
research 3.Jan (2003), pp. 993–1022.
12. Scott C Deerwester et al. Computer information retrieval using latent semantic structure. US Patent 4,839,853. June
1989.

More Related Content

PDF
Tips from Calvin and Hobbes on how to be a good customer
PDF
Thinkbox 'the case for creativity'
PDF
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
PDF
VS_TextByTheBay
PPTX
Deep Learning Automated Helpdesk
PPTX
Electricity (Diff Charging Processes).pptx
PPTX
2014 Charleston Conference Presentation: The Importance of Being Resourceful:...
PPTX
Engaging essays examples and a complete how to guide
Tips from Calvin and Hobbes on how to be a good customer
Thinkbox 'the case for creativity'
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
VS_TextByTheBay
Deep Learning Automated Helpdesk
Electricity (Diff Charging Processes).pptx
2014 Charleston Conference Presentation: The Importance of Being Resourceful:...
Engaging essays examples and a complete how to guide

Similar to Extreme Extractive Text Summarization and Topic Modeling (using LSA and LDA techniques) over Reddit Posts (20)

PPTX
Word_Embeddings.pptx
PDF
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
PDF
Dime-Novel Genre Classifier: A Prototype Text-Mining Application
PPTX
Science Subject for Elementary - 1st Grade_ Inquiry _ by Slidesgo.pptx
DOCX
This is a Case Study about you. (me)As you reflect your readings.docx
PDF
Nearest Neighbor And Decision Tree - NN DT
PPTX
Slide kk bm 2022.pptx
PPTX
Digital Marketing Proposal | by Slidesgo.pptx
PPTX
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
PPTX
Decision tree induction
PDF
Magpie
PPT
new.technique.column transposional CTi college.ppt
PDF
PPTX
Economics Thesis Orange variant.pptx
PPTX
Economics Thesis Orange12dfv variant.pptx
PPTX
Sql Server 2016 and JSON
PDF
Scholastic_Success_Grammer_Grade_3.pdf
PDF
GRAMMAR GRADE 3.pdf
PPT
webinar2.ppt
PPTX
Writing Fictional Asian Characters Workshop by Slidesgo (1).pptx
Word_Embeddings.pptx
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Dime-Novel Genre Classifier: A Prototype Text-Mining Application
Science Subject for Elementary - 1st Grade_ Inquiry _ by Slidesgo.pptx
This is a Case Study about you. (me)As you reflect your readings.docx
Nearest Neighbor And Decision Tree - NN DT
Slide kk bm 2022.pptx
Digital Marketing Proposal | by Slidesgo.pptx
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Decision tree induction
Magpie
new.technique.column transposional CTi college.ppt
Economics Thesis Orange variant.pptx
Economics Thesis Orange12dfv variant.pptx
Sql Server 2016 and JSON
Scholastic_Success_Grammer_Grade_3.pdf
GRAMMAR GRADE 3.pdf
webinar2.ppt
Writing Fictional Asian Characters Workshop by Slidesgo (1).pptx
Ad

More from Giorgio Carbone (8)

PDF
Master's Thesis - Data Science - Presentation
PDF
Electricity Consumption Forecasting Using Arima, UCM, Machine Learning and De...
PDF
Identification Of Alzheimer's Disease Using A Deep Learning Method Based O...
PDF
Milano Air Quality: Interactive Data Visualization
PDF
Competitive Pokémon Graph Database
PDF
Video Classification: Human Action Recognition on HMDB-51 dataset
PDF
Word Embedding (Word2Vec and CADE): the evolution of tópoi in the Italian lit...
PDF
CXR-ACGAN: Auxiliary Classifier GAN for Conditional Generation of Chest X-Ray...
Master's Thesis - Data Science - Presentation
Electricity Consumption Forecasting Using Arima, UCM, Machine Learning and De...
Identification Of Alzheimer's Disease Using A Deep Learning Method Based O...
Milano Air Quality: Interactive Data Visualization
Competitive Pokémon Graph Database
Video Classification: Human Action Recognition on HMDB-51 dataset
Word Embedding (Word2Vec and CADE): the evolution of tópoi in the Italian lit...
CXR-ACGAN: Auxiliary Classifier GAN for Conditional Generation of Chest X-Ray...
Ad

Recently uploaded (20)

PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Mega Projects Data Mega Projects Data
PDF
annual-report-2024-2025 original latest.
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
climate analysis of Dhaka ,Banglades.pptx
Qualitative Qantitative and Mixed Methods.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
.pdf is not working space design for the following data for the following dat...
Mega Projects Data Mega Projects Data
annual-report-2024-2025 original latest.
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Clinical guidelines as a resource for EBP(1).pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
1_Introduction to advance data techniques.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IBA_Chapter_11_Slides_Final_Accessible.pptx

Extreme Extractive Text Summarization and Topic Modeling (using LSA and LDA techniques) over Reddit Posts

  • 1. EXTRACTIVE TEXT SUMMARIZATION AND TOPIC MODELING OVER REDDIT POSTS University of Milano-Bicocca Master's Degree in Data Science Text Mining and Search Academic Year 2022-2023 Authors: Giorgio CARBONE matricola n. 811974 Marco SCATASSI matricola n. 883823 Gianluca SCURI matricola n. 886725
  • 2. reddit : the front page of the internet ❑ Social news aggregation & discussion website ❑ 100k+ communities & 430M+ posts in 2022 ❑ Redditors post, comment and rate content ❑ Subreddit: domain-specific community ❑ TL;DR = "Too Long; Didn't Read“ → summary for lengthy posts ❑ The fraction of posts containing TL;DR is decreasing
  • 3. Project objectives 1. To perform Extractive Text Summarization on reddit posts • to obtain very short summaries resembling TL;DRs 2. To perform Topic Modelling on reddit posts • to extract the main topics in a collection of posts POST: We are students and we go to college together, we have three lessons a week together. At school he normally sits at the front and I sit at the back, but recently the person I sit next to has been struggling with mental health and hasn't been in, so I moved and sit next to her most classes. For a while now we 've texted each other a few times, but outside of that, we don't really hang out at all. I see a lot of theatre, and about a week ago she said she wanted to come see a show with me. When we find our seats, mine has a pole in the way so I can't see a section of the stage unless I lean away from her. About half an hour in, this girl leans on my shoulder and starts hugging my arm, while still leaning on my shoulder. She was kind of cuddling all day, we went to an arcade earlier as well. She doesn't seem like the cuddling type of friend, and I'm very worried she has a crush on me. I don't want to ruin our friendship, I don't like her back. Should I just ignore it until she asks me? What if she thinks that was a date? TL;DR: I took my friend to see a show, she leant on my shoulder the whole time. I 'm not into her but I think she has a crush on me? TOPICS: • Topic 7: school, class, college, student, ... • Topic 1: relationship, friend, girl, dating, ... • ...
  • 5. Dataset: TLDRHQ ❑ Released in 2021 ❑ Posts published in 2005-2021 ❑ 1,671,099 reddit posts and their TL;DRs • Training set → 1,590,132 instances • Validation set → 40,486 instances • Test set → 40,481 instances ❑ Attributes • id → ID of the post • document → Text of the user’s post • summary → Text of the user-written TL;DR • ext_labels → Extractive labels of the post's sentences • rg_labels → Rouge scores of the post’s sentences id train-TLDR_RS_2012-02-4890.json document i 'm looking for a new pair of headphones that i will carry around with me when i travel .</s><s> i do n't want to spend more than $ 50 .</s><s> i do n't like earbuds because they do n't stay in very well .</s><s> i wear glasses so the headphones ca n't be too tight . </s><s> i 'm not an audiophile but i do appreciate quality .</s><s> i prefer over-ear style . </s><s> i 've tried [ skullcandy ] ( https://www.amazon.com/stores/page/E0223B) , sony , and some other weird brand a while a back and so far the sony 's have the least amount of pressure but also the least amount of volume . </s><s> i ca n't turn them up because they do n't cover the ear and i 'm not that guy who walks around and forces people to listen to distorted music from headphones . summary want new headphones - prefer over-ear . i wear glasses so ca n't be too tight . around $ 50 . thanks ! ext_labels [0, 0, 0, 1, 0, 1, 0, 0] rg_labels [0.10165, 0.11729, 0.07898, 0.36880, 0.03765, 0.15032, 0.04066, 0.10461]
  • 6. Data Exploration ❑ Most of the posts published after 2013 ❑ 53.8.% submissions / 46.2.% comments ❑ No missing values ❑ document and summary → 38K and 67K duplicates • announcements, bots messages, spam ❑ 𝒄𝒐𝒎𝒑𝒓𝒆𝒔𝒔𝒊𝒐𝒏 𝒓𝒂𝒕𝒆 = 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑤𝑜𝑟𝑑𝑠 𝑐𝑜𝑢𝑛𝑡 (𝑎𝑣𝑔) 𝑠𝑢𝑚𝑚𝑎𝑟𝑦 𝑤𝑜𝑟𝑑𝑠 𝑐𝑜𝑢𝑛𝑡 (𝑎𝑣𝑔) = 𝟏𝟐. 𝟏 • TLDRs heavily shorten the post’s text document summary words count (tot) ~468M ~38M words count (avg) 291 24 sentences count (tot) ~24M ~3.8M sentences count (avg) 15 2 words count / sentence (avg) 20 11 unique words ~738K ~254K compression rate 12.1
  • 7. Data & Text Pre-processing 1. Data Cleaning → duplicates removal 2. Sentences Splitting 3. Text Cleaning 4. Text Normalization → Words and punctuation 5. Tokenization → unigrams 6. Stop-words and 1-character words removal 7. Lemmatization 8. POS Tagging Text Cleaning & Tokenization +Lemmatization & Stop-Words Removal words count (tot) ~468M ~213M words count (avg) 291 133 words count / sentence (avg) 20 9 unique words ~738K ~715K
  • 9. Text Summarization ❑ NLP task aimed at identifying and extract the most important information within a text ❑ Many ways to undertake this task ❑ Characteristics of our approach: ❑ single-document ❑ generic ❑ extractive ❑ Moreover, extreme summarization
  • 10. Features Matrix ❑ 8 features: 1. Sentence relative position 2. Word in sentence (relative) 3. NOUN tag ratio 4. VERB tag ratio 5. ADJ tag ratio 6. ADV tag ratio 7. TS-ISF 8. Sentence similarity score
  • 11. Model Selection ❑ 3-Fold Cross Validation ❑ Two models: ❑ Random Forest ❑ Hist Gradient Boost ❑ Final Model ❑ Hist Gradient Boost: ❑ 500 max iterations ❑ 255 bins ❑ 0.05 as learning rate ❑ {0 : 1, 1 : 3} as class weights Selected Model F1 Recall Precision Train Test Train Test Train Test 6 0.59 0.24 0.73 0.72 0.50 0.16 F1 Recall Precision Train Val Train Val Train Val 1 0.70 0.59 0.91 0.76 0.57 0.48 2 0.83 0.60 0.88 0.61 0.78 0.60 3 0.61 0.59 0.74 0.72 0.51 0.50 4 0.57 0.56 0.86 0.84 0.43 0.42 5 0.73 0.60 0.81 0.66 0.66 0.56 6 0.59 0.59 0.73 0.73 0.50 0.49 7 0.61 0.59 0.74 0.72 0.51 0.50 8 0.56 0.56 0.85 0.85 0.41 0.41 9 0.73 0.60 0.81 0.66 0.66 0.55 10 0.78 0.60 0.94 0.69 0.67 0.52
  • 12. Summary Selection ❑ Three length summaries ❑ One - sentence ❑ extreme summarization ❑ Two - sentences ❑ Sqrt – sentences
  • 13. Summary Evaluation ❑ Comparable performance ❑ Despite lower complexity ❑ More sentences ❑ Slightly increase in RG scores Model RG-1(%) RG-2(%) RG-L(%) BertSumExt 28.40 11.35 21.38 One-sentence Hist Gradient Boost 21.99 7.06 17.15 Two-sentences Hist Gradient Boost 24.34 7.83 17.44 Sqrt-sentences Hist Gradient Boost 24.83 7.99 17.83
  • 14. Example 1 ❑ Example in which the model couldn’t capture the crucial information in all summaries
  • 15. Example 2 ❑ The one sentence summary is not representative of the TL;DR. ❑ The two sentence one captures the most relevant information inside the post
  • 16. Example 3 ❑ A good one sentence summary. ❑ Two and Sqrt summaries add non relevant information ❑ The extractive reference summary is wrong. ❑ Examples like this one, negatively affected the performance of our model
  • 17. Example 4 ❑ The one sentence summary captures quite well all the information inside the TL:DR ❑ it is equal to the reference extractive summary ❑ Two and Sqrt summaries add non relevant information
  • 19. Topic Modeling ❑ Groups words into different clusters named topics ❑ Results: ❑ list of topics ❑ probability of words for each topic ❑ distribution of topics for each document ❑ Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)
  • 20. Exploration and pre-processing ❑ Documents must contain only words with high semantic value ❑ Actions performed: ❑ Verb and noun filtering based on POS tagging (90,444 → 75,879 terms) ❑ Common (>50% of docs) and rare (<5 docs) words removal (75,879 → 13,554 terms) ❑ custom stop-words ['time', 'something', 'going', 'year', 'week', 'month', 'day', 'get', 'got', …] (13,554 → 13,536 terms)
  • 21. LDA: introduction ❑ Two key assumptions: ❑ documents are a mixture of topics ❑ topics are a mixture of words ❑ Both follow the Dirichlet distribution ❑ Hyper-parameters: α, β and k 𝑝 = 1 𝑝 > 1 𝑝 < 1
  • 22. LDA: hyper-parameters tuning ❑ Steps: ❑ BOW representation ❑ LdaMulticore() of Gensim ❑ CoherenceModel() ❑ Grid search maximizing coherence α β k UMass >1 >1 20 C_V 0.01 0.2 {20, 50} UMass C_V
  • 23. LDA: results ❑ Topics obtained are not well distinguished and many of them have overlapped terms ❑ Attempts at improvement: ❑ Expanded ranges grid search ❑ Different coherence measures ❑ Additional pre-processing operations ❑ Exclusion of short documents Doc Topic_0 Topic_1 Topic_2 Topic_3 Topic_4 Topic_5 Topic_6 Topic_7 Topic_8 0 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 0.0008 1 0.4687 0.0000 0.0993 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 2 0.0005 0.0005 0.0005 0.0005 0.7440 0.0005 0.0005 0.2300 0.0005 3 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 4 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 5 0.0001 0.0001 0.0001 0.0001 0.0001 0.1694 0.0001 0.0001 0.0001 6 0.0001 0.0001 0.0001 0.0001 0.0001 0.9930 0.0001 0.0001 0.0001 7 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 8 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 9 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 Top10Words Topic_0 Topic_1 Topic_2 Topic_3 Topic_4 Topic_5 Topic_6 Topic_7 Topic_8 0 people make think work think feel way make make 1 make think feel way lot way think work see 2 think feel people need way think make think think 3 way lot make people make let people people way 4 work friend anything think people make see point lot 5 need point need take feel anything friend friend friend 6 take work point point see lot everything way made 7 see way see make friend friend getting life people 8 friend life way getting anything see lot anything take 9 feel see take help need people anything help let Topic Top10Word Score 0 people 0.0065 0 make 0.0065 0 think 0.0059 0 way 0.0051 0 work 0.0049 0 need 0.0045 0 take 0.0044 0 see 0.0043 0 Friend 0.0042 0 feel 0.0041
  • 24. LSA: introduction ❑ Similar documents contains approximately the same distribution of word frequencies for certain words ❑ Truncated SVD of the Doc-Term matrix resulting in Doc-Topic and Topic-Term matrices (𝐴 = 𝑈Σ𝑉𝑇 ) ❑ Hyper-parameter: 𝐾 = 20 ❑ Steps: ❑ TF-IDF representation ❑ randomized_svd() of Sklearn
  • 25. LSA: results ❑ These results are much more meaningful than the LDA analysis ❑ Topics are now well defined and distinguished Doc Topic_0 Topic_1 Topic_2 Topic_3 Topic_4 Topic_5 Topic_6 Topic_7 Topic_8 0 0.0003 -0.0008 0.0001 0.0001 -0.0011 0.0002 0.0006 0.0004 0.0000 1 0.0104 0.0074 0.0009 0.0006 -0.0025 0.0045 -0.0075 -0.0049 -0.0048 2 0.0031 -0.0006 -0.0025 -0.0048 0.0009 0.0047 0.0034 0.0121 -0.0060 3 0.0062 -0.0034 -0.0095 -0.0028 -0.0081 0.0021 0.0006 0.0022 0.0059 4 0.0031 -0.0003 0.0012 -0.0044 -0.0003 0.0030 -0.0055 -0.0012 0.0003 5 0.0064 -0.0032 0.0026 -0.0011 0.0006 0.0026 -0.0027 0.0021 0.0018 6 0.0041 -0.0073 -0.0001 0.0087 0.0014 0.0036 0.0028 -0.0044 0.0009 7 0.0042 -0.0039 0.0073 -0.0054 0.0091 -0.0140 0.0054 -0.0137 -0.0015 8 0.0034 0.0002 0.0019 -0.0041 0.0038 0.0085 -0.0018 0.0001 0.0051 9 0.0017 -0.0010 0.0016 -0.0003 -0.0052 -0.0012 -0.0003 -0.0006 -0.0041 Top10Word Topic_0 Topic_1 Topic_2 Topic_3 Topic_4 Topic_5 Topic_6 Topic_7 Topic_8 0 feel relationship went game game game school school guy 1 friend friend home play hundred family people class girl 2 make girl house playing money parent class college people 3 think dating car player play life friend student hundred 4 people told room played work play college experience money 5 see boyfriend came friend buy playing game started car 6 way feeling minute guy pay mom student looking make 7 lot met took team hour kid everyone girl pay 8 relationship talk door girl playing love girl grade buy 9 work asked hour fun friend dad kid help date Topic Top10Word Score 0 feel 0.12 0 friend 0.12 0 make 0.12 0 think 0.12 0 people 0.11 0 see 0.11 0 way 0.11 0 lot 0.1 0 relationship 0.1 0 work 0.1
  • 26. LSA: example ❑ Topic 7: school ❑ Topic 3: games ❑ Topic 13: advices Top10Words Topic_3 Topic_7 Topic_13 0 game school post 1 play class advice 2 playing college question 3 player student help 4 played experience anyone 5 friend started twenty 6 guy looking hundred 7 team girl looking 8 girl grade team 9 fun help thirty Document Top1 Topic i can not for the life of me find a school for wing chun , and i am very eager to learn . i know that bad habits can come from learning online but i am getting restless . so if any one would like to help a ( hopefully ) soon to be chunner out , find a school near pleasanton , ca . lineage is n't a big concern of mine right now Topic_7 i have noticed lately that while i may not be interested in a game for its gameplay i want to dig into the story and lore as much as possible . as an example i cant stand the puzzle platforming or multiplayer aspects of splatoon but the lore and characters are super interesting to me ... i want to enjoy those parts of it but have no desire to actually play the game . Topic_3 since im using a ipad pro with a apple pencil for my work and my studies i had this cool idea of sketching little things and put them later on as wallpaper on my desktop but i ve tested some free apps like adobe sketch but sadly in all these apps , the screen resolution / format did nt fit my dektops one . so my sketch gets either cut off or i have like two big black bars on my desktop . so maybe can anyone of you advice me a sketching / drawing app where i can change the resolution or where the format fits my desktop . im using a 1080p monitor btw . i appreciate every suggestion :d Topic_13
  • 27. / Bibliography 1. Sajad Sotudeh et al. “TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts”. In: (Nov. 2021), pp. 142–151. DOI: 10.18653/v1/2021.newsum-1.15. URL: https://aclanthology.org/2021. newsum-1.15 2. Reddit - Dive into anything. URL: https://www.reddit.com. 3. Sahil Patel. Reddit Claims 52 Million Daily Users, Revealing a Key Figure for Social-Media Platforms. Ed. by The Wall Street Journal. URL: https://www.wsj.com/articles/reddit-claims-52-million-daily-users-revealing-a-key-figure-for- social-media-platforms-11606822200. (posted: Dec. 1, 2020). 4. Reddit Staff Announcements. Revealing This Year’s (2022) Reddit Recap. Ed. by upvoted: The Official Reddit blog. URL: https://www.redditinc.com/blog/reddit-recap-2022-global. (posted: Dec. 8, 2022) 5. ir@Georgetown - Home, ed. The Georgetown University Information Retrieval Lab. URL: https : / / ir . cs . georgetown.edu/. (accessed: January 16, 2023) 6. Sajastu. sajastu/reddit collector: Reddit Collector and Text Processor. URL: https://github.com/sajastu/ reddit_collector. 7. Vinicius Camargo da Silva, Jo ao Paulo Papa, and Kelton Augusto Pontara da Costa. Extractive Text Summarization Using Generalized Additive Models with Interactions for Sentence Selection. 2022. arXiv: 2212.10707 [cs.CL].
  • 28. / Bibliography 8. Kam-Fai Wong, Mingli Wu, and Wenjie Li. “Extractive Summarization Using Supervised and Semi-Supervised Learning”. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Manchester, UK: Coling 2008 Organizing Committee, Aug. 2008, pp. 985–992. URL: https://aclanthology.org/C08-1124. 9. Alexander Dlikman and Mark Last. “Using Machine Learning Methods and Linguistic Features in Single-Document Extractive Summarization”. In: DMNLP@PKDD/ECML. 2016. 10. Guillaume Lemaˆıtre, Fernando Nogueira, and Christos K. Aridas. “Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning”. In: Journal of Machine Learning Research 18.17 (2017), pp. 1–5. URL: http://jmlr.org/papers/v18/16-365. 11. David M Blei, Andrew Y Ng, and Michael I Jordan. “Latent dirichlet allocation”. In: Journal of machine Learning research 3.Jan (2003), pp. 993–1022. 12. Scott C Deerwester et al. Computer information retrieval using latent semantic structure. US Patent 4,839,853. June 1989.