SlideShare a Scribd company logo
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 1/17
6 Tips for Interpretable Topic Models
Hands-on Python tutorial on tuning LDA topic models for easy-to-understand
outputs.
Nicha Ruchirawat · Follow
Published in Towards Data Science
9 min read · Nov 1, 2020
Listen Share
With so much text outputted on digital platforms, the ability to automatically
understand key topic trends can reveal tremendous insight. For example, businesses
can benefit from understanding customer conversation trends around their brand
and products. A common method to pick up key topics is Latent Dirichlet Allocation
(LDA). However, outputs are often difficult to interpret for useful insights. We will
explore techniques to enhance interpretability.
What is Latent Dirichlet Allocation (LDA)?
Latent Dirichlet Allocation (LDA) is a generative statistical model that helps pick up
similarities across a collection of different data parts. In topic modeling, each data
part is a word document (e.g. a single review on a product page) and the collection
of documents is a corpus (e.g. all users’ reviews for a product page). Similar sets of
words occurring repeatedly may likely indicate topics.
LDA assumes that each document is represented by a distribution of a fixed
number of topics, and each topic is a distribution of words.
Algorithm’s high level key steps to approximate these distributions:
1. User select K, the number of topics present, tuned to fit each dataset.
2. Go through each document, and randomly assign each word to one of K topics.
From this, we have a starting point for calculating document distribution of
topics p(topic t|document d), proportion of words in document d that are
Open in app Sign up Sign in
Search
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 2/17
assigned to topic t. We can also calculate topic distribution of words p(word
w|topic t), proportion of word w in all documents’ words that are assigned to
topic t. These will be poor approximations due to randomness.
3. To improve approximations, we iterate through each document. For each
document, go through each word and reassign a new topic, where we choose
topic t with a probability p(topic t|document d) ∗p(word w|topic t) based on last
round’s distribution. This is essentially the probability that topic t generated
word w. Recalculate p(topic t|document d) and p(word w|topic t) from these new
assignments.
4. Keep iterating until topic/word assignments reach a steady state and no longer
change much, (i.e. converge). Use final assignments to estimate topic mixtures
of each document (% words assigned to each topic within that document) and
word associated to each topic (% times that word is assigned to each topic
overall).
Data Preparation
We will explore techniques to optimize interpretability using LDA on Amazon Office
Product reviews. To prepare the reviews data, we clean the reviews text with typical
text cleaning steps:
1. Remove non-ascii characters, such as À µ ∅ ©
2. ‘Lemmatize’ words, which transform words to its most basic form, such as
‘running’ and ‘ran’ to ‘run’ so that they are recognized as the same word
3. Remove punctuation
4. Remove non-English comments if present
All code in the tutorial can be found here, where the functions for cleaning are
located in clean_text.py. The main notebook for the whole process is
topic_model.ipynb.
Steps to Optimize Interpretability
Tip #1: Identify phrases through n-grams and filter noun-type structures
We want to identify phrases so the topic model can recognize them. Bigrams are
phrases containing 2 words e.g. ‘social media’. Likewise, trigrams are phrases
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 3/17
containing 3 words e.g. ‘Proctor and Gamble’. There are many ways to detect n-
grams, explained here. In this example, we will use Pointwise Mutual Information
(PMI) score. This measures how much more likely the words co-occur than if they
were independent. The metric is sensitive to rare combination of words, so it is used
with an occurrence frequency filter to ensure phrase relevance. Bigram example
below (trigram code included in Jupyter Notebook):
# Example for detecting bigrams
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder =nltk.collocations.BigramCollocationFinder
.from_documents([comment.split() for comment in
clean_reviews.reviewText])
# Filter only those that occur at least 50 times
finder.apply_freq_filter(50)
bigram_scores = finder.score_ngrams(bigram_measures.pmi)
Additionally, we filter bigrams or trigrams with noun structures. This helps the LDA
model better cluster topics, as nouns are better indicators of a topic being talked
about. We use NLTK package to tag part of speech and filter these structures.
# Example filter for noun-type structures bigrams
def bigram_filter(bigram):
tag = nltk.pos_tag(bigram)
if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['NN']:
return False
if bigram[0] in stop_word_list or bigram[1] in stop_word_list:
return False
if 'n' in bigram or 't' in bigram:
return False
if 'PRON' in bigram:
return False
return True
# Can eyeball list and choose PMI threshold where n-grams stop
making sense
# In this case, get top 500 bigrams/trigrams with highest PMI score
filtered_bigram = bigram_pmi[bigram_pmi.apply(lambda bigram:
bigram_filter(bigram['bigram'])
and bigram.pmi > 5, axis = 1)][:500]
bigrams = [' '.join(x) for x in filtered_bigram.bigram.values
if len(x[0]) > 2 or len(x[1]) > 2]
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 4/17
Lastly, we concatenate these phrases together into one word.
def replace_ngram(x):
for gram in bigrams:
x = x.replace(gram, '_'.join(gram.split()))
for gram in trigrams:
x = x.replace(gram, '_'.join(gram.split()))
return x
reviews_w_ngrams = clean_reviews.copy()
reviews_w_ngrams.reviewText = reviews_w_ngrams.reviewText
.map(lambda x: replace_ngram(x))
Tip #2: Filter remaining words for nouns
In the sentence, ‘The store is nice’, we know the sentence is talking about ‘store’. The
other words in the sentence provide more context and explanation about the topic
(‘store’) itself. Therefore, filtering for noun extracts words that are more
interpretable for the topic model. An alternative is also to filter for both nouns and
verbs.
# Tokenize reviews + remove stop words + remove names + remove words
with less than 2 characters
reviews_w_ngrams = reviews_w_ngrams.reviewText.map(lambda x: [word
for word in x.split()
if word not in stop_word_list
and word not in english_names
and len(word) > 2])
# Filter for only nouns
def noun_only(x):
pos_comment = nltk.pos_tag(x)
filtered =[word[0] for word in pos_comment if word[1] in ['NN']]
return filtered
final_reviews = reviews_w_ngrams.map(noun_only)
Tip #3: Optimize choice for number of topics through coherence measure
LDA requires specifying the number of topics. We can tune this through
optimization of measures such as predictive likelihood, perplexity, and coherence.
Much literature has indicated that maximizing a coherence measure, named Cv [1],
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 5/17
leads to better human interpretability. We can test out a number of topics and asses
the Cv measure:
coherence = []
for k in range(5,25):
print('Round: '+str(k))
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(doc_term_matrix, num_topics=k, 
id2word = dictionary, passes=40,
iterations=200, chunksize = 10000, eval_every = None)
cm = gensim.models.coherencemodel.CoherenceModel(
model=ldamodel, texts=final_reviews,
dictionary=dictionary, coherence='c_v')
coherence.append((k,cm.get_coherence()))
Plotting this shows:
The improvement stops significantly improving after 15 topics. It is not always best
where the highest Cv is, so we can try multiple to find the best result. We tried 15
and 23 here, and 23 yielded clearer results. Adding topics can help reveal further sub
topics. Nonetheless, if the same words start to appear across multiple topics, the
number of topics is too high.
Tip #4: Adjust LDA hyperparameters
Lda2 = gensim.models.ldamodel.LdaModel
ldamodel2 = Lda(doc_term_matrix, num_topics=23, id2word =
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 6/17
dictionary, passes=40,iterations=200, chunksize = 10000, eval_every
= None, random_state=0)
If your topics still do not make sense, try increasing passes and iterations, while
increasing chunksize to the extent your memory can handle.
chunksize is the number of documents to be loaded into memory each time for
training. passes is the number of training iterations through the entire corpus.
iterations is the maximum iterations over each document to reach convergence —
limiting this means that some documents may not converge in time. If the training
corpus has 200 documents, chunksize is 100, passes is 2, and iterations is 10,
algorithm goes through these rounds:
Round #1: documents 0–99
Round #2: documents 100–199
Round #3: documents 0–99
Round #4: documents 100–199
Each round will iterate each document’s probability distribution assignments for a
maximum of 10 times, moving to the next document before 10 times if it already
reached convergence. This is basically algorithm’s key steps 2–4 explained earlier,
repeated for the number of passes , while step 3 is repeated for 10 iterations or less.
The topic distributions for entire corpus is updated after each chunksize , and after
each passes . Increasing chunksize to the extent your memory can handle will
increase speed as topic distribution update is expensive. However, increasing
chunksize requires increasing number of passes to ensure sufficient corpus topic
distribution updates, especially in small corpuses. iterations also needs to be high
enough to ensure a good amount of documents reach convergence before moving
on. We can try increasing these parameters when topics still don’t make sense, but
logging can also help debug:
import logging
logging.basicConfig(filename='gensim.log',
format="%(asctime)s:%(levelname)s:%(message)s",
level=logging.INFO)
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 7/17
Look for a lines that look like this in the log, which will repeat for the number of
passes that you set:
2020-07-21 06:44:16,300 - gensim.models.ldamodel - DEBUG - 100/35600
documents converged within 200 iterations
By the end of the passes , most of the documents should have converged. If not,
increase passes and iterations .
Tip #5: Use pyLDAvis to visualize topic relationships
The pyLDAvis [2] package in Python gives two important pieces of information. The
circles represent each topic. The distance between the circles visualizes topic
relatedness. These are mapped through dimensionality reduction (PCA/t-sne) on
distances between each topic’s probability distributions into 2D space. This shows
whether our model developed distinct topics. We want to tune model parameters
and number of topics to minimize circle overlap.
Topic distance also shows how related topics are. Topics 1,2,13 clustered together
talk about electronics (printers, scanners, phone/fax). Topics in quadrant 3 such as
6,14,19 are about office stationary (packaging materials, post-its, file organizer).
Additionally, circle size represents topic prevalence. For example, topic 1 makes up
the biggest portion of topics being talked about amongst documents, constituting
17.1% of the tokens.
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 8/17
Tip #6: Tune relevancy score to prioritize terms more exclusive to a topic
Words representing a given topic may be ranked high because they are globally
frequent across a corpus. Relevancy score helps prioritize terms that belong more
exclusively to a given topic, making the topic more obvious. The relevance of term w
to topic k is defined as:
where ϕ_kw is the probability of word w in topic k and ϕ_kw/p_kw is the lift in
term’s probability within a topic to its marginal probability across the entire corpus
(this helps discards globally frequent terms). A lower λ gives more importance to the
second term (ϕ_kw/p_kw), which gives more importance to topic exclusivity. We can
again use pyLDAvis for this. For instance, when lowering λ to 0.6, we can see that
topic 13 ranked terms that are even more relevant to the topic of phones.
Dial the lambda around to get the result that makes the most sense and apply the
optimal lambda value to obtain the output:
all_topics = {}
lambd = 0.6 # Adjust this accordingly
for i in range(1,22): #Adjust number of topics in final model
topic = topic_data.topic_info[topic_data.topic_info
.Category == 'Topic'+str(i)]
topic['relevance'] = topic['loglift']*(1-lambd)
+topic['logprob']*lambd
all_topics['Topic '+str(i)] = topic.sort_values(by='relevance
, ascending=False).Term[:10].values
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 9/17
Final Results
From here, we can further analyze sentiment around these topics keywords (e.g.
search for adjectives or reviews star ratings associated). In business applications,
this provides insight into which topics customers deem important, as well as how
they feel about it. This enables targeted product development and customer
experience improvements. This example contains a variety of products, but a
separate topic model into each product may reveal aspects that customers care
about. For example, this analysis already started to reveal important aspects of
calculators (topic 21) such as display, easy to press buttons, battery, weight. Sellers
then need to make sure to highlight these features in their product descriptions or
improve upon these aspects for competitiveness.
Sources:
[1] Michael Röder, Andreas Both, Alexander Hinneburg, Exploring the Space of
Topic Coherence Measures
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 10/17
[2] Carson Sievert, Kenneth E. Shirley, LDAvis: A method for visualizing and
interpreting topics
Follow
Written by Nicha Ruchirawat
202 Followers · Writer for Towards Data Science
Data Scientist @ Visa. Github: https://github.com/nicharuc/ LinkedIn:
https://www.linkedin.com/in/nicharuchirawat/
More from Nicha Ruchirawat and Towards Data Science
Data Science Naturallanguageprocessing Machine Learning NLP
Topic Modeling
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 11/17
Nicha Ruchirawat
Collocations — identifying phrases that act like single words in Natural
Language Processing
What is a collocation? It is a phrase consisting of more than one word but these words more
commonly co-occur in a given context than its…
6 min read · Mar 17, 2018
609 7
Leonie Monigatti in Towards Data Science
Intro to DSPy: Goodbye Prompting, Hello Programming!
How the DSPy framework solves the fragility problem in LLM-based applications by replacing
prompting with programming and compiling
· 13 min read · Feb 28, 2024
3.2K 10
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 12/17
Dave Melillo in Towards Data Science
Building a Data Platform in 2024
How to build a modern, scalable data platform to power your analytics and data science
projects (updated)
9 min read · Feb 6, 2024
2.2K 34
Nicha Ruchirawat in Quick Code
Yelp Reviews Sentiment Prediction via PySpark, MongoDB, AWS EMR
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 13/17
By: Nicha Ruchirawat, Tina Peng, Maise Ly
7 min read · Feb 2, 2018
28 1
See all from Nicha Ruchirawat
See all from Towards Data Science
Recommended from Medium
Mariya Mansurova in Towards Data Science
Topics per Class Using BERTopic
How to understand the differences in texts by categories
15 min read · Sep 9, 2023
596 3
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 14/17
Sujatha Mudadla
Difference between LDA and LSA .
Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) are both techniques used
in Natural Language Processing (NLP) for…
2 min read · Nov 10, 2023
2
Lists
Predictive Modeling w/ Python
20 stories · 1011 saves
Practical Guides to Machine Learning
10 stories · 1212 saves
Natural Language Processing
1293 stories · 784 saves
The New Chatbots: ChatGPT, Bard, and Beyond
12 stories · 337 saves
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 15/17
Ashwin Rachha
Topic Modeling with Quantized Large Language Models (LLMs): A
Comprehensive Guide
· 12 min read · Jan 8, 2024
278 1
Mohsen Baghaee
Topic Modeling with ChatGPT Developer Mode & BERTopic
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 16/17
In this notebook I want to do a small practice of Topic Modeling first using BERTopic and
second take benefit of ChatGPT developer mode.
8 min read · Sep 19, 2023
89 1
Krikor Postalian-Yrausquin in Towards AI
NLP (doc2vec from scratch) & Clustering: Classification of news reports
based on the content of the…
Using NLP (doc2vec), with deep and customized text cleaning, and then clustering (Birch) to
find topics in the text of news articles.
9 min read · Nov 14, 2023
22
3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science
https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 17/17
Daphney
Topic Modeling with LDA, NMF, BERTopic, and Top2Vec: Model
Comparison, Part 2
In our earlier work, we presented an introduction to four major modelling algorithms (LDA,
NMF, Top2Vec, and BERTopic) used for the…
5 min read · Nov 4, 2023
See more recommendations

More Related Content

PPTX
Daniel Shank, Data Scientist, Talla at MLconf SF 2017
PPTX
topic modelling through LDA and bertopic model
PDF
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
PDF
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
PDF
Streaming topic model training and inference
PDF
​​Explainability in AI and Recommender systems: let’s make it interactive!
PDF
LDAvis
PPTX
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai
Daniel Shank, Data Scientist, Talla at MLconf SF 2017
topic modelling through LDA and bertopic model
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic mod...
Streaming topic model training and inference
​​Explainability in AI and Recommender systems: let’s make it interactive!
LDAvis
Practical Tips for Interpreting Machine Learning Models - Patrick Hall, H2O.ai

Similar to 6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Science.pdf (7)

PDF
Probabilistic Topic models
PPTX
ODSC APAC 2022 - Explainable AI
PDF
Discovering User's Topics of Interest in Recommender Systems
PDF
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
PDF
Review of Topic Modeling and Summarization
PDF
Interactive Latent Dirichlet Allocation
PDF
Can Machine Learning Models be Trusted? Explaining Decisions of ML Models
Probabilistic Topic models
ODSC APAC 2022 - Explainable AI
Discovering User's Topics of Interest in Recommender Systems
PyCon 2013 - Experiments in data mining, entity disambiguation and how to thi...
Review of Topic Modeling and Summarization
Interactive Latent Dirichlet Allocation
Can Machine Learning Models be Trusted? Explaining Decisions of ML Models
Ad

More from Dr Arash Najmaei ( Phd., MBA, BSc) (15)

PPTX
media management series of lectures of media planning
PPTX
foundation of contemporary marketing Topic 7- Branding.pptx
PDF
What Is Random Forest_ analytics_ IBM.pdf
PDF
Kantar-Catalyst State of Ecommerce 2021.pdf
PDF
A_pragmatist_approach_to_integrity_in_bu.pdf
PDF
2024-and-beyond-will-it-be-economic-stagnation-or-the-advent-of-productivity-...
PDF
2030-Visitor-Economy-Strategy-important.pdf
PDF
article-130324_Banjo_SME_Compass_2024 (1).pdf
PDF
social media marketing on brand trust and loyalty.pdf
PPTX
Media planning and media strategy- W9 2.4.24.pptx
PPTX
Marketing- Week 9 (Topic 9-10-11) 19.7.23.pptx
PPTX
media management and planning Week 5 NEW-Arash.pptx
PPTX
Media Management Lecture-Media Plan W67.pptx
PPTX
Media management post graduate Week 8.pptx
media management series of lectures of media planning
foundation of contemporary marketing Topic 7- Branding.pptx
What Is Random Forest_ analytics_ IBM.pdf
Kantar-Catalyst State of Ecommerce 2021.pdf
A_pragmatist_approach_to_integrity_in_bu.pdf
2024-and-beyond-will-it-be-economic-stagnation-or-the-advent-of-productivity-...
2030-Visitor-Economy-Strategy-important.pdf
article-130324_Banjo_SME_Compass_2024 (1).pdf
social media marketing on brand trust and loyalty.pdf
Media planning and media strategy- W9 2.4.24.pptx
Marketing- Week 9 (Topic 9-10-11) 19.7.23.pptx
media management and planning Week 5 NEW-Arash.pptx
Media Management Lecture-Media Plan W67.pptx
Media management post graduate Week 8.pptx
Ad

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Transcultural that can help you someday.
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Managing Community Partner Relationships
PPTX
modul_python (1).pptx for professional and student
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Computer network topology notes for revision
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Database Infoormation System (DBIS).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
STUDY DESIGN details- Lt Col Maksud (21).pptx
Supervised vs unsupervised machine learning algorithms
Transcultural that can help you someday.
Qualitative Qantitative and Mixed Methods.pptx
annual-report-2024-2025 original latest.
Managing Community Partner Relationships
modul_python (1).pptx for professional and student
IB Computer Science - Internal Assessment.pptx
SAP 2 completion done . PRESENTATION.pptx
Clinical guidelines as a resource for EBP(1).pdf
Computer network topology notes for revision
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Mega Projects Data Mega Projects Data
Database Infoormation System (DBIS).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Reliability_Chapter_ presentation 1221.5784
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Science.pdf

  • 1. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 1/17 6 Tips for Interpretable Topic Models Hands-on Python tutorial on tuning LDA topic models for easy-to-understand outputs. Nicha Ruchirawat · Follow Published in Towards Data Science 9 min read · Nov 1, 2020 Listen Share With so much text outputted on digital platforms, the ability to automatically understand key topic trends can reveal tremendous insight. For example, businesses can benefit from understanding customer conversation trends around their brand and products. A common method to pick up key topics is Latent Dirichlet Allocation (LDA). However, outputs are often difficult to interpret for useful insights. We will explore techniques to enhance interpretability. What is Latent Dirichlet Allocation (LDA)? Latent Dirichlet Allocation (LDA) is a generative statistical model that helps pick up similarities across a collection of different data parts. In topic modeling, each data part is a word document (e.g. a single review on a product page) and the collection of documents is a corpus (e.g. all users’ reviews for a product page). Similar sets of words occurring repeatedly may likely indicate topics. LDA assumes that each document is represented by a distribution of a fixed number of topics, and each topic is a distribution of words. Algorithm’s high level key steps to approximate these distributions: 1. User select K, the number of topics present, tuned to fit each dataset. 2. Go through each document, and randomly assign each word to one of K topics. From this, we have a starting point for calculating document distribution of topics p(topic t|document d), proportion of words in document d that are Open in app Sign up Sign in Search
  • 2. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 2/17 assigned to topic t. We can also calculate topic distribution of words p(word w|topic t), proportion of word w in all documents’ words that are assigned to topic t. These will be poor approximations due to randomness. 3. To improve approximations, we iterate through each document. For each document, go through each word and reassign a new topic, where we choose topic t with a probability p(topic t|document d) ∗p(word w|topic t) based on last round’s distribution. This is essentially the probability that topic t generated word w. Recalculate p(topic t|document d) and p(word w|topic t) from these new assignments. 4. Keep iterating until topic/word assignments reach a steady state and no longer change much, (i.e. converge). Use final assignments to estimate topic mixtures of each document (% words assigned to each topic within that document) and word associated to each topic (% times that word is assigned to each topic overall). Data Preparation We will explore techniques to optimize interpretability using LDA on Amazon Office Product reviews. To prepare the reviews data, we clean the reviews text with typical text cleaning steps: 1. Remove non-ascii characters, such as À µ ∅ © 2. ‘Lemmatize’ words, which transform words to its most basic form, such as ‘running’ and ‘ran’ to ‘run’ so that they are recognized as the same word 3. Remove punctuation 4. Remove non-English comments if present All code in the tutorial can be found here, where the functions for cleaning are located in clean_text.py. The main notebook for the whole process is topic_model.ipynb. Steps to Optimize Interpretability Tip #1: Identify phrases through n-grams and filter noun-type structures We want to identify phrases so the topic model can recognize them. Bigrams are phrases containing 2 words e.g. ‘social media’. Likewise, trigrams are phrases
  • 3. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 3/17 containing 3 words e.g. ‘Proctor and Gamble’. There are many ways to detect n- grams, explained here. In this example, we will use Pointwise Mutual Information (PMI) score. This measures how much more likely the words co-occur than if they were independent. The metric is sensitive to rare combination of words, so it is used with an occurrence frequency filter to ensure phrase relevance. Bigram example below (trigram code included in Jupyter Notebook): # Example for detecting bigrams bigram_measures = nltk.collocations.BigramAssocMeasures() finder =nltk.collocations.BigramCollocationFinder .from_documents([comment.split() for comment in clean_reviews.reviewText]) # Filter only those that occur at least 50 times finder.apply_freq_filter(50) bigram_scores = finder.score_ngrams(bigram_measures.pmi) Additionally, we filter bigrams or trigrams with noun structures. This helps the LDA model better cluster topics, as nouns are better indicators of a topic being talked about. We use NLTK package to tag part of speech and filter these structures. # Example filter for noun-type structures bigrams def bigram_filter(bigram): tag = nltk.pos_tag(bigram) if tag[0][1] not in ['JJ', 'NN'] and tag[1][1] not in ['NN']: return False if bigram[0] in stop_word_list or bigram[1] in stop_word_list: return False if 'n' in bigram or 't' in bigram: return False if 'PRON' in bigram: return False return True # Can eyeball list and choose PMI threshold where n-grams stop making sense # In this case, get top 500 bigrams/trigrams with highest PMI score filtered_bigram = bigram_pmi[bigram_pmi.apply(lambda bigram: bigram_filter(bigram['bigram']) and bigram.pmi > 5, axis = 1)][:500] bigrams = [' '.join(x) for x in filtered_bigram.bigram.values if len(x[0]) > 2 or len(x[1]) > 2]
  • 4. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 4/17 Lastly, we concatenate these phrases together into one word. def replace_ngram(x): for gram in bigrams: x = x.replace(gram, '_'.join(gram.split())) for gram in trigrams: x = x.replace(gram, '_'.join(gram.split())) return x reviews_w_ngrams = clean_reviews.copy() reviews_w_ngrams.reviewText = reviews_w_ngrams.reviewText .map(lambda x: replace_ngram(x)) Tip #2: Filter remaining words for nouns In the sentence, ‘The store is nice’, we know the sentence is talking about ‘store’. The other words in the sentence provide more context and explanation about the topic (‘store’) itself. Therefore, filtering for noun extracts words that are more interpretable for the topic model. An alternative is also to filter for both nouns and verbs. # Tokenize reviews + remove stop words + remove names + remove words with less than 2 characters reviews_w_ngrams = reviews_w_ngrams.reviewText.map(lambda x: [word for word in x.split() if word not in stop_word_list and word not in english_names and len(word) > 2]) # Filter for only nouns def noun_only(x): pos_comment = nltk.pos_tag(x) filtered =[word[0] for word in pos_comment if word[1] in ['NN']] return filtered final_reviews = reviews_w_ngrams.map(noun_only) Tip #3: Optimize choice for number of topics through coherence measure LDA requires specifying the number of topics. We can tune this through optimization of measures such as predictive likelihood, perplexity, and coherence. Much literature has indicated that maximizing a coherence measure, named Cv [1],
  • 5. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 5/17 leads to better human interpretability. We can test out a number of topics and asses the Cv measure: coherence = [] for k in range(5,25): print('Round: '+str(k)) Lda = gensim.models.ldamodel.LdaModel ldamodel = Lda(doc_term_matrix, num_topics=k, id2word = dictionary, passes=40, iterations=200, chunksize = 10000, eval_every = None) cm = gensim.models.coherencemodel.CoherenceModel( model=ldamodel, texts=final_reviews, dictionary=dictionary, coherence='c_v') coherence.append((k,cm.get_coherence())) Plotting this shows: The improvement stops significantly improving after 15 topics. It is not always best where the highest Cv is, so we can try multiple to find the best result. We tried 15 and 23 here, and 23 yielded clearer results. Adding topics can help reveal further sub topics. Nonetheless, if the same words start to appear across multiple topics, the number of topics is too high. Tip #4: Adjust LDA hyperparameters Lda2 = gensim.models.ldamodel.LdaModel ldamodel2 = Lda(doc_term_matrix, num_topics=23, id2word =
  • 6. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 6/17 dictionary, passes=40,iterations=200, chunksize = 10000, eval_every = None, random_state=0) If your topics still do not make sense, try increasing passes and iterations, while increasing chunksize to the extent your memory can handle. chunksize is the number of documents to be loaded into memory each time for training. passes is the number of training iterations through the entire corpus. iterations is the maximum iterations over each document to reach convergence — limiting this means that some documents may not converge in time. If the training corpus has 200 documents, chunksize is 100, passes is 2, and iterations is 10, algorithm goes through these rounds: Round #1: documents 0–99 Round #2: documents 100–199 Round #3: documents 0–99 Round #4: documents 100–199 Each round will iterate each document’s probability distribution assignments for a maximum of 10 times, moving to the next document before 10 times if it already reached convergence. This is basically algorithm’s key steps 2–4 explained earlier, repeated for the number of passes , while step 3 is repeated for 10 iterations or less. The topic distributions for entire corpus is updated after each chunksize , and after each passes . Increasing chunksize to the extent your memory can handle will increase speed as topic distribution update is expensive. However, increasing chunksize requires increasing number of passes to ensure sufficient corpus topic distribution updates, especially in small corpuses. iterations also needs to be high enough to ensure a good amount of documents reach convergence before moving on. We can try increasing these parameters when topics still don’t make sense, but logging can also help debug: import logging logging.basicConfig(filename='gensim.log', format="%(asctime)s:%(levelname)s:%(message)s", level=logging.INFO)
  • 7. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 7/17 Look for a lines that look like this in the log, which will repeat for the number of passes that you set: 2020-07-21 06:44:16,300 - gensim.models.ldamodel - DEBUG - 100/35600 documents converged within 200 iterations By the end of the passes , most of the documents should have converged. If not, increase passes and iterations . Tip #5: Use pyLDAvis to visualize topic relationships The pyLDAvis [2] package in Python gives two important pieces of information. The circles represent each topic. The distance between the circles visualizes topic relatedness. These are mapped through dimensionality reduction (PCA/t-sne) on distances between each topic’s probability distributions into 2D space. This shows whether our model developed distinct topics. We want to tune model parameters and number of topics to minimize circle overlap. Topic distance also shows how related topics are. Topics 1,2,13 clustered together talk about electronics (printers, scanners, phone/fax). Topics in quadrant 3 such as 6,14,19 are about office stationary (packaging materials, post-its, file organizer). Additionally, circle size represents topic prevalence. For example, topic 1 makes up the biggest portion of topics being talked about amongst documents, constituting 17.1% of the tokens.
  • 8. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 8/17 Tip #6: Tune relevancy score to prioritize terms more exclusive to a topic Words representing a given topic may be ranked high because they are globally frequent across a corpus. Relevancy score helps prioritize terms that belong more exclusively to a given topic, making the topic more obvious. The relevance of term w to topic k is defined as: where ϕ_kw is the probability of word w in topic k and ϕ_kw/p_kw is the lift in term’s probability within a topic to its marginal probability across the entire corpus (this helps discards globally frequent terms). A lower λ gives more importance to the second term (ϕ_kw/p_kw), which gives more importance to topic exclusivity. We can again use pyLDAvis for this. For instance, when lowering λ to 0.6, we can see that topic 13 ranked terms that are even more relevant to the topic of phones. Dial the lambda around to get the result that makes the most sense and apply the optimal lambda value to obtain the output: all_topics = {} lambd = 0.6 # Adjust this accordingly for i in range(1,22): #Adjust number of topics in final model topic = topic_data.topic_info[topic_data.topic_info .Category == 'Topic'+str(i)] topic['relevance'] = topic['loglift']*(1-lambd) +topic['logprob']*lambd all_topics['Topic '+str(i)] = topic.sort_values(by='relevance , ascending=False).Term[:10].values
  • 9. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 9/17 Final Results From here, we can further analyze sentiment around these topics keywords (e.g. search for adjectives or reviews star ratings associated). In business applications, this provides insight into which topics customers deem important, as well as how they feel about it. This enables targeted product development and customer experience improvements. This example contains a variety of products, but a separate topic model into each product may reveal aspects that customers care about. For example, this analysis already started to reveal important aspects of calculators (topic 21) such as display, easy to press buttons, battery, weight. Sellers then need to make sure to highlight these features in their product descriptions or improve upon these aspects for competitiveness. Sources: [1] Michael Röder, Andreas Both, Alexander Hinneburg, Exploring the Space of Topic Coherence Measures
  • 10. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 10/17 [2] Carson Sievert, Kenneth E. Shirley, LDAvis: A method for visualizing and interpreting topics Follow Written by Nicha Ruchirawat 202 Followers · Writer for Towards Data Science Data Scientist @ Visa. Github: https://github.com/nicharuc/ LinkedIn: https://www.linkedin.com/in/nicharuchirawat/ More from Nicha Ruchirawat and Towards Data Science Data Science Naturallanguageprocessing Machine Learning NLP Topic Modeling
  • 11. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 11/17 Nicha Ruchirawat Collocations — identifying phrases that act like single words in Natural Language Processing What is a collocation? It is a phrase consisting of more than one word but these words more commonly co-occur in a given context than its… 6 min read · Mar 17, 2018 609 7 Leonie Monigatti in Towards Data Science Intro to DSPy: Goodbye Prompting, Hello Programming! How the DSPy framework solves the fragility problem in LLM-based applications by replacing prompting with programming and compiling · 13 min read · Feb 28, 2024 3.2K 10
  • 12. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 12/17 Dave Melillo in Towards Data Science Building a Data Platform in 2024 How to build a modern, scalable data platform to power your analytics and data science projects (updated) 9 min read · Feb 6, 2024 2.2K 34 Nicha Ruchirawat in Quick Code Yelp Reviews Sentiment Prediction via PySpark, MongoDB, AWS EMR
  • 13. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 13/17 By: Nicha Ruchirawat, Tina Peng, Maise Ly 7 min read · Feb 2, 2018 28 1 See all from Nicha Ruchirawat See all from Towards Data Science Recommended from Medium Mariya Mansurova in Towards Data Science Topics per Class Using BERTopic How to understand the differences in texts by categories 15 min read · Sep 9, 2023 596 3
  • 14. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 14/17 Sujatha Mudadla Difference between LDA and LSA . Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) are both techniques used in Natural Language Processing (NLP) for… 2 min read · Nov 10, 2023 2 Lists Predictive Modeling w/ Python 20 stories · 1011 saves Practical Guides to Machine Learning 10 stories · 1212 saves Natural Language Processing 1293 stories · 784 saves The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 337 saves
  • 15. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 15/17 Ashwin Rachha Topic Modeling with Quantized Large Language Models (LLMs): A Comprehensive Guide · 12 min read · Jan 8, 2024 278 1 Mohsen Baghaee Topic Modeling with ChatGPT Developer Mode & BERTopic
  • 16. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 16/17 In this notebook I want to do a small practice of Topic Modeling first using BERTopic and second take benefit of ChatGPT developer mode. 8 min read · Sep 19, 2023 89 1 Krikor Postalian-Yrausquin in Towards AI NLP (doc2vec from scratch) & Clustering: Classification of news reports based on the content of the… Using NLP (doc2vec), with deep and customized text cleaning, and then clustering (Birch) to find topics in the text of news articles. 9 min read · Nov 14, 2023 22
  • 17. 3/22/24, 9:19 AM 6 Tips for Interpretable Topic Models | by Nicha Ruchirawat | Towards Data Science https://towardsdatascience.com/6-tips-to-optimize-an-nlp-topic-model-for-interpretability-20742f3047e2 17/17 Daphney Topic Modeling with LDA, NMF, BERTopic, and Top2Vec: Model Comparison, Part 2 In our earlier work, we presented an introduction to four major modelling algorithms (LDA, NMF, Top2Vec, and BERTopic) used for the… 5 min read · Nov 4, 2023 See more recommendations