SlideShare a Scribd company logo
5
Most read
7
Most read
9
Most read
Topic Modeling
By
LDA
Laten Dirichlet Allocation
Topic Modeling
Topic modeling: is technique to uncover the underlying topic from the
document, in simple words it helps to identify what the document is
talking about, the important topics in the article.
Types of Topic Models
1) Latent Semantic Indexing (LSI)
2) Laten Dirichlet Allocation (LDA)
3) Probalistic Latent Semantic Indexing (PLSI)
Document  topic  words
Rupak Roy
Topic Modeling - LDA
Topics Technology Healthcare Business
%topics in the
documents 30 % 60% 17%
Bag of words Google, Dell Radiology, Transactions,
Apple, Microsoft Diagnose Bank, Cost
DOCUMENT
Behind LDA
Topic 1: Technology: Google, Dell, Apple, Microsoft
Topic 2: Healthcare: Radiology, Diagnose, Ct Scan
Topic 3: Business: Transactions, Banks, Cost.
Rupak Roy
Topic Modeling
How often does “Diagnose appear in topic Healthcare ?
If the ‘Diagnose’ word often occurs in the Topic Healthcare, then this
instance of ‘Diagnose’ might belong to the topic Healthcare.
Now how common is the topic healthcare in the rest of the document?
This is actually similar to Bayes theorem.
To find the probability of possible topic T
Multiply the frequency of the word type W in T by the number of other
words in document D that already belong to T
Therefore the output is
The probability that this word came from topic T=>
=> P(TW,D) = )words W in the topic T/words in the document )* words
in D that belong to T
Rupak Roy
Topic Modeling - LDA
library(RTextTools)
library(topicmodels)
tweets<-read.csv(file.choose())
View(tweets)
names(tweets)
tweets1<-data.frame(tweets$text)
tweets1<-tweets[,c(6,11)]
names(tweets1)
dim(tweets1)
names(tweets1)[2]<-"tweets"
View(tweets1)
Rupak Roy
Topic Modeling - LDA
#Create a Document Term Matrix
matrix= create_matrix(cbind(as.vector(tweets1$airline),as.vector(tweets1$tweets)),
language="english",removeNumbers=TRUE, removePunctuation=TRUE,
removeSparseTerms=0,
removeStopwords=TRUE, stripWhitespace=TRUE, toLower=TRUE)
inspect(tweets.corpus[1:5])
#Choose the number of topics
k<- 15
#Split the Data into training and testing
#We will take a small subset of data
train <- matrix[1:500,]
test <- matrix[501:750,]
#train <- matrix[1:10248,]
#test <- matrix[10249:1460,]
Rupak Roy
Topic Modeling - LDA
#Build the model on train data
train.lda <- LDA(train,k)
topics<-get_topics(train.lda,5)
View(topics)
#by default it gives the highest topic with the document
terms<-get_terms(train.lda,5)
View(terms)
#by default it gives the most highly probable word in each topic
#Get the top topics
train.topics <- topics(train.lda)
#Test the model
test.topics <- posterior(train.lda,test)
test.topics$topics[1:10,1:15]
#[row, number of topics(upto 15topics)that is the value of K =15]
test.topics <- apply(test.topics$topics, 1, which.max)
#gives topic with highest probability
Rupak Roy
Topic Modeling - LDA
#Join the predicted Topic number to the original test Data
test1<-tweets[501:750,]
final<-data.frame(Title=test1$airline,Subject=test1$text,
Pred.topic=test.topics)
View(final)
table(final$Pred.topic)
#View each topic
View(final[final$Pred.topic==10,])
Rupak Roy
Topic Modeling - LDA
#---------------Another method to get the optimal number of topics ---------#
library(topicmodel)
best.model <- lapply(seq(2,20, by=1), function(k){LDA(matrix,k)})
#seq(2,20) refers range of K values
best_model<- as.data.frame(as.matrix(lapply(best.model, logLik)))
#one of the methods to measure the performance is loglikehood & to find out
#whether a model is good model or average model or bad model based on the
parameter model uses.
final_best_model <- data.frame(topics=c(seq(2,20, by=1)),
log_likelihood=as.numeric(as.matrix(best_model)))
#The higher the loglikelihood the better the model.
#finds out ideal topic for every doc
head(final_best_model)
library(ggplot2)
with(final_best_model,qplot(topics,log_likelihood,color="red"))
#the higher the likelihood value in the graph better the topics are.
Rupak Roy
Topic Modeling - LDA
#Get the best value from the graph
k=final_best_model[which.max(final_best_model$log_likelihood),1]
cat("Best topic number k=",k)
Rupak Roy
Steps Topic Modeling
1) Data
2) Create TDM
3) Choose number of topics (K)
4) Divide the data into train & test
5) Building model on train data
6) Get the topic
7) Test the model
8) Joining the predicted Topic Number to the original dataset
9) Analyize
Rupak Roy

More Related Content

ODP
Topic Modeling
PDF
Topic Modeling
PPT
Topic Models
PPT
Topic Models - LDA and Correlated Topic Models
PDF
Latent Dirichlet Allocation
PDF
Latent Dirichlet Allocation
POTX
LDA Beginner's Tutorial
PPTX
Text similarity measures
Topic Modeling
Topic Modeling
Topic Models
Topic Models - LDA and Correlated Topic Models
Latent Dirichlet Allocation
Latent Dirichlet Allocation
LDA Beginner's Tutorial
Text similarity measures

What's hot (20)

PPTX
Probabilistic information retrieval models & systems
PDF
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
ODP
Machine Learning with Decision trees
PDF
Natural Language Processing with Python
PPSX
Semantic analysis
PDF
Basic review on topic modeling
PPT
Introduction to Natural Language Processing
PPTX
Vector space model in information retrieval
PPTX
Naive Bayes Presentation
PPT
2.3 bayesian classification
PDF
Bayes Belief Networks
PPTX
The vector space model
PDF
Inference in Bayesian Networks
PPT
Perceptron
PPTX
Decision tree induction \ Decision Tree Algorithm with Example| Data science
PPTX
Signature files
PPTX
Text mining
PDF
Dimensionality Reduction
PPTX
What is word2vec?
PDF
Introduction to Statistical Machine Learning
Probabilistic information retrieval models & systems
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Machine Learning with Decision trees
Natural Language Processing with Python
Semantic analysis
Basic review on topic modeling
Introduction to Natural Language Processing
Vector space model in information retrieval
Naive Bayes Presentation
2.3 bayesian classification
Bayes Belief Networks
The vector space model
Inference in Bayesian Networks
Perceptron
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Signature files
Text mining
Dimensionality Reduction
What is word2vec?
Introduction to Statistical Machine Learning
Ad

Similar to Topic Modeling - NLP (20)

PPTX
Procrastinators CS340
PDF
Ire major project
PPTX
Topic extraction using machine learning
PPTX
topic modelling through LDA and bertopic model
PPTX
Topic Extraction using Machine Learning
PDF
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
PDF
A Text Mining Research Based on LDA Topic Modelling
DOCX
Independent Study_Final Report
PPTX
Topic modeling - EuroPython
PPTX
Finding bursty topics from microblogs
PDF
graduate_thesis (1)
PDF
Topic modelling
PDF
TopicModels_BleiPaper_Summary.pptx
PDF
LDAvis
PDF
Treasure Data Summer Internship Final Report
PDF
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
PDF
Survey of Generative Clustering Models 2008
PDF
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
PPTX
Machine Learning - Intro & Applications .pptx
Procrastinators CS340
Ire major project
Topic extraction using machine learning
topic modelling through LDA and bertopic model
Topic Extraction using Machine Learning
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A Text Mining Research Based on LDA Topic Modelling
Independent Study_Final Report
Topic modeling - EuroPython
Finding bursty topics from microblogs
graduate_thesis (1)
Topic modelling
TopicModels_BleiPaper_Summary.pptx
LDAvis
Treasure Data Summer Internship Final Report
Topic Modeling for Learning Analytics Researchers LAK15 Tutorial
Survey of Generative Clustering Models 2008
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Machine Learning - Intro & Applications .pptx
Ad

More from Rupak Roy (20)

PDF
Hierarchical Clustering - Text Mining/NLP
PDF
Clustering K means and Hierarchical - NLP
PDF
Network Analysis - NLP
PDF
Sentiment Analysis Practical Steps
PDF
NLP - Sentiment Analysis
PDF
Text Mining using Regular Expressions
PDF
Introduction to Text Mining
PDF
Apache Hbase Architecture
PDF
Introduction to Hbase
PDF
Apache Hive Table Partition and HQL
PDF
Installing Apache Hive, internal and external table, import-export
PDF
Introductive to Hive
PDF
Scoop Job, import and export to RDBMS
PDF
Apache Scoop - Import with Append mode and Last Modified mode
PDF
Introduction to scoop and its functions
PDF
Introduction to Flume
PDF
Apache Pig Relational Operators - II
PDF
Passing Parameters using File and Command Line
PDF
Apache PIG Relational Operations
PDF
Apache PIG casting, reference
Hierarchical Clustering - Text Mining/NLP
Clustering K means and Hierarchical - NLP
Network Analysis - NLP
Sentiment Analysis Practical Steps
NLP - Sentiment Analysis
Text Mining using Regular Expressions
Introduction to Text Mining
Apache Hbase Architecture
Introduction to Hbase
Apache Hive Table Partition and HQL
Installing Apache Hive, internal and external table, import-export
Introductive to Hive
Scoop Job, import and export to RDBMS
Apache Scoop - Import with Append mode and Last Modified mode
Introduction to scoop and its functions
Introduction to Flume
Apache Pig Relational Operators - II
Passing Parameters using File and Command Line
Apache PIG Relational Operations
Apache PIG casting, reference

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Foundation of Data Science unit number two notes
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Mega Projects Data Mega Projects Data
PPTX
1_Introduction to advance data techniques.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
.pdf is not working space design for the following data for the following dat...
STUDY DESIGN details- Lt Col Maksud (21).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Reliability_Chapter_ presentation 1221.5784
Foundation of Data Science unit number two notes
Qualitative Qantitative and Mixed Methods.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Supervised vs unsupervised machine learning algorithms
Introduction to Knowledge Engineering Part 1
Mega Projects Data Mega Projects Data
1_Introduction to advance data techniques.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Data_Analytics_and_PowerBI_Presentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Miokarditis (Inflamasi pada Otot Jantung)
climate analysis of Dhaka ,Banglades.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Business Analytics and business intelligence.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
.pdf is not working space design for the following data for the following dat...

Topic Modeling - NLP

  • 2. Topic Modeling Topic modeling: is technique to uncover the underlying topic from the document, in simple words it helps to identify what the document is talking about, the important topics in the article. Types of Topic Models 1) Latent Semantic Indexing (LSI) 2) Laten Dirichlet Allocation (LDA) 3) Probalistic Latent Semantic Indexing (PLSI) Document  topic  words Rupak Roy
  • 3. Topic Modeling - LDA Topics Technology Healthcare Business %topics in the documents 30 % 60% 17% Bag of words Google, Dell Radiology, Transactions, Apple, Microsoft Diagnose Bank, Cost DOCUMENT Behind LDA Topic 1: Technology: Google, Dell, Apple, Microsoft Topic 2: Healthcare: Radiology, Diagnose, Ct Scan Topic 3: Business: Transactions, Banks, Cost. Rupak Roy
  • 4. Topic Modeling How often does “Diagnose appear in topic Healthcare ? If the ‘Diagnose’ word often occurs in the Topic Healthcare, then this instance of ‘Diagnose’ might belong to the topic Healthcare. Now how common is the topic healthcare in the rest of the document? This is actually similar to Bayes theorem. To find the probability of possible topic T Multiply the frequency of the word type W in T by the number of other words in document D that already belong to T Therefore the output is The probability that this word came from topic T=> => P(TW,D) = )words W in the topic T/words in the document )* words in D that belong to T Rupak Roy
  • 5. Topic Modeling - LDA library(RTextTools) library(topicmodels) tweets<-read.csv(file.choose()) View(tweets) names(tweets) tweets1<-data.frame(tweets$text) tweets1<-tweets[,c(6,11)] names(tweets1) dim(tweets1) names(tweets1)[2]<-"tweets" View(tweets1) Rupak Roy
  • 6. Topic Modeling - LDA #Create a Document Term Matrix matrix= create_matrix(cbind(as.vector(tweets1$airline),as.vector(tweets1$tweets)), language="english",removeNumbers=TRUE, removePunctuation=TRUE, removeSparseTerms=0, removeStopwords=TRUE, stripWhitespace=TRUE, toLower=TRUE) inspect(tweets.corpus[1:5]) #Choose the number of topics k<- 15 #Split the Data into training and testing #We will take a small subset of data train <- matrix[1:500,] test <- matrix[501:750,] #train <- matrix[1:10248,] #test <- matrix[10249:1460,] Rupak Roy
  • 7. Topic Modeling - LDA #Build the model on train data train.lda <- LDA(train,k) topics<-get_topics(train.lda,5) View(topics) #by default it gives the highest topic with the document terms<-get_terms(train.lda,5) View(terms) #by default it gives the most highly probable word in each topic #Get the top topics train.topics <- topics(train.lda) #Test the model test.topics <- posterior(train.lda,test) test.topics$topics[1:10,1:15] #[row, number of topics(upto 15topics)that is the value of K =15] test.topics <- apply(test.topics$topics, 1, which.max) #gives topic with highest probability Rupak Roy
  • 8. Topic Modeling - LDA #Join the predicted Topic number to the original test Data test1<-tweets[501:750,] final<-data.frame(Title=test1$airline,Subject=test1$text, Pred.topic=test.topics) View(final) table(final$Pred.topic) #View each topic View(final[final$Pred.topic==10,]) Rupak Roy
  • 9. Topic Modeling - LDA #---------------Another method to get the optimal number of topics ---------# library(topicmodel) best.model <- lapply(seq(2,20, by=1), function(k){LDA(matrix,k)}) #seq(2,20) refers range of K values best_model<- as.data.frame(as.matrix(lapply(best.model, logLik))) #one of the methods to measure the performance is loglikehood & to find out #whether a model is good model or average model or bad model based on the parameter model uses. final_best_model <- data.frame(topics=c(seq(2,20, by=1)), log_likelihood=as.numeric(as.matrix(best_model))) #The higher the loglikelihood the better the model. #finds out ideal topic for every doc head(final_best_model) library(ggplot2) with(final_best_model,qplot(topics,log_likelihood,color="red")) #the higher the likelihood value in the graph better the topics are. Rupak Roy
  • 10. Topic Modeling - LDA #Get the best value from the graph k=final_best_model[which.max(final_best_model$log_likelihood),1] cat("Best topic number k=",k) Rupak Roy
  • 11. Steps Topic Modeling 1) Data 2) Create TDM 3) Choose number of topics (K) 4) Divide the data into train & test 5) Building model on train data 6) Get the topic 7) Test the model 8) Joining the predicted Topic Number to the original dataset 9) Analyize Rupak Roy