SlideShare a Scribd company logo
Jan Zizka et al. (Eds) : CCSEIT, AIAP, DMDB, MoWiN, CoSIT, CRIS, SIGL, ICBB, CNSA-2016
pp. 201–210, 2016. © CS & IT-CSCP 2016 DOI : 10.5121/csit.2016.60616
A TEXT MINING RESEARCH BASED ON
LDA TOPIC MODELLING
Zhou Tong1
and Haiyi Zhang2
Jodrey School of Computer Science, Acadia University, Wolfville, NS, Canada
1
zhoutong@acadiau.ca
2
haiyi.zhang@acadiau.ca
ABSTRACT
A Large number of digital text information is generated every day. Effectively searching,
managing and exploring the text data has become a main task. In this paper, we first represent
an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The
former one builds up a document topic model, aiming to a topic perspective solution on
searching, exploring and recommending articles. The latter one sets up a user topic model,
providing a full research and analysis over Twitter users’ interest. The experiment process
including data collecting, data pre-processing and model training is fully documented and
commented. Further more, the conclusion and application of this paper could be a useful
computation tool for social and business research.
KEYWORDS
topic model, LDA, text mining, probabilistic model
1. INTRODUCTION
As computers and Internet are widely used in almost every area, more and more information is
digitized and stored online in the form of news, blogs, and social networks. Since the amount of
the information is exploded to astronomical figures, searching and exploring the data has become
the main problem. Our research is intended to design a new computational tool based on topic
models using text mining techniques to organize, search and analyse the vast amounts of data,
providing a better way understanding and mining the information.
2. BACKGROUND
2.1. Text Mining
Text mining is the process of deriving high-quality information from text [1]. Text mining
usually involves the process of structuring the input text, finding patterns within the structured
data, and finally evaluation and interpretation of the output. Typical text mining tasks include text
categorization, text clustering, document summarization, keyword extraction and etc. In this
202 Computer Science & Information Technology (CS & IT)
research, statistical and machine learning techniques will be used to mine meaningful information
and explore data analysis.
2.2. Topic Modelling
In machine learning and natural language processing, topic models are generative models, which
provide a probabilistic framework [2]. Topic modelling methods are generally used for
automatically organizing, understanding, searching, and summarizing large electronic archives.
The "topics" signifies the hidden, to be estimated, variable relations that link words in a
vocabulary and their occurrence in documents. A document is seen as a mixture of topics. Topic
models discover the hidden themes through out the collection and annotate the documents
according to those themes. Each word is seen as drawn from one of those topics. Finally, A
document coverage distribution of topics is generated and it provides a new way to explore the
data on the perspective of topics.
2.3. Latent Dirichlet Allocation
Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be
explained by unobserved groups that explain why some parts of the data are similar [3]. LDA has
made a big impact in the fields of natural language processing and statistical machine learning
and has quickly become one of the most popular probabilistic text modelling techniques in
machine learning.
Intuitively in LDA, documents exhibit multiple topics [4]. In text pre-processing, we exclude
punctuation and stop words (such as, "if", "the", or "on", which contain little topical content).
Therefore, each document is regarded as a mixture of corpus-wide topics. A topic is a distribution
over a fixed vocabulary. These topics are generated from the collection of documents [5]. For
example, the sports topic has word "football", "hockey" with high probability and the computer
topic has word "data", "network" with high probability. Then, a collection of documents has
probability distribution over topics, where each word is regarded as drawn from one of those
topics. With this document probability distribution over each topic, we will know how much each
topic is involved in a document, meaning which topics a document is mainly talking about.
A graphical model for LDA is shown in Figure 1:
Figure 1. Graphic model for Latent Dirichlet allocation
Computer Science & Information Technology (CS & IT) 203
As the figure illustrated, we can describe LDA more formally with the following notation. First,
α and η are proportion parameter and topic parameter, respectively. The topics are β1:K , where
each βk is a distribution over the vocabulary. The topic proportion for the d th document are θd
, where θd,k is the topic proportion for topic k in document d . The topic assignments for the d
th document are Zd , where Zd,n is the topic assignment for the nth word in document d .
Finally, the observed words for document d are wd , where wd,n is the nth word in document
d , which is an element from the fixed vocabulary.
With this notation, the generative process for LDA corresponds to the following joint distribution
of the hidden and observed variables:
p(β1:K,θ1:D,z1:D,w1:D ) = p(βi ) p(θd )( p(zd,n |θd )p(wd,n | β1:K,zd,n ))
n=1
N
∏
d=1
D
∏
i=1
K
∏
Notice that this distribution specifies a number of dependencies. The topic assignment Zd,n
depends on the per-document topic distribution θd ; and the word wd,n depends on all of the
topics β1:K and the topic assignment Zd,n .
2.4. Jensen-Shannon Divergence
In probability theory and statistics, the Jensen-Shannon divergence is a popular method of
measuring the similarity between two probability distributions. It is also known as information
radius or total divergence to the average. It is based on the Kullback-Leibler divergence. The
square root of the Jensen-Shannon divergence is a metric often referred to as Jensen-Shannon
distance [6].
For discrete probability distributions P and Q, Kullback-Leibler divergence of Q from Pis
defined to be:
DKL (P ||Q) = P(i)
i
∑ log
P(i)
Q(i)
So, the Jensen-Shannon divergence of Q from P is defined by:
JSD(P ||Q) =
1
2
DKL (P || M)+
1
2
DKL (Q || M)
where M =
1
2
(P +Q)
Jensen-Shannon divergence measures the similarity between two distributions. By applying
Jensen-Shannon divergence to the topic assignment for the d th document Zd , it will allow us to
measure the distance and similarity between each document.
204 Computer Science & Information Technology (CS & IT)
3. DESIGNS AND EXPERIMENTS
In order to apply topic modelling and propose a new text mining solution on topics, we have
designed two experiments fulfilling our goal. The first one is to use topic modelling manage and
explore Wikipedia, and the second is a Twitter application on topic modelling. The former is a
new solution on a typical problem and the latter is building up a new model on Twitter data
analysis.
3.1. Wikipedia on Topic Modelling
3.1.1. Experiment Overview
Wikipedia is a free-access, free-content Internet encyclopaedia, supported by non-profit
Wikimedia Foundation. It has millions of articles for people to search, explore or even edit. In
this experiment, the text data is from simplified Wikipedia (English version) with over 200,000
articles. By applying Latent Dirichlet allocation (LDA) and topic modelling, a solution of topic
searching, exploring and recommending system will be achieved.
3.1.2. Data Pre-processing
The simplified Wikipedia English version is free for download from Wikipedia Foundation
database backup dumps. The backup is in a format of XML. The first step of data pre-processing
is to parse the XML file and extract the text data. R package XML provides a series of function
parsing XML file. By using those functions, we will get a relatively clear data of all the articles in
a data frame.
The next step is text-cleaning process. The purpose of text cleaning is to simplify the text data,
eliminating as much as possible language dependent factors. Articles are written in natural
language for human to understand. But in text mining, those data are not always easy for
computers to process. In this experiment, there are three steps in text cleaning:
• Tokenization: a document is treated as a string, removing all the punctuations and then
partitioned into a list of tokens.
• Removing stop words: stop words such as "the", "if", "and" ... are frequently occurring but
no significant meanings which need to be removed.
• Stemming word: stemming word that converts different word form into similar canonical
form. For example, computing to compute, happiness to happy. This process reduces the
data redundancy and simplifies the later computation [7].
3.1.3. Model Training
The training process requires R package topicmodels with its package dependencies (tm and
others) to be loaded. An LDA model of simplified English Wikipedia on a sample of 1000
articles with more than 1000 characters, returned after 2000 iterations of Gibbs sampling, with
K = 50 topics, and Dirichlet hyper-parameters β = 0.1 and α = 50 / K . Meanwhile, topic
distribution coverage for each document is generated. This distribution represents how much each
Computer Science & Information Technology (CS & IT)
document is related to each topic. A new way of search and explore documents over topics can be
implemented.
Table 1. A few selected t
Topic 2 Topic 13
athlete album
olympia song
field music
summer record
track band
men release
metre single
image rock
women singer
medal
Table 1 shows five selected topic terms after the model is trained,
for each topic. With LDA training, the terms in the same topic tend to be similar. Formally
speaking, they are highly associated. For example, topic 13 is about music, topic 22 is about
education and topic 46 is about weather
and explore between topics in order to find the document the user is looking for.
After the model is built, Jensen-
distribution. Sorting the similarity of one document between every other distribution, a topic
recommender system can be implemented.
3.1.4. Results
Here is an example of article Light
part of Figure 2, which can be also accessed in simple Wikipedia online (the data for this
experiment is retrieved as a backup version on 11/1/2016). After the model is trained, we got a
series of article distribution over each topic. The right part of Figure 2 is the bar plot of article
light topics distribution. In total of 50 topics, we can easily find there are 3 topics with obviously
high probabilities – topic 47, topic 45 and topic 16. T
These probabilities are how tightly this article is associated with each topic. Table 3 shows the
terms in these 5 topics with 10 terms each.
Figure 2. Topics Distribution of Article
Computer Science & Information Technology (CS & IT)
document is related to each topic. A new way of search and explore documents over topics can be
Table 1. A few selected topics generated from Wikipedia topic distribution
Topic 13 Topic 22 Topic 31 Topic 46
album universal movie hurricane
song college categories major
music school fiction season
record categories solstice minor
band new alien key
release economic star storm
single institut sun tropic
rock educate direct chord
singer science drama verse
pop work southern end
Table 1 shows five selected topic terms after the model is trained, where top ten terms are listed
for each topic. With LDA training, the terms in the same topic tend to be similar. Formally
speaking, they are highly associated. For example, topic 13 is about music, topic 22 is about
education and topic 46 is about weather. This topic distribution provides a way to search topic
and explore between topics in order to find the document the user is looking for.
-Shannon divergence is applied to calculate the similarity of each
ting the similarity of one document between every other distribution, a topic
recommender system can be implemented.
Light from the experiment. The original article is shown in the left
part of Figure 2, which can be also accessed in simple Wikipedia online (the data for this
experiment is retrieved as a backup version on 11/1/2016). After the model is trained, we got a
s of article distribution over each topic. The right part of Figure 2 is the bar plot of article
topics distribution. In total of 50 topics, we can easily find there are 3 topics with obviously
topic 47, topic 45 and topic 16. Table 2 shows the top 5 probabilities topics.
These probabilities are how tightly this article is associated with each topic. Table 3 shows the
terms in these 5 topics with 10 terms each.
Figure 2. Topics Distribution of Article Light
205
document is related to each topic. A new way of search and explore documents over topics can be
where top ten terms are listed
for each topic. With LDA training, the terms in the same topic tend to be similar. Formally
speaking, they are highly associated. For example, topic 13 is about music, topic 22 is about
. This topic distribution provides a way to search topic
Shannon divergence is applied to calculate the similarity of each
ting the similarity of one document between every other distribution, a topic
from the experiment. The original article is shown in the left
part of Figure 2, which can be also accessed in simple Wikipedia online (the data for this
experiment is retrieved as a backup version on 11/1/2016). After the model is trained, we got a
s of article distribution over each topic. The right part of Figure 2 is the bar plot of article
topics distribution. In total of 50 topics, we can easily find there are 3 topics with obviously
able 2 shows the top 5 probabilities topics.
These probabilities are how tightly this article is associated with each topic. Table 3 shows the
206 Computer Science & Information Technology (CS & IT)
Table 2. Top 5 Topic Probabilities of Article Light
Topics Topic 47 Topic 45 Topic 16 Topic 24 Topic 30
Probabilities 0.05692600 0.45920304 0.28462998 0.01518027 0.01328273
Table 3. Top 5 Topic Terms of Article Light
Topic 47 Topic 45 Topic 16 Topic 24 Topic 30
light use color computable cleaner
beamline one style equation people
beam also background fluid use
radiated can hex categories clean
don people rgb image make
synchrotron mania magenta program chemical
wavelength call ffffff protocol thing
physical time fuchsia mathematical paint
station two red function made
carlo like pink design put
So just like this example, every article of the whole collection is represented as a vector of
probabilities over 50 topics. This is the core data of our model, where we can do all sorts of
applications. Here is an example of finding the most related article of article Light. We will use
Jensen-Shannon divergence to calculate the distance between article Light and every other article.
The shortest distance will be the most related article. After calculating the distance, Table 4
shows the top 10 of the shortest articles. To be noted, the distance should have been the square
root of the distance below, but to simplify the calculation and higher the accuracy, we will stay
with the squared number, as there would be no difference on comparing the closet distance.
Table 4. Top 10 Shortest Distance of Article Light
Articles Article 855 Article 820 Article 837 Article 299 Article 911
Distance2
0.0000000 0.2271741 0.3404084 0.3881467 0.4583745
Articles Article 341 Article 287 Article 328 Article 544 Article 606
Distance2
0.4671845 0.4728383 0.4802241 0.4803357 0.4874499
As shown on Table 4, these 10 articles are the closest distance with Article Light. The distance of
Article 855 is 0, because it is article Light itself. So Article 820 is the closest distance with Light,
meaning their contents most related. Meanwhile, what we get is a sorted list of closest distance
and will also work if we require more than one most related article. This method is based on
calculating the probabilities each meaningful word in the model. So the accuracy is much better
than calculating the keywords, or titles, which is widely used on many documents management
system. Lastly, if we look up the title of Article 820, it is Beamline. Based on the 2000 article
dataset, it is a convincing answer to the most related article of Light.
3.2. Twitter Data Analysis
3.2.1. Experiment Overview
Twitter is an online social networking service that enables users to send and read short 140-
character messages called "tweets". Currently, Twitter has more than 332 million active users
posting 340 million tweets a day. Twitter has a big impact in everyday life. Twitter mining is not
Computer Science & Information Technology (CS & IT) 207
only a big research task in computer science or statistics, but also a key factor on social and
business research. In this experiment, by applying LDA and topic modelling, a deep research and
analysis on Twitter users is proposed. A detail model of Twitter user's personality and preference
will be inferred.
3.2.2. Data Pre-processing
Generally, a tweet from any twitter user is public and free to read for everyone. Having registered
as a Twitter application developer will allow you to access all the tweets and a set of APIs to
manipulate the data [8]. The first step is to collect tweets from the users. By using the APIs and R
package twitteR, a sample of 10,000 valid users is gathered into a data frame. To improve the data
quality, a standard is set up for the twitter user to be a "valid" user:
• The user profile is an unprotected, which means user's information and tweets is public to
everyone. If the user sets the profile to be protected, his or her information cannot be
gathered by developers.
• The user has at least 100 tweets. According to a statistical research from Twitter on January
2012, the average length of a tweet is 67.9 characters. Therefore, before pre-processing,
there are 6790 characters for one user in a sample. Less than 100 tweets per user will lower
the calculation quality.
• The user must use English as major language in the tweets. Some odd non-English word
will not affect the model, but a number of total non-English users' data will longer the pre-
processing time, add confusion to the topics, or even mess up the result.
The text cleaning process is basically the same as the Wikipedia experiment. Tokenization, stop
words removing and word stemming is required in this process. However most tweets are oral
and informal language, a few details need to be noticed:
• Some tweets may contain URLs, using hash tags on a topic, using @ to mention other users.
In text cleaning process, these situations need particular functions to remove or parse.
• Some Internet terms, such as "LOL", "OMG", "BTW", are abbreviation of a phrase. Those
terms can treat as stop words to delete.
• Some words are written as shorthand, such as "ppl" (people), "thx" (thanks), "fab"
(fabulous). Those words need to stem to the original form.
3.2.3. Model Training
The training process is also similar to the Wikipedia experiment. However, Twitter data are
formed with natural daily language, which has a narrow topic range compared to Wikipedia. The
topics number is less and not with equally clear boundaries. Nevertheless, the topic model still
has a good performance and the coverage distribution can easily illustrate the user's personal
interest. Here is an experiment a sample of 100 twitter user with more than 100 tweets, returned
after 2000 iterations of Gibbs sampling, with 30 topics.
208 Computer Science & Information Technology (CS & IT)
Table 5. A few selected topics generated from Twitter topic distri
Topic 1 Topic 5
halifax check
nova reward
scotia one
man kangaroo
refuge new
media facebook
say get
woman post
central photo
student coffee
Table 2 shows five selected topics terms after the Twitter topic model is trained. Similar to the
previous experiment, the LDA model has
expected, twitter data is limited by the daily language that leads to less clear boundaries as
Wikipedia topics. For example, in topic 5, it is hard to label this topic into a particular category.
However, the Twitter application is successful on building up topic models over users, and it will
certainly benefit the statistic analysis and even a big impact on social and business research.
3.2.4. Results
This experiment has a similar structure with the Wikip
will treat every user’s tweets as an article. With the topic model, we can also calculate the
distribution over each topic. Here is an example of Tim Cook’s Twitter. As shown in Figure 3,
the left part is the Twitter user, and the right part is the bar plot of topic distribution. This
distribution represents what kind of topic the user talks more and more interested. Table 6 shows
the top 5 topics of Tim Cook talks most about. By applying Jenson
calculate the distance, we can find the people that talks the most similar topics or even with
similar personality.
Figure 3. Topics Distribution of Twitter of Tim Cook
Computer Science & Information Technology (CS & IT)
Table 5. A few selected topics generated from Twitter topic distribution
Topic 5 Topic 9 Topic 17 Topic 28
check stream follow weight
reward live win loss
one league enter diet
kangaroo communities canada news
new check retweet lose
facebook ea card natural
get design sunday tip
post weekend away plan
photo chat chance health
coffee gold donate techno
Table 2 shows five selected topics terms after the Twitter topic model is trained. Similar to the
previous experiment, the LDA model has a good performance on dividing topics. But as we
expected, twitter data is limited by the daily language that leads to less clear boundaries as
Wikipedia topics. For example, in topic 5, it is hard to label this topic into a particular category.
he Twitter application is successful on building up topic models over users, and it will
certainly benefit the statistic analysis and even a big impact on social and business research.
This experiment has a similar structure with the Wikipedia articles. Instead of each article, we
will treat every user’s tweets as an article. With the topic model, we can also calculate the
distribution over each topic. Here is an example of Tim Cook’s Twitter. As shown in Figure 3,
ter user, and the right part is the bar plot of topic distribution. This
distribution represents what kind of topic the user talks more and more interested. Table 6 shows
the top 5 topics of Tim Cook talks most about. By applying Jenson-Shannon divergence
calculate the distance, we can find the people that talks the most similar topics or even with
Figure 3. Topics Distribution of Twitter of Tim Cook
Table 2 shows five selected topics terms after the Twitter topic model is trained. Similar to the
a good performance on dividing topics. But as we
expected, twitter data is limited by the daily language that leads to less clear boundaries as
Wikipedia topics. For example, in topic 5, it is hard to label this topic into a particular category.
he Twitter application is successful on building up topic models over users, and it will
certainly benefit the statistic analysis and even a big impact on social and business research.
edia articles. Instead of each article, we
will treat every user’s tweets as an article. With the topic model, we can also calculate the
distribution over each topic. Here is an example of Tim Cook’s Twitter. As shown in Figure 3,
ter user, and the right part is the bar plot of topic distribution. This
distribution represents what kind of topic the user talks more and more interested. Table 6 shows
Shannon divergence to
calculate the distance, we can find the people that talks the most similar topics or even with
Computer Science & Information Technology (CS & IT) 209
Table 6. Top 5 Topic Terms of Tim Cook’s Tweets
Topic 19 Topic 12 Topic 20 Topic 15 Topic 22
day new apple custom summer
today great iphone service student
thank canada trump expect job
will join deal job american
great congrat say product wed
time congratulations app donet still
get communities product photo campus
ea forward vs price fair
can proud court easier act
new event camera facebook hall
4. CONCLUSIONS AND FUTURE WORK
In this paper, an introduction of text mining and topic model LDA is represented. We proposed
two experiments, which built up topic models on Wikipedia articles and Twitter users’ tweets. A
brief introduction of each experiment including overview, pre-processing and model training is
given and analysed.
With these data and model foundation, a number of future works can be done for further research
and experiment.
• As the limitation of the computation power, this research is based on a relatively small
sample by the time we start writing. However, the result is quite convincing even with the
small size. Applying to a larger dataset will more likely achieve better results.
• An application on topic modelling to manage, search and explore offline Wikipedia articles
could be implemented.
• A full research on Twitter users' interest could be applied. Further more, this application
could be a useful tool for social and business research
• In Twitter application on topic modelling, we ignore the pictures users posted. What if we
can combine image processing and topic model to provide a better performance?
REFERENCES
[1] Martin Ponweiser (2012) Latent Dirichlet Allocation in R, Vienna University of Business and
Economics.
[2] Bettina Grun, kurt Hornik (2011) “topicmodels: An R Package for Fitting Topic Model”, Journal of
Statistical Software Vol. 40, No. 13.
[3] Qi Jing (2015) Searching for Economic Effects of User Specified Event Based on Topic Modelling
and Event Reference, Jordery School of Computer Science, Acadia University.
[4] David M.Blei (2012) “Probabilistic Topic Models”, Communications of the ACM Vol. 55, No. 4,
pp77-84.
210 Computer Science & Information Technology (CS & IT)
[5] David M.Blei, John D. Lafferty (2006) “A Correlated Topic Model of Science”, Annals of Applied
Statistics Vol. 1, No. 1, pp17-35.
[6] Jianhua Lin (1991) “Divergence Measures Based on the Shannon Entropy”, IEEE Transactions on
Information Theory Vol. 37, No. 1, pp145-151.
[7] Aurangzeb Khan, Baharum Baharudin, Lam Hong Lee, Khairullah khan (2010) “A Review of
Machine Learning Algorithms for Text-Document Classification”, Journal of Advances in
Information Technology Vol. 1, No. 1, pp4-20.
[8] Yanchang Zhao (2015) R and Data Mining, http://www.rdatamining.com.
AUTHORS
Zhou Tong is currently a master student of computer science at Acadia University,
Canada. His research is focusing on text mining.
Haiyi Zhang received his MS degree in 1990 from the Computer Science department
of New Jersey Institute of Technology of USA, and his Ph.D in 1996 from Harbin
Institute of Technology in China. He was a post-doctor in information department of
ABO, Finland in 2000. His research interests are machine learning, data mining. He has
more than 50 academic papers published. Currently he is an associate professor at
Acadia University, Canada.

More Related Content

PDF
A Document Exploring System on LDA Topic Model for Wikipedia Articles
PDF
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
PDF
Blei ngjordan2003
PDF
Concurrent Inference of Topic Models and Distributed Vector Representations
PDF
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
PDF
G04124041046
PDF
Topicmodels
PDF
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
A Document Exploring System on LDA Topic Model for Wikipedia Articles
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
Blei ngjordan2003
Concurrent Inference of Topic Models and Distributed Vector Representations
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
G04124041046
Topicmodels
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...

What's hot (19)

PDF
Canini09a
PPT
similarity measure
PDF
Basic review on topic modeling
PPTX
Boolean,vector space retrieval Models
PDF
Topic models
PPTX
The Duet model
PDF
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
PPTX
Term weighting
PPTX
PDF
Classification of News and Research Articles Using Text Pattern Mining
PPTX
Tdm probabilistic models (part 2)
PDF
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
PPTX
Neural Models for Information Retrieval
PDF
Blei lafferty2009
PDF
Survey of Generative Clustering Models 2008
PPTX
Neural Models for Document Ranking
PDF
Topic modelling
PPTX
Tdm recent trends
PDF
Word Embedding In IR
Canini09a
similarity measure
Basic review on topic modeling
Boolean,vector space retrieval Models
Topic models
The Duet model
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Term weighting
Classification of News and Research Articles Using Text Pattern Mining
Tdm probabilistic models (part 2)
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
Neural Models for Information Retrieval
Blei lafferty2009
Survey of Generative Clustering Models 2008
Neural Models for Document Ranking
Topic modelling
Tdm recent trends
Word Embedding In IR
Ad

Similar to A Text Mining Research Based on LDA Topic Modelling (20)

PDF
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
PDF
A-Study_TopicModeling
PPTX
Frontiers of Computational Journalism week 2 - Text Analysis
PDF
A scalable gibbs sampler for probabilistic entity linking
PPTX
Topic Extraction on Domain Ontology
PDF
I6 mala3 sowmya
PDF
Topic Modeling : Clustering of Deep Webpages
PDF
Topic Modeling : Clustering of Deep Webpages
PDF
Semantic Annotation of Documents
PDF
TopicModels_BleiPaper_Summary.pptx
PDF
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...
PDF
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
PPTX
topic modelling through LDA and bertopic model
PDF
LEARNING CONTEXT FOR TEXT.pdf
PDF
A rough set based hybrid method to text categorization
PDF
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
PDF
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
PPTX
Kdd 2014 tutorial bringing structure to text - chi
PDF
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
PDF
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
A-Study_TopicModeling
Frontiers of Computational Journalism week 2 - Text Analysis
A scalable gibbs sampler for probabilistic entity linking
Topic Extraction on Domain Ontology
I6 mala3 sowmya
Topic Modeling : Clustering of Deep Webpages
Topic Modeling : Clustering of Deep Webpages
Semantic Annotation of Documents
TopicModels_BleiPaper_Summary.pptx
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
topic modelling through LDA and bertopic model
LEARNING CONTEXT FOR TEXT.pdf
A rough set based hybrid method to text categorization
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
Kdd 2014 tutorial bringing structure to text - chi
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
Ad

Recently uploaded (20)

PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Sustainable Sites - Green Building Construction
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Digital Logic Computer Design lecture notes
PPTX
web development for engineering and engineering
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
DOCX
573137875-Attendance-Management-System-original
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Lecture Notes Electrical Wiring System Components
Operating System & Kernel Study Guide-1 - converted.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Sustainable Sites - Green Building Construction
Foundation to blockchain - A guide to Blockchain Tech
Embodied AI: Ushering in the Next Era of Intelligent Systems
Digital Logic Computer Design lecture notes
web development for engineering and engineering
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
573137875-Attendance-Management-System-original

A Text Mining Research Based on LDA Topic Modelling

  • 1. Jan Zizka et al. (Eds) : CCSEIT, AIAP, DMDB, MoWiN, CoSIT, CRIS, SIGL, ICBB, CNSA-2016 pp. 201–210, 2016. © CS & IT-CSCP 2016 DOI : 10.5121/csit.2016.60616 A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING Zhou Tong1 and Haiyi Zhang2 Jodrey School of Computer Science, Acadia University, Wolfville, NS, Canada 1 zhoutong@acadiau.ca 2 haiyi.zhang@acadiau.ca ABSTRACT A Large number of digital text information is generated every day. Effectively searching, managing and exploring the text data has become a main task. In this paper, we first represent an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The former one builds up a document topic model, aiming to a topic perspective solution on searching, exploring and recommending articles. The latter one sets up a user topic model, providing a full research and analysis over Twitter users’ interest. The experiment process including data collecting, data pre-processing and model training is fully documented and commented. Further more, the conclusion and application of this paper could be a useful computation tool for social and business research. KEYWORDS topic model, LDA, text mining, probabilistic model 1. INTRODUCTION As computers and Internet are widely used in almost every area, more and more information is digitized and stored online in the form of news, blogs, and social networks. Since the amount of the information is exploded to astronomical figures, searching and exploring the data has become the main problem. Our research is intended to design a new computational tool based on topic models using text mining techniques to organize, search and analyse the vast amounts of data, providing a better way understanding and mining the information. 2. BACKGROUND 2.1. Text Mining Text mining is the process of deriving high-quality information from text [1]. Text mining usually involves the process of structuring the input text, finding patterns within the structured data, and finally evaluation and interpretation of the output. Typical text mining tasks include text categorization, text clustering, document summarization, keyword extraction and etc. In this
  • 2. 202 Computer Science & Information Technology (CS & IT) research, statistical and machine learning techniques will be used to mine meaningful information and explore data analysis. 2.2. Topic Modelling In machine learning and natural language processing, topic models are generative models, which provide a probabilistic framework [2]. Topic modelling methods are generally used for automatically organizing, understanding, searching, and summarizing large electronic archives. The "topics" signifies the hidden, to be estimated, variable relations that link words in a vocabulary and their occurrence in documents. A document is seen as a mixture of topics. Topic models discover the hidden themes through out the collection and annotate the documents according to those themes. Each word is seen as drawn from one of those topics. Finally, A document coverage distribution of topics is generated and it provides a new way to explore the data on the perspective of topics. 2.3. Latent Dirichlet Allocation Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar [3]. LDA has made a big impact in the fields of natural language processing and statistical machine learning and has quickly become one of the most popular probabilistic text modelling techniques in machine learning. Intuitively in LDA, documents exhibit multiple topics [4]. In text pre-processing, we exclude punctuation and stop words (such as, "if", "the", or "on", which contain little topical content). Therefore, each document is regarded as a mixture of corpus-wide topics. A topic is a distribution over a fixed vocabulary. These topics are generated from the collection of documents [5]. For example, the sports topic has word "football", "hockey" with high probability and the computer topic has word "data", "network" with high probability. Then, a collection of documents has probability distribution over topics, where each word is regarded as drawn from one of those topics. With this document probability distribution over each topic, we will know how much each topic is involved in a document, meaning which topics a document is mainly talking about. A graphical model for LDA is shown in Figure 1: Figure 1. Graphic model for Latent Dirichlet allocation
  • 3. Computer Science & Information Technology (CS & IT) 203 As the figure illustrated, we can describe LDA more formally with the following notation. First, α and η are proportion parameter and topic parameter, respectively. The topics are β1:K , where each βk is a distribution over the vocabulary. The topic proportion for the d th document are θd , where θd,k is the topic proportion for topic k in document d . The topic assignments for the d th document are Zd , where Zd,n is the topic assignment for the nth word in document d . Finally, the observed words for document d are wd , where wd,n is the nth word in document d , which is an element from the fixed vocabulary. With this notation, the generative process for LDA corresponds to the following joint distribution of the hidden and observed variables: p(β1:K,θ1:D,z1:D,w1:D ) = p(βi ) p(θd )( p(zd,n |θd )p(wd,n | β1:K,zd,n )) n=1 N ∏ d=1 D ∏ i=1 K ∏ Notice that this distribution specifies a number of dependencies. The topic assignment Zd,n depends on the per-document topic distribution θd ; and the word wd,n depends on all of the topics β1:K and the topic assignment Zd,n . 2.4. Jensen-Shannon Divergence In probability theory and statistics, the Jensen-Shannon divergence is a popular method of measuring the similarity between two probability distributions. It is also known as information radius or total divergence to the average. It is based on the Kullback-Leibler divergence. The square root of the Jensen-Shannon divergence is a metric often referred to as Jensen-Shannon distance [6]. For discrete probability distributions P and Q, Kullback-Leibler divergence of Q from Pis defined to be: DKL (P ||Q) = P(i) i ∑ log P(i) Q(i) So, the Jensen-Shannon divergence of Q from P is defined by: JSD(P ||Q) = 1 2 DKL (P || M)+ 1 2 DKL (Q || M) where M = 1 2 (P +Q) Jensen-Shannon divergence measures the similarity between two distributions. By applying Jensen-Shannon divergence to the topic assignment for the d th document Zd , it will allow us to measure the distance and similarity between each document.
  • 4. 204 Computer Science & Information Technology (CS & IT) 3. DESIGNS AND EXPERIMENTS In order to apply topic modelling and propose a new text mining solution on topics, we have designed two experiments fulfilling our goal. The first one is to use topic modelling manage and explore Wikipedia, and the second is a Twitter application on topic modelling. The former is a new solution on a typical problem and the latter is building up a new model on Twitter data analysis. 3.1. Wikipedia on Topic Modelling 3.1.1. Experiment Overview Wikipedia is a free-access, free-content Internet encyclopaedia, supported by non-profit Wikimedia Foundation. It has millions of articles for people to search, explore or even edit. In this experiment, the text data is from simplified Wikipedia (English version) with over 200,000 articles. By applying Latent Dirichlet allocation (LDA) and topic modelling, a solution of topic searching, exploring and recommending system will be achieved. 3.1.2. Data Pre-processing The simplified Wikipedia English version is free for download from Wikipedia Foundation database backup dumps. The backup is in a format of XML. The first step of data pre-processing is to parse the XML file and extract the text data. R package XML provides a series of function parsing XML file. By using those functions, we will get a relatively clear data of all the articles in a data frame. The next step is text-cleaning process. The purpose of text cleaning is to simplify the text data, eliminating as much as possible language dependent factors. Articles are written in natural language for human to understand. But in text mining, those data are not always easy for computers to process. In this experiment, there are three steps in text cleaning: • Tokenization: a document is treated as a string, removing all the punctuations and then partitioned into a list of tokens. • Removing stop words: stop words such as "the", "if", "and" ... are frequently occurring but no significant meanings which need to be removed. • Stemming word: stemming word that converts different word form into similar canonical form. For example, computing to compute, happiness to happy. This process reduces the data redundancy and simplifies the later computation [7]. 3.1.3. Model Training The training process requires R package topicmodels with its package dependencies (tm and others) to be loaded. An LDA model of simplified English Wikipedia on a sample of 1000 articles with more than 1000 characters, returned after 2000 iterations of Gibbs sampling, with K = 50 topics, and Dirichlet hyper-parameters β = 0.1 and α = 50 / K . Meanwhile, topic distribution coverage for each document is generated. This distribution represents how much each
  • 5. Computer Science & Information Technology (CS & IT) document is related to each topic. A new way of search and explore documents over topics can be implemented. Table 1. A few selected t Topic 2 Topic 13 athlete album olympia song field music summer record track band men release metre single image rock women singer medal Table 1 shows five selected topic terms after the model is trained, for each topic. With LDA training, the terms in the same topic tend to be similar. Formally speaking, they are highly associated. For example, topic 13 is about music, topic 22 is about education and topic 46 is about weather and explore between topics in order to find the document the user is looking for. After the model is built, Jensen- distribution. Sorting the similarity of one document between every other distribution, a topic recommender system can be implemented. 3.1.4. Results Here is an example of article Light part of Figure 2, which can be also accessed in simple Wikipedia online (the data for this experiment is retrieved as a backup version on 11/1/2016). After the model is trained, we got a series of article distribution over each topic. The right part of Figure 2 is the bar plot of article light topics distribution. In total of 50 topics, we can easily find there are 3 topics with obviously high probabilities – topic 47, topic 45 and topic 16. T These probabilities are how tightly this article is associated with each topic. Table 3 shows the terms in these 5 topics with 10 terms each. Figure 2. Topics Distribution of Article Computer Science & Information Technology (CS & IT) document is related to each topic. A new way of search and explore documents over topics can be Table 1. A few selected topics generated from Wikipedia topic distribution Topic 13 Topic 22 Topic 31 Topic 46 album universal movie hurricane song college categories major music school fiction season record categories solstice minor band new alien key release economic star storm single institut sun tropic rock educate direct chord singer science drama verse pop work southern end Table 1 shows five selected topic terms after the model is trained, where top ten terms are listed for each topic. With LDA training, the terms in the same topic tend to be similar. Formally speaking, they are highly associated. For example, topic 13 is about music, topic 22 is about education and topic 46 is about weather. This topic distribution provides a way to search topic and explore between topics in order to find the document the user is looking for. -Shannon divergence is applied to calculate the similarity of each ting the similarity of one document between every other distribution, a topic recommender system can be implemented. Light from the experiment. The original article is shown in the left part of Figure 2, which can be also accessed in simple Wikipedia online (the data for this experiment is retrieved as a backup version on 11/1/2016). After the model is trained, we got a s of article distribution over each topic. The right part of Figure 2 is the bar plot of article topics distribution. In total of 50 topics, we can easily find there are 3 topics with obviously topic 47, topic 45 and topic 16. Table 2 shows the top 5 probabilities topics. These probabilities are how tightly this article is associated with each topic. Table 3 shows the terms in these 5 topics with 10 terms each. Figure 2. Topics Distribution of Article Light 205 document is related to each topic. A new way of search and explore documents over topics can be where top ten terms are listed for each topic. With LDA training, the terms in the same topic tend to be similar. Formally speaking, they are highly associated. For example, topic 13 is about music, topic 22 is about . This topic distribution provides a way to search topic Shannon divergence is applied to calculate the similarity of each ting the similarity of one document between every other distribution, a topic from the experiment. The original article is shown in the left part of Figure 2, which can be also accessed in simple Wikipedia online (the data for this experiment is retrieved as a backup version on 11/1/2016). After the model is trained, we got a s of article distribution over each topic. The right part of Figure 2 is the bar plot of article topics distribution. In total of 50 topics, we can easily find there are 3 topics with obviously able 2 shows the top 5 probabilities topics. These probabilities are how tightly this article is associated with each topic. Table 3 shows the
  • 6. 206 Computer Science & Information Technology (CS & IT) Table 2. Top 5 Topic Probabilities of Article Light Topics Topic 47 Topic 45 Topic 16 Topic 24 Topic 30 Probabilities 0.05692600 0.45920304 0.28462998 0.01518027 0.01328273 Table 3. Top 5 Topic Terms of Article Light Topic 47 Topic 45 Topic 16 Topic 24 Topic 30 light use color computable cleaner beamline one style equation people beam also background fluid use radiated can hex categories clean don people rgb image make synchrotron mania magenta program chemical wavelength call ffffff protocol thing physical time fuchsia mathematical paint station two red function made carlo like pink design put So just like this example, every article of the whole collection is represented as a vector of probabilities over 50 topics. This is the core data of our model, where we can do all sorts of applications. Here is an example of finding the most related article of article Light. We will use Jensen-Shannon divergence to calculate the distance between article Light and every other article. The shortest distance will be the most related article. After calculating the distance, Table 4 shows the top 10 of the shortest articles. To be noted, the distance should have been the square root of the distance below, but to simplify the calculation and higher the accuracy, we will stay with the squared number, as there would be no difference on comparing the closet distance. Table 4. Top 10 Shortest Distance of Article Light Articles Article 855 Article 820 Article 837 Article 299 Article 911 Distance2 0.0000000 0.2271741 0.3404084 0.3881467 0.4583745 Articles Article 341 Article 287 Article 328 Article 544 Article 606 Distance2 0.4671845 0.4728383 0.4802241 0.4803357 0.4874499 As shown on Table 4, these 10 articles are the closest distance with Article Light. The distance of Article 855 is 0, because it is article Light itself. So Article 820 is the closest distance with Light, meaning their contents most related. Meanwhile, what we get is a sorted list of closest distance and will also work if we require more than one most related article. This method is based on calculating the probabilities each meaningful word in the model. So the accuracy is much better than calculating the keywords, or titles, which is widely used on many documents management system. Lastly, if we look up the title of Article 820, it is Beamline. Based on the 2000 article dataset, it is a convincing answer to the most related article of Light. 3.2. Twitter Data Analysis 3.2.1. Experiment Overview Twitter is an online social networking service that enables users to send and read short 140- character messages called "tweets". Currently, Twitter has more than 332 million active users posting 340 million tweets a day. Twitter has a big impact in everyday life. Twitter mining is not
  • 7. Computer Science & Information Technology (CS & IT) 207 only a big research task in computer science or statistics, but also a key factor on social and business research. In this experiment, by applying LDA and topic modelling, a deep research and analysis on Twitter users is proposed. A detail model of Twitter user's personality and preference will be inferred. 3.2.2. Data Pre-processing Generally, a tweet from any twitter user is public and free to read for everyone. Having registered as a Twitter application developer will allow you to access all the tweets and a set of APIs to manipulate the data [8]. The first step is to collect tweets from the users. By using the APIs and R package twitteR, a sample of 10,000 valid users is gathered into a data frame. To improve the data quality, a standard is set up for the twitter user to be a "valid" user: • The user profile is an unprotected, which means user's information and tweets is public to everyone. If the user sets the profile to be protected, his or her information cannot be gathered by developers. • The user has at least 100 tweets. According to a statistical research from Twitter on January 2012, the average length of a tweet is 67.9 characters. Therefore, before pre-processing, there are 6790 characters for one user in a sample. Less than 100 tweets per user will lower the calculation quality. • The user must use English as major language in the tweets. Some odd non-English word will not affect the model, but a number of total non-English users' data will longer the pre- processing time, add confusion to the topics, or even mess up the result. The text cleaning process is basically the same as the Wikipedia experiment. Tokenization, stop words removing and word stemming is required in this process. However most tweets are oral and informal language, a few details need to be noticed: • Some tweets may contain URLs, using hash tags on a topic, using @ to mention other users. In text cleaning process, these situations need particular functions to remove or parse. • Some Internet terms, such as "LOL", "OMG", "BTW", are abbreviation of a phrase. Those terms can treat as stop words to delete. • Some words are written as shorthand, such as "ppl" (people), "thx" (thanks), "fab" (fabulous). Those words need to stem to the original form. 3.2.3. Model Training The training process is also similar to the Wikipedia experiment. However, Twitter data are formed with natural daily language, which has a narrow topic range compared to Wikipedia. The topics number is less and not with equally clear boundaries. Nevertheless, the topic model still has a good performance and the coverage distribution can easily illustrate the user's personal interest. Here is an experiment a sample of 100 twitter user with more than 100 tweets, returned after 2000 iterations of Gibbs sampling, with 30 topics.
  • 8. 208 Computer Science & Information Technology (CS & IT) Table 5. A few selected topics generated from Twitter topic distri Topic 1 Topic 5 halifax check nova reward scotia one man kangaroo refuge new media facebook say get woman post central photo student coffee Table 2 shows five selected topics terms after the Twitter topic model is trained. Similar to the previous experiment, the LDA model has expected, twitter data is limited by the daily language that leads to less clear boundaries as Wikipedia topics. For example, in topic 5, it is hard to label this topic into a particular category. However, the Twitter application is successful on building up topic models over users, and it will certainly benefit the statistic analysis and even a big impact on social and business research. 3.2.4. Results This experiment has a similar structure with the Wikip will treat every user’s tweets as an article. With the topic model, we can also calculate the distribution over each topic. Here is an example of Tim Cook’s Twitter. As shown in Figure 3, the left part is the Twitter user, and the right part is the bar plot of topic distribution. This distribution represents what kind of topic the user talks more and more interested. Table 6 shows the top 5 topics of Tim Cook talks most about. By applying Jenson calculate the distance, we can find the people that talks the most similar topics or even with similar personality. Figure 3. Topics Distribution of Twitter of Tim Cook Computer Science & Information Technology (CS & IT) Table 5. A few selected topics generated from Twitter topic distribution Topic 5 Topic 9 Topic 17 Topic 28 check stream follow weight reward live win loss one league enter diet kangaroo communities canada news new check retweet lose facebook ea card natural get design sunday tip post weekend away plan photo chat chance health coffee gold donate techno Table 2 shows five selected topics terms after the Twitter topic model is trained. Similar to the previous experiment, the LDA model has a good performance on dividing topics. But as we expected, twitter data is limited by the daily language that leads to less clear boundaries as Wikipedia topics. For example, in topic 5, it is hard to label this topic into a particular category. he Twitter application is successful on building up topic models over users, and it will certainly benefit the statistic analysis and even a big impact on social and business research. This experiment has a similar structure with the Wikipedia articles. Instead of each article, we will treat every user’s tweets as an article. With the topic model, we can also calculate the distribution over each topic. Here is an example of Tim Cook’s Twitter. As shown in Figure 3, ter user, and the right part is the bar plot of topic distribution. This distribution represents what kind of topic the user talks more and more interested. Table 6 shows the top 5 topics of Tim Cook talks most about. By applying Jenson-Shannon divergence calculate the distance, we can find the people that talks the most similar topics or even with Figure 3. Topics Distribution of Twitter of Tim Cook Table 2 shows five selected topics terms after the Twitter topic model is trained. Similar to the a good performance on dividing topics. But as we expected, twitter data is limited by the daily language that leads to less clear boundaries as Wikipedia topics. For example, in topic 5, it is hard to label this topic into a particular category. he Twitter application is successful on building up topic models over users, and it will certainly benefit the statistic analysis and even a big impact on social and business research. edia articles. Instead of each article, we will treat every user’s tweets as an article. With the topic model, we can also calculate the distribution over each topic. Here is an example of Tim Cook’s Twitter. As shown in Figure 3, ter user, and the right part is the bar plot of topic distribution. This distribution represents what kind of topic the user talks more and more interested. Table 6 shows Shannon divergence to calculate the distance, we can find the people that talks the most similar topics or even with
  • 9. Computer Science & Information Technology (CS & IT) 209 Table 6. Top 5 Topic Terms of Tim Cook’s Tweets Topic 19 Topic 12 Topic 20 Topic 15 Topic 22 day new apple custom summer today great iphone service student thank canada trump expect job will join deal job american great congrat say product wed time congratulations app donet still get communities product photo campus ea forward vs price fair can proud court easier act new event camera facebook hall 4. CONCLUSIONS AND FUTURE WORK In this paper, an introduction of text mining and topic model LDA is represented. We proposed two experiments, which built up topic models on Wikipedia articles and Twitter users’ tweets. A brief introduction of each experiment including overview, pre-processing and model training is given and analysed. With these data and model foundation, a number of future works can be done for further research and experiment. • As the limitation of the computation power, this research is based on a relatively small sample by the time we start writing. However, the result is quite convincing even with the small size. Applying to a larger dataset will more likely achieve better results. • An application on topic modelling to manage, search and explore offline Wikipedia articles could be implemented. • A full research on Twitter users' interest could be applied. Further more, this application could be a useful tool for social and business research • In Twitter application on topic modelling, we ignore the pictures users posted. What if we can combine image processing and topic model to provide a better performance? REFERENCES [1] Martin Ponweiser (2012) Latent Dirichlet Allocation in R, Vienna University of Business and Economics. [2] Bettina Grun, kurt Hornik (2011) “topicmodels: An R Package for Fitting Topic Model”, Journal of Statistical Software Vol. 40, No. 13. [3] Qi Jing (2015) Searching for Economic Effects of User Specified Event Based on Topic Modelling and Event Reference, Jordery School of Computer Science, Acadia University. [4] David M.Blei (2012) “Probabilistic Topic Models”, Communications of the ACM Vol. 55, No. 4, pp77-84.
  • 10. 210 Computer Science & Information Technology (CS & IT) [5] David M.Blei, John D. Lafferty (2006) “A Correlated Topic Model of Science”, Annals of Applied Statistics Vol. 1, No. 1, pp17-35. [6] Jianhua Lin (1991) “Divergence Measures Based on the Shannon Entropy”, IEEE Transactions on Information Theory Vol. 37, No. 1, pp145-151. [7] Aurangzeb Khan, Baharum Baharudin, Lam Hong Lee, Khairullah khan (2010) “A Review of Machine Learning Algorithms for Text-Document Classification”, Journal of Advances in Information Technology Vol. 1, No. 1, pp4-20. [8] Yanchang Zhao (2015) R and Data Mining, http://www.rdatamining.com. AUTHORS Zhou Tong is currently a master student of computer science at Acadia University, Canada. His research is focusing on text mining. Haiyi Zhang received his MS degree in 1990 from the Computer Science department of New Jersey Institute of Technology of USA, and his Ph.D in 1996 from Harbin Institute of Technology in China. He was a post-doctor in information department of ABO, Finland in 2000. His research interests are machine learning, data mining. He has more than 50 academic papers published. Currently he is an associate professor at Acadia University, Canada.