Text minings

Text Mining
Independent Course Study
Under the guidance of
Prof. Ranganathan Chandraskaren
1Aadish Chopra

Text Mining
Problem Statement :Classify tweets into categories such as Health ,
General Information
Data Set: Collected through web-scraping
• Twitter API is used widely but limitation of Twitter API is that you can
only collect upto a few thousand tweets.
• Another limitation of twitter API is that you can’t extract all the
information that is on the website
Tweets were only collected for Health institutions. There are 790 health
institutions or handles.
2Aadish Chopra

How the twitter website looks like ?
3Aadish Chopra
Twitter
Name
Twitter
Handle
Tweet

Web scraper
Web scraper was built using R. Package used is Rselenium*
https://drive.google.com/drive/folders/0Bzq4aAyBiuFQdHd0Y2JobmhjY
1k
Rselenium is based on Selenium server and automates the web browser
and takes control over it .You can then specify how you want to interact
with the web browser by specifying the action e.g. scrolling, pressing
the end key or pressing some specific button on the website
4Aadish Chopra
*I did web scraping using R. Earlier teams did it using Java and Python

How the collected data looks like ?
5Aadish Chopra
https://drive.google.com/drive/folders/0B-x4pbFFhIICWmdvWnBwX1VwZzA

Text Cleaning
Example Tweet :
Don't forget to register for the American Heart Associationâ€™s HeartsaverÂ® First Aid
Training offered by the...http://fb.me/6PjnRWaMAÂ
Steps taken to clean tweet
• Tweets can be cleaned using the “tm” package in R . The functions are as follows
1. removeNumbers
2. removePunctuation
3. Stopwords
4. removeWords
5. stripWhitespace
6. stemDocument
6Aadish Chopra

Text Cleaning
However if one wants more control over the cleaning of the data. For this purpose regex
expressions can be used
1. Punctuation replacement: LaSalletwe<-gsub(pattern = "W",replacement = "
",LaSalletwe)
2. Digits replacement: LaSalletwe<-gsub(pattern="d",replacement = "",LaSalletwe)
3. To remove a single letter: LaSalletwe<-gsub(pattern="b[A-
z]b{1}",replace="",LaSalletwe)
4. The list of stopwords* can be modified in the English.dat file which can be opened. The
stopwords can be created for any language in R. It can be helpful for instance in cities
like Chicago where people speak languages like Spanish
*Changed the English.dat file in retrospective manner.
7Aadish Chopra

Text Cleaning
8Aadish Chopra
English.dat file can
be changed for
stopwords

Topic Modelling
Topic Modelling:
• Classify documents using topic models.
• Get themes from documents
• Get topic distributions from corpus
• Get word distributions within topics
A technique which is popular is known as Latent Dirichlet allocation(LDA) was used to find
the topic distributions within documents.
Various versions of LDA like sLDA(supervised), Gibbs sampling are also present and may be
used interchangeably depending on the goals.
9Aadish Chopra

Topic Modelling
10Aadish Chopra
α is the parameter of the Dirichlet prior on the per-document topic distributions,
β is the parameter of the Dirichlet prior on the per-topic word distribution,
Ɵ is the topic distribution for document m,
ψ is the word distribution for topic k,
z is the topic for the n-th word in document m, and
w is the specific word.
N is the number of
words in a
document
M is the number of
documents

Topic Modelling
LDA gives the themes of the document or the corpus. LDA can be used
in various ways.
1. On a single tweet
2. On a single handle
3. On the entire corpus
11Aadish Chopra

Word Frequency -On a single handle
We are illustrating by taking example of LaSalleCoHealth
Total no of Observations: 691
Total no of words after data cleaning: 1126 observations
Top 16 words which occurs most in their tweet are
12Aadish Chopra

Word Frequency-On the corpus
Total no of Observations: 16449
Total no of words after data cleaning: 13665
observations
13Aadish Chopra

Text minings

More Related Content

What's hot (18)

Similar to Text minings (20)

More from University of Illinois,Chicago (15)

Recently uploaded (20)

Text minings