SlideShare a Scribd company logo
Text Mining
Independent Course Study
Under the guidance of
Prof. Ranganathan Chandraskaren
1Aadish Chopra
Text Mining
Problem Statement :Classify tweets into categories such as Health ,
General Information
Data Set: Collected through web-scraping
• Twitter API is used widely but limitation of Twitter API is that you can
only collect upto a few thousand tweets.
• Another limitation of twitter API is that you can’t extract all the
information that is on the website
Tweets were only collected for Health institutions. There are 790 health
institutions or handles.
2Aadish Chopra
How the twitter website looks like ?
3Aadish Chopra
Twitter
Name
Twitter
Handle
Tweet
Web scraper
Web scraper was built using R. Package used is Rselenium*
https://drive.google.com/drive/folders/0Bzq4aAyBiuFQdHd0Y2JobmhjY
1k
Rselenium is based on Selenium server and automates the web browser
and takes control over it .You can then specify how you want to interact
with the web browser by specifying the action e.g. scrolling, pressing
the end key or pressing some specific button on the website
4Aadish Chopra
*I did web scraping using R. Earlier teams did it using Java and Python
How the collected data looks like ?
5Aadish Chopra
https://drive.google.com/drive/folders/0B-x4pbFFhIICWmdvWnBwX1VwZzA
Text Cleaning
Example Tweet :
Don't forget to register for the American Heart Association’s Heartsaver® First Aid
Training offered by the...http://fb.me/6PjnRWaMAÂ
Steps taken to clean tweet
• Tweets can be cleaned using the “tm” package in R . The functions are as follows
1. removeNumbers
2. removePunctuation
3. Stopwords
4. removeWords
5. stripWhitespace
6. stemDocument
6Aadish Chopra
Text Cleaning
However if one wants more control over the cleaning of the data. For this purpose regex
expressions can be used
1. Punctuation replacement: LaSalletwe<-gsub(pattern = "W",replacement = "
",LaSalletwe)
2. Digits replacement: LaSalletwe<-gsub(pattern="d",replacement = "",LaSalletwe)
3. To remove a single letter: LaSalletwe<-gsub(pattern="b[A-
z]b{1}",replace="",LaSalletwe)
4. The list of stopwords* can be modified in the English.dat file which can be opened. The
stopwords can be created for any language in R. It can be helpful for instance in cities
like Chicago where people speak languages like Spanish
*Changed the English.dat file in retrospective manner.
7Aadish Chopra
Text Cleaning
8Aadish Chopra
English.dat file can
be changed for
stopwords
Topic Modelling
Topic Modelling:
• Classify documents using topic models.
• Get themes from documents
• Get topic distributions from corpus
• Get word distributions within topics
A technique which is popular is known as Latent Dirichlet allocation(LDA) was used to find
the topic distributions within documents.
Various versions of LDA like sLDA(supervised), Gibbs sampling are also present and may be
used interchangeably depending on the goals.
9Aadish Chopra
Topic Modelling
10Aadish Chopra
α is the parameter of the Dirichlet prior on the per-document topic distributions,
β is the parameter of the Dirichlet prior on the per-topic word distribution,
Ɵ is the topic distribution for document m,
ψ is the word distribution for topic k,
z is the topic for the n-th word in document m, and
w is the specific word.
N is the number of
words in a
document
M is the number of
documents
Topic Modelling
LDA gives the themes of the document or the corpus. LDA can be used
in various ways.
1. On a single tweet
2. On a single handle
3. On the entire corpus
11Aadish Chopra
Word Frequency -On a single handle
We are illustrating by taking example of LaSalleCoHealth
Total no of Observations: 691
Total no of words after data cleaning: 1126 observations
Top 16 words which occurs most in their tweet are
12Aadish Chopra
Word Frequency-On the corpus
Total no of Observations: 16449
Total no of words after data cleaning: 13665
observations
13Aadish Chopra

More Related Content

PPT
PPT
Enabling Exploration Through Text Analytics
PPT
Introduction To RDF and RDFS
PPT
Rdf Overview Presentation
PPT
Csre 15 May 2009
PDF
Efficient Query Answering against Dynamic RDF Databases
PPTX
Fedora Migration Considerations
PPTX
PhD Dissertation Writers
Enabling Exploration Through Text Analytics
Introduction To RDF and RDFS
Rdf Overview Presentation
Csre 15 May 2009
Efficient Query Answering against Dynamic RDF Databases
Fedora Migration Considerations
PhD Dissertation Writers

What's hot (18)

PPT
Semantic Web Austin Yahoo
PPTX
Otsuka Talk in Dec 2017
PPTX
Knowledge Technologies: Opportunities and Challenges
PPTX
PIR advanced information skills 2018
PDF
An introduction to Semantic Web and Linked Data
PDF
Hello data
ODP
2 Hka Researching
PPT
Terrorism 15 May 2009
PPTX
Search strategy
PPT
Introduction to RDF
PPTX
2014 CrossRef Workshops: Support Update and Multiple Resolution Overview
PPTX
Search strategies – subject searching
PDF
An introduction to Semantic Web and Linked Data
PPT
Publishing data on the Semantic Web
PPT
Year of the Monkey: Lessons from the first year of SearchMonkey
PPTX
Productive Searching
PPTX
What google scholar can do for you
Semantic Web Austin Yahoo
Otsuka Talk in Dec 2017
Knowledge Technologies: Opportunities and Challenges
PIR advanced information skills 2018
An introduction to Semantic Web and Linked Data
Hello data
2 Hka Researching
Terrorism 15 May 2009
Search strategy
Introduction to RDF
2014 CrossRef Workshops: Support Update and Multiple Resolution Overview
Search strategies – subject searching
An introduction to Semantic Web and Linked Data
Publishing data on the Semantic Web
Year of the Monkey: Lessons from the first year of SearchMonkey
Productive Searching
What google scholar can do for you
Ad

Similar to Text minings (20)

PPTX
Utilizing the natural langauage toolkit for keyword research
PDF
PPTX
Building NLP solutions for Davidson ML Group
PDF
How to get_community_support
PDF
Microformats I: What & Why
PDF
TLA Webinar: Introduction to Drupal -- part 3 of 3
PPT
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
DOCX
Copyright © 2017, 2018 Sinclair Community College. All Right.docx
PDF
Big Data Analytics course: Named Entities and Deep Learning for NLP
PPTX
BDA UNIT 1big data – web analytics – big data applications– big data technolo...
PPTX
Hadoop presentation
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
PDF
Harnessing Web Page Directories for Large-Scale Classification of Tweets
PDF
Resource Mining for Effective Research
PDF
8 programming concepts_you_should_know_in_2017
PPTX
ppt 2.pptxandxikcicncmk0wufjepfc09eufcdc
DOCX
Twitter analysis by Kaify Rais
PDF
Twitter System Design
PPTX
How Oracle Uses CrowdFlower For Sentiment Analysis
PDF
Social Networks at Scale
Utilizing the natural langauage toolkit for keyword research
Building NLP solutions for Davidson ML Group
How to get_community_support
Microformats I: What & Why
TLA Webinar: Introduction to Drupal -- part 3 of 3
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Copyright © 2017, 2018 Sinclair Community College. All Right.docx
Big Data Analytics course: Named Entities and Deep Learning for NLP
BDA UNIT 1big data – web analytics – big data applications– big data technolo...
Hadoop presentation
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Harnessing Web Page Directories for Large-Scale Classification of Tweets
Resource Mining for Effective Research
8 programming concepts_you_should_know_in_2017
ppt 2.pptxandxikcicncmk0wufjepfc09eufcdc
Twitter analysis by Kaify Rais
Twitter System Design
How Oracle Uses CrowdFlower For Sentiment Analysis
Social Networks at Scale
Ad

More from University of Illinois,Chicago (15)

DOCX
Pumps, Compressors and Turbine Fault Frequency Analysis
PDF
Pumps, Compressors and Turbine Fault Frequency Analysis
DOCX
Health informationexchangeacrossus healthinstitution (1)
PDF
PDF
PPTX
IDS 570 project presentation
PPTX
Final Presentation
PDF
Microsoft power point Face recognition
DOCX
(485226650) OLED 3
PPTX
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
Health informationexchangeacrossus healthinstitution (1)
IDS 570 project presentation
Final Presentation
Microsoft power point Face recognition
(485226650) OLED 3

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PDF
Mega Projects Data Mega Projects Data
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Computer network topology notes for revision
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Global journeys: estimating international migration
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
Quality review (1)_presentation of this 21
Mega Projects Data Mega Projects Data
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction-to-Cloud-ComputingFinal.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Computer network topology notes for revision
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Launch Your Data Science Career in Kochi – 2025
Moving the Public Sector (Government) to a Digital Adoption
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Global journeys: estimating international migration
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
1_Introduction to advance data techniques.pptx
Supervised vs unsupervised machine learning algorithms
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Clinical guidelines as a resource for EBP(1).pdf

Text minings

  • 1. Text Mining Independent Course Study Under the guidance of Prof. Ranganathan Chandraskaren 1Aadish Chopra
  • 2. Text Mining Problem Statement :Classify tweets into categories such as Health , General Information Data Set: Collected through web-scraping • Twitter API is used widely but limitation of Twitter API is that you can only collect upto a few thousand tweets. • Another limitation of twitter API is that you can’t extract all the information that is on the website Tweets were only collected for Health institutions. There are 790 health institutions or handles. 2Aadish Chopra
  • 3. How the twitter website looks like ? 3Aadish Chopra Twitter Name Twitter Handle Tweet
  • 4. Web scraper Web scraper was built using R. Package used is Rselenium* https://drive.google.com/drive/folders/0Bzq4aAyBiuFQdHd0Y2JobmhjY 1k Rselenium is based on Selenium server and automates the web browser and takes control over it .You can then specify how you want to interact with the web browser by specifying the action e.g. scrolling, pressing the end key or pressing some specific button on the website 4Aadish Chopra *I did web scraping using R. Earlier teams did it using Java and Python
  • 5. How the collected data looks like ? 5Aadish Chopra https://drive.google.com/drive/folders/0B-x4pbFFhIICWmdvWnBwX1VwZzA
  • 6. Text Cleaning Example Tweet : Don't forget to register for the American Heart Association’s Heartsaver® First Aid Training offered by the...http://fb.me/6PjnRWaMA Steps taken to clean tweet • Tweets can be cleaned using the “tm” package in R . The functions are as follows 1. removeNumbers 2. removePunctuation 3. Stopwords 4. removeWords 5. stripWhitespace 6. stemDocument 6Aadish Chopra
  • 7. Text Cleaning However if one wants more control over the cleaning of the data. For this purpose regex expressions can be used 1. Punctuation replacement: LaSalletwe<-gsub(pattern = "W",replacement = " ",LaSalletwe) 2. Digits replacement: LaSalletwe<-gsub(pattern="d",replacement = "",LaSalletwe) 3. To remove a single letter: LaSalletwe<-gsub(pattern="b[A- z]b{1}",replace="",LaSalletwe) 4. The list of stopwords* can be modified in the English.dat file which can be opened. The stopwords can be created for any language in R. It can be helpful for instance in cities like Chicago where people speak languages like Spanish *Changed the English.dat file in retrospective manner. 7Aadish Chopra
  • 8. Text Cleaning 8Aadish Chopra English.dat file can be changed for stopwords
  • 9. Topic Modelling Topic Modelling: • Classify documents using topic models. • Get themes from documents • Get topic distributions from corpus • Get word distributions within topics A technique which is popular is known as Latent Dirichlet allocation(LDA) was used to find the topic distributions within documents. Various versions of LDA like sLDA(supervised), Gibbs sampling are also present and may be used interchangeably depending on the goals. 9Aadish Chopra
  • 10. Topic Modelling 10Aadish Chopra α is the parameter of the Dirichlet prior on the per-document topic distributions, β is the parameter of the Dirichlet prior on the per-topic word distribution, Ɵ is the topic distribution for document m, ψ is the word distribution for topic k, z is the topic for the n-th word in document m, and w is the specific word. N is the number of words in a document M is the number of documents
  • 11. Topic Modelling LDA gives the themes of the document or the corpus. LDA can be used in various ways. 1. On a single tweet 2. On a single handle 3. On the entire corpus 11Aadish Chopra
  • 12. Word Frequency -On a single handle We are illustrating by taking example of LaSalleCoHealth Total no of Observations: 691 Total no of words after data cleaning: 1126 observations Top 16 words which occurs most in their tweet are 12Aadish Chopra
  • 13. Word Frequency-On the corpus Total no of Observations: 16449 Total no of words after data cleaning: 13665 observations 13Aadish Chopra