SlideShare a Scribd company logo
HANDS ON:
TEXT MINING WITH R
Jahnab Kumar Deka
Introduction
• To learn from collections of text documents like books,
newspapers, emails, etc.
Important Terms:
• Tokenization
• Tagging (Noun/Verb/…)
• Chunking(Noun Phase)
• Stemming(-ing/-s/-ed)
Important packages in R
• library(tm) # Framework for text mining.
• library(SnowballC) # Provides wordStem() for stemming.
• library(qdap) # Quantitative discourse analysis of
transcripts.
• library(qdapDictionaries)
• library(dplyr) # Data preparation and pipes %>%.
• library(RColorBrewer) # Generate palette of colours for
plots.
• library(ggplot2) # Plot word frequencies.
• library(scales) # Include commas in numbers.
• library(Rgraphviz) # Correlation plots.
Corpus
• Collection of text
• Each corpus will have separate articles, stories, volumes,
each treated as a separate entity or record.
• Any file format can be converted to text file for corpus
Eg:
• PDF to Text File
• system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")
• Word Document to Text File
• system("for f in *.doc; do antiword $f; done")
Corpus
• Consider folder corpus/txt
• List some of file names
Loading Corpus
• Loading Corpus
** Using DirSource() the source object is passed on to Corpus() which loads the documents.
• In case of PDF Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF))
** xpdf application needs to be installed for readPDF()
• In case of Word Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-r -s")))
** -r requests that removed text be included in the output
** -s requests that text hidden by Word be included
Exploration of Corpus
• inspect()
• Preparing the corpus
• Transformation type
• tm map() is used to apply one of this transformation
• Other transformations can be implemented using R functions and wrapped
within content_transformer()
Transformation Example
• replace “/”, “@” and “|” with a space
• Alternate method
• Conversion to toLower Case
• Remove Numbers
• Remove Punctuation
Contd...
• Remove English Stop Words
• Remove Own Stop Words
• Strip Whitespace
• Specific Transformations
Contd...
• Stemming
• Creating a Document Term Matrix
A matrix with documents as the rows
terms as the columns
count of the frequency of words as the cells of the matrix.
• Term frequency
Contd...
• Frequency order of item
• ord <- order(freq)
• Least Frequent item
• freq[head(ord)]
• Most frequent item
• freq[tail(ord)]
• Document Term matrix to CSV
• dtm <- DocumentTermMatrix(docs)
• m <- as.matrix(dtm)
• write.csv(m, file="dtm.csv")
Contd...
• Removing Sparse Terms
• dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor
• the resulting matrix contains only terms with a sparse factor of less than sparse.
• Frequent items and association
** lowfreq = terms that occur at least 1000 times
• Association with word with correlation limit
• // association of “data” with other word
• // two words always appear together => correlation would be 1.0
Correlation
• 50 of the more frequent words
• With minimum correlation of 0.5
• Word occurrences 100
• By default
• 20 random terms
• With minimum correlation of 0.7
Plotting word frequencies
• freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
• wf <- data.frame(word=names(freq), freq=freq)
• //words that occurs at least 500 times in the corpus
Word cloud
Size of Word & Frequency
• For word limitation
• wordcloud(names(freq), freq, max.words=100)
• For term frequency limitation
• wordcloud(names(freq), freq, min.freq=100)
• Adding Color
• wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
Quantitative Analysis of Text (qdap)
• Extracting the column names (the terms) and retain those shorter
than 20 characters
• To generate frequencies and percentage
Contd...
• Word Length Counts
** vertical line = Mean length of words
Letter and Position Heatmap

More Related Content

PDF
Dimensionality Reduction
PDF
An overview of Hidden Markov Models (HMM)
ODP
Topic Modeling
PDF
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
PPTX
KNN Classifier
PDF
The Machine Learning Workflow with Azure
PPTX
K-Nearest Neighbor Classifier
Dimensionality Reduction
An overview of Hidden Markov Models (HMM)
Topic Modeling
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
KNN Classifier
The Machine Learning Workflow with Azure
K-Nearest Neighbor Classifier

What's hot (20)

PPTX
Language models
PDF
Topic Modeling - NLP
PDF
Introduction to R Graphics with ggplot2
PDF
K - Nearest neighbor ( KNN )
PPTX
PAC Learning and The VC Dimension
PPTX
Text similarity measures
PDF
Class ppt intro to r
PDF
Exploring Generating AI with Diffusion Models
PPTX
Introduction to Linear Discriminant Analysis
PPT
DESIGN AND ANALYSIS OF ALGORITHMS
PPTX
Machine Learning - Accuracy and Confusion Matrix
PPT
Branch & bound
PPTX
Decision Tree Learning
PDF
Introduction to Machine Learning with SciKit-Learn
PDF
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
PDF
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
PPTX
Machine learning with ADA Boost
PDF
Gpt models
PDF
Latent Dirichlet Allocation
PPTX
Text Classification
Language models
Topic Modeling - NLP
Introduction to R Graphics with ggplot2
K - Nearest neighbor ( KNN )
PAC Learning and The VC Dimension
Text similarity measures
Class ppt intro to r
Exploring Generating AI with Diffusion Models
Introduction to Linear Discriminant Analysis
DESIGN AND ANALYSIS OF ALGORITHMS
Machine Learning - Accuracy and Confusion Matrix
Branch & bound
Decision Tree Learning
Introduction to Machine Learning with SciKit-Learn
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Machine learning with ADA Boost
Gpt models
Latent Dirichlet Allocation
Text Classification
Ad

Viewers also liked (20)

PDF
Text Mining with R -- an Analysis of Twitter Data
PPTX
TextMining with R
PPTX
Natural Language Processing in R (rNLP)
PPTX
Text Mining Infrastructure in R
PPT
Social media analysis in R using twitter API
DOCX
Twitter analysis by Kaify Rais
PPTX
Sentiment analysis of tweets
PPTX
Sentiment Analysis in Twitter
PPTX
Text analytics in Python and R with examples from Tobacco Control
PDF
Sentiment Analysis of Twitter Data
PDF
Time Series Analysis and Mining with R
PDF
Data Clustering with R
PDF
Interactive Text Mining Suite: Data Visualization for Literary Studies
PPTX
R Datatypes
PDF
Reading Data into R
PPTX
Text MIning
PDF
Rugby World Cup 2011 twitter analysis
PDF
Der Nobelpreis geht an: Vitamin C
PDF
Text Mining for Second Screen
PPTX
Count-Min Tree Sketch : Approximate counting for NLP tasks
Text Mining with R -- an Analysis of Twitter Data
TextMining with R
Natural Language Processing in R (rNLP)
Text Mining Infrastructure in R
Social media analysis in R using twitter API
Twitter analysis by Kaify Rais
Sentiment analysis of tweets
Sentiment Analysis in Twitter
Text analytics in Python and R with examples from Tobacco Control
Sentiment Analysis of Twitter Data
Time Series Analysis and Mining with R
Data Clustering with R
Interactive Text Mining Suite: Data Visualization for Literary Studies
R Datatypes
Reading Data into R
Text MIning
Rugby World Cup 2011 twitter analysis
Der Nobelpreis geht an: Vitamin C
Text Mining for Second Screen
Count-Min Tree Sketch : Approximate counting for NLP tasks
Ad

Similar to hands on: Text Mining With R (20)

PPTX
Text Mining of Twitter in Data Mining
PDF
Text Mining Analytics 101
PDF
RDataMining slides-text-mining-with-r
PPTX
AINL 2016: Bugaychenko
PDF
Sales_Prediction_Technique using R Programming
PDF
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
PPTX
Text Analytics
DOCX
Text classification
PPT
Text Mining
PDF
Data Science - Part XI - Text Analytics
PPTX
Introduction to Text Mining
PPTX
3. introduction to text mining
PPTX
3. introduction to text mining
PDF
Twitter data analysis using r (part 2)
PDF
Text Mining with R
PPTX
Intro to Vectorization Concepts - GaTech cse6242
PPTX
R in the Humanities: Text Analysis (v2)
PDF
2014-mo444-practical-assignment-02-paulo_faria
PDF
Big Data Palooza Talk: Aspects of Semantic Processing
PDF
3 Data Structure in R
Text Mining of Twitter in Data Mining
Text Mining Analytics 101
RDataMining slides-text-mining-with-r
AINL 2016: Bugaychenko
Sales_Prediction_Technique using R Programming
Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, M...
Text Analytics
Text classification
Text Mining
Data Science - Part XI - Text Analytics
Introduction to Text Mining
3. introduction to text mining
3. introduction to text mining
Twitter data analysis using r (part 2)
Text Mining with R
Intro to Vectorization Concepts - GaTech cse6242
R in the Humanities: Text Analysis (v2)
2014-mo444-practical-assignment-02-paulo_faria
Big Data Palooza Talk: Aspects of Semantic Processing
3 Data Structure in R

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Fluorescence-microscope_Botany_detailed content
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Mega Projects Data Mega Projects Data
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Business Analytics and business intelligence.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
IB Computer Science - Internal Assessment.pptx
1_Introduction to advance data techniques.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Supervised vs unsupervised machine learning algorithms
climate analysis of Dhaka ,Banglades.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
.pdf is not working space design for the following data for the following dat...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Fluorescence-microscope_Botany_detailed content
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Mega Projects Data Mega Projects Data
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Knowledge Engineering Part 1
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Analytics and business intelligence.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx

hands on: Text Mining With R

  • 1. HANDS ON: TEXT MINING WITH R Jahnab Kumar Deka
  • 2. Introduction • To learn from collections of text documents like books, newspapers, emails, etc. Important Terms: • Tokenization • Tagging (Noun/Verb/…) • Chunking(Noun Phase) • Stemming(-ing/-s/-ed)
  • 3. Important packages in R • library(tm) # Framework for text mining. • library(SnowballC) # Provides wordStem() for stemming. • library(qdap) # Quantitative discourse analysis of transcripts. • library(qdapDictionaries) • library(dplyr) # Data preparation and pipes %>%. • library(RColorBrewer) # Generate palette of colours for plots. • library(ggplot2) # Plot word frequencies. • library(scales) # Include commas in numbers. • library(Rgraphviz) # Correlation plots.
  • 4. Corpus • Collection of text • Each corpus will have separate articles, stories, volumes, each treated as a separate entity or record. • Any file format can be converted to text file for corpus Eg: • PDF to Text File • system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done") • Word Document to Text File • system("for f in *.doc; do antiword $f; done")
  • 5. Corpus • Consider folder corpus/txt • List some of file names
  • 6. Loading Corpus • Loading Corpus ** Using DirSource() the source object is passed on to Corpus() which loads the documents. • In case of PDF Documents • docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF)) ** xpdf application needs to be installed for readPDF() • In case of Word Documents • docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-r -s"))) ** -r requests that removed text be included in the output ** -s requests that text hidden by Word be included
  • 7. Exploration of Corpus • inspect() • Preparing the corpus • Transformation type • tm map() is used to apply one of this transformation • Other transformations can be implemented using R functions and wrapped within content_transformer()
  • 8. Transformation Example • replace “/”, “@” and “|” with a space • Alternate method • Conversion to toLower Case • Remove Numbers • Remove Punctuation
  • 9. Contd... • Remove English Stop Words • Remove Own Stop Words • Strip Whitespace • Specific Transformations
  • 10. Contd... • Stemming • Creating a Document Term Matrix A matrix with documents as the rows terms as the columns count of the frequency of words as the cells of the matrix. • Term frequency
  • 11. Contd... • Frequency order of item • ord <- order(freq) • Least Frequent item • freq[head(ord)] • Most frequent item • freq[tail(ord)] • Document Term matrix to CSV • dtm <- DocumentTermMatrix(docs) • m <- as.matrix(dtm) • write.csv(m, file="dtm.csv")
  • 12. Contd... • Removing Sparse Terms • dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor • the resulting matrix contains only terms with a sparse factor of less than sparse. • Frequent items and association ** lowfreq = terms that occur at least 1000 times • Association with word with correlation limit • // association of “data” with other word • // two words always appear together => correlation would be 1.0
  • 13. Correlation • 50 of the more frequent words • With minimum correlation of 0.5 • Word occurrences 100 • By default • 20 random terms • With minimum correlation of 0.7
  • 14. Plotting word frequencies • freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE) • wf <- data.frame(word=names(freq), freq=freq) • //words that occurs at least 500 times in the corpus
  • 16. Size of Word & Frequency • For word limitation • wordcloud(names(freq), freq, max.words=100) • For term frequency limitation • wordcloud(names(freq), freq, min.freq=100) • Adding Color • wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
  • 17. Quantitative Analysis of Text (qdap) • Extracting the column names (the terms) and retain those shorter than 20 characters • To generate frequencies and percentage
  • 18. Contd... • Word Length Counts ** vertical line = Mean length of words