SlideShare a Scribd company logo
Text Analytics with
and
(w/ examples from Tobacco Control)
@BenHealey
The Process
Look intenselyFrequencies
Classification
Bright Idea Gather Clean Standardise
De-dup and select
http://scrapy.org
Spiders  Items  Pipelines
- readLines, XML / Rcurl / scrapeR packages
- tm package (factiva plugin), twitteR
- Beautiful Soup
- Pandas (eg, financial data)
http://blog.siliconstraits.vn/building-web-crawler-scrapy/
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
• Translating text to consistent form
– Scrapy returns unicode strings
– Māori  Maori
• SWAPSET =
[[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]]
• translation_table =
dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET])
• cleaned_content =
html_content.translate(translation_table)
– Or…
• test=u’Māori’ (you already have unicode)
• Unidecode(test) (returns ‘Maori’)
• Dealing with non-Unicode
– http://nedbatchelder.com/text/unipain.html
– Some scraped html will be in latin1 (mismatch UTF8)
– Have your datastore default to UTF-8
– Learn to love whack-a-mole
• Dealing with too many spaces:
– newstring = ' '.join(mystring.split())
– Or… use re
• Don’t forget the metadata!
– Define a common data structure early if you have
multiple sources
Text Standardisation
• Stopwords
– "a, about, above, across, ... yourself, yourselves, you've, z”
• Stemmers
– "some sample stemmed words"  "some sampl stem word“
• Tokenisers (eg, for bigrams)
– BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
– tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
– ‘and said’, ‘and security’
Natural Language Toolkittm package
Text Standardisation
libs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels")
…
cleanCorpus = function(corpus) {
corpus.tmp = tm_map(corpus, tolower) # ??? Not sure.
corpus.tmp = tm_map(corpus.tmp, removePunctuation)
corpus.tmp = tm_map(corpus.tmp, removeWords, stopwords("english"))
corpus.tmp = tm_map(corpus.tmp, stripWhitespace)
return(corpus.tmp)
}
posts.corpus = cleanCorpus(posts.corpus)
posts.corpus_stemmed = tm_map(posts.corpus, stemDocument)
Text Standardisation
• Using dictionaries for stem completion
politi.tdm <- TermDocumentMatrix(politi.corpus)
politi.tdm = removeSparseTerms(politi.tdm, 0.99)
politi.tdm = as.matrix(politi.tdm)
# get word counts in decreasing order, put these into a plain text doc.
word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE)
length(word_freqs)
smalldict = PlainTextDocument(names(word_freqs))
politi.corpus_final = tm_map(politi.corpus_stemmed,
stemCompletion, dictionary=smalldict, type="first")
Deduplication
• Python sets
– shingles1 = set(get_shingles(record1['standardised_content']))
• Shingling and Jaccard similarity
– (a,rose,is,a,rose,is,a,rose)
– {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose)}
• {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)}
–
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf  a free text
http://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf
Frequency Analysis
• Document-Term Matrix
– politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed,
control = list(wordLengths=c(4,Inf)))
• Frequent and co-occurring terms
– findFreqTerms(politi.dtm, 5000)
[1] "2011" "also" "announc" "area" "around"
[6] "auckland" "better" "bill" "build" "busi"
– findAssocs(politi.dtm, "smoke", 0.5)
smoke tobacco quit smokefre smoker 2025 cigarett
1.00 0.74 0.68 0.62 0.62 0.58 0.57
Text analytics in Python and R with examples from Tobacco Control
Mentions of the 2025 goal
Mentions of the 2025 goal
Top 100 terms: Tariana Turia
Note: Documents from Aug 2011 – July 2012 Wordcloud package
Top 100 terms: Tony Ryall
Note: Documents from Aug 2011 – July 2012
• Exploration and feature extraction
– Metadata gathered at time of collection (eg, Scrapy)
– RODBC or MySQLdb with plain ol’ SQL
– Native or package functions for length of strings, sna, etc.
• Unsupervised
– nltk.cluster
– tm, topicmodels, as.matrix(dtm)  kmeans, etc.
• Supervised
– First hurdle: Training set 
– nltk.classify
– tm, e1071, others…
Classification
2 posts or fewer more than 750 posts
846 1,157 23 45,499
41.0% 1.3% 1.1% 50.1%
Cohort: New users (posters) in Q1 2012
• LDA (topicmodels)
– New users
– Highly active users
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
good smoke just smoke feel
day time day quit day
thank week get can dont
well patch realli one like
will start think will still
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
quit good day like feel
smoke one well day thing
can take great your just
will stay done now get
luck strong awesom get time
• LDA (topicmodels)
– Highly active users (HAU)
– HAU1 (F, 38, PI)
– HAU2 (F, 33, NZE)
– HAU3 (M, 48, NZE)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
quit good day like feel
smoke one well day thing
can take great your just
will stay done now get
luck strong awesom get time
18% 14% 40% 8% 20%
31% 21% 27% 6% 16%
16% 9% 21% 49% 5%
Recap
• Your text will probably be messy
– Python, R-based tools reduce the pain
• Simple analyses can generate useful insight
• Combine with data of other types for context
– source, quantities, dates, network position, history
• May surface useful features for classification
Slides, Code: message2ben@gmail.com

More Related Content

PPTX
Natural Language Processing in R (rNLP)
PPTX
Text Mining Infrastructure in R
PPTX
hands on: Text Mining With R
PDF
Text Mining with R
PDF
RDataMining slides-text-mining-with-r
PDF
Working with text data
PDF
Text Mining with R -- an Analysis of Twitter Data
PDF
Elements of Text Mining Part - I
Natural Language Processing in R (rNLP)
Text Mining Infrastructure in R
hands on: Text Mining With R
Text Mining with R
RDataMining slides-text-mining-with-r
Working with text data
Text Mining with R -- an Analysis of Twitter Data
Elements of Text Mining Part - I

What's hot (20)

PDF
Text mining and social network analysis of twitter data part 1
PDF
Text Mining Using R
PPTX
A first look at tf idf-pdx data science meetup
PDF
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
PDF
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
PPTX
Algorithm Name Detection & Extraction
PPTX
Duet @ TREC 2019 Deep Learning Track
PDF
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
PPTX
Classification of CNN.com Articles using a TF*IDF Metric
PDF
Cross-lingual Information Retrieval
PPT
Artificial Intelligence
PPT
Cross language information retrieval (clir)slide
PPTX
Dual Embedding Space Model (DESM)
PPT
07 04-06
PDF
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
PDF
Framester: A Wide Coverage Linguistic Linked Data Hub
PDF
Interactive Knowledge Discovery over Web of Data.
PPT
Slides
PPTX
Neural Models for Information Retrieval
Text mining and social network analysis of twitter data part 1
Text Mining Using R
A first look at tf idf-pdx data science meetup
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Algorithm Name Detection & Extraction
Duet @ TREC 2019 Deep Learning Track
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
Classification of CNN.com Articles using a TF*IDF Metric
Cross-lingual Information Retrieval
Artificial Intelligence
Cross language information retrieval (clir)slide
Dual Embedding Space Model (DESM)
07 04-06
DCU Search Runs at MediaEval 2014 Search and Hyperlinking
Framester: A Wide Coverage Linguistic Linked Data Hub
Interactive Knowledge Discovery over Web of Data.
Slides
Neural Models for Information Retrieval
Ad

Viewers also liked (20)

PDF
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
PPTX
TextMining with R
PDF
Learn 90% of Python in 90 Minutes
PPTX
Python for R users
PPTX
Textkernel - Semantic Recruiting Technology
PDF
Social Networks at Scale
PPT
Making the invisible visible through SNA
PPTX
2015 pdf-marc smith-node xl-social media sna
PPT
Social Network Analysis for Competitive Intelligence
PPTX
Chapter 9 the progressive era
PPT
Space decoder-v3 preassessment game
PPTX
The revolution begins
PPTX
RC Vasco da gama - TRF
DOC
Договор для юридических лиц
PPTX
Presentation at D 3170 Pre-Pets
PPTX
Timelin present day timeline ppt dr. carr
PPTX
Matching Grants - A tool to strengthen fellowship &amp; International Goodwill
PPTX
Declaring independence
PPTX
New south
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
TextMining with R
Learn 90% of Python in 90 Minutes
Python for R users
Textkernel - Semantic Recruiting Technology
Social Networks at Scale
Making the invisible visible through SNA
2015 pdf-marc smith-node xl-social media sna
Social Network Analysis for Competitive Intelligence
Chapter 9 the progressive era
Space decoder-v3 preassessment game
The revolution begins
RC Vasco da gama - TRF
Договор для юридических лиц
Presentation at D 3170 Pre-Pets
Timelin present day timeline ppt dr. carr
Matching Grants - A tool to strengthen fellowship &amp; International Goodwill
Declaring independence
New south
Ad

Similar to Text analytics in Python and R with examples from Tobacco Control (20)

PPTX
Prolog 7-Languages
PDF
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
PDF
Introduction to Text Mining
PPTX
Semi-automated Exploration and Extraction of Data in Scientific Tables
PPTX
Introduction To Programming In R for data analyst
PPTX
Feature Engineering for NLP
PDF
Оформление пайплайна в NLP проекте​, Виталий Радченко. 22 июня, 2019
PDF
useR! 2012 Talk
PPT
Profiling and optimization
PPTX
05 -- Feature Engineering (Text).pptxiuy
PPTX
datastrubsbwbwbbwcturesinpython-3-4.pptx
PPTX
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
PPT
SMS Spam Filter Design Using R: A Machine Learning Approach
PDF
Textrank algorithm
PPTX
DA_02_algorithms.pptx
PDF
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
PDF
Data Structures & Algorithms - Spring 2025.pdf
PDF
functional groovy
PPT
Meow Hagedorn
PDF
Topic Modeling - NLP
Prolog 7-Languages
Text Mining with Node.js - Philipp Burckhardt, Carnegie Mellon University
Introduction to Text Mining
Semi-automated Exploration and Extraction of Data in Scientific Tables
Introduction To Programming In R for data analyst
Feature Engineering for NLP
Оформление пайплайна в NLP проекте​, Виталий Радченко. 22 июня, 2019
useR! 2012 Talk
Profiling and optimization
05 -- Feature Engineering (Text).pptxiuy
datastrubsbwbwbbwcturesinpython-3-4.pptx
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns
SMS Spam Filter Design Using R: A Machine Learning Approach
Textrank algorithm
DA_02_algorithms.pptx
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Data Structures & Algorithms - Spring 2025.pdf
functional groovy
Meow Hagedorn
Topic Modeling - NLP

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
KodekX | Application Modernization Development
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
cuic standard and advanced reporting.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
KodekX | Application Modernization Development
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
“AI and Expert System Decision Support & Business Intelligence Systems”
NewMind AI Monthly Chronicles - July 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25 Week I

Text analytics in Python and R with examples from Tobacco Control

  • 1. Text Analytics with and (w/ examples from Tobacco Control) @BenHealey
  • 2. The Process Look intenselyFrequencies Classification Bright Idea Gather Clean Standardise De-dup and select
  • 3. http://scrapy.org Spiders  Items  Pipelines - readLines, XML / Rcurl / scrapeR packages - tm package (factiva plugin), twitteR - Beautiful Soup - Pandas (eg, financial data) http://blog.siliconstraits.vn/building-web-crawler-scrapy/
  • 7. • Translating text to consistent form – Scrapy returns unicode strings – Māori  Maori • SWAPSET = [[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]] • translation_table = dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET]) • cleaned_content = html_content.translate(translation_table) – Or… • test=u’Māori’ (you already have unicode) • Unidecode(test) (returns ‘Maori’)
  • 8. • Dealing with non-Unicode – http://nedbatchelder.com/text/unipain.html – Some scraped html will be in latin1 (mismatch UTF8) – Have your datastore default to UTF-8 – Learn to love whack-a-mole • Dealing with too many spaces: – newstring = ' '.join(mystring.split()) – Or… use re • Don’t forget the metadata! – Define a common data structure early if you have multiple sources
  • 9. Text Standardisation • Stopwords – "a, about, above, across, ... yourself, yourselves, you've, z” • Stemmers – "some sample stemmed words"  "some sampl stem word“ • Tokenisers (eg, for bigrams) – BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) – tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) – ‘and said’, ‘and security’ Natural Language Toolkittm package
  • 10. Text Standardisation libs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels") … cleanCorpus = function(corpus) { corpus.tmp = tm_map(corpus, tolower) # ??? Not sure. corpus.tmp = tm_map(corpus.tmp, removePunctuation) corpus.tmp = tm_map(corpus.tmp, removeWords, stopwords("english")) corpus.tmp = tm_map(corpus.tmp, stripWhitespace) return(corpus.tmp) } posts.corpus = cleanCorpus(posts.corpus) posts.corpus_stemmed = tm_map(posts.corpus, stemDocument)
  • 11. Text Standardisation • Using dictionaries for stem completion politi.tdm <- TermDocumentMatrix(politi.corpus) politi.tdm = removeSparseTerms(politi.tdm, 0.99) politi.tdm = as.matrix(politi.tdm) # get word counts in decreasing order, put these into a plain text doc. word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE) length(word_freqs) smalldict = PlainTextDocument(names(word_freqs)) politi.corpus_final = tm_map(politi.corpus_stemmed, stemCompletion, dictionary=smalldict, type="first")
  • 12. Deduplication • Python sets – shingles1 = set(get_shingles(record1['standardised_content'])) • Shingling and Jaccard similarity – (a,rose,is,a,rose,is,a,rose) – {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose)} • {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)} – http://infolab.stanford.edu/~ullman/mmds/ch3.pdf  a free text http://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf
  • 13. Frequency Analysis • Document-Term Matrix – politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed, control = list(wordLengths=c(4,Inf))) • Frequent and co-occurring terms – findFreqTerms(politi.dtm, 5000) [1] "2011" "also" "announc" "area" "around" [6] "auckland" "better" "bill" "build" "busi" – findAssocs(politi.dtm, "smoke", 0.5) smoke tobacco quit smokefre smoker 2025 cigarett 1.00 0.74 0.68 0.62 0.62 0.58 0.57
  • 15. Mentions of the 2025 goal
  • 16. Mentions of the 2025 goal
  • 17. Top 100 terms: Tariana Turia Note: Documents from Aug 2011 – July 2012 Wordcloud package
  • 18. Top 100 terms: Tony Ryall Note: Documents from Aug 2011 – July 2012
  • 19. • Exploration and feature extraction – Metadata gathered at time of collection (eg, Scrapy) – RODBC or MySQLdb with plain ol’ SQL – Native or package functions for length of strings, sna, etc. • Unsupervised – nltk.cluster – tm, topicmodels, as.matrix(dtm)  kmeans, etc. • Supervised – First hurdle: Training set  – nltk.classify – tm, e1071, others… Classification
  • 20. 2 posts or fewer more than 750 posts 846 1,157 23 45,499 41.0% 1.3% 1.1% 50.1%
  • 21. Cohort: New users (posters) in Q1 2012
  • 22. • LDA (topicmodels) – New users – Highly active users Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 good smoke just smoke feel day time day quit day thank week get can dont well patch realli one like will start think will still Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 quit good day like feel smoke one well day thing can take great your just will stay done now get luck strong awesom get time
  • 23. • LDA (topicmodels) – Highly active users (HAU) – HAU1 (F, 38, PI) – HAU2 (F, 33, NZE) – HAU3 (M, 48, NZE) Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 quit good day like feel smoke one well day thing can take great your just will stay done now get luck strong awesom get time 18% 14% 40% 8% 20% 31% 21% 27% 6% 16% 16% 9% 21% 49% 5%
  • 24. Recap • Your text will probably be messy – Python, R-based tools reduce the pain • Simple analyses can generate useful insight • Combine with data of other types for context – source, quantities, dates, network position, history • May surface useful features for classification Slides, Code: message2ben@gmail.com

Editor's Notes

  • #5: Gather stage.
  • #6: Gather stage.
  • #7: Clean stage
  • #8: Clean stage
  • #9: Clean stage
  • #10: Standardise stage
  • #11: Standardise stage
  • #12: Standardise stage0.99 is generous. Lower would remove more terms.A term-document matrix where those terms from x are removed which have at least asparse percentage of empty (i.e., terms occurring 0 times in a document) elements. I.e., the resulting matrix contains only terms with a sparse factor of less than sparse.TermDocumentMatrix (terms along side (rows), docs along top (columns))
  • #13: Dedup and select stage
  • #14: Analysis stage
  • #15: Analysis stage
  • #16: Analysis stage
  • #17: Analysis stage
  • #18: Analysis stage
  • #19: Analysis stage
  • #20: Analysis stageDragonfly talk by Marcus Frean on LatentDirichletAllocation
  • #21: Analysis stage (exploratory)
  • #22: Analysis stage (Exploratory)
  • #23: Analysis stage (Unsupervised classification)
  • #24: Analysis stage (Unsupervised classification)