SlideShare a Scribd company logo
Visualizing Words and Topics with
Scattertext
Jason S. Kessler*
June 14, 2018
Code for all visualizations is available at:
https://github.com/JasonKessler/PyDataSeattle2018
$ pip3 install scattertext
@jasonkessler*No, not that Jason Kessler
Lexicon speculation
Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification
using machine learning techniques. EMNLP. 2002. (ACL 2018 Test of Time Award Winner)
@jasonkessler
Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification
using machine learning techniques. EMNLP. 2002.
Lexicon mining ≈ lexicon speculation
@jasonkessler
One is made from positive reviews, the other
negative reviews (Mueller’s wordcloud)
@jasonkessler
Motivation
• What language can be used to better market a product?
• Words characteristic of effective (vs ineffective) marketing messages
• Uncover framing of political issues
• How do Republicans and Democratic politicians talk differently about
abortion?
• Cultural anthropology
• What topics or language are characteristic to groups of people
• Psycholinguistics
• How language use is associated personality and other personal characteritics
• Writing better headlines
Language and Demographics
Christian Rudder: http://blog.okcupid.com/index.php/page/7/
hobos
almond
butter 100 Years of
Solitude
Bikram yoga
@jasonkessler
Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010)
OKCupid: Words and phrases that distinguish white
men.
@jasonkessler
Explanation
OKCupid: Words and phrases that
distinguish Latin men.
Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010) @jasonkessler
Ranking with everyone else
The smaller the distance from the top left, the
higher the association with white men
Source: Christian Rudder. Dataclysm. 2014.
Phish is highly associated with white men
Kpop is not
@jasonkessler
@jasonkessler
my blue eyes
Source: Christian Rudder. Dataclysm. 2014.
Scattertext
pip install scattertext
github.com/JasonKessler/scattertext
@jasonkessler
Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017.
- Interactive, d3-
based scatterplot
- Concise Python
API
- Automatically
displays non-
overlapping labels
Scaled F-Score
• Term-class* associations:
• “Good” is associated with the “positive” class
• “Bad” with the class “negative”
• Core intuition: association relies on two necessary factors
• Frequency: How often a term occurs in a class
• Precision: P(class|document contains term)
• F-Score:
• Information retrieval evaluation metric
• Harmonic mean between precision and recall
• Requires both metrics to be high
• *Term: is defined to be a word, phrase or other discrete linguistic element
@jasonkessler
@jasonkessler
Precision
Frequency
Naïve approach
@jasonkessler
Precision
Frequency
Naïve approach
Y-axis: - Precision, i.e. P(class|term), roughly normal distribution
- Mean ≅ 0.5, sd ≅ 0.4,
X-axis: - Frequency, i.e. P(term|class), roughly power distribution
- Mean ≅ 0.00008, sd ≅ 0.008
Color: - Harmonic mean of precision and frequency (blue=high, red=low)
@jasonkessler
Precision
Frequency
Problem:
• Top words are just stop words.
• Why?
• Harmonic mean of uneven distributions.
• Most words have prec of ~0.5, leads harmonic
mean to rely on frequency.
Fix: Normalize Precision and Frequency
• Task: make precision
and frequency
similarly distributed
• How: take normal CDF
of each term’s
precision and
frequency
• Mean and std.
computed from data
• Right: log normal CDF
@jasonkessler
This area is the log-normal
CDF of the term “beauty”
(0.938 ∈ [0,1]).
Each tick mark is the log-
frequency of a term.
*log-normal CDF isn’t used in these charts
Scaled-F-Score
@jasonkessler
NormCDF-Precision
NormCDF - Frequency
Positive Scaled-F-Score
Good: positive terms
make sense!
Still some function words,
but that’s okay.
Note These frequent
terms are all very close
to 1 on the x-axis, but
are ordered
NormCDF-Precision
NormCDF - Frequency @jasonkessler
Pos
Freq.
Neg
Freq. Prec.
Freq
%
Raw
Hmean
Prec.
CDF
Freq.
CDF
Scaled
F-Score
best 108 36 75.00% 0.22% 0.44% 71.95% 99.50% 83.51%
entertaining 58 13 81.69% 0.12% 0.24% 77.07% 90.94% 83.43%
fun 73 26 73.74% 0.15% 0.30% 70.92% 95.63% 81.44%
heart 45 11 80.36% 0.09% 0.18% 76.09% 84.49% 80.07%
Top Scaled-F Score Terms
Note: normalized precision and frequency are on comparable
scales, allowing for the harmonic mean to take both into account.
Problem: highly negative terms are
all low frequency
NormCDF-Precision
NormCDF - Frequency @jasonkessler
Solution:
• Compute Scaled F-
Score association
scores for negative
reviews.
• Use the highest score
Scaled-F-Score
Positive Scaled F-Score
@jasonkessler
Negative Scaled F-
Score
Note: only one
obviously negative
term
@jasonkessler
Scaled-F-Score by log-
frequency
The score can be overly
sensitive to very frequent
terms, but still doesn’t score
them very highly
ScaledF-Score
Log Frequency
This chart over-emphasizes
stop words, and has a lot of
white space
Characteristic Term Detection
• General idea
• Characteristic terms are more likely to occur in corpus than in general English
• “Normal” English:
• Peter Norvig’s list of word frequencies from web in late ‘00s.
• Algorithm:
• Dense rank terms that appear in corpus by their corpus frequencies and their
“standard” English frequencies.
• Scale term ranks by the number of distinct ranks in the corpus (fewer) or
background (greater)
• Take rank difference
@jasonkessler
• Left: most frequent
terms in corpus
• Positive rank difference
between film and movie@jasonkessler
Top 10 Characteristic Words by Rank
@jasonkessler
Y-axis: Scaled F-Score
X-axis: Characteristic Rank Delta
Some non-movie-like words do affect
sentiment
@jasonkessler
Why not use TF-IDF?
• Drastically favors low frequency
terms
• term in all classes -> idf=1 ->
score=0
TF-IDF(Positive)-TF-IDF(Negative)
Log Frequency
TF IDF
@jasonkessler
Burt Monroe, Michael Colaresi and Kevin Quinn. Fightin'
words: Lexical feature selection and evaluation for
identifying the content of political conflict. Political
Analysis. 2008.@jasonkessler
Monroe et. al (2009) approach
• Bayesian approach to term-
association
• Likelihood: Z-score of log-odds-
ratio
• Prior: Term frequency in a
background corpus
• Posterior: Z-score of log-odds-ratio
with background counts as
smoothing values
Popular, but much more tweaking to
get to work than Scaled F Score.
@jasonkessler
Scattertext reimplementation of Monroe et al. See
http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Class-Association-Scores.ipynb for code.
Scattertext
implementation (with prior
weighting modifications)
In defense of stop words
Cindy K. Chung and James W. Pennebaker. Counting
Little Words in Big Data: The Psychology of
Communities, Culture, and History. EASP. 2012
In times of
shared crisis,
“we” use
increases, while
“I” use decreases.
I/we: age, social
integration
I: lying, social
rank
@jasonkessler
Function words and gender
Newman, ML; Groom, CJ; Handelman LD, Pennebaker, JW. Gender
Differences in Language Use: An Analysis of 14,000 Text Samples. 2008.
LIWC Dimension
Bold: entirely stop words
Effect Size (Cohen’s d)
(>0 F, <0 M) MANOVA p<.001
All Pronouns (esp. 3rd person) 0.36
Present tense verbs (walk, is, be) 0.18
Feeling (touch, hold, feel) 0.17
Certainty (always, never) 0.14
Word count NS
Numbers -0.15
Prepositions -0.17
Words >6 letters -0.24
Swear words -0.22
Articles -0.24
• Performed on a
variety of
language
categories,
including
speech.
• Other studies
have found that
function words
are the best
predictors of
gender.
@jasonkessler
Clickbait: what works?
@jasonkessler
Clickbait corpus
• Facebook posts from BuzzFeed, NY Times, etc/ from 2010s.
• Includes headline and the number of Facebook likes
• Scraped by researcher Max Woolf at github.com/minimaxir/clickbait-cluster.
• We’ll separate articles from 2016 into the upper third and lower third
of likes.
• Identify words and phrases that predict likes.
• Begin with noun phrases identified from Phrase Machine (Handler et
al. 2016)
• Filter out redundant NPs.
Abram Handler, Matt Denny, Hanna Wallach, and Brendan O'Connor. Bag of what? Simple noun
phrase extraction for corpus analysis. NLP+CSS Workshop at EMNLP 2016.
@jasonkessler
@jasonkessler
Scaled-F-Score of engagement by
noun phrase
@jasonkessler
Scaled-F-Score of engagement by
unigram
Psycholinguist information:
3rd person pronouns -> high
engagement (indicative of female)
2nd person low (male)
“dies”: obit
Can, guess, how: questions.
Clickbait corpus
• How do terms with similar meanings differ in terms of their
engagement rates?
• Use Gensim (https://radimrehurek.com/gensim/) to find word embeddings
• Use UMAP (McInnes and Healy 2018) to project them into two
dimensions, and explore them with Scattertext.
• Locally groups words with similar embeddings together.
• Better alternative to T-SNE; allows for cosine instead of Euclidean distance
criteria
Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for
Dimension Reduction. Arxiv. 2018.
@jasonkessler
@jasonkessler
This island is mostly
food related.
“Chocolate” and
“cake” are highly
engaging, but
“breakfast” has
predictive of low
engagement.
Term positions from determined by UMAP,
color by Scaled F-Score for engagement.
Clickbait corpus
• How do the Times and Buzzfeed differ in what they talk about, and
their content engages their readers?
• Scattertext can easily create visualizations to help answer these
questions.
• First, we’ll look at how what engages for Buzzfeed contrasts with what
engages for the Times, and vice versa
Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for
Dimension Reduction. Arxiv. 2018.
@jasonkessler
Oddly, NY Times readers distinctly
like articles about sex, death, and
which are written in a smug tone.
This chart doesn’t give a good sense
of what language is more associated
with one site.
@jasonkessler
This chart let’s you know how Buzzfeed and the
Times are distinct, while still distinguishing
engaging content,
@jasonkessler
Thank you! Questions?
@jasonkessler
Jason S. Kessler
Global AI Conference
April 27, 2018
https://github.com/JasonKessler/GlobalAI2018

More Related Content

PDF
PDF
Đề tài: Nhận dạng đối tượng sử dụng thuật toán AdaBoost, HOT
DOCX
KHO DỮ LIỆU VÀ KHAI PHÁ DỮ LIỆU PTIT
PPTX
Natural Language Visualization with Scattertext
PPTX
Scattertext: A Tool for Visualizing Differences in Language
PDF
A Gentle Introduction to Text Analysis :)
PPTX
DH Tools Workshop #1: Text Analysis
PDF
Statistics for linguistics
Đề tài: Nhận dạng đối tượng sử dụng thuật toán AdaBoost, HOT
KHO DỮ LIỆU VÀ KHAI PHÁ DỮ LIỆU PTIT
Natural Language Visualization with Scattertext
Scattertext: A Tool for Visualizing Differences in Language
A Gentle Introduction to Text Analysis :)
DH Tools Workshop #1: Text Analysis
Statistics for linguistics

Similar to Visualizing Words and Topics with Scattertext (20)

PPT
Ijcai 2007 Pedersen
PPTX
Introduction to Text Mining
PPT
Eurolan 2005 Pedersen
PDF
Text Mining Analytics 101
PDF
Big Data Palooza Talk: Aspects of Semantic Processing
PDF
HackYale - Natural Language Processing (Week 1)
PDF
Introduction to sentiment analysis
PPT
Text Analytics for Semantic Computing
PDF
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
PPTX
02 naive bays classifier and sentiment analysis
PDF
A Gentle Introduction to Text Analysis I
PDF
Automatic Profiling Of Learner Texts
PPT
PPT
PDF
Semantics ii what definitions offer
PPTX
Perceptual Data_04182016
PPTX
Can i hear you sentiment analysis on medical forums
PDF
RCOMM 2011 - Sentiment Classification with RapidMiner
PDF
RCOMM 2011 - Sentiment Classification
PDF
Additional1
Ijcai 2007 Pedersen
Introduction to Text Mining
Eurolan 2005 Pedersen
Text Mining Analytics 101
Big Data Palooza Talk: Aspects of Semantic Processing
HackYale - Natural Language Processing (Week 1)
Introduction to sentiment analysis
Text Analytics for Semantic Computing
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
02 naive bays classifier and sentiment analysis
A Gentle Introduction to Text Analysis I
Automatic Profiling Of Learner Texts
Semantics ii what definitions offer
Perceptual Data_04182016
Can i hear you sentiment analysis on medical forums
RCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification
Additional1
Ad

More from Jason Kessler (7)

PPTX
Lexicon Mining for Semiotic Squares: Exploding Binary Classification
PPTX
Jason Kessler Problems: What's Wrong with Twitter
PPTX
Discovering Persuasive Language through Observing Customer Behavior
PPTX
From Sentiment to Persuasion Analysis: A Look at Idea Generation Tools
PPT
The 2010 JDPA Sentiment Corpus for the Automotive Domain
PPTX
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
PPT
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Lexicon Mining for Semiotic Squares: Exploding Binary Classification
Jason Kessler Problems: What's Wrong with Twitter
Discovering Persuasive Language through Observing Customer Behavior
From Sentiment to Persuasion Analysis: A Look at Idea Generation Tools
The 2010 JDPA Sentiment Corpus for the Automotive Domain
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Ad

Recently uploaded (20)

PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to machine learning and Linear Models
PPT
Quality review (1)_presentation of this 21
PPTX
Database Infoormation System (DBIS).pptx
Reliability_Chapter_ presentation 1221.5784
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
climate analysis of Dhaka ,Banglades.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
STERILIZATION AND DISINFECTION-1.ppthhhbx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Fluorescence-microscope_Botany_detailed content
Supervised vs unsupervised machine learning algorithms
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to Data Science and Data Analysis
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Knowledge Engineering Part 1
Introduction to machine learning and Linear Models
Quality review (1)_presentation of this 21
Database Infoormation System (DBIS).pptx

Visualizing Words and Topics with Scattertext

  • 1. Visualizing Words and Topics with Scattertext Jason S. Kessler* June 14, 2018 Code for all visualizations is available at: https://github.com/JasonKessler/PyDataSeattle2018 $ pip3 install scattertext @jasonkessler*No, not that Jason Kessler
  • 2. Lexicon speculation Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. EMNLP. 2002. (ACL 2018 Test of Time Award Winner) @jasonkessler
  • 3. Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. EMNLP. 2002. Lexicon mining ≈ lexicon speculation @jasonkessler
  • 4. One is made from positive reviews, the other negative reviews (Mueller’s wordcloud) @jasonkessler
  • 5. Motivation • What language can be used to better market a product? • Words characteristic of effective (vs ineffective) marketing messages • Uncover framing of political issues • How do Republicans and Democratic politicians talk differently about abortion? • Cultural anthropology • What topics or language are characteristic to groups of people • Psycholinguistics • How language use is associated personality and other personal characteritics • Writing better headlines
  • 6. Language and Demographics Christian Rudder: http://blog.okcupid.com/index.php/page/7/ hobos almond butter 100 Years of Solitude Bikram yoga @jasonkessler
  • 7. Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010) OKCupid: Words and phrases that distinguish white men. @jasonkessler
  • 8. Explanation OKCupid: Words and phrases that distinguish Latin men. Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010) @jasonkessler
  • 9. Ranking with everyone else The smaller the distance from the top left, the higher the association with white men Source: Christian Rudder. Dataclysm. 2014. Phish is highly associated with white men Kpop is not @jasonkessler
  • 10. @jasonkessler my blue eyes Source: Christian Rudder. Dataclysm. 2014.
  • 11. Scattertext pip install scattertext github.com/JasonKessler/scattertext @jasonkessler Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017. - Interactive, d3- based scatterplot - Concise Python API - Automatically displays non- overlapping labels
  • 12. Scaled F-Score • Term-class* associations: • “Good” is associated with the “positive” class • “Bad” with the class “negative” • Core intuition: association relies on two necessary factors • Frequency: How often a term occurs in a class • Precision: P(class|document contains term) • F-Score: • Information retrieval evaluation metric • Harmonic mean between precision and recall • Requires both metrics to be high • *Term: is defined to be a word, phrase or other discrete linguistic element @jasonkessler
  • 14. @jasonkessler Precision Frequency Naïve approach Y-axis: - Precision, i.e. P(class|term), roughly normal distribution - Mean ≅ 0.5, sd ≅ 0.4, X-axis: - Frequency, i.e. P(term|class), roughly power distribution - Mean ≅ 0.00008, sd ≅ 0.008 Color: - Harmonic mean of precision and frequency (blue=high, red=low)
  • 15. @jasonkessler Precision Frequency Problem: • Top words are just stop words. • Why? • Harmonic mean of uneven distributions. • Most words have prec of ~0.5, leads harmonic mean to rely on frequency.
  • 16. Fix: Normalize Precision and Frequency • Task: make precision and frequency similarly distributed • How: take normal CDF of each term’s precision and frequency • Mean and std. computed from data • Right: log normal CDF @jasonkessler This area is the log-normal CDF of the term “beauty” (0.938 ∈ [0,1]). Each tick mark is the log- frequency of a term. *log-normal CDF isn’t used in these charts
  • 18. Positive Scaled-F-Score Good: positive terms make sense! Still some function words, but that’s okay. Note These frequent terms are all very close to 1 on the x-axis, but are ordered NormCDF-Precision NormCDF - Frequency @jasonkessler
  • 19. Pos Freq. Neg Freq. Prec. Freq % Raw Hmean Prec. CDF Freq. CDF Scaled F-Score best 108 36 75.00% 0.22% 0.44% 71.95% 99.50% 83.51% entertaining 58 13 81.69% 0.12% 0.24% 77.07% 90.94% 83.43% fun 73 26 73.74% 0.15% 0.30% 70.92% 95.63% 81.44% heart 45 11 80.36% 0.09% 0.18% 76.09% 84.49% 80.07% Top Scaled-F Score Terms Note: normalized precision and frequency are on comparable scales, allowing for the harmonic mean to take both into account.
  • 20. Problem: highly negative terms are all low frequency NormCDF-Precision NormCDF - Frequency @jasonkessler Solution: • Compute Scaled F- Score association scores for negative reviews. • Use the highest score
  • 21. Scaled-F-Score Positive Scaled F-Score @jasonkessler Negative Scaled F- Score Note: only one obviously negative term
  • 22. @jasonkessler Scaled-F-Score by log- frequency The score can be overly sensitive to very frequent terms, but still doesn’t score them very highly ScaledF-Score Log Frequency This chart over-emphasizes stop words, and has a lot of white space
  • 23. Characteristic Term Detection • General idea • Characteristic terms are more likely to occur in corpus than in general English • “Normal” English: • Peter Norvig’s list of word frequencies from web in late ‘00s. • Algorithm: • Dense rank terms that appear in corpus by their corpus frequencies and their “standard” English frequencies. • Scale term ranks by the number of distinct ranks in the corpus (fewer) or background (greater) • Take rank difference @jasonkessler
  • 24. • Left: most frequent terms in corpus • Positive rank difference between film and movie@jasonkessler
  • 25. Top 10 Characteristic Words by Rank @jasonkessler
  • 26. Y-axis: Scaled F-Score X-axis: Characteristic Rank Delta Some non-movie-like words do affect sentiment @jasonkessler
  • 27. Why not use TF-IDF? • Drastically favors low frequency terms • term in all classes -> idf=1 -> score=0 TF-IDF(Positive)-TF-IDF(Negative) Log Frequency TF IDF @jasonkessler
  • 28. Burt Monroe, Michael Colaresi and Kevin Quinn. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis. 2008.@jasonkessler Monroe et. al (2009) approach • Bayesian approach to term- association • Likelihood: Z-score of log-odds- ratio • Prior: Term frequency in a background corpus • Posterior: Z-score of log-odds-ratio with background counts as smoothing values Popular, but much more tweaking to get to work than Scaled F Score.
  • 29. @jasonkessler Scattertext reimplementation of Monroe et al. See http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Class-Association-Scores.ipynb for code. Scattertext implementation (with prior weighting modifications)
  • 30. In defense of stop words Cindy K. Chung and James W. Pennebaker. Counting Little Words in Big Data: The Psychology of Communities, Culture, and History. EASP. 2012 In times of shared crisis, “we” use increases, while “I” use decreases. I/we: age, social integration I: lying, social rank @jasonkessler
  • 31. Function words and gender Newman, ML; Groom, CJ; Handelman LD, Pennebaker, JW. Gender Differences in Language Use: An Analysis of 14,000 Text Samples. 2008. LIWC Dimension Bold: entirely stop words Effect Size (Cohen’s d) (>0 F, <0 M) MANOVA p<.001 All Pronouns (esp. 3rd person) 0.36 Present tense verbs (walk, is, be) 0.18 Feeling (touch, hold, feel) 0.17 Certainty (always, never) 0.14 Word count NS Numbers -0.15 Prepositions -0.17 Words >6 letters -0.24 Swear words -0.22 Articles -0.24 • Performed on a variety of language categories, including speech. • Other studies have found that function words are the best predictors of gender. @jasonkessler
  • 33. Clickbait corpus • Facebook posts from BuzzFeed, NY Times, etc/ from 2010s. • Includes headline and the number of Facebook likes • Scraped by researcher Max Woolf at github.com/minimaxir/clickbait-cluster. • We’ll separate articles from 2016 into the upper third and lower third of likes. • Identify words and phrases that predict likes. • Begin with noun phrases identified from Phrase Machine (Handler et al. 2016) • Filter out redundant NPs. Abram Handler, Matt Denny, Hanna Wallach, and Brendan O'Connor. Bag of what? Simple noun phrase extraction for corpus analysis. NLP+CSS Workshop at EMNLP 2016. @jasonkessler
  • 35. @jasonkessler Scaled-F-Score of engagement by unigram Psycholinguist information: 3rd person pronouns -> high engagement (indicative of female) 2nd person low (male) “dies”: obit Can, guess, how: questions.
  • 36. Clickbait corpus • How do terms with similar meanings differ in terms of their engagement rates? • Use Gensim (https://radimrehurek.com/gensim/) to find word embeddings • Use UMAP (McInnes and Healy 2018) to project them into two dimensions, and explore them with Scattertext. • Locally groups words with similar embeddings together. • Better alternative to T-SNE; allows for cosine instead of Euclidean distance criteria Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Arxiv. 2018. @jasonkessler
  • 37. @jasonkessler This island is mostly food related. “Chocolate” and “cake” are highly engaging, but “breakfast” has predictive of low engagement. Term positions from determined by UMAP, color by Scaled F-Score for engagement.
  • 38. Clickbait corpus • How do the Times and Buzzfeed differ in what they talk about, and their content engages their readers? • Scattertext can easily create visualizations to help answer these questions. • First, we’ll look at how what engages for Buzzfeed contrasts with what engages for the Times, and vice versa Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Arxiv. 2018. @jasonkessler
  • 39. Oddly, NY Times readers distinctly like articles about sex, death, and which are written in a smug tone. This chart doesn’t give a good sense of what language is more associated with one site. @jasonkessler
  • 40. This chart let’s you know how Buzzfeed and the Times are distinct, while still distinguishing engaging content, @jasonkessler
  • 41. Thank you! Questions? @jasonkessler Jason S. Kessler Global AI Conference April 27, 2018 https://github.com/JasonKessler/GlobalAI2018

Editor's Notes

  • #7: Selected termsWhich words and phrases statistically distinguish ethnic groups and genders?
  • #11: Selected terms
  • #12: Selected terms