SlideShare a Scribd company logo
Modern text mining – understanding a million comments in 60 minutes
How to derive data-driven insights
… from user-generated content
https://datanizing.com/dts
Automatically
gather relevant
content
1
Cleaning
& Linguistics
2
Relevanceranking
of gathered
content
3
Data-driven
calculationof main
insights
4
Text analysis - automatically,regularly & based on large amount of data.
Visualizationin
dashboard&
reports
5
What is a trending
hotel and what
about bars?
Are there unmet needs I
could address?
What do people care
about in NYC?
Products for new
target groups
Marketing campaigns
in new places
Data-driven category
management
…
Founded
in 2017
BigDatatext
analytics
experts
Locatednear
Nuremberg
Christian
Winkler
OurBig Datamarketresearch approachexplained on a common situation:
Visiting foreign places.
Theproblem:
Too little time, too much going on.
Opinion of 1 author
Too many opinions
14 years
of NYC TripAdvisor Forum
Jan 2005 – Apr 2019
Making sense of …
Much content!
1.6 mio posts
89.815 users
Nobody can read that.
Thereis one problemwithuser-generatedcontent:
Too much.
Solution?
Automatically
gather relevant
content
1
Cleaning
& Linguistics
2
Relevanceranking
of gathered
content
3
Data-driven
calculationof main
insights
4
Displayin
dashboard&
reports
5
Text processing pipeline
Language detection
Synonyms
Outlier detection
Featureextraction
Clustering
Wordcombinations
Categories
Wordfrequencies
Uniqueness
Domainsimilarity
Inverteddomain
frequency
Perform quality assurance
with the whole content
Getoverviewof texts,data qualityand
recognizepossible biasin data
Typical questions answered with statistics
Do frequent
authorswrite
shorter posts?
How does the
number of
articles change
over time?
How does the
article length
change over
time?
What is the
length
distributionof
articles?
Which are the
most frequent
words?
How are
keywords used
over time?
Python
(pandas)
Content
spidering
Data
extraction
SQL
database
Jupyter
Notebook
Spidering, Database, Pandas, Jupyter
Question Answering
Translation / Dialogue
Summarization / Topic Mining
Classification / Retrieval
strong
weak
"ShallowNLP"
• Simple language models with many
simplifications (Bag-of-Words, n-grams)
• Keywords, phrases
• Robust algorithms
"DeepNLP"
• Complex language models necessary
for deep understanding
• Statements spanning sentences
• Fragile algorithms
Text prepration with NLP
Typical questions answered with statistics
How does the
number of
articles change
over time?
Typical questions answered with statistics
How does the
article length
change over
time?
Typical questions answered with statistics
Do frequent
authorswrite
shorter posts?
Typical questions answered with statistics
Which are the
most frequent
words?
How would you rate the quality?
How are
keywords used
over time?
Summary quality assurance
Be surethe textmassesfuture data-drivendecisionswillbe
based on has a good-enoughquality
Recognizepossible biasin data
Getoverviewof texts& dataquality
take-away value of
text statistics
Create data-driven insights
from post of NYC travelers
Data-drivenfocuspointsfordigital marketing, category
management,productdesign, personalization
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
80% 20%
Kim
User #1
Hank
User #2
Marty
# ...
Olivia
#68,641
How to create data-driven
personas
Looking for hidden / latent structure
1) Which could be candidates for topics?
2) How are they distributed in document space?
Basic idea of topic modelling
Topic 1
Topic 2
Topic 3
topicsdocuments
...
Topic k
doc 1 doc 2 doc n...
Only use word frequencies
• Term frequency (TF)
• Very simple, but robust
• Basis for many algorithms (retrieval,
classification)
Disadvantages
• Very simplified model of language
• No syntactical or relational information kept
Improvements
• TF/IDF, n-grams
Need to vectorize data (BoW)
Documents
D1: „Steffi likes London."
D2: „Steffi does not like London."
D3: „Steffi likes London, but not Paris."
D1 1 1 1
D2 1 1 1 1 1
D3 1 1 1 1 1 1
Most ML is boring maths
x11 x1n
...
...
...
xm1 ... xmn
m documents with n features (words)
• Use a matrix representation
• m x n Matrix can become very large
• 1.3 million rows, 500.000 columns
• Matrix is sparse:
Most documents containonly a few words
Matrix can be simplified
• Only keep certain number of features
• Only keep features which occur more than
x times
features
documents
How Topic Modeling works
Adopted from http://topicmodels.west.uni-koblenz.de/ckling/tmt/svd_ap.html
Topic modelling transforms the matrix
• Re-arrange features (words) and
documents
• Find blocks
• Word in blocks constitute topics
• Documents in blocks belong to topic
Topic1:
Transportation
jfk train airport flight car taxi
bus way fly arrive
Topic4:
Happiness
just great good time like place
think really people food
Topic2:
Newbie
thank nyc hi look help good
suggestion appreciateadvice
visit
Topic5:
Organization
day tour walk museum brooklyn
square island central park park
plan
Topic3:
Accommodation
hotel stay room night look area
square bed price times
Topic6:
Discounts
ticket buy game seat book
purchase discount just sell
website
Result: 6 data-driven topics
Summary topic modeling
Decisionsbacked by data - fordigitalmarketing,category
management,productdesign, personalization, …
Detectdistincttopicsuserstalkabout
Detecthiddencustomersegmentsbasedon interest
take-away business value of
data-driven topics
Can Tripadivor help in
predicting popular posts?
Tripadvisor Posts Training Predictions
Number of replies as
metric for popularity
<4: unpopular
>15: popular
• Use the “labels”
• Training with
40,000posts
• Classify text of
possible offers
• Find out which
content appeals to
travelers
Example: Predict popularity
Example: Predict popular posts
Meet the
locals at
Times Square
Brooklyn at
night
Walk the
high line
Visit the One
World
Observatory
Take a boat
trip on the
Hudson river
Summary classification
Classifytextin customerservice,findhatespeech, find news
categories, separateEnglish fromGermantext,…
Classificationand prediction withcategorized texts
take-away: construct labels from unstructured text and
use as categories
Detect what people are
interested in
when talking about NYC
Alignyour messagesto whatpeople actuallylikeabout you.
Detectchanginginterestsin real-time.
Search result for “airport”
in the TripAdvisor forum
Analysis of words in text
• Order not used
• Relations between words neglected
• Lost semantics
Analyze n-grams
• Order taken into account
• Static relations via tuples
• Abstraction to semantics missing
So far in text analytics…
Each word is a
single entity
Context decides
about semantics!
Analysis of words in text
• Order not used
• Relations between words neglected
• Lost semantics
Analyze n-grams
• Order taken into account
• Static relations via tuples
• Abstraction to semantics missing
So far in text analytics…
Each word is a
single entity
Context decides
about semantics!
Aim: Find contextinformation of words
CBOW model model
• Predict word from context
Skip-gram model
• Determine contextfrom word
• Slower, more precise with infrequent
words
Training word vectors
Word2vec similarities in detail
Search result for “airport”
after training word embeddings
There are three airports in New York:
JFK EWR LGA
John F. Kennedy NewarkLiberty International Airport LaGuardia
Brooklyn Bridge goes to Brooklyn
Where does Lincoln Tunnel go to?
LincolnTunnel goes to:
Hoboken Queens Jersey City
Summary word embeddings
• Benefitfromchangingtrends
• Createsemanticallyawaresearchresults
Detectchanginginterestsin real-time
Detectrelevant contextof a topic
take-away business value of
semantic context
To wrap it up
insights & business value from UGC analysis
Data-driven
personas
Semantic
context
Changing
interests
Decisionsbacked
by relevantdata
for marketing, categorymanagement,
productdesign, …
Alignyour
messages
to whatpeople really likeabout
you & adjust over time
Insights
Business
value
Looking beyond UGC
derive business value from other text sources
Technical
documentation
Data-drivenapproachto derivingdiverse,un-biased insight from…
Company
wikis
Change requests
Scientific
publications
…
Future
cost-drivers
Knowledge
bottlenecks
Emerging
competing
technologies
Technical
debts
Data-driven travel recommendation
from 1.6 mio NYC TripAdvisor posts
Dr. Christian
Winkler
datanizing
GmbH
https://datanizing.com/dts

More Related Content

PDF
Introduction to automated text analyses in the Political Sciences
PPTX
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
PPTX
Semantic engagement
PPTX
Global Media Monitor - Marko Grobelnik
PPTX
From Text To Reasoning - Marko Grobelnik - SWANK Workshop Stanford - 16 Apr 2014
PDF
Best Practices for Large Scale Text Mining Processing
PPTX
Beyond document retrieval using semantic annotations
Introduction to automated text analyses in the Political Sciences
Language as social sensor - Marko Grobelnik - Dubrovnik - HrTAL2016 - 30 Sep ...
Semantic engagement
Global Media Monitor - Marko Grobelnik
From Text To Reasoning - Marko Grobelnik - SWANK Workshop Stanford - 16 Apr 2014
Best Practices for Large Scale Text Mining Processing
Beyond document retrieval using semantic annotations

Similar to Modern text mining – understanding a million comments in 60 minutes (20)

PDF
Question Answering - Application and Challenges
PPTX
Frontiers of Computational Journalism week 2 - Text Analysis
PDF
Text Mining : Experience
PPTX
Introduction to Text Mining
PDF
[系列活動] 人工智慧與機器學習在推薦系統上的應用
PDF
Metadata
PPTX
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
PDF
Book of the Dead Project
PPT
The future of the DCC
PDF
Guest Lecture: Linked Open Data for the Humanities and Social Sciences
PDF
Usable Language | How Content Shapes The User Experience
PDF
Big Data Analytics course: Named Entities and Deep Learning for NLP
PPTX
An Introduction to NOSQL, Graph Databases and Neo4j
PPTX
Large-Scale Semantic Search
PPTX
Machine Learning - Transformers, Large Language Models and ChatGPT
PDF
Cultural text mining workshop
PPT
Open University - TU100 Day school 1
PPT
Effective XML Keyword Search with Relevance Oriented Ranking
KEY
Adaptable Information Workshop slides
PDF
Zemanta Tech Talk at Audible
Question Answering - Application and Challenges
Frontiers of Computational Journalism week 2 - Text Analysis
Text Mining : Experience
Introduction to Text Mining
[系列活動] 人工智慧與機器學習在推薦系統上的應用
Metadata
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
Book of the Dead Project
The future of the DCC
Guest Lecture: Linked Open Data for the Humanities and Social Sciences
Usable Language | How Content Shapes The User Experience
Big Data Analytics course: Named Entities and Deep Learning for NLP
An Introduction to NOSQL, Graph Databases and Neo4j
Large-Scale Semantic Search
Machine Learning - Transformers, Large Language Models and ChatGPT
Cultural text mining workshop
Open University - TU100 Day school 1
Effective XML Keyword Search with Relevance Oriented Ranking
Adaptable Information Workshop slides
Zemanta Tech Talk at Audible
Ad

More from ZOLLHOF - Tech Incubator (13)

PDF
Digital Health in times of COVID
PDF
Know-How Event - Online Edition: Digitalszene und Coronavirus
PDF
Social Entrepreneurship - Für die Gesellschaft Unternehmen by Prof. Dr. Beckmann
PDF
Machine Learning for dummies!
PDF
An entrepreneurial journey through living, loving and launching things in Africa
PDF
Dr. Melanie Rieback at the Digital Tech Summit 2019
PDF
Robotergesteuerte Prozessautomatesierung (RPA) im Schadenfall - Know-How Even...
PDF
Jorge ramos at HACKBAY 2019
PDF
DAVID UND GOLIATH ZUM DREAMTEAM MACHEN
PDF
ZOLLHOF Know-How Event You track too much and learn too little
PDF
Shift Happens @ZOLLHOF by Tobias Burkhardt
PDF
Early-stage Fundraising 101 & How Not to Screw it Up - a VC perspective
PDF
Blockchain Beyond the Hype - ZOLLHOF Know-How Event #6 (GERMAN)
Digital Health in times of COVID
Know-How Event - Online Edition: Digitalszene und Coronavirus
Social Entrepreneurship - Für die Gesellschaft Unternehmen by Prof. Dr. Beckmann
Machine Learning for dummies!
An entrepreneurial journey through living, loving and launching things in Africa
Dr. Melanie Rieback at the Digital Tech Summit 2019
Robotergesteuerte Prozessautomatesierung (RPA) im Schadenfall - Know-How Even...
Jorge ramos at HACKBAY 2019
DAVID UND GOLIATH ZUM DREAMTEAM MACHEN
ZOLLHOF Know-How Event You track too much and learn too little
Shift Happens @ZOLLHOF by Tobias Burkhardt
Early-stage Fundraising 101 & How Not to Screw it Up - a VC perspective
Blockchain Beyond the Hype - ZOLLHOF Know-How Event #6 (GERMAN)
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
A Presentation on Artificial Intelligence
Spectral efficient network and resource selection model in 5G networks
The AUB Centre for AI in Media Proposal.docx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Programs and apps: productivity, graphics, security and other tools
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
A comparative analysis of optical character recognition models for extracting...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Big Data Technologies - Introduction.pptx
Encapsulation theory and applications.pdf
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A Presentation on Artificial Intelligence

Modern text mining – understanding a million comments in 60 minutes

  • 2. How to derive data-driven insights … from user-generated content https://datanizing.com/dts
  • 3. Automatically gather relevant content 1 Cleaning & Linguistics 2 Relevanceranking of gathered content 3 Data-driven calculationof main insights 4 Text analysis - automatically,regularly & based on large amount of data. Visualizationin dashboard& reports 5
  • 4. What is a trending hotel and what about bars? Are there unmet needs I could address? What do people care about in NYC? Products for new target groups Marketing campaigns in new places Data-driven category management …
  • 6. OurBig Datamarketresearch approachexplained on a common situation: Visiting foreign places.
  • 7. Theproblem: Too little time, too much going on.
  • 8. Opinion of 1 author
  • 10. 14 years of NYC TripAdvisor Forum Jan 2005 – Apr 2019 Making sense of …
  • 11. Much content! 1.6 mio posts 89.815 users Nobody can read that.
  • 14. Automatically gather relevant content 1 Cleaning & Linguistics 2 Relevanceranking of gathered content 3 Data-driven calculationof main insights 4 Displayin dashboard& reports 5 Text processing pipeline Language detection Synonyms Outlier detection Featureextraction Clustering Wordcombinations Categories Wordfrequencies Uniqueness Domainsimilarity Inverteddomain frequency
  • 15. Perform quality assurance with the whole content Getoverviewof texts,data qualityand recognizepossible biasin data
  • 16. Typical questions answered with statistics Do frequent authorswrite shorter posts? How does the number of articles change over time? How does the article length change over time? What is the length distributionof articles? Which are the most frequent words? How are keywords used over time?
  • 18. Question Answering Translation / Dialogue Summarization / Topic Mining Classification / Retrieval strong weak "ShallowNLP" • Simple language models with many simplifications (Bag-of-Words, n-grams) • Keywords, phrases • Robust algorithms "DeepNLP" • Complex language models necessary for deep understanding • Statements spanning sentences • Fragile algorithms Text prepration with NLP
  • 19. Typical questions answered with statistics How does the number of articles change over time?
  • 20. Typical questions answered with statistics How does the article length change over time?
  • 21. Typical questions answered with statistics Do frequent authorswrite shorter posts?
  • 22. Typical questions answered with statistics Which are the most frequent words?
  • 23. How would you rate the quality? How are keywords used over time?
  • 24. Summary quality assurance Be surethe textmassesfuture data-drivendecisionswillbe based on has a good-enoughquality Recognizepossible biasin data Getoverviewof texts& dataquality take-away value of text statistics
  • 25. Create data-driven insights from post of NYC travelers Data-drivenfocuspointsfordigital marketing, category management,productdesign, personalization
  • 26. Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 80% 20% Kim User #1 Hank User #2 Marty # ... Olivia #68,641 How to create data-driven personas
  • 27. Looking for hidden / latent structure 1) Which could be candidates for topics? 2) How are they distributed in document space? Basic idea of topic modelling Topic 1 Topic 2 Topic 3 topicsdocuments ... Topic k doc 1 doc 2 doc n...
  • 28. Only use word frequencies • Term frequency (TF) • Very simple, but robust • Basis for many algorithms (retrieval, classification) Disadvantages • Very simplified model of language • No syntactical or relational information kept Improvements • TF/IDF, n-grams Need to vectorize data (BoW) Documents D1: „Steffi likes London." D2: „Steffi does not like London." D3: „Steffi likes London, but not Paris." D1 1 1 1 D2 1 1 1 1 1 D3 1 1 1 1 1 1
  • 29. Most ML is boring maths x11 x1n ... ... ... xm1 ... xmn m documents with n features (words) • Use a matrix representation • m x n Matrix can become very large • 1.3 million rows, 500.000 columns • Matrix is sparse: Most documents containonly a few words Matrix can be simplified • Only keep certain number of features • Only keep features which occur more than x times features documents
  • 30. How Topic Modeling works Adopted from http://topicmodels.west.uni-koblenz.de/ckling/tmt/svd_ap.html Topic modelling transforms the matrix • Re-arrange features (words) and documents • Find blocks • Word in blocks constitute topics • Documents in blocks belong to topic
  • 31. Topic1: Transportation jfk train airport flight car taxi bus way fly arrive Topic4: Happiness just great good time like place think really people food Topic2: Newbie thank nyc hi look help good suggestion appreciateadvice visit Topic5: Organization day tour walk museum brooklyn square island central park park plan Topic3: Accommodation hotel stay room night look area square bed price times Topic6: Discounts ticket buy game seat book purchase discount just sell website Result: 6 data-driven topics
  • 32. Summary topic modeling Decisionsbacked by data - fordigitalmarketing,category management,productdesign, personalization, … Detectdistincttopicsuserstalkabout Detecthiddencustomersegmentsbasedon interest take-away business value of data-driven topics
  • 33. Can Tripadivor help in predicting popular posts?
  • 34. Tripadvisor Posts Training Predictions Number of replies as metric for popularity <4: unpopular >15: popular • Use the “labels” • Training with 40,000posts • Classify text of possible offers • Find out which content appeals to travelers Example: Predict popularity
  • 35. Example: Predict popular posts Meet the locals at Times Square Brooklyn at night Walk the high line Visit the One World Observatory Take a boat trip on the Hudson river
  • 36. Summary classification Classifytextin customerservice,findhatespeech, find news categories, separateEnglish fromGermantext,… Classificationand prediction withcategorized texts take-away: construct labels from unstructured text and use as categories
  • 37. Detect what people are interested in when talking about NYC Alignyour messagesto whatpeople actuallylikeabout you. Detectchanginginterestsin real-time.
  • 38. Search result for “airport” in the TripAdvisor forum
  • 39. Analysis of words in text • Order not used • Relations between words neglected • Lost semantics Analyze n-grams • Order taken into account • Static relations via tuples • Abstraction to semantics missing So far in text analytics… Each word is a single entity Context decides about semantics!
  • 40. Analysis of words in text • Order not used • Relations between words neglected • Lost semantics Analyze n-grams • Order taken into account • Static relations via tuples • Abstraction to semantics missing So far in text analytics… Each word is a single entity Context decides about semantics!
  • 41. Aim: Find contextinformation of words CBOW model model • Predict word from context Skip-gram model • Determine contextfrom word • Slower, more precise with infrequent words Training word vectors
  • 43. Search result for “airport” after training word embeddings There are three airports in New York: JFK EWR LGA John F. Kennedy NewarkLiberty International Airport LaGuardia
  • 44. Brooklyn Bridge goes to Brooklyn Where does Lincoln Tunnel go to? LincolnTunnel goes to: Hoboken Queens Jersey City
  • 45. Summary word embeddings • Benefitfromchangingtrends • Createsemanticallyawaresearchresults Detectchanginginterestsin real-time Detectrelevant contextof a topic take-away business value of semantic context
  • 46. To wrap it up insights & business value from UGC analysis Data-driven personas Semantic context Changing interests Decisionsbacked by relevantdata for marketing, categorymanagement, productdesign, … Alignyour messages to whatpeople really likeabout you & adjust over time Insights Business value
  • 47. Looking beyond UGC derive business value from other text sources Technical documentation Data-drivenapproachto derivingdiverse,un-biased insight from… Company wikis Change requests Scientific publications … Future cost-drivers Knowledge bottlenecks Emerging competing technologies Technical debts
  • 48. Data-driven travel recommendation from 1.6 mio NYC TripAdvisor posts