Modern text mining – understanding a million comments in 60 minutes

How to derive data-driven insights
… from user-generated content
https://datanizing.com/dts

Automatically
gather relevant
content
1
Cleaning
& Linguistics
2
Relevanceranking
of gathered
content
3
Data-driven
calculationof main
insights
4
Text analysis - automatically,regularly & based on large amount of data.
Visualizationin
dashboard&
reports
5

What is a trending
hotel and what
about bars?
Are there unmet needs I
could address?
What do people care
about in NYC?
Products for new
target groups
Marketing campaigns
in new places
Data-driven category
management
…

Founded
in 2017
BigDatatext
analytics
experts
Locatednear
Nuremberg
Christian
Winkler

OurBig Datamarketresearch approachexplained on a common situation:
Visiting foreign places.

Theproblem:
Too little time, too much going on.

14 years
of NYC TripAdvisor Forum
Jan 2005 – Apr 2019
Making sense of …

Much content!
1.6 mio posts
89.815 users
Nobody can read that.

Thereis one problemwithuser-generatedcontent:
Too much.

Automatically
gather relevant
content
1
Cleaning
& Linguistics
2
Relevanceranking
of gathered
content
3
Data-driven
calculationof main
insights
4
Displayin
dashboard&
reports
5
Text processing pipeline
Language detection
Synonyms
Outlier detection
Featureextraction
Clustering
Wordcombinations
Categories
Wordfrequencies
Uniqueness
Domainsimilarity
Inverteddomain
frequency

Perform quality assurance
with the whole content
Getoverviewof texts,data qualityand
recognizepossible biasin data

Typical questions answered with statistics
Do frequent
authorswrite
shorter posts?
How does the
number of
articles change
over time?
How does the
article length
change over
time?
What is the
length
distributionof
articles?
Which are the
most frequent
words?
How are
keywords used
over time?

Python
(pandas)
Content
spidering
Data
extraction
SQL
database
Jupyter
Notebook
Spidering, Database, Pandas, Jupyter

Question Answering
Translation / Dialogue
Summarization / Topic Mining
Classification / Retrieval
strong
weak
"ShallowNLP"
• Simple language models with many
simplifications (Bag-of-Words, n-grams)
• Keywords, phrases
• Robust algorithms
"DeepNLP"
• Complex language models necessary
for deep understanding
• Statements spanning sentences
• Fragile algorithms
Text prepration with NLP

How does the
number of
articles change
over time?

How does the
article length
change over
time?

Do frequent
authorswrite
shorter posts?

Which are the
most frequent
words?

How would you rate the quality?
How are
keywords used
over time?

Summary quality assurance
Be surethe textmassesfuture data-drivendecisionswillbe
based on has a good-enoughquality
Recognizepossible biasin data
Getoverviewof texts& dataquality
take-away value of
text statistics

Create data-driven insights
from post of NYC travelers
Data-drivenfocuspointsfordigital marketing, category
management,productdesign, personalization

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
80% 20%
Kim
User #1
Hank
User #2
Marty
# ...
Olivia
#68,641
How to create data-driven
personas

Looking for hidden / latent structure
1) Which could be candidates for topics?
2) How are they distributed in document space?
Basic idea of topic modelling
Topic 1
Topic 2
Topic 3
topicsdocuments
...
Topic k
doc 1 doc 2 doc n...

Only use word frequencies
• Term frequency (TF)
• Very simple, but robust
• Basis for many algorithms (retrieval,
classification)
Disadvantages
• Very simplified model of language
• No syntactical or relational information kept
Improvements
• TF/IDF, n-grams
Need to vectorize data (BoW)
Documents
D1: „Steffi likes London."
D2: „Steffi does not like London."
D3: „Steffi likes London, but not Paris."
D1 1 1 1
D2 1 1 1 1 1
D3 1 1 1 1 1 1

Most ML is boring maths
x11 x1n
...
...
...
xm1 ... xmn
m documents with n features (words)
• Use a matrix representation
• m x n Matrix can become very large
• 1.3 million rows, 500.000 columns
• Matrix is sparse:
Most documents containonly a few words
Matrix can be simplified
• Only keep certain number of features
• Only keep features which occur more than
x times
features
documents

How Topic Modeling works
Adopted from http://topicmodels.west.uni-koblenz.de/ckling/tmt/svd_ap.html
Topic modelling transforms the matrix
• Re-arrange features (words) and
documents
• Find blocks
• Word in blocks constitute topics
• Documents in blocks belong to topic

Topic1:
Transportation
jfk train airport flight car taxi
bus way fly arrive
Topic4:
Happiness
just great good time like place
think really people food
Topic2:
Newbie
thank nyc hi look help good
suggestion appreciateadvice
visit
Topic5:
Organization
day tour walk museum brooklyn
square island central park park
plan
Topic3:
Accommodation
hotel stay room night look area
square bed price times
Topic6:
Discounts
ticket buy game seat book
purchase discount just sell
website
Result: 6 data-driven topics

Summary topic modeling
Decisionsbacked by data - fordigitalmarketing,category
management,productdesign, personalization, …
Detectdistincttopicsuserstalkabout
Detecthiddencustomersegmentsbasedon interest
take-away business value of
data-driven topics

Can Tripadivor help in
predicting popular posts?

Tripadvisor Posts Training Predictions
Number of replies as
metric for popularity
<4: unpopular
>15: popular
• Use the “labels”
• Training with
40,000posts
• Classify text of
possible offers
• Find out which
content appeals to
travelers
Example: Predict popularity

Example: Predict popular posts
Meet the
locals at
Times Square
Brooklyn at
night
Walk the
high line
Visit the One
World
Observatory
Take a boat
trip on the
Hudson river

Summary classification
Classifytextin customerservice,findhatespeech, find news
categories, separateEnglish fromGermantext,…
Classificationand prediction withcategorized texts
take-away: construct labels from unstructured text and
use as categories

Detect what people are
interested in
when talking about NYC
Alignyour messagesto whatpeople actuallylikeabout you.
Detectchanginginterestsin real-time.

Search result for “airport”
in the TripAdvisor forum

Analysis of words in text
• Order not used
• Relations between words neglected
• Lost semantics
Analyze n-grams
• Order taken into account
• Static relations via tuples
• Abstraction to semantics missing
So far in text analytics…
Each word is a
single entity
Context decides
about semantics!

Aim: Find contextinformation of words
CBOW model model
• Predict word from context
Skip-gram model
• Determine contextfrom word
• Slower, more precise with infrequent
words
Training word vectors

Word2vec similarities in detail

Search result for “airport”
after training word embeddings
There are three airports in New York:
JFK EWR LGA
John F. Kennedy NewarkLiberty International Airport LaGuardia

Brooklyn Bridge goes to Brooklyn
Where does Lincoln Tunnel go to?
LincolnTunnel goes to:
Hoboken Queens Jersey City

Summary word embeddings
• Benefitfromchangingtrends
• Createsemanticallyawaresearchresults
Detectchanginginterestsin real-time
Detectrelevant contextof a topic
take-away business value of
semantic context

To wrap it up
insights & business value from UGC analysis
Data-driven
personas
Semantic
context
Changing
interests
Decisionsbacked
by relevantdata
for marketing, categorymanagement,
productdesign, …
Alignyour
messages
to whatpeople really likeabout
you & adjust over time
Insights
Business
value

Looking beyond UGC
derive business value from other text sources
Technical
documentation
Data-drivenapproachto derivingdiverse,un-biased insight from…
Company
wikis
Change requests
Scientific
publications
…
Future
cost-drivers
Knowledge
bottlenecks
Emerging
competing
technologies
Technical
debts

Data-driven travel recommendation
from 1.6 mio NYC TripAdvisor posts

Dr. Christian
Winkler
datanizing
GmbH
https://datanizing.com/dts

Modern text mining – understanding a million comments in 60 minutes

More Related Content

Similar to Modern text mining – understanding a million comments in 60 minutes (20)

More from ZOLLHOF - Tech Incubator (13)

Recently uploaded (20)

Modern text mining – understanding a million comments in 60 minutes