Topic Modelling: for news recommendation, user behaviour modelling, and many more

Copyright 2017 Aaron Li (aaron@potatos.io)
Modelling
Aaron Li
aaron@potatos.io
for news recommendation, user behaviour modelling, and many more

About me
• Working on a stealth startup
• Former lead inference engineer at Scaled Inference
• Did AI / Machine Learning at Google Research,
NICTA, CMU, ANU, etc.
• https://www.linkedin.com/in/aaronqli/
Copyright 2017 Aaron Li (aaron@potatos.io)2
Copyright 2017 Aaron Li
(aaron@potatos.io)

Overview
• Theory (2 classes, 2h each)
• work out the problem & solutions & why
• discuss the math & models & NLP fundamentals
• Industry use cases & systems & applications
• Practice (2 classes, 2h each)
• live demo + coding + debugging
• data sets, open source tools, Q & A
(aaron@potatos.io)

Overview
• Background Knowledge
• Linear Algebra
• Probability Theory
• Calculus
• Scala / Go / Node / C++ (please vote)
(aaron@potatos.io)

Theory 1
What is news recommendation?
What is topic modeling? Why?
Basic architecture
NLP foundamentals
Basic model: LDA
Practice 1
LDA live demo
NLP tools introduction
Preprocessed Datasets
Code LDA + Experiments
Open source tools for industry
Theory 2
LDA Inference
Gibbs sampling
SparseLDA, AliasLDA, LightLDA
Applications & Industrial use cases
Practice 2
Set up NLP pipeline
SparseLDA, AliasLDA, LightLDA
Train & use the model 
News recommendation demo
Schedule
(aaron@potatos.io)

News Recommendation
(aaron@potatos.io)

7
• A lot of people read news every day
• Flipboard, CNN, Facebook, WeChat … 
• How do we make people more engaged?
• Personalise & Recommendation
• learn preference and show relevant content
• recommend articles based on the current one
News Recommendation
(aaron@potatos.io)

8
• Top websites / apps already doing this
News Recommendation
(aaron@potatos.io)

News Recommendation
(aaron@potatos.io)
Flipboard

Yahoo! News (now “Oath” News)
News Recommendation
(aaron@potatos.io)

11
• Many websites don’t do it (e.g CNN)
• Why not? It’s not a easy problem
• Challenges
• News article vocabulary is large (100k ~ 1M)
• Documents are represented by high-dimensional
vector, based on count of vocabulary
• Traditional similarity measures don’t work
News Recommendation
(aaron@potatos.io)

Example
In 1996 Linus Torvalds, the Finnish creator of the Open Source
operating system Linux, visited the National Zoo and Aquarium with
members of the Canberra Linux Users Group, and was captivated
by one of the Zoo's little Penguins. Legend has it that Linus was
infected with a mythical disease called Penguinitis. Penguinitis
makes you stay awake at night thinking about Penguins and feeling
great love towards them.
Not long after this event the Open Source Software community
decided they needed a logo for Linux. They were looking for
something fun and after Linus mentioned his fondness of penguins,
a slightly overweighted penguin sitting down after having a great
meal seemed to ﬁt the bill perfectly. Hence, Tux the penguin was
created and now when people think of Linux they think of Tux.
(aaron@potatos.io)

Example
• Word count = 132, unique words = 91
• Very hard to measure its distance to other articles in our
database talking about Linux, Linus Torvalds, and the
creation of Tux
• Distance for low-dimensional space aren’t effective
• e.g. cosine similarity won’t make sense
• Need to represent things in low-dimensional vectors
• Capture semantics / topics efﬁciently
(aaron@potatos.io)

Solutions
(aaron@potatos.io)
Step 1. Get text data
 
Step 2. ??? (Machine can’t read text)
 
Step 3. Model & Train
Step 4. Deploy & Predict
News articles
Emails
Legal docs
Resume
…
(i.e. documents)

15
Step 2: NLP Preprocessing - common pipeline
Solutions
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
There are a lot more…
(used in advanced NLP tasks)
Chunking
Named Entity Recognition
Sentiment Analysis
Syntactic Analysis
Dependency Parsing
Coreference Resolution
Entity Relationship Extraction
Semantic Analysis
…
(aaron@potatos.io)

NLP Preprocessing
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Mostly rules (by regex or FST)
• Look for sentence splitter
• For English: , . ! ? etc.
• Checkout Wikipedia article
• Open source code is good
• Also checkout this article

NLP Preprocessing
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Find boundaries for words
• Easy for English (look for space)
• Hard for Chinese etc.
• Solution: FST, CRF, etc.
• Difﬁculties: see Wikipedia article
• Try making one by yourself using
FST! (CMU 11711 homework)

NLP Preprocessing
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Stop words:
• occurs frequently
• semantically not meaningful
• i.e. am, is, who, what, etc. 
• Small set of words 
• Easy to implement
• e.g. in-memory hashset

NLP Preprocessing
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Reduce word to root (stem)
• Usually used in IR system
• Root can be a non-word. e.g.
• fishing, fished, fisher => fish
• cats, catty => cat
• argument, arguing => argu 
• Rule based implementation 
• e.g. Porter’s Snowball stemmer 
Also see Wikipedia article

NLP Preprocessing
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• POS = Part of Speech
• Find grammar role of each word.  
I ate a ﬁsh
PRP VBD DT NN 
• Disambiguate same words used
in different context. e.g:
• “Train” as in “train a model”
• “Train” as in “catch a train”
• Techniques: HMM, CRF, etc.
• See this article for more details

NLP Preprocessing
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Find base form of a word
• More complex than stemming
• Use POS tag information
• Different rules for different POS 
• Base form is a valid word. e.g.
• walks, walking, walked =>walk
• am, are, is => be
• argument (NN) => argument
• arguing (VBG) => argue 
• See Wikipedia article for details

NLP Preprocessing
(aaron@potatos.io)
Sentence splitting
Tokenisation
Stop words removal
Stemming (optional)
POS Tagging
Lemmatisation
Form bag of words
• Index pre-processed documents
and words with id and frequency 
• e.g:
• id:1 word:(train, VBG) freq: 5
• id:2 word:(model, NN) freq: 2
• id:3 word:(train, NN) freq: 3
• …
See UCI Bag of Words dataset

Solutions
• Modelling & Training
• Naive Bayes
• Latent Semantic Analysis
• word2vec, doc2vec, …
• Topic Modelling
(aaron@potatos.io)

24
• Naive Bayes (very old technique)
• Use only key words to get probability for K labels
• Good for spam detection
• Poor performance for news recommendation
• Does not capture semantics / topics
• https://web.stanford.edu/class/cs124/lec/
naivebayes.pdf
Solutions
(aaron@potatos.io)

25
• Latent Semantic Analysis (~1990 - 2000)
• SVD on a TF-IDF frequency matrix with documents as columns and
words as rows
• Gives a low-rank approximation of the matrix and represent documents
in low dimension vectors
• Problem: hard to interpret vectors / documents, probability distribution
is wrong (Gaussian)
• https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-
indexing-1.html
• Thomas Hofmann. Probabilistic Latent Semantic Analysis. In Kathryn B.
Laskey and Henri Prade, editors, UAI,
Solutions
(aaron@potatos.io)

26
• word2vec, doc2vec (2013~)
• Convert words to dense, low-dimensional, compositional
vectors (e.g. king - man + woman = queen)
• Good for classiﬁcation problems
• Slow to train, hard to interpret (because of neural network),
yet to be tested in industrial use cases
• Mikolov, Tomas; et al. "Efﬁcient Estimation of Word
Representations in Vector Space" ICLR 2013.
• Getting started with word2vec
Solutions
(aaron@potatos.io)

27
• Topic Models (LDA etc., 2003~)
• Deﬁne a generative structure involving latent variables (e.g topics)
using well-structured distributions and infer the parameters
• Represent documents / words using low-dimensional, highly
interpretable distributions
• Extensively used in industry. Many open source tools
• Extensive research on speeding up / scaling up
• D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation.
Journal of Machine Learning Research, 3:993–1022, 2003
• Tutorial: Parameter estimation for text analysis, Gregor Heinrich 2008
Solutions
(aaron@potatos.io)

(aaron@potatos.io)

Topic Models
(aaron@potatos.io)
Latent Dirichlet Allocation (LDA)
Image from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]

30
• LDA (Latent Dirichlet Allocation)
• Arguably the most popular topic model since 2013
• Created by David Blei, Andrew Ng, Michael Jordan
• To be practical we use this topic model in class
Topic Models
(aaron@potatos.io)

LDA
(aaron@potatos.io)
Extracted from [Li 2012, Multi-GPU Distributed Parallel Bayesian Differential Topic Modelling]

LDA
(aaron@potatos.io)

Example
(aaron@potatos.io)

Example
(aaron@potatos.io)
Extracted from [BleiNgJordan2003, Latent Dirichlet Allocation]

LDA
• Task: infer parameters
• each document’s representation by topic vector
• with this we can compute document similarity!
• each topic’s representation by words (counts)
• with this we can look at each topic manually, and
interpret the meaning of them!
(aaron@potatos.io)

Theory 1
(aaron@potatos.io)
End of class
Questions?

Industrial Applications &
Use cases
• Yi Wang et al. Peacock: Learning Long-Tail Topic Features for
Industrial Applications (TIST 2014)
• Advertising system in production
• Aaron Li et al. High Performance Latent Variable Models (arxiv,
2014)
• User preference learning from search data
• Arnab Bhadury, Clustering Similar Stories Using LDA
• News Recommendation
• And many more… search “AliasLDA” or “LightLDA” on Google
(aaron@potatos.io)

LDA Inference
(aaron@potatos.io)
Bayes rule:
Where denotes all latent variables

41
In LDA, the topic assignment for each word is latent
LDA Inference
Intractable: KL terms in denominator
(aaron@potatos.io)

42
What can we do to address intractability?
• Gibbs sampling
• Variational inference (not discussed in class)
LDA Inference
(aaron@potatos.io)

43
Estimate by sampling:
LDA Inference
Gibbs sampling:
Sample if is known
(aaron@potatos.io)

LDA Inference
(aaron@potatos.io)
We can compute using Bayes rules
Above equation is called “predictive probability”. It
can be applied to the latent variable which assigns a
topic to each word
i.e. compute the probability of a word is assigned
with a particular topic, given other topic assignments
and the data (docs, words)

LDA Inference
(aaron@potatos.io)
Derive predictive probability

LDA Inference
(aaron@potatos.io)

Putting everything together
LDA Inference
(aaron@potatos.io)

LDA Inference
(aaron@potatos.io)

Terms on the right are known all the time!
We can compute the predictive probability (left
term) by normalising over all k’s
LDA Inference
(aaron@potatos.io)

50
• Algorithm (Gibbs sampling):
• Randomly assign a topic to each word & doc
• For T iterations (a large number to ensure convergence)
• For each doc
• For each word
• For each topic, compute predictive prob
• Sample topic by normalising over all predictive prob
• Repeat for T’ iterations (a small number) and compute topic count per
word and per doc. Use them to estimate and
LDA Inference
(aaron@potatos.io)

Speed up LDA
(Switch to my KDD 2014 slides)
https://www.slideshare.net/AaronLi11/
kdd-2014-presentation-best-research-
paper-award-alias-topic-modelling-
reducing-the-sampling-complexity-of-
topic-models
(aaron@potatos.io)

Theory 2
(aaron@potatos.io)
End of class
Questions?

Topic Modelling: for news recommendation, user behaviour modelling, and many more

More Related Content

Similar to Topic Modelling: for news recommendation, user behaviour modelling, and many more (20)

Recently uploaded (20)

Topic Modelling: for news recommendation, user behaviour modelling, and many more