SlideShare a Scribd company logo
Parallel Non-blocking Deterministic
Algorithm for Online Topic Modeling
Murat Apishev
great-mel@yandex.ru
Oleksandr Frei
oleksandr.frei@gmail.com
HSE, MSU, MIPT
April 8, 2016
Contents
1 Introduction
Topic modeling
ARTM
BigARTM
2 Parallel implementation
Synchronous algorithms
Asynchronous algorithms
Comparison
3 Applications
The RSF project
Conclusions
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
Topic modeling
Topic modeling — an application of machine learning to statistical
text analysis.
Topic — a specific terminology of the subject area, the set of terms
(unigrams or n−grams) frequently appearing together in
documents.
Topic model uncovers latent semantic structure of a text collection:
topic t is a probability distribution p(w|t) over terms w
document d is a probability distribution p(t|d) over topics t
Applications — information retrieval for long-text queries,
classification, categorization, summarization of texts.
Murat Apishev great-mel@yandex.ru AIST 2016 3 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
Topic modeling task
Given: W — set (vocabulary) of terms (unigrams or n−grams),
D — set (collection) of text documents d ⊂ W ,
ndw — how many times term w appears in document d.
Find: model p(w|d) =
∑︀
t∈T
𝜑wt 𝜃td with parameters Φ
W×T
и Θ
T×D
:
𝜑wt =p(w|t) — term probabilities w in each topic t,
𝜃td =p(t|d) — topic probabilities t in each document d.
Criteria log-likelihood maximization:
∑︁
d∈D
∑︁
w∈d
ndw ln
∑︁
t∈T
𝜑wt 𝜃td → max
𝜑,𝜃
;
𝜑wt 0;
∑︀
w 𝜑wt = 1; 𝜃td 0;
∑︀
t 𝜃td = 1.
Issue: the problem of stochastic matrix factorization is ill-posed:
ΦΘ = (ΦS)(S−1Θ) = Φ′Θ′.
Murat Apishev great-mel@yandex.ru AIST 2016 4 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
PLSA and EM-algorithm
Log-likelihood maximization:
∑︁
d∈D
∑︁
w∈W
ndw ln
∑︁
t
𝜑wt 𝜃td → max
Φ,Θ
EM-algorithm: the simple iteration method for the set of equations
E-шаг:
M-шаг:
⎧
⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎩
ptdw = norm
t∈T
(︀
𝜑wt 𝜃td
)︀
𝜑wt = norm
w∈W
(︀
nwt
)︀
, nwt =
∑︀
d∈D
ndw ptdw
𝜃td = norm
t∈T
(︀
ntd
)︀
, ntd =
∑︀
w∈d
ndw ptdw
where norm
i∈I
xi = max{xi ,0}∑︀
j∈I
max{xj ,0}
Murat Apishev great-mel@yandex.ru AIST 2016 5 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
ARTM and regularized EM-algorithm
Log-likelihood maximization with additive regularization criterion R:
∑︁
d∈D
∑︁
w∈W
ndw ln
∑︁
t
𝜑wt 𝜃td + R(Φ, Θ) → max
Φ,Θ
EM-algorithm: the simple iteration method for the set of equations
E-шаг:
M-шаг:
⎧
⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎩
ptdw = norm
t∈T
(︀
𝜑wt 𝜃td
)︀
𝜑wt = norm
w∈W
(︁
nwt + 𝜑wt
𝜕R
𝜕𝜑wt
)︁
, nwt =
∑︀
d∈D
ndw ptdw
𝜃td = norm
t∈T
(︁
ntd + 𝜃td
𝜕R
𝜕𝜃td
)︁
, ntd =
∑︀
w∈d
ndw ptdw
Murat Apishev great-mel@yandex.ru AIST 2016 6 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
Examples of regularizers
Many Bayesian models can be reinterpreted as regularizers
in ARTM.
Some examples of regularizes:
1 Smoothing Φ / Θ (leads to popular LDA model)
2 Sparsing Φ / Θ
3 Decorrelation of topics in Φ
4 Semi-supervised learning
5 Topic coherence maximization
6 Topic selection
7 . . .
Murat Apishev great-mel@yandex.ru AIST 2016 7 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
Multimodal Topic Model
Multimodal Topic Model finds topical distributions for terms
p(w|t), authors p(a|t), time p(y|t), objects of images p(o|t),
linked documents p(d′|t), advertising banners p(b|t), users p(u|t),
and binds all these modalities into a single topic model.
Topics of documents
Words and keyphrases of topics
doc1:
doc2:
doc3:
doc4:
...
Text documents
Topic
Modeling
D
o
c
u
m
e
n
t
s
T
o
p
i
c
s
Metadata:
Authors
Data Time
Conference
Organization
URL
etc.
Ads Images Links
Users
Murat Apishev great-mel@yandex.ru AIST 2016 8 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
M-ARTM and multimodal regularized EM-algorithm
W m
is a vocabulary of terms of m-th modality, m ∈ M,
W = W 1
⊔ W m
as a joint vocabulary of all modalities
Multimodal log-likelihood maximization with additive regularization
criterion R:
∑︁
m∈M
𝜆m
∑︁
d∈D
∑︁
w∈W m
ndw ln
∑︁
t
𝜑wt 𝜃td + R(Φ, Θ) → max
Φ,Θ
EM-algorithm: the simple iteration method for the set of equations
E-шаг:
M-шаг:
⎧
⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎩
ptdw = norm
t∈T
(︀
𝜑wt 𝜃td
)︀
𝜑wt = norm
w∈W m
(︁
nwt + 𝜑wt
𝜕R
𝜕𝜑wt
)︁
, nwt =
∑︀
d∈D
𝜆m(w)ndw ptdw
𝜃td = norm
t∈T
(︁
ntd + 𝜃td
𝜕R
𝜕𝜃td
)︁
, ntd =
∑︀
w∈d
𝜆m(w)ndw ptdw
Murat Apishev great-mel@yandex.ru AIST 2016 9 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
BigARTM project
BigARTM features:
Fast1
parallel and online processing of Big Data;
Multimodal and regularized topic modeling;
Built-in library of regularizers and quality measures;
BigARTM community:
Open-source https://github.com/bigartm
Documentation http://bigartm.org
BigARTM license and programming environment:
Freely available for commercial usage (BSD 3-Clause license)
Cross-platform — Windows, Linux, Mac OS X (32 bit, 64 bit)
Programming APIs: command line, C++, Python
1
Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. BigARTM:
Open Source Library for Regularized Multimodal Topic Modeling of Large
Collections Analysis of Images, Social Networks and Texts. 2015
Murat Apishev great-mel@yandex.ru AIST 2016 10 / 33
Introduction
Parallel implementation
Applications
Topic modeling
ARTM
BigARTM
BigARTM vs. Gensim vs. Vowpal Wabbit LDA
3.7M articles from Wikipedia, 100K unique words
Framework procs train inference perplexity
BigARTM 1 35 min 72 sec 4000
LdaModel 1 369 min 395 sec 4161
VW.LDA 1 73 min 120 sec 4108
BigARTM 4 9 min 20 sec 4061
LdaMulticore 4 60 min 222 sec 4111
BigARTM 8 4.5 min 14 sec 4304
LdaMulticore 8 57 min 224 sec 4455
procs = number of parallel threads
inference = time to infer 𝜃d for 100K held-out documents
perplexity P is calculated on held-out documents
P(D) = exp
(︂
−
1
n
∑︁
d∈D
∑︁
w∈d
ndw ln
∑︁
t∈T
𝜑wt 𝜃td
)︂
, n =
∑︁
d
nd .
Murat Apishev great-mel@yandex.ru AIST 2016 11 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Offline algorithm
The collection is split into batches.
Offline algorithm performs scans over the collection.
Each thread process one batch at a time, inferring nwt and 𝜃td
(using Θ regularization).
After each scan algorithm recalculates Φ matrix and apply Φ
regularizers according to the equation
𝜑wt = norm
w∈W
(︁
nwt + 𝜑wt
𝜕R
𝜕𝜑wt
)︁
.
The implementation never stores the entire Θ matrix at any
given time.
Murat Apishev great-mel@yandex.ru AIST 2016 12 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Offline algorithm: Gantt chart
0s 4s 8s 12s 16s 20s 24s 28s 32s 36s
Main
Proc-1
Proc-2
Proc-3
Proc-4
Proc-5
Proc-6
Batch processing Norm
This and further Gantt charts were created using the NYTimes dataset:
https://archive.ics.uci.edu/ml/datasets/Bag+of+Words.
Size of dataset is ≈ 300k documents, but each algorithm was run on
some subset (from 70% to 100%) to archive the ≈ 36 sec. working time.
Murat Apishev great-mel@yandex.ru AIST 2016 13 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Online algorithm
The algorithm is a generalization of Online variational Bayes
algorithm for LDA model.
Online ARTM improves the convergence rate of the
Offline ARTM by re-calculating matrix Φ after every 𝜂
batches.
Better suited for large and heterogeneous text collections.
Weighted sum of nwt from previous and current 𝜂 batches to
control the importance of new information.
Issue: all threads has no useful work to do during the update
of Φ matrix.
Murat Apishev great-mel@yandex.ru AIST 2016 14 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Online algorithm: Gantt chart
0 s. 4 s. 8 s. 12 s. 16 s. 20 s. 24 s. 28 s. 32 s. 36 s.
Main
Proc-1
Proc-2
Proc-3
Proc-4
Proc-5
Proc-6
Odd batch Even batch Norm Merge
Murat Apishev great-mel@yandex.ru AIST 2016 15 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Async: Asynchronous online algorithm
Processor threads:
ProcessBatch(Db,¡wt)Db ñwt
Merger thread:
Accumulate ñwt
Recalculate ¡wt
Queue
{Db}
Queue
{ñwt}
¡wt
Sync()
Db
Faster asynchronous implementation (it was compared with
Gensim and VW LDA)
Issue: Merger and DataLoader can become a bottleneck.
Issue: the result of such algorithm is non-deterministic.
Murat Apishev great-mel@yandex.ru AIST 2016 16 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Async: Gantt chart in normal case
0s 4s 8s 12s 16s 20s 24s 28s 32s 36s
Merger
Proc-1
Proc-2
Proc-3
Proc-4
Proc-5
Proc-6
Odd batch Even batch Norm
Merge matrix Merge increments
Murat Apishev great-mel@yandex.ru AIST 2016 17 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Async: Gantt chart in bad case
0s 4s 8s 12s 16s 20s 24s 28s 32s 36s
Merger
Proc-1
Proc-2
Proc-3
Proc-4
Proc-5
Proc-6
Odd batch Even batch Norm
Merge matrix Merge increments
Murat Apishev great-mel@yandex.ru AIST 2016 18 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
DetAsync: Deterministic asynchronous online algorithm
To avoid the indeterministic behavior lets replace the update
after first 𝜂 batches with update after given 𝜂 batches.
Remove Merger and DataLoader threads. Each Processor
thread reads batches and writes results into nwt matrix by
itself.
Processor threads get a set of batches to process, start
processing and immediately return a future object to main
thread.
The main thread can process the updates of Φ matrix while
Processor threads work, and then get the result by passing
received future object to Await function.
Murat Apishev great-mel@yandex.ru AIST 2016 19 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
DetAsync: schema
MasterModel
Processor threads:
Db = LoadBatch(b)
ProcessBatch(Db,¡wt)
Main thread:
Recalculate ¡wt
nwt¡wt
Transform({Db})
FitOffline({Db})
FitOnline({Db})
Murat Apishev great-mel@yandex.ru AIST 2016 20 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
DetAsync: Gantt chart
0s 4s 8s 12s 16s 20s 24s 28s 32s 36s
Main
Proc-1
Proc-2
Proc-3
Proc-4
Proc-5
Proc-6
Odd batch Even batch Norm Merge
Murat Apishev great-mel@yandex.ru AIST 2016 21 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Experiments
Datasets: Wikipedia (|D| = 3.7M articles, |W | = 100K words), Pubmed
(|D| = 8.2M abstracts, |W | = 141K words).
Node: Intel Xeon CPU E5-2650 v2 system with 2 processors, 16 physical
cores in total (32 with hyper-threading).
Metric: perplexity P value achieved in the allotted time.
Time: each algorithm was time-boxed to run for a 30 minutes.
Peak memory usage (Gb):
|T| Offline Online DetAsync Async (v0.6)
Pubmed 1000 5.17 4.68 8.18 13.4
Pubmed 100 1.86 1.62 2.17 3.71
Wiki 1000 1.74 2.44 3.93 7.9
Wiki 100 0.54 0.53 0.83 1.28
Murat Apishev great-mel@yandex.ru AIST 2016 22 / 33
Introduction
Parallel implementation
Applications
Synchronous algorithms
Asynchronous algorithms
Comparison
Reached perplexity value
0 5 10 15 20 25 30
2,000
2,200
2,400
Time (min)
Perplexity
Offline
Online
Async
DetAsync
10 15 20 25 30 35
3,800
4,000
4,200
4,400
4,600
4,800
5,000
Time (min)
Perplexity
Offline
Online
Async
DetAsync
Wikipedia (left), Pubmed (right).
DetAsync achives best perplexity in given time-box.
Murat Apishev great-mel@yandex.ru AIST 2016 23 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Mining ethnic-related content from blogosphere
Development of concept and methodology for multi-level
monitoring of the state of inter-ethnic relations with the data from
social media.
The objectives of Topic Modeling in this project:
1 Identify ethnic topics in social media big data
2 Identify event and permanent ethnic topics
3 Identify spatio-temporal patterns of the ethnic discourse
4 Estimate the sentiment of the ethnic discourse
5 Develop the monitoring system of inter-ethnic discourse
The Russian Science Foundation grant 15-18-00091 (2015–2017)
(Higher School of Economics, St. Petersburg School of Social Sciences and
Humanities, Internet Studies Laboratory LINIS)
Murat Apishev great-mel@yandex.ru AIST 2016 24 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Example ethnonyms for semi-supervised topic modeling
османский русич
восточноевропейский сингапурец
эвенк перуанский
швейцарская словенский
аланский вепсский
саамский ниггер
латыш адыги
литовец сомалиец
цыганка абхаз
ханты-мансийский темнокожий
карачаевский нигериец
кубинка лягушатник
гагаузский камбоджиец
Murat Apishev great-mel@yandex.ru AIST 2016 25 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Regularization for finding ethnic topics
smoothing ethnonyms in ethnic topics
sparsing ethnonyms in background topics
Murat Apishev great-mel@yandex.ru AIST 2016 26 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Regularization for finding ethnic topics
smoothing ethnonyms in ethnic topics
sparsing ethnonyms in background topics
smoothing non-ethnonyms for background topics
Murat Apishev great-mel@yandex.ru AIST 2016 27 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Regularization for finding ethnic topics
smoothing ethnonyms in ethnic topics
sparsing ethnonyms in background topics
smoothing non-ethnonyms in background topics
decorrelating ethnic topics
Murat Apishev great-mel@yandex.ru AIST 2016 28 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Regularization for finding ethnic topics
smoothing ethnonyms in ethnic topics
sparsing ethnonyms in background topics
smoothing non-ethnonyms in background topics
decorrelating ethnic topics
adding ethnonyms modality and decorrelating their topics
Murat Apishev great-mel@yandex.ru AIST 2016 29 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Experiment
LiveJournal collection: 1.58M of documents
860K of words in the raw vocabulary after lemmatization
90K of words after filtering out
short words with length 2,
rare words with nw < 20 including:
non-Russian words
250 ethnonyms
Murat Apishev great-mel@yandex.ru AIST 2016 30 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Semi-supervised ARTM for ethnic topic modeling
The number of ethnic topics found by the model:
model ethnic |S| background |B| ++ +− −+ coh20
2
tfidf20
PLSA 400 12 15 17 -1447 -1012
LDA 400 12 15 17 -1540 -1121
ARTM-4 250 150 21 27 20 -1651 -1296
ARTM-5 250 150 38 42 30 -1342 -908
ARTM-4:
ethnic topics: sparsing and decorrelating, ethnonyms smoothing
background topics: smoothing, ethnonyms sparsing
ARTM-5:
ARTM-4 + ethnonyms as additional modality
2
Coherence and TF-IDF coherence are metrics that match the human
judgment of topic quality. The topic is better if it has higher coherence value.
Murat Apishev great-mel@yandex.ru AIST 2016 31 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Ethnic topics examples
(русские): русский, князь, россия, татарин, великий, царить, царь, иван,
император, империя, грозить, государь, век, московская, екатерина, москва,
(русские): акция, организация, митинг, движение, активный, мероприятие,
совет, русский, участник, москва, оппозиция, россия, пикет, протест, проведение,
националист, поддержка, общественный, проводить, участие,
(славяне, византийцы): славянский, святослав, жрец, древние, письменность,
рюрик, летопись, византия, мефодий, хазарский, русский, азбука,
(сирийцы): сирийский, асад, боевик, район, террорист, уничтожать, группировка,
дамаск, оружие, алесио, оппозиция, операция, селение, сша, нусра, турция,
(турки): турция, турецкий, курдский, эрдоган, стамбул, страна, кавказ, горин,
полиция, премьер-министр, регион, курдистан, ататюрк, партия,
(иранцы): иран, иранский, сша, россия, ядерный, президент, тегеран, сирия, оон,
израиль, переговоры, обама, санкция, исламский,
(палестинцы): террорист, израиль, терять, палестинский, палестинец,
террористический, палестина, взрыв, территория, страна, государство,
безопасность, арабский, организация, иерусалим, военный, полиция, газ,
(ливанцы): ливанский, боевик, район, ливан, армия, террорист, али, военный,
хизбалла, раненый, уничтожать, сирия, подразделение, квартал, армейский,
(ливийцы): ливан, демократия, страна, ливийский, каддафи, государство,
алжир, война, правительство, сша, арабский, али, муаммар, сирия,
(евреи): израиль, израильский, страна, израил, война, нетаньяху, тель-авив,
время, сша, сирия, египет, случай, самолет, еврейский, военный, ближний,
Murat Apishev great-mel@yandex.ru AIST 2016 32 / 33
Introduction
Parallel implementation
Applications
The RSF project
Conclusions
Conclusions
BigARTM is an open-source library supporting multimodal
ARTM theory.
Fast implementation of the underlying online EM-algorithm
was even more improved. Memory usage was reduced.
Combination of 8 regularizers in the task of ethnic topics
extraction showed the supirity of ARTM approach.
BigARTM is using to process more than 20 collections in
several different projects.
Join our comunity!
Contacts: bigartm.org, great-mel@yandex.ru
Murat Apishev great-mel@yandex.ru AIST 2016 33 / 33

More Related Content

PDF
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...
PDF
Speaker Diarization
PDF
010_20160216_Variational Gaussian Process
PDF
A calculus of mobile Real-Time processes
PDF
Training and Inference for Deep Gaussian Processes
PDF
Tutorial of topological data analysis part 3(Mapper algorithm)
PDF
A short and naive introduction to using network in prediction models
PDF
Parallel Optimization in Machine Learning
Vladimir Milov and Andrey Savchenko - Classification of Dangerous Situations...
Speaker Diarization
010_20160216_Variational Gaussian Process
A calculus of mobile Real-Time processes
Training and Inference for Deep Gaussian Processes
Tutorial of topological data analysis part 3(Mapper algorithm)
A short and naive introduction to using network in prediction models
Parallel Optimization in Machine Learning

What's hot (19)

PDF
VAE-type Deep Generative Models
PDF
D143136
PDF
Hyperparameter optimization with approximate gradient
PDF
Safe and Efficient Off-Policy Reinforcement Learning
PPTX
Machine learning applications in aerospace domain
PDF
Data-Driven Recommender Systems
PDF
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...
PPTX
Differential privacy without sensitivity [NIPS2016読み会資料]
PDF
Neural Networks: Radial Bases Functions (RBF)
PDF
Matrix and Tensor Tools for Computer Vision
PDF
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PDF
Improving Variational Inference with Inverse Autoregressive Flow
PPTX
Introduction of "TrailBlazer" algorithm
PDF
Lecture 6: Convolutional Neural Networks
PDF
CSC446: Pattern Recognition (LN8)
PDF
safe and efficient off policy reinforcement learning
PDF
The Gaussian Process Latent Variable Model (GPLVM)
PPTX
Bidirectional graph search techniques for finding shortest path in image base...
PDF
Dictionary Learning for Massive Matrix Factorization
VAE-type Deep Generative Models
D143136
Hyperparameter optimization with approximate gradient
Safe and Efficient Off-Policy Reinforcement Learning
Machine learning applications in aerospace domain
Data-Driven Recommender Systems
Time Series Forecasting Using Recurrent Neural Network and Vector Autoregress...
Differential privacy without sensitivity [NIPS2016読み会資料]
Neural Networks: Radial Bases Functions (RBF)
Matrix and Tensor Tools for Computer Vision
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
Improving Variational Inference with Inverse Autoregressive Flow
Introduction of "TrailBlazer" algorithm
Lecture 6: Convolutional Neural Networks
CSC446: Pattern Recognition (LN8)
safe and efficient off policy reinforcement learning
The Gaussian Process Latent Variable Model (GPLVM)
Bidirectional graph search techniques for finding shortest path in image base...
Dictionary Learning for Massive Matrix Factorization
Ad

Viewers also liked (6)

ODP
Topic Modeling
PDF
Topic model
PDF
Topic Models, LDA and all that
POTX
LDA Beginner's Tutorial
PPTX
20151221 public
PPT
Topic Models
Topic Modeling
Topic model
Topic Models, LDA and all that
LDA Beginner's Tutorial
20151221 public
Topic Models
Ad

Similar to Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algorithm for Online Topic Modeling (20)

PPT
Parallel algorithms
PPT
Parallel algorithms
PPT
Parallel algorithms
PDF
Gk3611601162
PDF
Massive Matrix Factorization : Applications to collaborative filtering
PPTX
autoTVM
PDF
cis97003
PDF
Safety Verification of Deep Neural Networks_.pdf
PPTX
A Tale of Data Pattern Discovery in Parallel
PPT
introegthnhhdfhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhppt
PPTX
Chapter two
PDF
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
PPT
Research Away Day Jun 2009
PDF
Second order traffic flow models on networks
PDF
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
PDF
Does PostgreSQL respond to the challenge of analytical queries?
PPT
Stacksqueueslists
PPT
Stacks queues lists
PPT
Stacks queues lists
PPT
Stacks queues lists
Parallel algorithms
Parallel algorithms
Parallel algorithms
Gk3611601162
Massive Matrix Factorization : Applications to collaborative filtering
autoTVM
cis97003
Safety Verification of Deep Neural Networks_.pdf
A Tale of Data Pattern Discovery in Parallel
introegthnhhdfhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhppt
Chapter two
Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large C...
Research Away Day Jun 2009
Second order traffic flow models on networks
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Does PostgreSQL respond to the challenge of analytical queries?
Stacksqueueslists
Stacks queues lists
Stacks queues lists
Stacks queues lists

More from AIST (20)

PDF
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
PDF
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
PDF
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
PDF
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
PDF
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
PDF
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
PDF
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
PPTX
Иосиф Иткин, Exactpro - TBA
PPTX
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
PDF
George Moiseev - Classification of E-commerce Websites by Product Categories
PDF
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
PDF
Marina Danshina - The methodology of automated decryption of znamenny chants
PDF
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
PPTX
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
PDF
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
PPTX
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
PPTX
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
PDF
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
PPTX
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
PPT
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Иосиф Иткин, Exactpro - TBA
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
George Moiseev - Classification of E-commerce Websites by Product Categories
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Marina Danshina - The methodology of automated decryption of znamenny chants
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Olesia Kushnir - Reflection Symmetry of Shapes Based on Skeleton Primitive Ch...

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to machine learning and Linear Models
PPT
Quality review (1)_presentation of this 21
PDF
Foundation of Data Science unit number two notes
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Lecture1 pattern recognition............
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Database Infoormation System (DBIS).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Knowledge Engineering Part 1
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to machine learning and Linear Models
Quality review (1)_presentation of this 21
Foundation of Data Science unit number two notes
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Lecture1 pattern recognition............
Galatica Smart Energy Infrastructure Startup Pitch Deck
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Ppt On Nestle.pptx huunnnhhgfvu
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Database Infoormation System (DBIS).pptx
climate analysis of Dhaka ,Banglades.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Clinical guidelines as a resource for EBP(1).pdf

Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algorithm for Online Topic Modeling

  • 1. Parallel Non-blocking Deterministic Algorithm for Online Topic Modeling Murat Apishev great-mel@yandex.ru Oleksandr Frei oleksandr.frei@gmail.com HSE, MSU, MIPT April 8, 2016
  • 2. Contents 1 Introduction Topic modeling ARTM BigARTM 2 Parallel implementation Synchronous algorithms Asynchronous algorithms Comparison 3 Applications The RSF project Conclusions
  • 3. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM Topic modeling Topic modeling — an application of machine learning to statistical text analysis. Topic — a specific terminology of the subject area, the set of terms (unigrams or n−grams) frequently appearing together in documents. Topic model uncovers latent semantic structure of a text collection: topic t is a probability distribution p(w|t) over terms w document d is a probability distribution p(t|d) over topics t Applications — information retrieval for long-text queries, classification, categorization, summarization of texts. Murat Apishev great-mel@yandex.ru AIST 2016 3 / 33
  • 4. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM Topic modeling task Given: W — set (vocabulary) of terms (unigrams or n−grams), D — set (collection) of text documents d ⊂ W , ndw — how many times term w appears in document d. Find: model p(w|d) = ∑︀ t∈T 𝜑wt 𝜃td with parameters Φ W×T и Θ T×D : 𝜑wt =p(w|t) — term probabilities w in each topic t, 𝜃td =p(t|d) — topic probabilities t in each document d. Criteria log-likelihood maximization: ∑︁ d∈D ∑︁ w∈d ndw ln ∑︁ t∈T 𝜑wt 𝜃td → max 𝜑,𝜃 ; 𝜑wt 0; ∑︀ w 𝜑wt = 1; 𝜃td 0; ∑︀ t 𝜃td = 1. Issue: the problem of stochastic matrix factorization is ill-posed: ΦΘ = (ΦS)(S−1Θ) = Φ′Θ′. Murat Apishev great-mel@yandex.ru AIST 2016 4 / 33
  • 5. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM PLSA and EM-algorithm Log-likelihood maximization: ∑︁ d∈D ∑︁ w∈W ndw ln ∑︁ t 𝜑wt 𝜃td → max Φ,Θ EM-algorithm: the simple iteration method for the set of equations E-шаг: M-шаг: ⎧ ⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎩ ptdw = norm t∈T (︀ 𝜑wt 𝜃td )︀ 𝜑wt = norm w∈W (︀ nwt )︀ , nwt = ∑︀ d∈D ndw ptdw 𝜃td = norm t∈T (︀ ntd )︀ , ntd = ∑︀ w∈d ndw ptdw where norm i∈I xi = max{xi ,0}∑︀ j∈I max{xj ,0} Murat Apishev great-mel@yandex.ru AIST 2016 5 / 33
  • 6. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM ARTM and regularized EM-algorithm Log-likelihood maximization with additive regularization criterion R: ∑︁ d∈D ∑︁ w∈W ndw ln ∑︁ t 𝜑wt 𝜃td + R(Φ, Θ) → max Φ,Θ EM-algorithm: the simple iteration method for the set of equations E-шаг: M-шаг: ⎧ ⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎩ ptdw = norm t∈T (︀ 𝜑wt 𝜃td )︀ 𝜑wt = norm w∈W (︁ nwt + 𝜑wt 𝜕R 𝜕𝜑wt )︁ , nwt = ∑︀ d∈D ndw ptdw 𝜃td = norm t∈T (︁ ntd + 𝜃td 𝜕R 𝜕𝜃td )︁ , ntd = ∑︀ w∈d ndw ptdw Murat Apishev great-mel@yandex.ru AIST 2016 6 / 33
  • 7. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM Examples of regularizers Many Bayesian models can be reinterpreted as regularizers in ARTM. Some examples of regularizes: 1 Smoothing Φ / Θ (leads to popular LDA model) 2 Sparsing Φ / Θ 3 Decorrelation of topics in Φ 4 Semi-supervised learning 5 Topic coherence maximization 6 Topic selection 7 . . . Murat Apishev great-mel@yandex.ru AIST 2016 7 / 33
  • 8. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM Multimodal Topic Model Multimodal Topic Model finds topical distributions for terms p(w|t), authors p(a|t), time p(y|t), objects of images p(o|t), linked documents p(d′|t), advertising banners p(b|t), users p(u|t), and binds all these modalities into a single topic model. Topics of documents Words and keyphrases of topics doc1: doc2: doc3: doc4: ... Text documents Topic Modeling D o c u m e n t s T o p i c s Metadata: Authors Data Time Conference Organization URL etc. Ads Images Links Users Murat Apishev great-mel@yandex.ru AIST 2016 8 / 33
  • 9. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM M-ARTM and multimodal regularized EM-algorithm W m is a vocabulary of terms of m-th modality, m ∈ M, W = W 1 ⊔ W m as a joint vocabulary of all modalities Multimodal log-likelihood maximization with additive regularization criterion R: ∑︁ m∈M 𝜆m ∑︁ d∈D ∑︁ w∈W m ndw ln ∑︁ t 𝜑wt 𝜃td + R(Φ, Θ) → max Φ,Θ EM-algorithm: the simple iteration method for the set of equations E-шаг: M-шаг: ⎧ ⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎩ ptdw = norm t∈T (︀ 𝜑wt 𝜃td )︀ 𝜑wt = norm w∈W m (︁ nwt + 𝜑wt 𝜕R 𝜕𝜑wt )︁ , nwt = ∑︀ d∈D 𝜆m(w)ndw ptdw 𝜃td = norm t∈T (︁ ntd + 𝜃td 𝜕R 𝜕𝜃td )︁ , ntd = ∑︀ w∈d 𝜆m(w)ndw ptdw Murat Apishev great-mel@yandex.ru AIST 2016 9 / 33
  • 10. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM BigARTM project BigARTM features: Fast1 parallel and online processing of Big Data; Multimodal and regularized topic modeling; Built-in library of regularizers and quality measures; BigARTM community: Open-source https://github.com/bigartm Documentation http://bigartm.org BigARTM license and programming environment: Freely available for commercial usage (BSD 3-Clause license) Cross-platform — Windows, Linux, Mac OS X (32 bit, 64 bit) Programming APIs: command line, C++, Python 1 Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections Analysis of Images, Social Networks and Texts. 2015 Murat Apishev great-mel@yandex.ru AIST 2016 10 / 33
  • 11. Introduction Parallel implementation Applications Topic modeling ARTM BigARTM BigARTM vs. Gensim vs. Vowpal Wabbit LDA 3.7M articles from Wikipedia, 100K unique words Framework procs train inference perplexity BigARTM 1 35 min 72 sec 4000 LdaModel 1 369 min 395 sec 4161 VW.LDA 1 73 min 120 sec 4108 BigARTM 4 9 min 20 sec 4061 LdaMulticore 4 60 min 222 sec 4111 BigARTM 8 4.5 min 14 sec 4304 LdaMulticore 8 57 min 224 sec 4455 procs = number of parallel threads inference = time to infer 𝜃d for 100K held-out documents perplexity P is calculated on held-out documents P(D) = exp (︂ − 1 n ∑︁ d∈D ∑︁ w∈d ndw ln ∑︁ t∈T 𝜑wt 𝜃td )︂ , n = ∑︁ d nd . Murat Apishev great-mel@yandex.ru AIST 2016 11 / 33
  • 12. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Offline algorithm The collection is split into batches. Offline algorithm performs scans over the collection. Each thread process one batch at a time, inferring nwt and 𝜃td (using Θ regularization). After each scan algorithm recalculates Φ matrix and apply Φ regularizers according to the equation 𝜑wt = norm w∈W (︁ nwt + 𝜑wt 𝜕R 𝜕𝜑wt )︁ . The implementation never stores the entire Θ matrix at any given time. Murat Apishev great-mel@yandex.ru AIST 2016 12 / 33
  • 13. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Offline algorithm: Gantt chart 0s 4s 8s 12s 16s 20s 24s 28s 32s 36s Main Proc-1 Proc-2 Proc-3 Proc-4 Proc-5 Proc-6 Batch processing Norm This and further Gantt charts were created using the NYTimes dataset: https://archive.ics.uci.edu/ml/datasets/Bag+of+Words. Size of dataset is ≈ 300k documents, but each algorithm was run on some subset (from 70% to 100%) to archive the ≈ 36 sec. working time. Murat Apishev great-mel@yandex.ru AIST 2016 13 / 33
  • 14. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Online algorithm The algorithm is a generalization of Online variational Bayes algorithm for LDA model. Online ARTM improves the convergence rate of the Offline ARTM by re-calculating matrix Φ after every 𝜂 batches. Better suited for large and heterogeneous text collections. Weighted sum of nwt from previous and current 𝜂 batches to control the importance of new information. Issue: all threads has no useful work to do during the update of Φ matrix. Murat Apishev great-mel@yandex.ru AIST 2016 14 / 33
  • 15. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Online algorithm: Gantt chart 0 s. 4 s. 8 s. 12 s. 16 s. 20 s. 24 s. 28 s. 32 s. 36 s. Main Proc-1 Proc-2 Proc-3 Proc-4 Proc-5 Proc-6 Odd batch Even batch Norm Merge Murat Apishev great-mel@yandex.ru AIST 2016 15 / 33
  • 16. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Async: Asynchronous online algorithm Processor threads: ProcessBatch(Db,¡wt)Db ñwt Merger thread: Accumulate ñwt Recalculate ¡wt Queue {Db} Queue {ñwt} ¡wt Sync() Db Faster asynchronous implementation (it was compared with Gensim and VW LDA) Issue: Merger and DataLoader can become a bottleneck. Issue: the result of such algorithm is non-deterministic. Murat Apishev great-mel@yandex.ru AIST 2016 16 / 33
  • 17. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Async: Gantt chart in normal case 0s 4s 8s 12s 16s 20s 24s 28s 32s 36s Merger Proc-1 Proc-2 Proc-3 Proc-4 Proc-5 Proc-6 Odd batch Even batch Norm Merge matrix Merge increments Murat Apishev great-mel@yandex.ru AIST 2016 17 / 33
  • 18. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Async: Gantt chart in bad case 0s 4s 8s 12s 16s 20s 24s 28s 32s 36s Merger Proc-1 Proc-2 Proc-3 Proc-4 Proc-5 Proc-6 Odd batch Even batch Norm Merge matrix Merge increments Murat Apishev great-mel@yandex.ru AIST 2016 18 / 33
  • 19. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison DetAsync: Deterministic asynchronous online algorithm To avoid the indeterministic behavior lets replace the update after first 𝜂 batches with update after given 𝜂 batches. Remove Merger and DataLoader threads. Each Processor thread reads batches and writes results into nwt matrix by itself. Processor threads get a set of batches to process, start processing and immediately return a future object to main thread. The main thread can process the updates of Φ matrix while Processor threads work, and then get the result by passing received future object to Await function. Murat Apishev great-mel@yandex.ru AIST 2016 19 / 33
  • 20. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison DetAsync: schema MasterModel Processor threads: Db = LoadBatch(b) ProcessBatch(Db,¡wt) Main thread: Recalculate ¡wt nwt¡wt Transform({Db}) FitOffline({Db}) FitOnline({Db}) Murat Apishev great-mel@yandex.ru AIST 2016 20 / 33
  • 21. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison DetAsync: Gantt chart 0s 4s 8s 12s 16s 20s 24s 28s 32s 36s Main Proc-1 Proc-2 Proc-3 Proc-4 Proc-5 Proc-6 Odd batch Even batch Norm Merge Murat Apishev great-mel@yandex.ru AIST 2016 21 / 33
  • 22. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Experiments Datasets: Wikipedia (|D| = 3.7M articles, |W | = 100K words), Pubmed (|D| = 8.2M abstracts, |W | = 141K words). Node: Intel Xeon CPU E5-2650 v2 system with 2 processors, 16 physical cores in total (32 with hyper-threading). Metric: perplexity P value achieved in the allotted time. Time: each algorithm was time-boxed to run for a 30 minutes. Peak memory usage (Gb): |T| Offline Online DetAsync Async (v0.6) Pubmed 1000 5.17 4.68 8.18 13.4 Pubmed 100 1.86 1.62 2.17 3.71 Wiki 1000 1.74 2.44 3.93 7.9 Wiki 100 0.54 0.53 0.83 1.28 Murat Apishev great-mel@yandex.ru AIST 2016 22 / 33
  • 23. Introduction Parallel implementation Applications Synchronous algorithms Asynchronous algorithms Comparison Reached perplexity value 0 5 10 15 20 25 30 2,000 2,200 2,400 Time (min) Perplexity Offline Online Async DetAsync 10 15 20 25 30 35 3,800 4,000 4,200 4,400 4,600 4,800 5,000 Time (min) Perplexity Offline Online Async DetAsync Wikipedia (left), Pubmed (right). DetAsync achives best perplexity in given time-box. Murat Apishev great-mel@yandex.ru AIST 2016 23 / 33
  • 24. Introduction Parallel implementation Applications The RSF project Conclusions Mining ethnic-related content from blogosphere Development of concept and methodology for multi-level monitoring of the state of inter-ethnic relations with the data from social media. The objectives of Topic Modeling in this project: 1 Identify ethnic topics in social media big data 2 Identify event and permanent ethnic topics 3 Identify spatio-temporal patterns of the ethnic discourse 4 Estimate the sentiment of the ethnic discourse 5 Develop the monitoring system of inter-ethnic discourse The Russian Science Foundation grant 15-18-00091 (2015–2017) (Higher School of Economics, St. Petersburg School of Social Sciences and Humanities, Internet Studies Laboratory LINIS) Murat Apishev great-mel@yandex.ru AIST 2016 24 / 33
  • 25. Introduction Parallel implementation Applications The RSF project Conclusions Example ethnonyms for semi-supervised topic modeling османский русич восточноевропейский сингапурец эвенк перуанский швейцарская словенский аланский вепсский саамский ниггер латыш адыги литовец сомалиец цыганка абхаз ханты-мансийский темнокожий карачаевский нигериец кубинка лягушатник гагаузский камбоджиец Murat Apishev great-mel@yandex.ru AIST 2016 25 / 33
  • 26. Introduction Parallel implementation Applications The RSF project Conclusions Regularization for finding ethnic topics smoothing ethnonyms in ethnic topics sparsing ethnonyms in background topics Murat Apishev great-mel@yandex.ru AIST 2016 26 / 33
  • 27. Introduction Parallel implementation Applications The RSF project Conclusions Regularization for finding ethnic topics smoothing ethnonyms in ethnic topics sparsing ethnonyms in background topics smoothing non-ethnonyms for background topics Murat Apishev great-mel@yandex.ru AIST 2016 27 / 33
  • 28. Introduction Parallel implementation Applications The RSF project Conclusions Regularization for finding ethnic topics smoothing ethnonyms in ethnic topics sparsing ethnonyms in background topics smoothing non-ethnonyms in background topics decorrelating ethnic topics Murat Apishev great-mel@yandex.ru AIST 2016 28 / 33
  • 29. Introduction Parallel implementation Applications The RSF project Conclusions Regularization for finding ethnic topics smoothing ethnonyms in ethnic topics sparsing ethnonyms in background topics smoothing non-ethnonyms in background topics decorrelating ethnic topics adding ethnonyms modality and decorrelating their topics Murat Apishev great-mel@yandex.ru AIST 2016 29 / 33
  • 30. Introduction Parallel implementation Applications The RSF project Conclusions Experiment LiveJournal collection: 1.58M of documents 860K of words in the raw vocabulary after lemmatization 90K of words after filtering out short words with length 2, rare words with nw < 20 including: non-Russian words 250 ethnonyms Murat Apishev great-mel@yandex.ru AIST 2016 30 / 33
  • 31. Introduction Parallel implementation Applications The RSF project Conclusions Semi-supervised ARTM for ethnic topic modeling The number of ethnic topics found by the model: model ethnic |S| background |B| ++ +− −+ coh20 2 tfidf20 PLSA 400 12 15 17 -1447 -1012 LDA 400 12 15 17 -1540 -1121 ARTM-4 250 150 21 27 20 -1651 -1296 ARTM-5 250 150 38 42 30 -1342 -908 ARTM-4: ethnic topics: sparsing and decorrelating, ethnonyms smoothing background topics: smoothing, ethnonyms sparsing ARTM-5: ARTM-4 + ethnonyms as additional modality 2 Coherence and TF-IDF coherence are metrics that match the human judgment of topic quality. The topic is better if it has higher coherence value. Murat Apishev great-mel@yandex.ru AIST 2016 31 / 33
  • 32. Introduction Parallel implementation Applications The RSF project Conclusions Ethnic topics examples (русские): русский, князь, россия, татарин, великий, царить, царь, иван, император, империя, грозить, государь, век, московская, екатерина, москва, (русские): акция, организация, митинг, движение, активный, мероприятие, совет, русский, участник, москва, оппозиция, россия, пикет, протест, проведение, националист, поддержка, общественный, проводить, участие, (славяне, византийцы): славянский, святослав, жрец, древние, письменность, рюрик, летопись, византия, мефодий, хазарский, русский, азбука, (сирийцы): сирийский, асад, боевик, район, террорист, уничтожать, группировка, дамаск, оружие, алесио, оппозиция, операция, селение, сша, нусра, турция, (турки): турция, турецкий, курдский, эрдоган, стамбул, страна, кавказ, горин, полиция, премьер-министр, регион, курдистан, ататюрк, партия, (иранцы): иран, иранский, сша, россия, ядерный, президент, тегеран, сирия, оон, израиль, переговоры, обама, санкция, исламский, (палестинцы): террорист, израиль, терять, палестинский, палестинец, террористический, палестина, взрыв, территория, страна, государство, безопасность, арабский, организация, иерусалим, военный, полиция, газ, (ливанцы): ливанский, боевик, район, ливан, армия, террорист, али, военный, хизбалла, раненый, уничтожать, сирия, подразделение, квартал, армейский, (ливийцы): ливан, демократия, страна, ливийский, каддафи, государство, алжир, война, правительство, сша, арабский, али, муаммар, сирия, (евреи): израиль, израильский, страна, израил, война, нетаньяху, тель-авив, время, сша, сирия, египет, случай, самолет, еврейский, военный, ближний, Murat Apishev great-mel@yandex.ru AIST 2016 32 / 33
  • 33. Introduction Parallel implementation Applications The RSF project Conclusions Conclusions BigARTM is an open-source library supporting multimodal ARTM theory. Fast implementation of the underlying online EM-algorithm was even more improved. Memory usage was reduced. Combination of 8 regularizers in the task of ethnic topics extraction showed the supirity of ARTM approach. BigARTM is using to process more than 20 collections in several different projects. Join our comunity! Contacts: bigartm.org, great-mel@yandex.ru Murat Apishev great-mel@yandex.ru AIST 2016 33 / 33