Tensors for topic modeling and deep learning on AWS Sagemaker

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Tensors for Large-scale
Topic Modeling and Deep Learning
A n i m a A n a n d k u m a r , P r i n c i p a l S c i e n t i s t , A m a z o n A I
M C L 3 3 7
N o v e m b e r 2 9 , 2 0 1 7

Machine learning in many domains…

Image
Understanding

Object
Classification

Text
Understanding

Topic Detection

Topic Detection
Government
Information
Technology
Politics
Topics

Trinity in Machine Learning
Algorithms
ComputeData

AWS ML Stack
Frameworks &
Infrastructure
AWS Deep Learning AMI
GPU
(P3 Instances)
Mobile
CPU
(C5 Instances)
IoT
(Greengrass)
Vision:
Rekognition Image
Rekognition Video
Speech:
Polly
Transcribe
Language:
Lex Translate
Comprehend
Apache
MXNet
PyTorch
Cognitive
Toolkit
Keras
Caffe2
& Caffe
TensorFlow Gluon
Application
Services
Platform
Services
Amazon Machine
Learning
Mechanical
Turk
Spark &
EMR
Amazon
SageMaker
AWS
DeepLens

Amazon Comprehend for Text

ML Algorithms in SageMaker

End-to-end
Machine Learning
Platform
Zero setup Flexible model
training
Pay by the
second
Introducing Amazon SageMaker
The quickest and easiest way to get ML models from idea to production

XGBoost, FM,
and Linear for
classification and
regression
Kmeans and PCA
for clustering and
dimensionality
reduction
Image
classification with
convolutional
neural networks
LDA and NTM for
topic modeling,
seq2seq for
translation
More than just general purpose algorithms

LDA topic model on AWS SageMaker

LDA Topic Models

Topic Models for Document Categorization
Government
Information
Technology
Politics
Topics

• Labeled sample
documents
hard to obtain

• Labeled sample
documents
hard to obtain
• How do we
discover topics
automatically?

Unsupervised Learning Supervised Learning
ML Algorithms

Warm-up: Clustering
• Each data point is part of a cluster

Warm-up: Clustering
• Data point = document
• Cluster = topic

Warm-up: Clustering
• Data point = document
• Cluster = topic
But documents have multiple topics!

LDA Topic Model: Beyond Clustering
Justice
Education
Sports
Topics

LDA Topic Model: Beyond Clustering
brai
n
comput
data
evolve
gene
neuron
Justice
Education
Sports
Topics

Training and Inference in SageMaker LDA
brai
n
comput
data
evolve
gene
neuron
• Training using spectralLDA algorithm
• Inference using stochastic gradient descent (SGD)
LDA ModelDocument
corpus
Learning
topic-word
matrix
Inference
brai
n
comput
data
evolve
gene
neuron

Notebook Demo
h t t p s : / / g i t h u b . c o m / a w s l a b s / a m a z o n - s a g e m a k e r - e x a m p l e s

LDA synthetic data generation

Performance Analysis

Qualitative Analysis

NewYork Times topics
Lifestyle
Politics
Sports
Business
1 2
3 4

PubMed Topics
BloodClinicalTrials
treatmentPublichealth
Cancer/genetics
1 2
3 4
5

Example Document in NYTimes
Government
Information
Technology
Politics
Topics

Example Document in NYTimes
Business
Information
Technology
Topics

Performance Benchmarks

SageMaker LDA training is faster
0.00
20.00
40.00
60.00
80.00
100.00
5 10 15 20 25 30 50 75 100
Timeinminutes
Number of Topics
Training time for NYTimes
Spectral Time(minutes) Mallet Time (minutes)
0.00
50.00
100.00
150.00
200.00
250.00
5 10 15 20 25 50 100
Timeinminutes
Number of Topics
Training time for PubMed
Spectral Time (minutes) Mallet Time (minutes)
8 million documents
22x faster on average 12x faster on average
• Mallet is an open-source framework for topic modeling
• Mallet does training and inference together
• Benchmarks on AWS SageMaker Platform
300000 documents

SageMaker LDA is cheaper on AWS
0.00
0.50
1.00
1.50
2.00
2.50
1 2 3 4 5 6 7 8 9
Cost($)
Number of Topics
Training cost for NYTimes
Spectral Cost ($) Mallet Cost ($)
300000 documents
0.000
1.000
2.000
3.000
4.000
5.000
6.000
1 2 3 4 5 6 7
Cost($)
Number of Topics
Training cost for PubMed
Spectral Cost ($) Mallet Cost ($)
22x cheaper on average
12x cheaper on average
• Faster training translates to lower costs on AWS
• Benchmarks on C4.8x
1 million documents

SageMaker LDA inference is faster
0
20
40
60
80
100
120
5 10 15 20 25 50 100
Inferencetimeinminutes
Number of Topics
Inference time for NYTimes
SpectralLDA Mallet
0
10
20
30
40
50
60
5 10 15 20 25 50 100
Inferencetimeinminutes
Number of topics
Inference time for Pubmed
SpectralLDA Mallet
300000 documents 1 million documents
13x faster on average
3.5x faster on average

SageMaker LDA training + inference
faster
0
20
40
60
80
100
120
5 10 15 20 25 50 100
Totaltimeinminutes
Number of Topics
Total Time (Training + Inference) for NYTimes
SpectralLDA Mallet
0
10
20
30
40
50
60
5 10 15 20 25 50 100
Totaltimeinminutes
Number of Topics
Total Time (Training + Inference) for Pubmed
SpectralLDA Mallet
7x faster on average
2.5x faster on average
300000 documents 1 million documents

SageMaker LDA has better topic
coherence
1.4
1.5
1.6
1.7
1.8
5 10 15 20 25 30 40 50 75 100
PMI
Number of Topics
Topic coherence for NYTimes
Mallet PMI Spectral PMI
• Topic coherence = Pairwise Mutual Information (PMI)
• PMI: co-occurrence of top words in a topic
• Higher PMI represents better topic quality and is a
better representative of human judgement
• Human judgement not highly correlated to log
likelihood of topic model
300000 documents

SageMaker LDA has better topic
coherence
1.4
1.5
1.6
1.7
1.8
5 10 15 20 25 30 40 50 75 100
PMI
Number of Topics
Topic coherence for NYTimes
Mallet PMI Spectral PMI
• Topic coherence = Pairwise Mutual Information (PMI)
• PMI: co-occurrence of top words in a topic
• Higher PMI represents better topic quality and is a
better representative of human judgement
• Human judgement not highly correlated to log
likelihood of topic model
300000 documents
Faster algorithm with competitive topic quality

Neural Topic Modeling on SageMaker
Perplexity vs. Number of Topics
Encoder: feedforward net
Input term counts vector
Document
Posterior
Sampled Document
Representation
Decoder:
Softmax
Output term counts vector
0
2000
4000
6000
8000
10000
12000
0 50 100 150 200
Perplexity
Number of Topics
NTM Other

Tensor Methods for LDA Topic
Models

Tensors in ML Algorithms

LDA Topic Model
brai
n
comput
data
evolve
gene
neuron
Justice
Education
Sports
Topics

Topic-word matrix [word = i|topic = j ]
Topic proportions P[topic = j|document]
Moment Tensor: Co-occurrence of Word Triplets
= + +
crim
e
Sports
Educa
on
Learning LDA Model

Tensor Decomposit ions
Spectral Decomposition

Why Tensors?
Statistical reasons:
• Incorporate higher order relationships in data
• Discover hidden topics (not possible with matrix methods)
A. Anandkumar et al.,Tensor Decompositions for Learning Latent Variable Models, JMLR 2014.

Why Tensors?
Statistical reasons:
• Incorporate higher order relationships in data
• Discover hidden topics (not possible with matrix methods)
Computational reasons:
• Tensor algebra is parallelizable like linear algebra.
• Faster than other algorithms for LDA
• Flexible: Training and inference decoupled
• Guaranteed in theory to converge to global optimum
A. Anandkumar et al., Tensor Decompositions for Learning Latent Variable Models, JMLR 2014.

TENSORS IN DEEP LEARNING

Existing Deep Networks

Deep Tensorized Networks

Space Saving in Deep Tensorized
Networks

RNN and LSTM for Sequence Modeling

Tensor RNN and Tensor LSTM

C l i m a t e d a t a s e tTr a ff i c d a t a s e t
TLSTM for Long-term Forecasting

Visual Question & Answering
Tensors for multiple modalities

Visual Question & Answering
Tensor Sketching Algorithms

Tensorly: Framework for Tensor Algebra
• Python programming
• User-friendly API
• Multiple backends:
flexible + scalable
• Example notebooks in
repository

CONCLUSION

Conclusion
• AWS SageMaker: Serverless ML framework
• Algorithms on SageMaker: faster and cheaper
• LDA model for unsupervised document categorization
• SageMaker LDA is faster and yields good topic quality
• Tensors are extensions of matrices
• Multiple dimensions and modalities
• Can be combined with deep learning
= + ..

THANK YOU!

Tensors for topic modeling and deep learning on AWS Sagemaker

More Related Content

Recently uploaded (20)

Featured (20)

Tensors for topic modeling and deep learning on AWS Sagemaker

Editor's Notes