Compressing neural language models by sparse word representation
1. by Sparse
Word Representation
Compressing
Neural Language Model
KoreaUniversity,
DepartmentofComputerScience&Radio
CommunicationEngineering
2016010646 ๊น๋ฒ์
2016010636 ์ด์งํ
Korean Information Processing
Professor HaechangLim
1
2. KOREAN INFORMATION PROCESSING Presentation
Contents
01.LanguageModel
02.ProposedModel
1-1. Language Model ์ด๋?
2-1. Sparse Representation
2
1-2. N-grams
1-3. Standard Neural LM
2-2. Embedding Compression
2-3. Prediction Compression
2-4. ZRegression NCE
03.Evaluation
3-1. Dataset
3-2. Qualitative Analysis
3-3. Quantitative analysis
3-4. Conclusion
3. Language Model
1-1. Language Model ์ด๋?
1-2. N-grams
3KOREAN INFORMATION PROCESSING Presentation
1-3. Standard Neural LM
4. 1. Language Model
4
1-1. Language Model ์ด๋?
Unfortunately, I am an ____________
KOREAN INFORMATION PROCESSING Presentation
Language Model (์ธ์ด๋ชจ๋ธ)
ํ์ฌ๊น์ง์ context(history)๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๋ค์์ ๋ํ๋ ๋จ์ด์ ํ๋ฅ ๐ท๐ท ๐๐ ๐๐
๋ฅผ ๋ํ๋ด๋ ๋ชจ๋ธ ๊ธฐ๋ฒ
idiot 0.672
flower 0.115
psycho-pass 0.581
genius 0.336
โฆ
walk 0.016
process 0.052
cancel 0.039
์ธ์ด๋ชจ๋ธ์ด๋ ๋ฌด์์ธ๊ฐ
6. 1. Language Model
6
1-1. Language Model ์ด๋?
KOREAN INFORMATION PROCESSING Presentation
์ธ์ด๋ชจ๋ธ, ์ ์ ์ฉํ๊ฐ?
Machine translation
Spell Correction
Speech Recognition
P(delicious fish) > P( ominous fish )
P( I love you ) > P( I loev you )
P( I saw a van ) > P( eyes awe of an )
12. 1. Language Model
12
1-3. Standard Neural LM
KOREAN INFORMATION PROCESSING Presentation
Neural Probabilistic Language Model (Bengioโ 2003)
Prediction
z Encoding
Word Embedding
13. 1. Language Model
13
1-3. Standard Neural LM
KOREAN INFORMATION PROCESSING Presentation
์ด 3๋จ๊ณ๋ก ๊ตฌ์ฑ
Prediction
Encoding
Word Embedding (๐๐๐๐,๐๐๐๐,โฆ,๐๐๐๐โ๐๐)
๐๐๐๐ loss
min
14. 1. Language Model
14
1-3. Standard Neural LM
KOREAN INFORMATION PROCESSING Presentation
1๋จ๊ณ โ Word Embedding
Prediction
Encoding
Word Embedding
๊ฐ ๋จ์ด๋ฅผ dense vector๋ก mapping
neural model์ ํตํด ๋ฌธ์ฅ ๋ด ๊ฐ ๋จ์ด์
ํ๋ฅ ๊ฐ ์์ธกํ๋๋ก ํ์ต
๋ฐฉ๋ฒ1) Skip gram
๋ฐฉ๋ฒ2) CBOW (Continuous Bag Of Words)
15. 1. Language Model
15
1-3. Standard Neural LM
KOREAN INFORMATION PROCESSING Presentation
2๋จ๊ณ โ Encoding
Prediction
Encoding
Word Embedding
Context๋ฅผ dense vector๋ก mapping
FFNN, RNN ๋ฑ์ ์ฌ์ฉํ์ฌ ๊ธฐ์กด ํต๊ณ์
๊ธฐ๋ฐ ์์ธก์ ํ๊ณ์ ์ ๊ทน๋ณต
RNN : long distance dependency๋ก ํต์ฉ
16. 1. Language Model
16
1-3. Standard Neural LM
KOREAN INFORMATION PROCESSING Presentation
3๋จ๊ณ โ Prediction
Prediction
Encoding
Word Embedding
Maximum likelihood estimation
โด ๋ค์์ ๋ํ๋ ๋จ์ด ๐๐๐๐ ์ log ํ๋ฅ ์ด ์ต๋๊ฐ ๋๋๋ก ํ์ต.
๐๐(๐๐, ๐๐๐๐):scoring function (context ๐๐ โtarget word ๐๐?)
๐พ๐พ๐๐ :NeuralLMmodel์weight(Cdimension, V๊ฐ์กด์ฌ)
๐๐๐๐ :NeuralLMmodel์bias(Vdimension, )
๐๐:context ๋ฅผ์ธ์ฝ๋ฉํ๋ฒกํฐ
17. 1. Language Model
17
1-3. Standard Neural LM
KOREAN INFORMATION PROCESSING Presentation
3๋จ๊ณ โ Prediction
Prediction
Encoding
Word Embedding
Maximum likelihood estimation
๋ ผ๋ฌธ์๋ C dimension ์ด๋ผ๊ณ ๊ธฐ์ ๋์ด ์์ผ๋, ๊ณ์ฐ์ด ๋ง์ง
์์ ํด๋น ๋ ผ๋ฌธ์ 1์ ์(Yunchuan Chen) ์๊ฒ ๋ฉ์ผ์ ๋๋ฆฐ ๊ฒฐ
๊ณผ V dimension ์ด ๋ง๋ค๋ ํ๋ต์ ๋ฐ์๊ณ , ๊ณง ๋ ผ๋ฌธ์ ์์ ํ
์ฌ arXiv์ ์ด๋ฅผ ์์ ํ์ฌ ๊ฒ์ฌํ์ค ๊ฒ์ด๋ผ๊ณ ํจ.
๐๐๐๐ :NeuralLMmodel์bias(Vdimension )
18. 1. Language Model
18
1-3. Standard Neural LM
KOREAN INFORMATION PROCESSING Presentation
3๋จ๊ณ โ Prediction
Prediction
Encoding
Word Embedding โด ๐ฝ๐ฝ ๋ด์ ์๋ ๋ชจ๋ ๋จ์ด ๐๐๐๐ ์ ๋ํด ๊ฐ๊ฐ
์ ๋ชจ๋ ๊ณ์ฐํด์ฃผ์ด์ผ ํจ
Maximum likelihood estimation
19. 1. Language Model
19
1-3. Standard Neural LM
KOREAN INFORMATION PROCESSING Presentation
๋ฌด์์ด ๋ฌธ์ ์ธ๊ฐ?
Time complexity Memory complexity
Vocabulary size dependent
โ time complexity
parameters โ Vocabulary size
Hierarchical softmax by Bayesian network
Importance sampling (Bengio & Senecal)
Noise Contrastive Estimation, etcโฆ
Differentiated softmax (Chen et al.)
W๋ง compress, input์ ๊ทธ๋๋ก์ธ ๋จ์
21. Proposed Model
2-1. Sparse Representation
2-2. Embedding Compression
21KOREAN INFORMATION PROCESSING Presentation
2-3. Prediction Compression
2-4. ZRegression NCE
22. 2. Proposed Model
22
2-1. Sparse Representation
KOREAN INFORMATION PROCESSING Presentation
์์์ฌ์ , ์ ์ฐ๋๊ฐ?
[-] lung disease that is otherwise known as silicosis.
์์ฃผ ๋ฑ์ฅํ์ง ์๋, ์ด๋ ค์ด, ๋ชจ๋ฅด๋ ๋จ์ด๋ ์์ฃผ
๋ฑ์ฅํ๋ ์๊ธฐ ์ฌ์ด ๋จ์ด๋ก ์ ์ํ ์ ์๋ค!
23. 2. Proposed Model
23
2-1. Sparse Representation
KOREAN INFORMATION PROCESSING Presentation
์์์ฌ์ , ์ ์ฐ๋๊ฐ?
[-] lung disease that is otherwise known as silicosis.
์์ฃผ ๋ฑ์ฅํ์ง ์๋, ์ด๋ ค์ด, ๋ชจ๋ฅด๋ ๋จ์ด๋ ์์ฃผ
๋ฑ์ฅํ๋ ์๊ธฐ ์ฌ์ด ๋จ์ด๋ก ์ ์ํ ์ ์๋ค!
Infrequent word ์ ๋ํด, ๋ฌด์์ vocabulary size๋ฅผ ๋๋ฆด ๊ฒ์ด ์๋๋ผ,
๋น์ถ ๋จ์ด์ linear combination์ผ๋ก ์ ์ํด๋ณด์!!
24. 2. Proposed Model
24
2-1. Sparse Representation
KOREAN INFORMATION PROCESSING Presentation
Embedding ๊ตฌ์กฐ
๐ฝ๐ฝ : Vocabulary
๐ฉ๐ฉ : Base set
8k of common words
๐ช๐ช : Uncommon words
vocabulary ํฌ๊ธฐ์ ๋น๋กํ
์ฌ complexity ์ฆ๊ฐ ๋ฐฉ์ง
์ด๋ฅผ ํตํด ๋จ์ด๋ฅผ vector
์ mapping ํ ๋์ ์ฐจ
์์ ์ถ์
์ถ์ ๊ณผ์ ์์, sparse
vector ๋ฅผ ์ด์ฉํ๋ฏ๋ก
๋ณต์ก๋ ์ ์ฝ
25. 2. Proposed Model
25
2-1. Sparse Representation
KOREAN INFORMATION PROCESSING Presentation
Word Embedding ํ์ต
๐ฝ๐ฝ : Vocabulary
๐ฉ๐ฉ : Base set
8k of common words
๐ช๐ช : Uncommon words
์ ์ฒด vocabulary์ ๋ํด SkipGram ๋ฐฉ์์ผ๋ก embedding
Common words ๐ฉ๐ฉ ๊ฐ์ ๋ํ embedding์ ์ถ์ถ
26. 2. Proposed Model
26
2-1. Sparse Representation
KOREAN INFORMATION PROCESSING Presentation
Common Base Set ์ถ์ถ
๐ฝ๐ฝ : Vocabulary
๐ฉ๐ฉ : Base set
8k of common words
๐ช๐ช : Uncommon words
sparse code ๐๐
๐๐ : one-hot vector (๐ฉ๐ฉโs iโth word) ์ผ ๋,
needs to learn sparse representation ๐๐
27. 2. Proposed Model
27
2-1. Sparse Representation
KOREAN INFORMATION PROCESSING Presentation
Sparse vector ํ์ต
๐ฝ๐ฝ : Vocabulary
๐ฉ๐ฉ : Base set
8k of common words
๐ช๐ช : Uncommon words
: fitting loss
Sparse coding representation โ โtrueโ representation
: ๐๐๐๐ regularizer
๐๐ ์ sum 1 ์ด ๋๋๋ก ํจ
: regularization term
: non-negative
28. 2. Proposed Model
28
2-1. Sparse Representation
KOREAN INFORMATION PROCESSING Presentation
Optimization function
๐ฝ๐ฝ : Vocabulary
๐ฉ๐ฉ : Base set
8k of common words
: ๐ณ๐ณ(๐๐)
: ๐น๐น๐๐(๐๐)
: ๐น๐น๐๐(๐๐)
: ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
๋ฒ์์์๋ฒ์ด๋๊ฐ์๋ฒ์๋ด์
๊ฐ์ผ๋ก์กฐ์
29. 2. Proposed Model
29
2-1. Sparse Representation
KOREAN INFORMATION PROCESSING Presentation
Objective function
๐ฝ๐ฝ : Vocabulary
๐ฉ๐ฉ : Base set
8k of common words
์ด ๋, fitting๊ณผ regularization์ด ์ผ์ ํ ๋น์จ
์ ์ด๋ฃจ๋๋ก updateํ๋ค.
30. 2. Proposed Model
30
2-2. Embedding Compression
KOREAN INFORMATION PROCESSING Presentation
Uncommon Word Representation
๐ฉ๐ฉ : Base set
8k of common words
๐ช๐ช : Uncommon words non-negative sparse code ๐๐ โ โ๐ฉ๐ฉ
uncommon word
์ด ๋, U๊ฐ dense ํ๋ฏ๋ก ์ญ์ sparse ํ์ง ์๋ค.
31. 2. Proposed Model
31
2-3. Prediction Compression
KOREAN INFORMATION PROCESSING Presentation
Flashback
Prediction
Encoding
Word Embedding
Maximum likelihood estimation
โด ๋ค์์ ๋ํ๋ ๋จ์ด ๐๐๐๐ ์ log ํ๋ฅ ์ด ์ต๋๊ฐ ๋๋๋ก ํ์ต.
๐๐(๐๐, ๐๐๐๐):scoring function (context ๐๐ โtarget word ๐๐?)
๐พ๐พ๐๐ :NeuralLMmodel์weight(Cdimension,V๊ฐ์กด์ฌ)
๐๐๐๐ :NeuralLMmodel์bias(Vdimension, )
๐๐:context ๋ฅผ์ธ์ฝ๋ฉํ๋ฒกํฐ
32. 2. Proposed Model
32
2-3. Prediction Compression
KOREAN INFORMATION PROCESSING Presentation
Compressing Output of Neural LM
์ ์ฒด Embedding์ ๋ํ weight์
๐พ๐พ, bias๋ฅผ ๐๐๋ผ ํ๋ค.
์ด ๋, ์ด ์ค common words์ ๋ํ
weight ๐ซ๐ซ, bias ๐๐ ๋ก ๐พ๐พ, ๐๐ ํํ
Word Embedding ์์๋ ๊ณ์ฐ๋
sparse code (๐๐)๋ฅผ ํตํด ์ด๋ฅผ ํํ
33. 2. Proposed Model
33
2-3. Prediction Compression
KOREAN INFORMATION PROCESSING Presentation
Word Embedding โ Prediction
Common word ์ ๋ํ embedding ๐ผ๐ผ ๋ฅผ ๋ฏธ๋ฆฌ ์ ์ ์์.
โ rare word ์ ๋ํ sparse vector ๐๐ ๋ฅผ ๋ฏธ๋ฆฌ ์ ์ ์์.
๊ทธ๋ฌ๋ Output Weight, bias์์๋ ํ์ตํ ๋์์ด ์์ผ๋ฏ๋ก ๐๐ ๋ฅผ ์ ์ ์์.
Embedding ์ ๊ณ์ฐ๋ sparse code (๐๐)๋ฅผ prediction์์ ์ฌ์ฉํ์ฌ, Infrequentํ
word์ ๋ํ weight์ bias๋ฅผ frequent word์ ๋ํ parameter์ sparse code
๋ก ํํํ๋ ๊ฒ์ด ๋ชฉ์ .
Then, how do we obtain sparse vector ๐๐ for output weight & bias?
34. 2. Proposed Model
34
2-3. Prediction Compression
KOREAN INFORMATION PROCESSING Presentation
Compressing Output of Neural LM
context ๐๐์ word ๐๐๊ฐ ๋น์ทํ structure
๋ฅผ ๊ฐ์ง๋ฏ๋ก, Word Embedding๊ณผ
output weight ์ญ์๋ ๋น์ทํ ๊ตฌ์กฐ๋ฅผ
๊ฐ์ง ๊ฒ.
๋ฐ๋ผ์, word์์ ์ฌ์ฉํ sparse code
๐๐ ๋ฅผ ๐๐์์๋ ์ฌ์ฉํ ์ ์๋ค.
35. 2. Proposed Model
35
2-4. ZRegression NCE
KOREAN INFORMATION PROCESSING Presentation
Noise-Contrastive Estimation
Gutmann and Hyvarinen, 2012
Non-linear logistic regression
Noise-Contrastive Estimation
Discriminate real data and artificial data (=noise)
Log-density function
Maximize log probability of softmax
36. 2. Proposed Model
36
2-4. ZRegression NCE
KOREAN INFORMATION PROCESSING Presentation
Noise-Contrastive Estimation
์ถ์ฒ : https://www.tensorflow.org/versions/r0.10/tutorials/word2vec/index.html
Maximum Likelihood Estimation Noise-Contrastive Estimation
37. 2. Proposed Model
37
2-4. ZRegression NCE
KOREAN INFORMATION PROCESSING Presentation
NCE vs. MLE
์ถ์ฒ : https://www.tensorflow.org/versions/r0.10/tutorials/word2vec/index.html
Maximum Likelihood Estimation Noise-Contrastive Estimation
vocabulary ๋ด ๋ชจ๋ word ์ ๋ํด
softmax estimation ์ ์คํํ๋ค.
noise distribution k๊ฐ์ ๋ํด์๋ง
softmax estimation ์ ์คํํ๋ค.
ํ๋์ positive data์ ๋ํด noise๋ฅผ
k๊ฐ๋งํผ ์์ฑ
ํ๋์ positive data์ ๋ํด ์ ์ฒด
word์ probability๋ฅผ ํ์๋ก ํจ.
38. 2. Proposed Model
38
2-4. ZRegression NCE
KOREAN INFORMATION PROCESSING Presentation
Noise-Contrastive Estimation
MLE ๋ฐฉ์์์๋ ๐๐๐๐ = ๋ฅผ ์ ์ฒด vocabulary์ ๋ํด ๊ตฌํ๋ค.
Noise-Contrastive Estimation
๊ทธ๋ฌ๋ NCE ๋ฐฉ์์์๋ ๐๐๐๐๊ฐ h์ dependentํ์ฌ ์ ์ฉํ๊ธฐ๊ฐ ์ด๋ ต๋ค.
Mnih & The (2012)์ ์ํด, ๐๐๐๐ = ๐๐ ๋ก ๊ฐ์ ํ์ฌ๋ ์ผ๋ฐ์ ์ผ๋ก ์ฑ๋ฆฝํ๋ค.
39. 2. Proposed Model
39
2-4. ZRegression NCE
KOREAN INFORMATION PROCESSING Presentation
Noise-Contrastive Estimation
Noise-Contrastive Estimation
Mnih & The (2012)์ ์ํด, ๐๐๐๐ = ๐๐ ๋ก ๊ฐ์ .
๐๐๐๐๊ฐ ๐๐์ ๋ฑ์ฅํ log probability ๐๐๐๐๊ฐ ๐๐๐๐๐๐๐๐๐๐์ ๋ฑ์ฅํ log probability
์ํ๋ ๋จ์ด๋ context์์, ๋๋จธ์ง ๋จ์ด๋ noise์์ ๋ฑ์ฅํ ํ๋ฅ
์ด ๋์์ง๋๋ก ํ์ต.
๐ท๐ท๐๐ โถ negative sample(noise)
์์ ์ถ์ถํ ํ๋ฅ
40. 2. Proposed Model
40
2-4. ZRegression NCE
KOREAN INFORMATION PROCESSING Presentation
Noise-Contrastive Estimation
ZRegression
Objective function :
๐๐๐๐ = ๐๐ ์ ๊ฒฝ์ฐ unstability ๋ฐ์ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐
41. 2. Proposed Model
41
2-4. ZRegression NCE
KOREAN INFORMATION PROCESSING Presentation
ZRegression NCE
ZRegression
: Regression Layer
43. 3. Evaluation
43
3-1. Dataset
KOREAN INFORMATION PROCESSING Presentation
์ํคํผ๋์ 2014๋ ์๋ฃ
v
100M for neural LM
2014 Wikipedia dump
Preprocessing
1.6 billion running words
Train / Validation / Test split
All for backoff n-gram
44. 3. Evaluation
44
3-2. Qualitative Analysis
KOREAN INFORMATION PROCESSING Presentation
8000๊ฐ์ common โ 2k~24k๊ฐ์ uncommon
v
๐ฉ๐ฉ : Base set
8k of common words
๐ช๐ช : Uncommon words
2k ~ 24k
v
Pre-trained word
embedding ์ด์ฉ
Adam Optimizer
Small coefficients
45. 3. Evaluation
45
3-2. Qualitative Analysis
KOREAN INFORMATION PROCESSING Presentation
8000๊ฐ์ common โ 2k~24k๊ฐ์ uncommon
v
๋จ์ํ โ common
Uncommon(rare)์
coefficient๊ฐ 0.6 ์ด์
์ ์๋ฏธํ๊ฒ mapping ๋จ
๋ณต์ํ โ rare
coefficient
commonwords
Uncommon(rare)์
sparse representation
46. 3. Evaluation
46
3-3. Quantitative Analysis
KOREAN INFORMATION PROCESSING Presentation
Perplexity measure
v
N : Test corpus์ running words ๊ฐ์
Lower the Better, ๋ฎ์์๋ก ์ฑ๋ฅ์ด ๋ ๋ฐ์ด๋จ์ ์๋ฏธ
LSTM-RNN ์ ํตํด Encoding
200d Hidden Layer, Adam Optimization
Given context์์, ์ผ๋ง๋ ์ฃผ์ด์ง word๊ฐ ์๋ง๊ฒ ๋์ถ๋๋๊ฐ?
47. Perplexity of each models
Memory reduction compared to LSTM-z
3. Evaluation
47
3-3. Quantitative Analysis
KOREAN INFORMATION PROCESSING Presentation
Perplexity measure
v
- KN3
Knerser-Ney smoothing technique on 3-gram LM
- LBL5 (5 preceding words)
Log-bilinear model (Mnih and Hinton, 2007)
- LSTM-s
Standard LSTM-RNN LM
- LSTM-z
Enhanced with ZRegression
- LSTM-z, wb
Compressing both weights and biases in Prediction
- LSTM-z, w
Compressing only weights in prediction