Compressing neural language models by sparse word representation

by Sparse
Word Representation
Compressing
Neural Language Model
KoreaUniversity,
DepartmentofComputerScience&Radio
CommunicationEngineering
2016010646 김범수
2016010636 이진혁
Korean Information Processing
Professor HaechangLim
1

KOREAN INFORMATION PROCESSING Presentation
Contents
01.LanguageModel
02.ProposedModel
1-1. Language Model 이란?
2-1. Sparse Representation
2
1-2. N-grams
1-3. Standard Neural LM
2-2. Embedding Compression
2-3. Prediction Compression
2-4. ZRegression NCE
03.Evaluation
3-1. Dataset
3-2. Qualitative Analysis
3-3. Quantitative analysis
3-4. Conclusion

Language Model
1-2. N-grams
3KOREAN INFORMATION PROCESSING Presentation

1. Language Model
4
Unfortunately, I am an ____________
Language Model (언어모델)
현재까지의 context(history)를 기반으로 다음에 나타날 단어의 확률 𝑷𝑷 𝒘𝒘 𝒄𝒄
를 나타내는 모델 기법
idiot 0.672
flower 0.115
psycho-pass 0.581
genius 0.336
…
walk 0.016
process 0.052
cancel 0.039
언어모델이란 무엇인가

1. Language Model
5
LM 정리 1.
각 단어의 확률로부터 문장,. 즉 sequence의 joint probability를 chain rule에
의거하여 계산할 수 있다.
𝑷𝑷( 𝑾𝑾) = 𝑃𝑃(𝑤𝑤1,𝑤𝑤2,𝑤𝑤3,…,𝑤𝑤𝑛𝑛)
𝑷𝑷 𝒘𝒘𝟏𝟏,𝒘𝒘𝟐𝟐,𝒘𝒘𝟑𝟑,…,𝒘𝒘𝒏𝒏 = 𝑃𝑃 𝑤𝑤1 𝑃𝑃 𝑤𝑤2 𝑤𝑤1 𝑃𝑃 𝑤𝑤3 𝑤𝑤1,𝑤𝑤2 …𝑃𝑃(𝑤𝑤𝑛𝑛|𝑤𝑤1,𝑤𝑤2,…,𝑤𝑤𝑛𝑛−1)
𝑷𝑷 𝒕𝒕𝒕𝒕𝒕𝒕 𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘 𝒊𝒊𝒊𝒊 𝒔𝒔𝒔𝒔 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕
= 𝑃𝑃 𝑡𝑡 𝑡𝑡𝑡 𝑃𝑃 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑡𝑡 𝑡𝑡𝑡 𝑃𝑃 𝑖𝑖𝑖𝑖 𝑡𝑡 𝑡𝑡𝑡𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 … 𝑃𝑃(𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡|𝑡𝑡 𝑡𝑡𝑡𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑖𝑖𝑖𝑖 𝑠𝑠𝑠𝑠)
언어모델, 어떻게 활용하는가

1. Language Model
6
언어모델, 왜 유용한가?
Machine translation
Spell Correction
Speech Recognition
P(delicious fish) > P( ominous fish )
P( I love you ) > P( I loev you )
P( I saw a van ) > P( eyes awe of an )

1. Language Model
7
그러나…
𝑷𝑷 𝒕𝒕𝒕𝒕𝒕𝒕 𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘 𝒊𝒊𝒊𝒊 𝒔𝒔𝒔𝒔 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕
= 𝑃𝑃 𝑡𝑡 𝑡𝑡𝑡 𝑃𝑃 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑡𝑡 𝑡𝑡𝑡 𝑃𝑃 𝑖𝑖𝑖𝑖 𝑡𝑡 𝑡𝑡𝑡𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑃𝑃(𝑠𝑠𝑠𝑠|𝑡𝑡 𝑡𝑡𝑡𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑖𝑖𝑖𝑖)𝑃𝑃(𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡|𝑡𝑡 𝑡𝑡𝑡𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑖𝑖𝑖𝑖 𝑠𝑠𝑠𝑠)
문장이 길어질 경우, 매우 복잡한 chain rule 계산을 요구한다.
𝑷𝑷 𝒘𝒘 에 대해, 일일이 count를 세어 확률을 구하기엔 데이터가 매우 불충분하다.
𝑃𝑃(𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡|𝑡𝑡 𝑡𝑡𝑡𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑖𝑖𝑖𝑖 𝑠𝑠𝑠𝑠) 에서, 𝑃𝑃(𝑡𝑡 𝑡𝑡𝑡𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑖𝑖𝑖𝑖 𝑠𝑠𝑠𝑠) 가 충분히 등장할 확률은 매우 낮다.

1. Language Model
8
1-2. N-gram
Markov assumption
단어에 이전 단어들의 상태가 모두 반영되어 있다는 단순화 가정.
그렇다면, 단순화하여 추정해보자
𝑷𝑷 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕|𝒕𝒕𝒕𝒕𝒕𝒕 𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘 𝒊𝒊𝒊𝒊 𝒔𝒔𝒔𝒔 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕 ⋍ 𝑷𝑷 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕|𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕
𝑷𝑷 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕|𝒕𝒕𝒕𝒕𝒕𝒕 𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘𝒘 𝒊𝒊𝒊𝒊 𝒔𝒔𝒔𝒔 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕 ⋍ 𝑷𝑷 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕|𝒔𝒔𝒔𝒔 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕
…

1. Language Model
9
1-2. N-gram
Markov assumption
Markov 가정으로부터 출발한 n-gram 모델
N-gram model
LM의 history로써 이전 n-1 개의 단어(𝒘𝒘𝟏𝟏,𝒘𝒘𝟐𝟐,…,𝒘𝒘𝒏𝒏−𝟏𝟏)만을 참조하는 모델.
𝑷𝑷 𝒘𝒘𝒏𝒏 = 𝑷𝑷(𝒘𝒘𝒏𝒏|𝒘𝒘𝟏𝟏,𝒘𝒘𝟐𝟐,…,𝒘𝒘𝒏𝒏−𝟏𝟏)

1. Language Model
10
1-2. N-gram
한계점
각 단어를 one-hot 으로 표현한다.
vocabulary size가 증가할 경우 메모리 소모량이 비효율적으로 증가
통계적 기반의 parameter estimation
unseen word에 대해 표현할 수 있는 방법이 없음.
언어에는 long-distance dependency가 존재한다.
n-gram으로 찾기 매우 부적절함.

1. Language Model
11
1-2. N-gram
한계점의 해결
각 단어를 one-hot 으로 표현한다.
vocabulary size가 증가할 경우 메모리 소모량이 비효율적으로 증가
언어에는 long-distance dependency가 존재한다.
n-gram으로 찾기 매우 부적절함.
word embedding
FFNN, RNN
RNN (LSTM)

1. Language Model
12
Neural Probabilistic Language Model (Bengio’ 2003)
Prediction
z Encoding
Word Embedding

1. Language Model
13
총 3단계로 구성
Prediction
Encoding
Word Embedding (𝒘𝒘𝟏𝟏,𝒘𝒘𝟐𝟐,…,𝒘𝒘𝒏𝒏−𝟏𝟏)
𝒘𝒘𝒏𝒏 loss
min

1. Language Model
14
1단계 – Word Embedding
Prediction
Encoding
Word Embedding
각 단어를 dense vector로 mapping
neural model을 통해 문장 내 각 단어의
확률 값 예측하도록 학습
방법1) Skip gram
방법2) CBOW (Continuous Bag Of Words)

1. Language Model
15
2단계 – Encoding
Prediction
Encoding
Word Embedding
Context를 dense vector로 mapping
FFNN, RNN 등을 사용하여 기존 통계적
기반 예측의 한계점을 극복
RNN : long distance dependency로 통용

1. Language Model
16
3단계 – Prediction
Prediction
Encoding
Word Embedding
Maximum likelihood estimation
∴ 다음에 나타날 단어 𝒘𝒘𝒊𝒊 의 log 확률이 최대가 되도록 학습.
𝒔𝒔(𝒉𝒉, 𝒘𝒘𝒊𝒊):scoring function (context 𝒉𝒉 ≈target word 𝒘𝒘?)
𝑾𝑾𝒊𝒊 :NeuralLMmodel의weight(Cdimension, V개존재)
𝒃𝒃𝒊𝒊 :NeuralLMmodel의bias(Vdimension, )
𝒉𝒉:context 를인코딩한벡터

1. Language Model
17
Prediction
Encoding
Word Embedding
논문에는 C dimension 이라고 기술되어 있으나, 계산이 맞지
않아 해당 논문의 1저자(Yunchuan Chen) 에게 메일을 드린 결
과 V dimension 이 맞다는 확답을 받았고, 곧 논문을 수정하
여 arXiv에 이를 수정하여 게재하실 것이라고 함.
𝒃𝒃𝒊𝒊 :NeuralLMmodel의bias(Vdimension )

1. Language Model
18
Prediction
Encoding
Word Embedding ∴ 𝑽𝑽 내에 있는 모든 단어 𝒘𝒘𝒊𝒊 에 대해 각각
을 모두 계산해주어야 함

1. Language Model
19
무엇이 문제인가?
Time complexity Memory complexity
Vocabulary size dependent
≈ time complexity
parameters ∝ Vocabulary size
Hierarchical softmax by Bayesian network
Importance sampling (Bengio & Senecal)
Noise Contrastive Estimation, etc…
Differentiated softmax (Chen et al.)
W만 compress, input은 그대로인 단점

1. Language Model
FFNN, RNN
무엇이 문제인가?
그러나, Neural LM 에서도 infrequent, 즉 자주 등장하지 않는 단어는 매우
적은 횟수로 업데이트 되어 제대로 학습이 되지 않고 메모리와 시간 복잡도
만 향상시키는 문제점으로 남아있다.
이를 좀 더 효율적으로 표현할 수 없을까?

Proposed Model

2. Proposed Model
22
영영사전, 왜 쓰는가?
[-] lung disease that is otherwise known as silicosis.
자주 등장하지 않는, 어려운, 모르는 단어는 자주
등장하는 알기 쉬운 단어로 정의할 수 있다!

2. Proposed Model
23
영영사전, 왜 쓰는가?
[-] lung disease that is otherwise known as silicosis.
자주 등장하지 않는, 어려운, 모르는 단어는 자주
등장하는 알기 쉬운 단어로 정의할 수 있다!
Infrequent word 에 대해, 무작정 vocabulary size를 늘릴 것이 아니라,
빈출 단어의 linear combination으로 정의해보자!!

2. Proposed Model
24
Embedding 구조
𝑽𝑽 : Vocabulary
𝑩𝑩 : Base set
8k of common words
𝑪𝑪 : Uncommon words
vocabulary 크기에 비례하
여 complexity 증가 방지
이를 통해 단어를 vector
에 mapping 할 때의 차
원을 축소
축소 과정에서, sparse
vector 를 이용하므로
복잡도 절약

2. Proposed Model
25
Word Embedding 학습
𝑩𝑩 : Base set
8k of common words
전체 vocabulary에 대해 SkipGram 방식으로 embedding
Common words 𝑩𝑩 개에 대한 embedding을 추출

2. Proposed Model
26
Common Base Set 추출
𝑩𝑩 : Base set
8k of common words
sparse code 𝒙𝒙
𝒙𝒙 : one-hot vector (𝑩𝑩’s i’th word) 일 때,
needs to learn sparse representation 𝒙𝒙

2. Proposed Model
27
Sparse vector 학습
𝑩𝑩 : Base set
8k of common words
: fitting loss
Sparse coding representation ≈ “true“ representation
: 𝒍𝒍𝟏𝟏 regularizer
𝒙𝒙 의 sum 1 이 되도록 함
: regularization term
: non-negative

2. Proposed Model
28
Optimization function
𝑩𝑩 : Base set
8k of common words
: 𝑳𝑳(𝒙𝒙)
: 𝑹𝑹𝟏𝟏(𝒙𝒙)
: 𝑹𝑹𝟐𝟐(𝒙𝒙)
: 𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄𝒄
범위에서벗어난값을범위내의
값으로조정

2. Proposed Model
29
Objective function
𝑩𝑩 : Base set
8k of common words
이 때, fitting과 regularization이 일정한 비율
을 이루도록 update한다.

2. Proposed Model
30
Uncommon Word Representation
𝑩𝑩 : Base set
8k of common words
𝑪𝑪 : Uncommon words non-negative sparse code 𝒙𝒙 ∊ ℝ𝑩𝑩
uncommon word
이 때, U가 dense 하므로 역시 sparse 하지 않다.

2. Proposed Model
31
Flashback
Prediction
Encoding
Word Embedding
∴ 다음에 나타날 단어 𝒘𝒘𝒊𝒊 의 log 확률이 최대가 되도록 학습.
𝒔𝒔(𝒉𝒉, 𝒘𝒘𝒊𝒊):scoring function (context 𝒉𝒉 ≈target word 𝒘𝒘?)
𝑾𝑾𝒊𝒊 :NeuralLMmodel의weight(Cdimension,V개존재)
𝒃𝒃𝒊𝒊 :NeuralLMmodel의bias(Vdimension, )
𝒉𝒉:context 를인코딩한벡터

2. Proposed Model
32
Compressing Output of Neural LM
전체 Embedding에 대한 weight을
𝑾𝑾, bias를 𝒃𝒃라 한다.
이 때, 이 중 common words에 대한
weight 𝑫𝑫, bias 𝒄𝒄 로 𝑾𝑾, 𝒃𝒃 표현
Word Embedding 에서는 계산된
sparse code (𝒙𝒙)를 통해 이를 표현

2. Proposed Model
33
Word Embedding → Prediction
Common word 에 대한 embedding 𝑼𝑼 를 미리 알 수 있음.
→ rare word 에 대한 sparse vector 𝒙𝒙 를 미리 알 수 있음.
그러나 Output Weight, bias에서는 학습할 대상이 없으므로 𝒙𝒙 를 알 수 없음.
Embedding 시 계산된 sparse code (𝒙𝒙)를 prediction에서 사용하여, Infrequent한
word에 대한 weight와 bias를 frequent word에 대한 parameter와 sparse code
로 표현하는 것이 목적.
Then, how do we obtain sparse vector 𝒙𝒙 for output weight & bias?

2. Proposed Model
34
Compressing Output of Neural LM
context 𝒉𝒉와 word 𝒘𝒘가 비슷한 structure
를 가지므로, Word Embedding과
output weight 역시도 비슷한 구조를
가질 것.
따라서, word에서 사용한 sparse code
𝒙𝒙 를 𝐖𝐖에서도 사용할 수 있다.

2. Proposed Model
35
Noise-Contrastive Estimation
Gutmann and Hyvarinen, 2012
Non-linear logistic regression
Discriminate real data and artificial data (=noise)
Log-density function
Maximize log probability of softmax

2. Proposed Model
36
출처 : https://www.tensorflow.org/versions/r0.10/tutorials/word2vec/index.html
Maximum Likelihood Estimation Noise-Contrastive Estimation

2. Proposed Model
37
NCE vs. MLE
출처 : https://www.tensorflow.org/versions/r0.10/tutorials/word2vec/index.html
Maximum Likelihood Estimation Noise-Contrastive Estimation
vocabulary 내 모든 word 에 대해
softmax estimation 을 실행한다.
noise distribution k개에 대해서만
softmax estimation 을 실행한다.
하나의 positive data에 대해 noise를
k개만큼 생성
하나의 positive data에 대해 전체
word의 probability를 필요로 함.

2. Proposed Model
38
MLE 방식에서는 𝒁𝒁𝒉𝒉 = 를 전체 vocabulary에 대해 구한다.
그러나 NCE 방식에서는 𝒁𝒁𝒉𝒉가 h에 dependent하여 적용하기가 어렵다.
Mnih & The (2012)에 의해, 𝒁𝒁𝒉𝒉 = 𝟏𝟏 로 가정하여도 일반적으로 성립한다.

2. Proposed Model
39
Mnih & The (2012)에 의해, 𝒁𝒁𝒉𝒉 = 𝟏𝟏 로 가정.
𝒘𝒘𝒊𝒊가 𝒉𝒉에 등장할 log probability 𝒘𝒘𝒋𝒋가 𝒏𝒏𝒏𝒏𝒏𝒏𝒏𝒏𝒏𝒏에 등장할 log probability
원하는 단어는 context에서, 나머지 단어는 noise에서 등장할 확률
이 높아지도록 학습.
𝑷𝑷𝒏𝒏 ∶ negative sample(noise)
에서 추출할 확률

2. Proposed Model
40
ZRegression
Objective function :
𝒁𝒁𝒉𝒉 = 𝟏𝟏 의 경우 unstability 발생 𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁𝒁

2. Proposed Model
41
ZRegression NCE
ZRegression
: Regression Layer

Evaluation
3-1. Dataset
3-3. Quantitative Analysis

3. Evaluation
43
3-1. Dataset
위키피디아 2014년 자료
v
100M for neural LM
2014 Wikipedia dump
Preprocessing
1.6 billion running words
Train / Validation / Test split
All for backoff n-gram

3. Evaluation
44
8000개의 common → 2k~24k개의 uncommon
v
𝑩𝑩 : Base set
8k of common words
2k ~ 24k
v
Pre-trained word
embedding 이용
Adam Optimizer
Small coefficients

3. Evaluation
45
8000개의 common → 2k~24k개의 uncommon
v
단수형 → common
Uncommon(rare)의
coefficient가 0.6 이상
유의미하게 mapping 됨
복수형 → rare
coefficient
commonwords
Uncommon(rare)의
sparse representation

3. Evaluation
46
Perplexity measure
v
N : Test corpus의 running words 개수
Lower the Better, 낮을수록 성능이 더 뛰어남을 의미
LSTM-RNN 을 통해 Encoding
200d Hidden Layer, Adam Optimization
Given context에서, 얼마나 주어진 word가 알맞게 도출되는가?

Perplexity of each models
Memory reduction compared to LSTM-z
3. Evaluation
47
Perplexity measure
v
- KN3
Knerser-Ney smoothing technique on 3-gram LM
- LBL5 (5 preceding words)
Log-bilinear model (Mnih and Hinton, 2007)
- LSTM-s
Standard LSTM-RNN LM
- LSTM-z
Enhanced with ZRegression
- LSTM-z, wb
Compressing both weights and biases in Prediction
- LSTM-z, w
Compressing only weights in prediction

3. Evaluation
48
Vocabulary Size X PPL
Vocabulary size가 증가하여도, Compression한 방법에서는 거의 Memory가 증가하지 않음
반면, Compression을 하여도 KN3보다 PPL이 낮음( 낮을수록 우수한 성능 )

3. Evaluation
49
Method Comparison
KN3 이 더 넓은 corpus에서 training 됐음에도 불구하고, LSTMs의 성능이 더 좋음.
LSTM-z,w > LSTM-z (80% 가량의 memory reduction + 성능향상)

3. Evaluation
50
왜 LSTM-z,w > LSTM-z,wb ??
Compression의 결과, 정보 손실이 발생하는 대신 rare word에 대한 정확도가 높아진다.
LSTM-z,wb의 경우, bias를 compress할 때의 손실이 정확도 상승분보다 크다.

3. Evaluation
51
3-4. Conclusion
Compressing NL by Sparse Word Representation
Sparse Linear Combinations for Rare Words, 이를 통해 embedding dimension과
prediction dimension을 현저하게 줄일 수 있으며, infrequent word가 제대로 학습되지
않는 문제 또한 해결할 수 있다.
또한, vocabulary size와 비례하여 memory와 time complexity 가 같이 증가하는 문제를
해결하여, vocabulary size와 거의 무관하게 일정한 메모리 소모량을 유지할 수 있다.
메모리 소모량을 줄임에도 불구하고 performance 역시 증가함(PPL 수치의 감소)을 확
인할 수 있다.

Thank you for your attention!
52

Compressing neural language models by sparse word representation

More Related Content

Similar to Compressing neural language models by sparse word representation (17)

More from Brian Kim (8)

Compressing neural language models by sparse word representation