Text2Action: Generative Adversarial Synthesis from Language to Action

Text2Action: Generative Adversarial
Synthesis from Language to Action
2017.11.17
Presenter : Hyemin Ahn

Introducing Myself
2017-11-16 CPSLAB (EECS) 2
Interested in Human Robot Interaction based on the machine learning,
and Human’s nonverbal communication.

Today’s Seminar: Text2Action
2017-11-16 CPSLAB (EECS) 3
Text2Action: Generative Adversarial Synthesis from Language to Action
• 사람의 행동을 설명하는 문장이 주어지면, 해당 문장 (Language)이 설명하
는 사람의 행동(Action)을 생성할 수 있게 하는 Neural Network.
Man is dancing to music

Text2Action: Generative Adversarial Synthesis from Language to Action
• 사람의 행동을 설명하는 문장이 주어지면, 해당 문장 (Language)이 설명하
는 사람의 행동(Action)을 생성할 수 있게 하는 Neural Network.
Today’s Seminar: Text2Action
2017-11-16 CPSLAB (EECS) 4
이런 네트워크를 만드는 것이 목적이라면 구체적으로 어떤 일을 해야 하는가?
1. 입력 받은 Natural Language를 어떻게 처리해야 하는가?
• 문장(Sentence) 이란 무엇인가?
• Sequence of characters / words
• 입력 문장이 행동에 대해 어떤 정보를 담고 있는지와 관련된
feature는 어떻게 encoding해야 하는가?
2. 처리된 Natural Language로부터 행동을 어떻게 생성해내야 하는가?
• 행동(Action) 이란 무엇인가?
• Sequence of poses in time.
• 매 순간의 pose를 생성하기 위해선 입력문장으로부터
encoding된 feature를 어떻게 전달해 주는 것이 좋은가?
Word2Vec
RNN
Sequence
to Sequence

• Vector Representations of Words! (Word embeddings)
• 글 내부에서 가까이 위치해 있는 단어끼리는 유사한 의미를 지녔을 것이라는
가정(Distributional Hypothesis)을 기반으로, 벡터 공간에서 각 단어들이 어떻게
분포해 있는지를 학습.
• 각 단어들을 one-hot vector로 표현해 쓰는 것 보다 더 효과적!
Backgrounds : Word2Vec
2017-11-16 CPSLAB (EECS) 5

Backgrounds : Word2Vec
2017-11-16 CPSLAB (EECS) 6

Backgrounds : Recurrent Neural Networks(RNN)
2017-11-16 CPSLAB (EECS) 7
• 사람은 연속적으로 일어나는 일들의 패턴을 기억하고 사용.
• 쉽게 되는 것 : ‘가 나 다 라 마 바 사…’
• 하지만 이걸 거꾸로 한다면?: ‘하 파 카 타 차 자 아…’ ?
• ‘이러한 Sequence에 담긴 정보를 활용할 수 있도록 해보자!’
가 RNN이라는 것을 탄생시킨 아이디어!
• Sequence가 가진 패턴을 학습해서, 다음에 어떤 일이 일어날
지 Estimation하거나, 새로운 Sequence를 Generation하는데
이용해보자!
• But HOW?

2017-11-16 CPSLAB (EECS) 8
OUTPUT
INPUT
ONE
STEP
DELAY
HIDDEN
STATE
 RNN이 “RECURRENT” 라고 불리는 이유는
Sequence를 이루는 요소를 하나씩 입력으로 받을
때 마다 같은 작업을 반복적으로 수행하기 때문.
 또한, 출력되는 값은 이전 작업들에서 계산되어왔
던 내용들에 dependent 하게 됨.
 RNN은 현재까지 어떤 내용들이 계산되어 왔는지
를 저장하는 “메모리”를 가지고 있음
 “메모리”에 해당하는 Hidden state 𝒉 𝒕 는 입력
Sequence와 관련된 정보를 저장함.
 만약 𝑓 = tanh, 이라면 Vanishing/Exploding
gradient problem이 생겨날 수 있음.
 이를 극복하기 위해, 주로 LSTM/GRU가 𝑓로
써 주로 사용됨.
𝒉 𝒕
𝒚 𝒕
𝒙 𝒕
ℎ 𝑡 = 𝑓 𝑈𝑥 𝑡 + 𝑊ℎ 𝑡−1 + 𝑏
𝑦𝑡 = 𝑉ℎ 𝑡 + 𝑐
𝑈
𝑊
𝑉
Backgrounds : Recurrent Neural Networks(RNN)

2017-11-16 CPSLAB (EECS) 9
• 쇼핑백에 들어있는 물건들로부터 오늘의 저녁 메뉴가 무엇일지 추측해
보는 기계가 있다고 생각해 봅시다.
음…
까르보나라?
Backgrounds : Long Short Term Memory (LSTM)

2017-11-16 CPSLAB (EECS) 10
𝑪 𝒕
Cell state,
Internal memory unit,
Like a conveyor belt!
𝒉 𝒕
𝒙 𝒕

2017-11-16 CPSLAB (EECS) 11
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒙 𝒕
Forget
Some
Memories!

2017-11-16 CPSLAB (EECS) 12
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒙 𝒕
Forget
Some
Memories!
LSTM 은 (1) 이전 ℎ 𝑡−1와 새로운 입력 𝑥 𝑡 이 주어졌을 때 Memory의 어떤 부분을 지울지
(2) 그리고 ℎ 𝑡−1 and 𝑥 𝑡가 들어왔을 때 새 메모리를 어떻게 더할 지 결정

2017-11-16 CPSLAB (EECS) 13
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒙 𝒕
Insert
Some
Memories!

2017-11-16 CPSLAB (EECS) 14
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒙 𝒕

2017-11-16 CPSLAB (EECS) 15
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒙 𝒕

2017-11-16 CPSLAB (EECS) 16
𝑪 𝒕
Cell state,
𝒉 𝒕
𝒚 𝒕
𝒙 𝒕

2017-11-16 CPSLAB (EECS) 17
Figures from http://colah.github.io/posts/2015-08-Understanding-LSTMs/

2017-11-16 CPSLAB (EECS) 18

2017-11-16 CPSLAB (EECS) 19

𝑧𝑡 = 𝜎 𝑊𝑧 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏 𝑧
𝑟𝑡 = 𝜎 𝑊𝑟 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏 𝑟
෨ℎ 𝑡 = tanh 𝑊ℎ ∙ 𝑟𝑡 ∗ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏 𝐶
ℎ 𝑡 = (1 − 𝑧𝑡) ∗ ℎ 𝑡−1 + 𝑧𝑡 ∗ ෨ℎ 𝑡
2017-11-16 CPSLAB (EECS) 20
𝑓𝑡 = 𝜎(𝑊𝑓 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏𝑓)
𝑖 𝑡 = 𝜎 𝑊𝑖 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏𝑖
𝑜𝑡 = 𝜎(𝑊𝑜 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏 𝑜)
ሚ𝐶𝑡 = tanh 𝑊𝐶 ∙ ℎ 𝑡−1, 𝑥 𝑡 + 𝑏 𝐶
𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡−1 + 𝑖 𝑡 ∗ ሚ𝐶𝑡
ℎ 𝑡 = 𝑜𝑡 ∗ tanh(𝐶𝑡)
이 구조는 더 간단하게 바뀔 수 있을것만 같은데…!
GRU

2017-11-16 CPSLAB (EECS) 21
ℎ 𝑒(1) ℎ 𝑒(2) ℎ 𝑒(3) ℎ 𝑒(4) ℎ 𝑒(5)
LSTM/GRU
Encoder
LSTM/GRU
Decoder
ℎ 𝑑(1) ℎ 𝑑(𝑇𝑒)
Western Food
To
Korean Food
Transition
Backgrounds : Sequence to Sequence

2017-11-16 CPSLAB (EECS) 22
• Sequence to Sequence 모델을 구현하는 가장 간단한 방법은?
Encoder의 마지막 hidden state 𝒉 𝑻를 Decoder
의 맨 처음 cell으로 넘겨준다!
• 하지만, 이 방법은 Decoder에서 더 긴 sequence를 생성해낼 필요가 있을 수
록 효과가 떨어진다는 단점이 있다.

2017-11-16 CPSLAB (EECS) 23
Bidirectional
GRU Encoder
Attention
GRU Decoder
𝑐𝑡
• Decoder를 구성하는 각 GRU cell마다,
Encoder가 가진 정보를 각각 다르게
넘겨주자!
ℎ𝑖 =
ℎ𝑖
ℎ𝑖
𝑐𝑖 = ෍
𝑗=1
𝑇𝑥
𝛼𝑖𝑗ℎ𝑗
𝑠𝑖 = 𝑓 𝑠𝑖−1, 𝑦𝑖−1, 𝑐𝑖
= 1 − 𝑧𝑖 ∗ 𝑠𝑖−1 + 𝑧𝑖 ∗ ǁ𝑠𝑖
𝑧𝑖 = 𝜎 𝑊𝑧 𝑦𝑖−1 + 𝑈𝑧 𝑠𝑖−1 + 𝑏 𝑧
𝑟𝑖 = 𝜎 𝑊𝑟 𝑦𝑖−1 + 𝑈𝑟 𝑠𝑖−1 + 𝑏 𝑟
ǁ𝑠𝑖 = tanh(𝑦𝑖−1 + 𝑈 𝑟𝑖 ∗ 𝑠𝑖−1 + 𝐶𝑐𝑖 + 𝑏)
𝛼𝑖𝑗 =
exp(𝑒 𝑖𝑗)
σ 𝑘=1
𝑇 𝑥 exp(𝑒 𝑖𝑘)
𝑒𝑖𝑗 = 𝑣 𝑎
𝑇
tanh 𝑊𝑎 𝑠𝑖−1 + 𝑈 𝑎ℎ𝑗 + 𝑏 𝑎

2017-11-16 CPSLAB (EECS) 24
Back to the Text2Action : Possible Structure?

2017-11-16 CPSLAB (EECS) 25
But the result from just Seq2Seq is…..
Input Sentence:
The girl is dancing
to the music.

2017-11-16 CPSLAB (EECS) 26
But the result from just Seq2Seq is…..
Input Sentence:
The man is talking
to the audience.

2017-11-16 CPSLAB (EECS) 27
How can we generate more realistic action?
Let’s take advantage of Generative Adversarial Network! (GAN)
But HOW?

2017-11-16 CPSLAB (EECS) 28
Generator and Discriminator
min
𝐺
max
𝐷
𝑉 𝐷, 𝐺 =
𝔼 𝒙~𝑝 𝑑𝑎𝑡𝑎(𝒙) log 𝐷(𝒙, 𝒄)
+𝔼 𝒛~𝑝 𝒛(𝒛) log 1 − 𝐷 𝐺 𝒛, 𝒄
Only relying on this
value function can
make terrible results!
<Warning>

2017-11-16 CPSLAB (EECS) 29
Text2Action: Overall Structure

2017-11-16 CPSLAB (EECS) 30
Text2Action: Used Training Data
• Extracted pose data from the MSR-VTT dataset, which includes the Youtube
videos and corresponding language descriptions

2017-11-16 CPSLAB (EECS) 31
Text2Action: Result
Input Sentence:
The girl is dancing
to the hip hop beat.

2017-11-16 CPSLAB (EECS) 32
Text2Action: Result
Input Sentence:
The girl is dancing

2017-11-16 CPSLAB (EECS) 33
Text2Action: Result
Input Sentence:
The girl is dancing

2017-11-16 CPSLAB (EECS) 34
Text2Action: Result
Input Sentence:
A chef is cooking a
meal in the kitchen.

2017-11-16 CPSLAB (EECS) 35
Text2Action: Result
Input Sentence:
A man is throwing
something to the
front.

Text2Action: Generative Adversarial Synthesis from Language to Action

More Related Content

What's hot (9)

Similar to Text2Action: Generative Adversarial Synthesis from Language to Action (20)

More from NAVER Engineering (20)

Text2Action: Generative Adversarial Synthesis from Language to Action