SlideShare a Scribd company logo
CHATBOT A Generative Based
Approach
-MANISH MISHRA
WHAT IS CHATBOT?
A chatbot is a program that communicates with us.
A chatbot is a service, powered by rules and sometimes artificial
intelligence, that we interact with via a chat interface.
Some chatterbots use sophisticated natural language processing
systems, but many simpler systems scan for keywords within the
input, then pull a reply with the most matching keywords, or the most
similar wording pattern, from a database.
Today, chatbots are part of virtual assistants such as Google
Assistant, and are accessed via many organizations' apps, websites,
and on instant messaging platforms such as Facebook Messenger
WHY WE NEED CHATBOT?
Trends shows that, users are
investing more time on
messaging apps.
Chatbots can handle numerous
conversations at once without
requiring a person on the other
end answering messages by
hand.
WHY WE NEED
CHATBOT?(CONTINUE)
Apps consume most of the memory of the device. Hence the user’s
do not want to use separate apps for separate purposes.
Trend shows that over 90% of all the apps are uninstalled after its
first use.
Developing a chatbot takes significantly less time and it is also easy
to maintain and less expensive as compared to apps.
TAXONOMY OF MODELS
TAXONOMY OF MODELS
(CONTINUE)
Retrieval-based models (easier) use a repository of predefined
responses and some kind of heuristic to pick an appropriate response
based on the input and context.
I. Respond rule based expression, don’t generate any new text.
II. Ensemble of machine learning.
III. Just pick up a response from a fixed set.
IV. Don’t make any grammatical mistakes.
V. In open domain, it is impossible to make repository of
handcrafted responses
TAXONOMY OF MODELS
(CONTINUE)
Generative models (harder) don’t rely on pre-defined responses. They
generate new responses from scratch. Generative models are typically
based on Machine Translation techniques, but instead of translating
from one language to another, we “translate” from an input to an
output (response).
I. Huge amount of data is needed to train the model.
II. On long text, these models makes grammatical mistakes.
III. In closed domain, Generative models are tough to train than the
Retrieval-Based model.
TAXONOMY OF MODELS (CONTINUE)
The encoder data will be the text from one side of conversation. The
decoder data will be the responses.
Tokenize the sentence by chopping it into words and giving every word a
Token ID, so that data retrieval will be faster, now train the model.
RECURRENT NEURAL NETWORKS-
PROMISING IN NLP TASKS
Applications:-
1. It allows us to score arbitrary sentences based on how likely they
are to occur in the real world. This gives us a measure
of grammatical and semantic correctness.(For machine translation)
2. Allows us to generate new text. (For Language Modelling i.e.
Chatbot)
IDEA BEHIND RNN
I. To make use of sequential information.
II. In traditional NN, we assume that all inputs (and outputs) are
independent of each other.
III. For NLP tasks, it is a bad idea because If you want to predict the
next word in a sentence you better know which words came before
it.
IV. RNNs have a “memory” which captures information about what
has been calculated so far.
WORKING PRINCIPLE OF RNN
 x_t is the input at time step t. For example, x_1 could be a one-hot vector corresponding to
the second word of a sentence.
 s_t is the hidden state at time step t.
s_t=f(Ux_t + Ws_{t-1}).
Function f is tanh or ReLU(non linear function)
o_t = softmax(Vs_t).
(o_t is the output at step t)
IMPORTANT POINTS ON RNN
I. Unlike a traditional deep neural network, a RNN shares the same
parameters (U, V, W above) across all steps. This greatly reduces
the total number of parameters we need to learn.
II. In theory RNNs can make use of information in arbitrarily long
sequences, but in practice they are limited to looking back only a
few steps.
III. certain types of RNNs (like LSTMs, GRU (a simplified version of
LSTM)) were specifically designed to overcome the problem of
vanishing gradient(difficulties learning long-term dependencies).
LONG SHORT TERM MEMORY
NETWORKS(LSTM)
LSTMs are explicitly designed to avoid the long-term dependency
problem. Remembering information for long periods of time is
practically their default behavior, not something they struggle to
learn!
THE CORE IDEA BEHIND LSTMS
I. The cell state is kind of like a conveyor belt. It runs straight down the entire
chain, with only some minor linear interactions. It’s very easy for information
to just flow along it unchanged.
II. The LSTM does have the ability to remove or add information to the cell state,
carefully regulated by structures called gates.
III. The sigmoid layer outputs numbers between zero and one, describing how
much of each component should be let through.
STEP-BY-STEP LSTM WALK
THROUGH
•To decide what information we’re going to throw away from the cell state.
This decision is made by a sigmoid layer called the “forget gate layer.”
It looks at h(T−1) and X(t), and outputs a number between 0 and 1 for each
number in the cell state C(t-1).
STEP-BY-STEP LSTM WALK
THROUGH
(CONTINUED..)
•What new information we’re going to store in the cell state.
I. A sigmoid layer called the “input gate layer” decides which values
we’ll update.
II. A Tanh layer creates a vector of new candidate values, ~C(t), that
could be added to the state. In the next step, we’ll combine these
two to create an update to the state.
STEP-BY-STEP LSTM WALK
THROUGH
(CONTINUED..)
•To update the old Cell state C(t-1), into the new cell state c(t).
I. Multiply the old state by f(t), forgetting the things we decided to
forget earlier.
II. Then we add i(t)∗~C(t). This is the new candidate values, scaled by
how much we decided to update each state value.
STEP-BY-STEP LSTM WALK
THROUGH
(CONTINUED..)• We need to decide what we’re going to output.
1. First, we run a sigmoid layer which decides what parts of the cell state we’re
going to output.
2. We put the cell state through tanh (to push the values to be between −1 and 1)
and multiply it by the output of the sigmoid gate, so that we only output the
parts we decided to.
GATED RECCURENT UNITS(GRU)
•A GRU has two gates, an LSTM has three gates.
•GRUs don’t possess and internal memory (C(t)) that is different from the exposed hidden
state. They don’t have the output gate that is present in LSTMs.
•The input and forget gates are coupled by an update gate z and the reset gate r is applied
directly to the previous hidden state. Thus, the responsibility of the reset gate in a LSTM
is really split up into both r and z.
•We don’t apply a second nonlinearity when computing the output.
ADDING A SECOND GRU LAYER
1. Adding a second layer to our
network allows our model to
capture higher-level interactions.
2. It is likely see diminishing
returns after 2-3 layers and
unless we have a huge amount of
data (which we don’t) more layers
are unlikely to make a
big difference and may lead to
overfitting.
GRU VS LSTM
 In many tasks both architectures yield comparable performance and
tuning hyperparameters like layer size is probably more important
than picking the ideal architecture.
 GRUs have fewer parameters (U and W are smaller) and thus may
train a bit faster or need less data to generalize.
 On the other hand, if you have enough data, the greater expressive
power of LSTMs may lead to better results.
PRE-PROCESSING THE DATA
•TOKENIZE TEXT
We want to make predictions on a per-word basis. This means we
must tokenize our comments into sentences, and sentences into
words.
The sentence “He left!” should be 3 tokens: “He”, “left”, “!”.
•REMOVE INFREQUENT WORDS
Most words in our text will only appear one or two times. It’s a good
idea to remove these infrequent words as having a huge vocabulary
will make our model slow to train
PRE-PROCESSING THE
DATA(CONTINUED..)
•PADDING
Before training, we work on the dataset to convert the variable length sequences into fixed length
sequences, by padding. We use a few special symbols to fill in the sequence.
1. EOS : End of sentence
2. PAD : Filler
3. GO : Start decoding
4. UNK : Unknown; word not in vocabulary
Consider the following query-response pair:
Q : How are you?
A : I am fine.
Assuming that we would like our sentences (queries and responses) to be of fixed length, 10, this pair will be
converted to:
Q : [ PAD, PAD, PAD, PAD, PAD, PAD, “?”, “you”, “are”, “How” ]
A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]
PRE-PROCESSING THE
DATA(CONTINUED..)
•BUCKETING
•If the largest sentence in our dataset is of length 100, we need to encode all our sentences
to be of length 100, in order to not lose any words. Now, what happens to “How are you?”
? There will be 97 PAD symbols in the encoded version of the sentence. This will
overshadow the actual information in the sentence.
•Bucketing kind of solves this problem, by putting sentences into buckets of different
sizes. Consider this list of buckets : [ (5,10), (10,15), (20,25), (40,50) ].
•If the length of a query is 4 and the length of its response is 4 (as in our previous
example), we put this sentence in the bucket (5,10). The query will be padded to length 5
and the response will be padded to length 10.
•If we are using the bucket (5,10), our sentences will be encoded to :
Q : [ PAD, “?”, “you”, “are”, “How” ]
A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]
WORD EMBEDDING
•CO-OCCURRENCE MATRIX
Since deep learning loves math, we’re going to represent each word as
a d-dimensional vector.
Here, 6 distinct word, so each word will be of 6-dim vector.
CONTINUE
D….
Extracting the rows from this matrix can give us a simple initialization of our
word vectors.
INFERENCE FROM THE ABOVE
EXAMPLE
I. Notice that the words ‘love’ and ‘like’ both contain 1’s for their counts
with nouns (NLP and dogs).
II. They also have 1’s for the count with “I”, thus indicating that the words
must be some sort of verb.
III. With a larger dataset than just one sentence, it can be imagined that this
similarity will become more clear as ‘like’, ‘love’, and other synonyms will
begin to have similar word vectors, because of the fact that they are used
in similar contexts.
LIMITATION
I. The dimensionality of each word will increase linearly with the size of the
corpus.
II. If we had a million words (not really a lot in NLP standards), we’d have a
million by million sized matrix which would be extremely sparse (lots of
0’s). Definitely not the best in terms of storage efficiency. Alternatively,
WORD2VEC APPROACH
•Word2Vec operates on the idea that we want to predict the surrounding words
of every word.
We’re going to look at the first 3 words of this sentence. Window size m=3.
Goal is to take the center word, ‘love’, and predict the words that come before
and after it by maximizing/optimizing a function to maximize the log probability
of any context word given the current center word.
Where log function is:
The above cost function is basically saying that we’re going to add the log
probabilities of ‘I’ and ‘love’ as well as ‘NLP’ and ‘love’ (where ‘love’ is the center
word in both cases).
WORD2VEC
APPROACH(CONTINUED..)
Vc is the word vector of the center word. Every word has two vector
representations (Uo and Uw), one for when the word is used as the center word
and one for when it’s used as the outer word. The vectors are trained with
stochastic gradient descent.
Word2Vec seeks to find vector representations of different words by maximizing
the log probability of context words given a center word and modifying the
vectors through SGD.
The most interesting contribution of Word2Vec was the appearance of linear
relationships between different word vectors.
After training, the word vectors seemed to capture different grammatical and
semantic concept.
It’s pretty incredible how these linear relationships could be formed through a
ALGORITHM OF WORD2VEC
•Two algorithms
1. Skip-grams (SG):Predict context words given target (position
independent).
2. Continuous Bag of Words (CBOW):Predict target word from bag-of-
words context.
•Two (moderately efficient) training methods :
1. Hierarchical softmax
2. Negative sampling
SKIP-GRAM PREDICTION
Chatbot ppt
Chatbot ppt
Chatbot ppt
TO TRAIN THE MODEL: COMPUTE ALL
VECTOR GRADIENTS!
•We often define the set of all parameters in a model in terms of one long
vector Theta.
•Then optimize these parameters using gradient descent.
SEQUENCE TO SEQUENCE MODEL
FOR CHATBOT
•Sequence To Sequence model become the Go-To model for Dialogue
Systems and Machine Translation.
• It consists of two RNNs (Recurrent Neural Network(LSTM or GRU)) :
I. An encoder
II. A decoder
Encoder
•The encoder takes a sequence(sentence) as input and processes one
symbol(word) at each timestep.
•Its objective is to convert a sequence of symbols into a fixed size feature
vector that encodes only the important information in the sequence while
losing the unnecessary information.
•You can visualize data flow in the encoder along the time axis, as the flow of
local information from one end of the sequence to another.
SEQUENCE TO SEQUENCE MODEL FOR
CHATBOT(CONTINUED..)
•Each hidden state influences the next hidden state and the final hidden state can be
seen as the summary of the sequence. This state is called the context or thought
vector, as it represents the intention of the sequence.
•From the context, the decoder generates another sequence, one symbol(word) at a
time. Here, at each time step, the decoder is influenced by the context and the
previously generated symbols.
DUAL ENCODER LSTM ALGORITHM FOR
SEQ2SEQ
1. Both the context and the response text are split by words, and
each word is embedded into a vector. The word embeddings are
initialized with Word2Vec Skip gram model of vectors and are
fine-tuned during training.
2. Both the embedded context and response are fed into the same
Recurrent Neural Network word-by-word. The RNN generates a
vector representation that, loosely speaking, captures the
“meaning” of the context and response (c and r in the picture). We
can choose how large these vectors should be, but let’s say we
pick 256 dimensions.
3. We multiply c with a matrix M to “predict” a response r’. If c is a
256-dimensional vector, then M is a 256×256 dimensional matrix,
and the result is another 256-dimensional vector, which we can
interpret as a generated response. The matrix M is learned during
training.
DUAL ENCODER LSTM ALGORITHM FOR
SEQ2SEQ(CONT.)
•We measure the similarity of the predicted response r’ and the
actual response r by taking the dot product of these two
vectors.
•A large dot product means the vectors are similar and that the
response should receive a high score.
• We then apply a sigmoid function to convert that score into a
probability.
REFERENCES
• https://github.com/Marsan-Ma/chat_corpus (Sources of data for
trial ChatBot)
•http://www.wildml.com/2015/09/recurrent-neural-networks-
tutorial-part-1-introduction-to-rnns/
•http://www.wildml.com/2015/09/recurrent-neural-networks-
tutorial-part-2-implementing-a-language-model-rnn-with-python-
numpy-and-theano/
•http://www.wildml.com/2015/10/recurrent-neural-networks-
tutorial-part-3-backpropagation-through-time-and-vanishing-
gradients/
•http://www.wildml.com/2015/10/recurrent-neural-network-
tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
REFERENCES…
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://cs231n.github.io/optimization-1/
http://colah.github.io/posts/2015-08-Backprop/
http://cs231n.github.io/optimization-2/
http://neuralnetworksanddeeplearning.com/chap2.html
http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture2.pdf
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
http://sebastianruder.com/word-embeddings-1/
http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/
https://www.tensorflow.org/tutorials/seq2seq

More Related Content

PPTX
Artificially Intelligent chatbot Implementation
PPTX
Chatbot and Virtual AI Assistant Implementation in Natural Language Processing
PPTX
chatbots presentation .pptx
PDF
Introduction to Chatbots
PPTX
Chat Bots Presentation 8.9.16
PPTX
Chatbot_Presentation
PDF
PDF
Artificial Intelligence Virtual Assistants & Chatbots
Artificially Intelligent chatbot Implementation
Chatbot and Virtual AI Assistant Implementation in Natural Language Processing
chatbots presentation .pptx
Introduction to Chatbots
Chat Bots Presentation 8.9.16
Chatbot_Presentation
Artificial Intelligence Virtual Assistants & Chatbots

What's hot (20)

PPTX
Chatbot ppt
PPT
Chat bots and AI
PDF
How do Chatbots Work? A Guide to Chatbot Architecture
PPTX
Chatbot Abstract
PDF
ChatGPT and OpenAI.pdf
PPTX
Chat bots
POTX
What is a chatbot?
PDF
Let's Build a Chatbot!
PPTX
Generative AI
PPSX
Chatbot
PDF
Large Language Models - Chat AI.pdf
PDF
Chatbot Artificial Intelligence
PPTX
Chatbot
PPTX
Webinar on ChatGPT.pptx
PDF
Uses of AI text bot.pdf
PDF
ChatGPT Use- Cases
PPTX
ChatGPT ppt.pptx
PDF
An Introduction to Generative AI - May 18, 2023
PPTX
Chatbot Technology
PDF
An Introduction to Generative AI
Chatbot ppt
Chat bots and AI
How do Chatbots Work? A Guide to Chatbot Architecture
Chatbot Abstract
ChatGPT and OpenAI.pdf
Chat bots
What is a chatbot?
Let's Build a Chatbot!
Generative AI
Chatbot
Large Language Models - Chat AI.pdf
Chatbot Artificial Intelligence
Chatbot
Webinar on ChatGPT.pptx
Uses of AI text bot.pdf
ChatGPT Use- Cases
ChatGPT ppt.pptx
An Introduction to Generative AI - May 18, 2023
Chatbot Technology
An Introduction to Generative AI
Ad

Similar to Chatbot ppt (20)

PPTX
RNN and LSTM model description and working advantages and disadvantages
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
PDF
Cheatsheet recurrent-neural-networks
PPTX
Neural machine translation by jointly learning to align and translate.pptx
PDF
IRJET- Survey on Generating Suggestions for Erroneous Part in a Sentence
PDF
Convolutional and Recurrent Neural Networks
PDF
Applying Deep Learning Machine Translation to Language Services
PDF
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
PDF
Recurrent Neural Networks
PDF
Deep Learning: Application & Opportunity
PDF
Deep-learning based Language Understanding and Emotion extractions
PPTX
recurrent_neural_networks_april_2020.pptx
PDF
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
PPTX
Gnerative AI presidency Module1_L4_LLMs_new.pptx
PDF
IRJET- Survey on Text Error Detection using Deep Learning
PDF
Recurrent Neural Networks. Part 1: Theory
PPTX
Deep Learning and Watson Studio
PPTX
RNN JAN 2025 ppt fro scratch looking from basic.pptx
PPT
Nlp 2020 global ai conf -jeff_shomaker_final
PDF
rnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
RNN and LSTM model description and working advantages and disadvantages
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Cheatsheet recurrent-neural-networks
Neural machine translation by jointly learning to align and translate.pptx
IRJET- Survey on Generating Suggestions for Erroneous Part in a Sentence
Convolutional and Recurrent Neural Networks
Applying Deep Learning Machine Translation to Language Services
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks
Deep Learning: Application & Opportunity
Deep-learning based Language Understanding and Emotion extractions
recurrent_neural_networks_april_2020.pptx
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Gnerative AI presidency Module1_L4_LLMs_new.pptx
IRJET- Survey on Text Error Detection using Deep Learning
Recurrent Neural Networks. Part 1: Theory
Deep Learning and Watson Studio
RNN JAN 2025 ppt fro scratch looking from basic.pptx
Nlp 2020 global ai conf -jeff_shomaker_final
rnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Ad

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PPTX
A Quantitative-WPS Office.pptx research study
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Foundation of Data Science unit number two notes
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Computer network topology notes for revision
A Quantitative-WPS Office.pptx research study
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
climate analysis of Dhaka ,Banglades.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Reliability_Chapter_ presentation 1221.5784
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Clinical guidelines as a resource for EBP(1).pdf
Foundation of Data Science unit number two notes
IB Computer Science - Internal Assessment.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Moving the Public Sector (Government) to a Digital Adoption
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction-to-Cloud-ComputingFinal.pptx

Chatbot ppt

  • 1. CHATBOT A Generative Based Approach -MANISH MISHRA
  • 2. WHAT IS CHATBOT? A chatbot is a program that communicates with us. A chatbot is a service, powered by rules and sometimes artificial intelligence, that we interact with via a chat interface. Some chatterbots use sophisticated natural language processing systems, but many simpler systems scan for keywords within the input, then pull a reply with the most matching keywords, or the most similar wording pattern, from a database. Today, chatbots are part of virtual assistants such as Google Assistant, and are accessed via many organizations' apps, websites, and on instant messaging platforms such as Facebook Messenger
  • 3. WHY WE NEED CHATBOT? Trends shows that, users are investing more time on messaging apps. Chatbots can handle numerous conversations at once without requiring a person on the other end answering messages by hand.
  • 4. WHY WE NEED CHATBOT?(CONTINUE) Apps consume most of the memory of the device. Hence the user’s do not want to use separate apps for separate purposes. Trend shows that over 90% of all the apps are uninstalled after its first use. Developing a chatbot takes significantly less time and it is also easy to maintain and less expensive as compared to apps.
  • 6. TAXONOMY OF MODELS (CONTINUE) Retrieval-based models (easier) use a repository of predefined responses and some kind of heuristic to pick an appropriate response based on the input and context. I. Respond rule based expression, don’t generate any new text. II. Ensemble of machine learning. III. Just pick up a response from a fixed set. IV. Don’t make any grammatical mistakes. V. In open domain, it is impossible to make repository of handcrafted responses
  • 7. TAXONOMY OF MODELS (CONTINUE) Generative models (harder) don’t rely on pre-defined responses. They generate new responses from scratch. Generative models are typically based on Machine Translation techniques, but instead of translating from one language to another, we “translate” from an input to an output (response). I. Huge amount of data is needed to train the model. II. On long text, these models makes grammatical mistakes. III. In closed domain, Generative models are tough to train than the Retrieval-Based model.
  • 8. TAXONOMY OF MODELS (CONTINUE) The encoder data will be the text from one side of conversation. The decoder data will be the responses. Tokenize the sentence by chopping it into words and giving every word a Token ID, so that data retrieval will be faster, now train the model.
  • 9. RECURRENT NEURAL NETWORKS- PROMISING IN NLP TASKS Applications:- 1. It allows us to score arbitrary sentences based on how likely they are to occur in the real world. This gives us a measure of grammatical and semantic correctness.(For machine translation) 2. Allows us to generate new text. (For Language Modelling i.e. Chatbot)
  • 10. IDEA BEHIND RNN I. To make use of sequential information. II. In traditional NN, we assume that all inputs (and outputs) are independent of each other. III. For NLP tasks, it is a bad idea because If you want to predict the next word in a sentence you better know which words came before it. IV. RNNs have a “memory” which captures information about what has been calculated so far.
  • 11. WORKING PRINCIPLE OF RNN  x_t is the input at time step t. For example, x_1 could be a one-hot vector corresponding to the second word of a sentence.  s_t is the hidden state at time step t. s_t=f(Ux_t + Ws_{t-1}). Function f is tanh or ReLU(non linear function) o_t = softmax(Vs_t). (o_t is the output at step t)
  • 12. IMPORTANT POINTS ON RNN I. Unlike a traditional deep neural network, a RNN shares the same parameters (U, V, W above) across all steps. This greatly reduces the total number of parameters we need to learn. II. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps. III. certain types of RNNs (like LSTMs, GRU (a simplified version of LSTM)) were specifically designed to overcome the problem of vanishing gradient(difficulties learning long-term dependencies).
  • 13. LONG SHORT TERM MEMORY NETWORKS(LSTM) LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!
  • 14. THE CORE IDEA BEHIND LSTMS I. The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged. II. The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. III. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through.
  • 15. STEP-BY-STEP LSTM WALK THROUGH •To decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at h(T−1) and X(t), and outputs a number between 0 and 1 for each number in the cell state C(t-1).
  • 16. STEP-BY-STEP LSTM WALK THROUGH (CONTINUED..) •What new information we’re going to store in the cell state. I. A sigmoid layer called the “input gate layer” decides which values we’ll update. II. A Tanh layer creates a vector of new candidate values, ~C(t), that could be added to the state. In the next step, we’ll combine these two to create an update to the state.
  • 17. STEP-BY-STEP LSTM WALK THROUGH (CONTINUED..) •To update the old Cell state C(t-1), into the new cell state c(t). I. Multiply the old state by f(t), forgetting the things we decided to forget earlier. II. Then we add i(t)∗~C(t). This is the new candidate values, scaled by how much we decided to update each state value.
  • 18. STEP-BY-STEP LSTM WALK THROUGH (CONTINUED..)• We need to decide what we’re going to output. 1. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. 2. We put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.
  • 19. GATED RECCURENT UNITS(GRU) •A GRU has two gates, an LSTM has three gates. •GRUs don’t possess and internal memory (C(t)) that is different from the exposed hidden state. They don’t have the output gate that is present in LSTMs. •The input and forget gates are coupled by an update gate z and the reset gate r is applied directly to the previous hidden state. Thus, the responsibility of the reset gate in a LSTM is really split up into both r and z. •We don’t apply a second nonlinearity when computing the output.
  • 20. ADDING A SECOND GRU LAYER 1. Adding a second layer to our network allows our model to capture higher-level interactions. 2. It is likely see diminishing returns after 2-3 layers and unless we have a huge amount of data (which we don’t) more layers are unlikely to make a big difference and may lead to overfitting.
  • 21. GRU VS LSTM  In many tasks both architectures yield comparable performance and tuning hyperparameters like layer size is probably more important than picking the ideal architecture.  GRUs have fewer parameters (U and W are smaller) and thus may train a bit faster or need less data to generalize.  On the other hand, if you have enough data, the greater expressive power of LSTMs may lead to better results.
  • 22. PRE-PROCESSING THE DATA •TOKENIZE TEXT We want to make predictions on a per-word basis. This means we must tokenize our comments into sentences, and sentences into words. The sentence “He left!” should be 3 tokens: “He”, “left”, “!”. •REMOVE INFREQUENT WORDS Most words in our text will only appear one or two times. It’s a good idea to remove these infrequent words as having a huge vocabulary will make our model slow to train
  • 23. PRE-PROCESSING THE DATA(CONTINUED..) •PADDING Before training, we work on the dataset to convert the variable length sequences into fixed length sequences, by padding. We use a few special symbols to fill in the sequence. 1. EOS : End of sentence 2. PAD : Filler 3. GO : Start decoding 4. UNK : Unknown; word not in vocabulary Consider the following query-response pair: Q : How are you? A : I am fine. Assuming that we would like our sentences (queries and responses) to be of fixed length, 10, this pair will be converted to: Q : [ PAD, PAD, PAD, PAD, PAD, PAD, “?”, “you”, “are”, “How” ] A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]
  • 24. PRE-PROCESSING THE DATA(CONTINUED..) •BUCKETING •If the largest sentence in our dataset is of length 100, we need to encode all our sentences to be of length 100, in order to not lose any words. Now, what happens to “How are you?” ? There will be 97 PAD symbols in the encoded version of the sentence. This will overshadow the actual information in the sentence. •Bucketing kind of solves this problem, by putting sentences into buckets of different sizes. Consider this list of buckets : [ (5,10), (10,15), (20,25), (40,50) ]. •If the length of a query is 4 and the length of its response is 4 (as in our previous example), we put this sentence in the bucket (5,10). The query will be padded to length 5 and the response will be padded to length 10. •If we are using the bucket (5,10), our sentences will be encoded to : Q : [ PAD, “?”, “you”, “are”, “How” ] A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]
  • 25. WORD EMBEDDING •CO-OCCURRENCE MATRIX Since deep learning loves math, we’re going to represent each word as a d-dimensional vector. Here, 6 distinct word, so each word will be of 6-dim vector.
  • 26. CONTINUE D…. Extracting the rows from this matrix can give us a simple initialization of our word vectors.
  • 27. INFERENCE FROM THE ABOVE EXAMPLE I. Notice that the words ‘love’ and ‘like’ both contain 1’s for their counts with nouns (NLP and dogs). II. They also have 1’s for the count with “I”, thus indicating that the words must be some sort of verb. III. With a larger dataset than just one sentence, it can be imagined that this similarity will become more clear as ‘like’, ‘love’, and other synonyms will begin to have similar word vectors, because of the fact that they are used in similar contexts. LIMITATION I. The dimensionality of each word will increase linearly with the size of the corpus. II. If we had a million words (not really a lot in NLP standards), we’d have a million by million sized matrix which would be extremely sparse (lots of 0’s). Definitely not the best in terms of storage efficiency. Alternatively,
  • 28. WORD2VEC APPROACH •Word2Vec operates on the idea that we want to predict the surrounding words of every word. We’re going to look at the first 3 words of this sentence. Window size m=3. Goal is to take the center word, ‘love’, and predict the words that come before and after it by maximizing/optimizing a function to maximize the log probability of any context word given the current center word. Where log function is: The above cost function is basically saying that we’re going to add the log probabilities of ‘I’ and ‘love’ as well as ‘NLP’ and ‘love’ (where ‘love’ is the center word in both cases).
  • 29. WORD2VEC APPROACH(CONTINUED..) Vc is the word vector of the center word. Every word has two vector representations (Uo and Uw), one for when the word is used as the center word and one for when it’s used as the outer word. The vectors are trained with stochastic gradient descent. Word2Vec seeks to find vector representations of different words by maximizing the log probability of context words given a center word and modifying the vectors through SGD. The most interesting contribution of Word2Vec was the appearance of linear relationships between different word vectors. After training, the word vectors seemed to capture different grammatical and semantic concept. It’s pretty incredible how these linear relationships could be formed through a
  • 30. ALGORITHM OF WORD2VEC •Two algorithms 1. Skip-grams (SG):Predict context words given target (position independent). 2. Continuous Bag of Words (CBOW):Predict target word from bag-of- words context. •Two (moderately efficient) training methods : 1. Hierarchical softmax 2. Negative sampling
  • 35. TO TRAIN THE MODEL: COMPUTE ALL VECTOR GRADIENTS! •We often define the set of all parameters in a model in terms of one long vector Theta. •Then optimize these parameters using gradient descent.
  • 36. SEQUENCE TO SEQUENCE MODEL FOR CHATBOT •Sequence To Sequence model become the Go-To model for Dialogue Systems and Machine Translation. • It consists of two RNNs (Recurrent Neural Network(LSTM or GRU)) : I. An encoder II. A decoder Encoder •The encoder takes a sequence(sentence) as input and processes one symbol(word) at each timestep. •Its objective is to convert a sequence of symbols into a fixed size feature vector that encodes only the important information in the sequence while losing the unnecessary information. •You can visualize data flow in the encoder along the time axis, as the flow of local information from one end of the sequence to another.
  • 37. SEQUENCE TO SEQUENCE MODEL FOR CHATBOT(CONTINUED..) •Each hidden state influences the next hidden state and the final hidden state can be seen as the summary of the sequence. This state is called the context or thought vector, as it represents the intention of the sequence. •From the context, the decoder generates another sequence, one symbol(word) at a time. Here, at each time step, the decoder is influenced by the context and the previously generated symbols.
  • 38. DUAL ENCODER LSTM ALGORITHM FOR SEQ2SEQ 1. Both the context and the response text are split by words, and each word is embedded into a vector. The word embeddings are initialized with Word2Vec Skip gram model of vectors and are fine-tuned during training. 2. Both the embedded context and response are fed into the same Recurrent Neural Network word-by-word. The RNN generates a vector representation that, loosely speaking, captures the “meaning” of the context and response (c and r in the picture). We can choose how large these vectors should be, but let’s say we pick 256 dimensions. 3. We multiply c with a matrix M to “predict” a response r’. If c is a 256-dimensional vector, then M is a 256×256 dimensional matrix, and the result is another 256-dimensional vector, which we can interpret as a generated response. The matrix M is learned during training.
  • 39. DUAL ENCODER LSTM ALGORITHM FOR SEQ2SEQ(CONT.) •We measure the similarity of the predicted response r’ and the actual response r by taking the dot product of these two vectors. •A large dot product means the vectors are similar and that the response should receive a high score. • We then apply a sigmoid function to convert that score into a probability.
  • 40. REFERENCES • https://github.com/Marsan-Ma/chat_corpus (Sources of data for trial ChatBot) •http://www.wildml.com/2015/09/recurrent-neural-networks- tutorial-part-1-introduction-to-rnns/ •http://www.wildml.com/2015/09/recurrent-neural-networks- tutorial-part-2-implementing-a-language-model-rnn-with-python- numpy-and-theano/ •http://www.wildml.com/2015/10/recurrent-neural-networks- tutorial-part-3-backpropagation-through-time-and-vanishing- gradients/ •http://www.wildml.com/2015/10/recurrent-neural-network- tutorial-part-4-implementing-a-grulstm-rnn-with-python-and- http://karpathy.github.io/2015/05/21/rnn-effectiveness/