Chatbot ppt

CHATBOT A Generative Based
Approach
-MANISH MISHRA

WHAT IS CHATBOT?
A chatbot is a program that communicates with us.
A chatbot is a service, powered by rules and sometimes artificial
intelligence, that we interact with via a chat interface.
Some chatterbots use sophisticated natural language processing
systems, but many simpler systems scan for keywords within the
input, then pull a reply with the most matching keywords, or the most
similar wording pattern, from a database.
Today, chatbots are part of virtual assistants such as Google
Assistant, and are accessed via many organizations' apps, websites,
and on instant messaging platforms such as Facebook Messenger

WHY WE NEED CHATBOT?
Trends shows that, users are
investing more time on
messaging apps.
Chatbots can handle numerous
conversations at once without
requiring a person on the other
end answering messages by
hand.

WHY WE NEED
CHATBOT?(CONTINUE)
Apps consume most of the memory of the device. Hence the user’s
do not want to use separate apps for separate purposes.
Trend shows that over 90% of all the apps are uninstalled after its
first use.
Developing a chatbot takes significantly less time and it is also easy
to maintain and less expensive as compared to apps.

TAXONOMY OF MODELS
(CONTINUE)
Retrieval-based models (easier) use a repository of predefined
responses and some kind of heuristic to pick an appropriate response
based on the input and context.
I. Respond rule based expression, don’t generate any new text.
II. Ensemble of machine learning.
III. Just pick up a response from a fixed set.
IV. Don’t make any grammatical mistakes.
V. In open domain, it is impossible to make repository of
handcrafted responses

TAXONOMY OF MODELS
(CONTINUE)
Generative models (harder) don’t rely on pre-defined responses. They
generate new responses from scratch. Generative models are typically
based on Machine Translation techniques, but instead of translating
from one language to another, we “translate” from an input to an
output (response).
I. Huge amount of data is needed to train the model.
II. On long text, these models makes grammatical mistakes.
III. In closed domain, Generative models are tough to train than the
Retrieval-Based model.

TAXONOMY OF MODELS (CONTINUE)
The encoder data will be the text from one side of conversation. The
decoder data will be the responses.
Tokenize the sentence by chopping it into words and giving every word a
Token ID, so that data retrieval will be faster, now train the model.

RECURRENT NEURAL NETWORKS-
PROMISING IN NLP TASKS
Applications:-
1. It allows us to score arbitrary sentences based on how likely they
are to occur in the real world. This gives us a measure
of grammatical and semantic correctness.(For machine translation)
2. Allows us to generate new text. (For Language Modelling i.e.
Chatbot)

IDEA BEHIND RNN
I. To make use of sequential information.
II. In traditional NN, we assume that all inputs (and outputs) are
independent of each other.
III. For NLP tasks, it is a bad idea because If you want to predict the
next word in a sentence you better know which words came before
it.
IV. RNNs have a “memory” which captures information about what
has been calculated so far.

WORKING PRINCIPLE OF RNN
 x_t is the input at time step t. For example, x_1 could be a one-hot vector corresponding to
the second word of a sentence.
 s_t is the hidden state at time step t.
s_t=f(Ux_t + Ws_{t-1}).
Function f is tanh or ReLU(non linear function)
o_t = softmax(Vs_t).
(o_t is the output at step t)

IMPORTANT POINTS ON RNN
I. Unlike a traditional deep neural network, a RNN shares the same
parameters (U, V, W above) across all steps. This greatly reduces
the total number of parameters we need to learn.
II. In theory RNNs can make use of information in arbitrarily long
sequences, but in practice they are limited to looking back only a
few steps.
III. certain types of RNNs (like LSTMs, GRU (a simplified version of
LSTM)) were specifically designed to overcome the problem of
vanishing gradient(difficulties learning long-term dependencies).

LONG SHORT TERM MEMORY
NETWORKS(LSTM)
LSTMs are explicitly designed to avoid the long-term dependency
problem. Remembering information for long periods of time is
practically their default behavior, not something they struggle to
learn!

THE CORE IDEA BEHIND LSTMS
I. The cell state is kind of like a conveyor belt. It runs straight down the entire
chain, with only some minor linear interactions. It’s very easy for information
to just flow along it unchanged.
II. The LSTM does have the ability to remove or add information to the cell state,
carefully regulated by structures called gates.
III. The sigmoid layer outputs numbers between zero and one, describing how
much of each component should be let through.

STEP-BY-STEP LSTM WALK
THROUGH
•To decide what information we’re going to throw away from the cell state.
This decision is made by a sigmoid layer called the “forget gate layer.”
It looks at h(T−1) and X(t), and outputs a number between 0 and 1 for each
number in the cell state C(t-1).

THROUGH
(CONTINUED..)
•What new information we’re going to store in the cell state.
I. A sigmoid layer called the “input gate layer” decides which values
we’ll update.
II. A Tanh layer creates a vector of new candidate values, ~C(t), that
could be added to the state. In the next step, we’ll combine these
two to create an update to the state.

THROUGH
(CONTINUED..)
•To update the old Cell state C(t-1), into the new cell state c(t).
I. Multiply the old state by f(t), forgetting the things we decided to
forget earlier.
II. Then we add i(t)∗~C(t). This is the new candidate values, scaled by
how much we decided to update each state value.

THROUGH
(CONTINUED..)• We need to decide what we’re going to output.
1. First, we run a sigmoid layer which decides what parts of the cell state we’re
going to output.
2. We put the cell state through tanh (to push the values to be between −1 and 1)
and multiply it by the output of the sigmoid gate, so that we only output the
parts we decided to.

GATED RECCURENT UNITS(GRU)
•A GRU has two gates, an LSTM has three gates.
•GRUs don’t possess and internal memory (C(t)) that is different from the exposed hidden
state. They don’t have the output gate that is present in LSTMs.
•The input and forget gates are coupled by an update gate z and the reset gate r is applied
directly to the previous hidden state. Thus, the responsibility of the reset gate in a LSTM
is really split up into both r and z.
•We don’t apply a second nonlinearity when computing the output.

ADDING A SECOND GRU LAYER
1. Adding a second layer to our
network allows our model to
capture higher-level interactions.
2. It is likely see diminishing
returns after 2-3 layers and
unless we have a huge amount of
data (which we don’t) more layers
are unlikely to make a
big difference and may lead to
overfitting.

GRU VS LSTM
 In many tasks both architectures yield comparable performance and
tuning hyperparameters like layer size is probably more important
than picking the ideal architecture.
 GRUs have fewer parameters (U and W are smaller) and thus may
train a bit faster or need less data to generalize.
 On the other hand, if you have enough data, the greater expressive
power of LSTMs may lead to better results.

PRE-PROCESSING THE DATA
•TOKENIZE TEXT
We want to make predictions on a per-word basis. This means we
must tokenize our comments into sentences, and sentences into
words.
The sentence “He left!” should be 3 tokens: “He”, “left”, “!”.
•REMOVE INFREQUENT WORDS
Most words in our text will only appear one or two times. It’s a good
idea to remove these infrequent words as having a huge vocabulary
will make our model slow to train

PRE-PROCESSING THE
DATA(CONTINUED..)
•PADDING
Before training, we work on the dataset to convert the variable length sequences into fixed length
sequences, by padding. We use a few special symbols to fill in the sequence.
1. EOS : End of sentence
2. PAD : Filler
3. GO : Start decoding
4. UNK : Unknown; word not in vocabulary
Consider the following query-response pair:
Q : How are you?
A : I am fine.
Assuming that we would like our sentences (queries and responses) to be of fixed length, 10, this pair will be
converted to:
Q : [ PAD, PAD, PAD, PAD, PAD, PAD, “?”, “you”, “are”, “How” ]
A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]

PRE-PROCESSING THE
DATA(CONTINUED..)
•BUCKETING
•If the largest sentence in our dataset is of length 100, we need to encode all our sentences
to be of length 100, in order to not lose any words. Now, what happens to “How are you?”
? There will be 97 PAD symbols in the encoded version of the sentence. This will
overshadow the actual information in the sentence.
•Bucketing kind of solves this problem, by putting sentences into buckets of different
sizes. Consider this list of buckets : [ (5,10), (10,15), (20,25), (40,50) ].
•If the length of a query is 4 and the length of its response is 4 (as in our previous
example), we put this sentence in the bucket (5,10). The query will be padded to length 5
and the response will be padded to length 10.
•If we are using the bucket (5,10), our sentences will be encoded to :
Q : [ PAD, “?”, “you”, “are”, “How” ]
A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]

WORD EMBEDDING
•CO-OCCURRENCE MATRIX
Since deep learning loves math, we’re going to represent each word as
a d-dimensional vector.
Here, 6 distinct word, so each word will be of 6-dim vector.

CONTINUE
D….
Extracting the rows from this matrix can give us a simple initialization of our
word vectors.

INFERENCE FROM THE ABOVE
EXAMPLE
I. Notice that the words ‘love’ and ‘like’ both contain 1’s for their counts
with nouns (NLP and dogs).
II. They also have 1’s for the count with “I”, thus indicating that the words
must be some sort of verb.
III. With a larger dataset than just one sentence, it can be imagined that this
similarity will become more clear as ‘like’, ‘love’, and other synonyms will
begin to have similar word vectors, because of the fact that they are used
in similar contexts.
LIMITATION
I. The dimensionality of each word will increase linearly with the size of the
corpus.
II. If we had a million words (not really a lot in NLP standards), we’d have a
million by million sized matrix which would be extremely sparse (lots of
0’s). Definitely not the best in terms of storage efficiency. Alternatively,

WORD2VEC APPROACH
•Word2Vec operates on the idea that we want to predict the surrounding words
of every word.
We’re going to look at the first 3 words of this sentence. Window size m=3.
Goal is to take the center word, ‘love’, and predict the words that come before
and after it by maximizing/optimizing a function to maximize the log probability
of any context word given the current center word.
Where log function is:
The above cost function is basically saying that we’re going to add the log
probabilities of ‘I’ and ‘love’ as well as ‘NLP’ and ‘love’ (where ‘love’ is the center
word in both cases).

WORD2VEC
APPROACH(CONTINUED..)
Vc is the word vector of the center word. Every word has two vector
representations (Uo and Uw), one for when the word is used as the center word
and one for when it’s used as the outer word. The vectors are trained with
stochastic gradient descent.
Word2Vec seeks to find vector representations of different words by maximizing
the log probability of context words given a center word and modifying the
vectors through SGD.
The most interesting contribution of Word2Vec was the appearance of linear
relationships between different word vectors.
After training, the word vectors seemed to capture different grammatical and
semantic concept.
It’s pretty incredible how these linear relationships could be formed through a

ALGORITHM OF WORD2VEC
•Two algorithms
1. Skip-grams (SG):Predict context words given target (position
independent).
2. Continuous Bag of Words (CBOW):Predict target word from bag-of-
words context.
•Two (moderately efficient) training methods :
1. Hierarchical softmax
2. Negative sampling

TO TRAIN THE MODEL: COMPUTE ALL
VECTOR GRADIENTS!
•We often define the set of all parameters in a model in terms of one long
vector Theta.
•Then optimize these parameters using gradient descent.

SEQUENCE TO SEQUENCE MODEL
FOR CHATBOT
•Sequence To Sequence model become the Go-To model for Dialogue
Systems and Machine Translation.
• It consists of two RNNs (Recurrent Neural Network(LSTM or GRU)) :
I. An encoder
II. A decoder
Encoder
•The encoder takes a sequence(sentence) as input and processes one
symbol(word) at each timestep.
•Its objective is to convert a sequence of symbols into a fixed size feature
vector that encodes only the important information in the sequence while
losing the unnecessary information.
•You can visualize data flow in the encoder along the time axis, as the flow of
local information from one end of the sequence to another.

SEQUENCE TO SEQUENCE MODEL FOR
CHATBOT(CONTINUED..)
•Each hidden state influences the next hidden state and the final hidden state can be
seen as the summary of the sequence. This state is called the context or thought
vector, as it represents the intention of the sequence.
•From the context, the decoder generates another sequence, one symbol(word) at a
time. Here, at each time step, the decoder is influenced by the context and the
previously generated symbols.

DUAL ENCODER LSTM ALGORITHM FOR
SEQ2SEQ
1. Both the context and the response text are split by words, and
each word is embedded into a vector. The word embeddings are
initialized with Word2Vec Skip gram model of vectors and are
fine-tuned during training.
2. Both the embedded context and response are fed into the same
Recurrent Neural Network word-by-word. The RNN generates a
vector representation that, loosely speaking, captures the
“meaning” of the context and response (c and r in the picture). We
can choose how large these vectors should be, but let’s say we
pick 256 dimensions.
3. We multiply c with a matrix M to “predict” a response r’. If c is a
256-dimensional vector, then M is a 256×256 dimensional matrix,
and the result is another 256-dimensional vector, which we can
interpret as a generated response. The matrix M is learned during
training.

DUAL ENCODER LSTM ALGORITHM FOR
SEQ2SEQ(CONT.)
•We measure the similarity of the predicted response r’ and the
actual response r by taking the dot product of these two
vectors.
•A large dot product means the vectors are similar and that the
response should receive a high score.
• We then apply a sigmoid function to convert that score into a
probability.

REFERENCES
• https://github.com/Marsan-Ma/chat_corpus (Sources of data for
trial ChatBot)
•http://www.wildml.com/2015/09/recurrent-neural-networks-
tutorial-part-1-introduction-to-rnns/
tutorial-part-2-implementing-a-language-model-rnn-with-python-
numpy-and-theano/
tutorial-part-3-backpropagation-through-time-and-vanishing-
gradients/
•http://www.wildml.com/2015/10/recurrent-neural-network-
tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-
http://karpathy.github.io/2015/05/21/rnn-effectiveness/

REFERENCES…
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
http://cs231n.github.io/optimization-1/
http://colah.github.io/posts/2015-08-Backprop/
http://cs231n.github.io/optimization-2/
http://neuralnetworksanddeeplearning.com/chap2.html
http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture2.pdf
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
http://sebastianruder.com/word-embeddings-1/
http://suriyadeepan.github.io/2016-06-28-easy-seq2seq/
https://www.tensorflow.org/tutorials/seq2seq

Chatbot ppt

More Related Content

What's hot (20)

Similar to Chatbot ppt (20)

Recently uploaded (20)

Chatbot ppt