Sequencing and Attention Models - 2nd Version

Sequencing and Attention in
Deep Learning
By Sohail Jabbar

What is the Time Step in sequence Model?
• In Recurrent Neural Networks, there is a concept of time steps. This
means that the recurrent cells or units take inputs from a sequence
one by one. Each step at which the cell picks up an input is called a
time step.
• For example, if we have a sequence of words that form a sentence, such
as “It’s a sunny day.”, our recurrent cell will take the word “It’s” as its input at
the first time step. Now it stores information about the word “It’s” in its
memory and updates its state.
• Next, it takes the word “a” as its second input at the second time step. Now it
incorporates information about the word “a” into its memory and updates its
state once again. It repeats the process until the last word.


• Therefore, the cell state at the 1st time step depends only on the 1st input,
the cell state at the 2nd time state depends on the 1st and 2nd inputs, the cell
state at the third time step depends on the 1st, 2nd and 3rd inputs and so on.
• In this way the cell continuously updates its memory as time passes (similar to
a human brain).

Sequence to Sequence Architecture
• In Deep Learning, Many Complex problems can be solved by
constructing better neural network architecture. The RNN(Recurrent
Neural Network) and its variants are very useful in sequence to
sequence learning.
• The RNN variant LSTM (Long Short-term Memory) is the most used
cell in seq-seq learning tasks.
• The encoder-decoder architecture for recurrent neural networks is
the standard neural machine translation method that rivals and in
some cases outperforms classical statistical machine translation
methods.
• This architecture is very new, having only been pioneered in 2014,
although, has been adopted as the core technology inside Google’s
translate service.

Deep Sequence Model
• Deep Sequence models are used when your model has to remember
information across different timestamps, understand the relationship
between them and then predict what will be the next best move.
• Few applications of Deep Sequence Models:
• Stock Prediction
• Medical diagnostics
• Climate change
• Autonomous driving etc.

 Types of Deep Sequence Models


• Many to One
A model where your network is fed multiple inputs and is expected to output
just one value e.g. sentiment classification (you feed your model with a
continuous string of words and expect to output one sentiment)
• One to Many:
A model where your network is fed only one input and is expected to output
values of variable lengths e.g. image captioning (you give your model one image
and your model will output a description of what the image is about)
• Many to Many:
A model where your network is fed multiple inputs and is expected to output
values of variable length e.g. machine translation (you give your model an
English sentence and your model will translate it to French)


• Designing Sequence Models:
• To design a sequence model, we need to:
• Handle variable-length sequences
• Track long-term dependencies
• Maintain information about the order
• Share parameters across the sequence
• Recurrent Neural Networks meet these sequence modeling design criteria

RNN – Recurrent Neural Networks

Sequencing and Attention Models - 2nd Version

 Recurrent Neural Networks:
As daunting as the above diagram might look like, it is the easiest diagram and all you need to know to
understand how RNNs work. Let me break it down.


• Step 1: Basic Neural Network:
• you can consider the inputs (x1,x2,x3) as (location, sq. ft and house type) and
outputs (y1, y2, y3, y4) as (price, safety status, appraisal value, # of people it
can accommodate).
• Step 2: Simplifying basic NN:
• For simplicity purposes, let's combine all the inputs (x1,x2,x3..) into one
variable x(t), we can call it input at a timestamp t.
• Let's remove the complexity of drawing the hidden layers by substituting
them with one green box and let's combine all the outputs to one variable
y(t), we can call it output at the timestamp t. All we are doing here is just
combining inputs and outputs so that it looks simple.


• Step 3: Turn it upside down:
• Again, we still have our simplified NN(neural network) from Step 2 and here
we are only turning it upside down.
• Step 4: Multiple NNs:
• This is the fun part. Consider multiple NNs from Step 3 stacked side-by-side
and joined by one small string — that’s it, you have designed an RNN. As the
name goes, there is a recurrent flow of NNs that are joined by a string, what
we call a “state” (h).
• In the diagram shown above, consider y0 and y1 as intermediate output at a
particular timestamp t and y2 as the final output. RNNs can also be depicted
by the figure to the left of the vertical line in Step 4.


• Intuition: Consider a relay race where usually there is a team of 4–5
people each one of them standing at different checkpoints. The first
person to start the race carries a baton which he/she has to pass to
another team member once they both meet at the checkpoint and
ultimately complete the race with a baton.
• In the case of RNN, you can consider
• The participants as the neural networks(x)
• Baton that is being passed as the state (h)
• Stopping of one player and starting of another player after baton is passed as
intermediate output(y0,y1)
• Completing the race as predicting output (y2)

 Mathematics behind RNN
This is the same figure as Fig 2.1 but just with more notations:


• Input vector x(t)
• It's a normal vector that consists of your inputs — nothing fancy here
• Weights
• W(xh) -> Weights that are used to transform input in a way that is consumable by the hidden
state
• W(hh) -> Weights that define the relationship between the previous hidden state and the
current hidden state
• W(yh) ->Weights that are used to transform the hidden state output to a prediction-based
output.
Hidden State h(t)
• You remember the example of the baton being passed from player to player in a relay race —
that is what the hidden state does. As you move from one NN to another NN, the hidden
state captures all the information from the current NN and passes it along so that the next
NN is aware of what the context is when predicting something.


• Equation #2 in the diagram says that the hidden state (h(t)) is a
parameterized function of the input to that NN (x(t)) and previous
hidden state (h(t-1)) — here the input helps in understanding what to
predict and the previous hidden state gives your NN some context
regarding whatever is happening in your model. Parameterized just
means that they are defined using the weight matrices (explained
above).
• The function can be any NON-LINEAR function. In the above diagram,
the non-linear function used is tanh also called the hyperbolic
activation function.
Note: The same function and set of parameters are used at every
time step. RNNs have a state h(t), that is updated at each time step as
a sequence is processed.


• Output Vector y(t)
The output vector is nothing but uses the current hidden state (h(t))
and W(hy) which basically helps transform the hidden state to output
predicted values as shown in equation #1.
• Loss function (L, L0, L1, L2 etc..)
As with any model, you need to minimize the loss in order for your
model to perform better which basically means learning from your
mistakes. The L0, L1, L2, etc. that you see in the diagram are the
losses calculated at each stage and summed up to calculate L and
then our task is to minimize that final loss function using
backpropagation.

 Why encoding input?
• We all know that machines do no understand language, they understand
numbers. For example, even if feed my model with this string “This
morning I took my cat for a” and the model outputs “walk”, beneath that
lovely sentence is a mixture of numbers.
• That’s exactly what encoding is. Encoding is a process of converting a string
to a sequence of numbers understandable by the machine and decoding is
a process of converting the numbers to a human-readable string.


• How do we do that? It is easy —
• Step 1: We first take a set of unique words
• Step 2: Allocate each word with a unique number
• Step 3: Represent our input in a vector form using Step1 and Step 2 (see
figure below) → This step is basically called embedding.
• The diagram above shows two types of embedding: One-hot encoding and
Learned embedding (just different ways to represent your inputs in a
vectorized form)

 Notes
• Feature encoding is the process of turning categorical data into a
numerical format that models can use.
• While there are several ways to do this, two of the most popular
methods are
• One Hot Encoding
• It works by converting each category in your dataset into a unique binary
code — a series of 0s and 1s.
• Here’s the magic: each category gets its very own slot in the vector, and we
mark that slot with a 1 to indicate the presence of that category.


• Say you have a categorical variable like color, with the possible values:
["Red", "Green", "Blue"].
• One Hot Encoding will create a separate binary column for each color.
For example:
• Red: [1, 0, 0]
• Green: [0, 1, 0]
• Blue: [0, 0, 1]
• Here’s what happens behind the scenes: each category gets its own
“column” in a new binary matrix.
• When a specific category is present, that column is marked with a 1,
while the others are left as 0. It’s simple, effective, and universally
understood by most machine learning algorithms.


• Embeddings are a way to represent categorical data in a dense,
lower-dimensional space.
• Unlike One Hot Encoding, which creates sparse binary vectors,
embeddings compress information into smaller, more compact
vectors that can capture relationships between categories.
• Let’s step out of NLP for a second and use an eCommerce example. Imagine
you’re building a recommendation system for an online store. You have
product categories like ["Electronics", "Clothing", "Furniture"].
• With embeddings, your model might learn that Electronics and Furniture are
more similar to each other than either one is to Clothing.
• This allows your model to make better recommendations — say, someone
browsing for TVs might also be shown home theater systems or speakers,
rather than completely unrelated products.


• In this case, the embeddings would look something like:
• Electronics: [0.9, 0.1, 0.3]
• Furniture: [0.8, 0.2, 0.4]
• Clothing: [0.1, 0.9, 0.2]
• Notice how Electronics and Furniture have more similar vectors? This
kind of relationship would be impossible to capture with One Hot
Encoding.

 Backpropagation again?
• Example:
• After an interview, we sit back and retrospect on how things might have gone
differently if you had answered differently —understanding where you fell
short and learning from it. Just apply the same concept here —
backpropagation is a process of a model to learn from its mistake (by
updating weights) and try to minimize the loss.
• The only difference, in this case, is that here the backpropagation is
THROUGH TIME, meaning the loss is backpropagated to each model and also
across the model because it is a SEQUENCE model

Problems with RNN?
• Holistically how a backpropagation algorithm works is by calculating
the gradient (i.e. derivative of final loss function w.r.t. each
parameter) and then shifting the parameters in order to minimize loss


• Here is the simplified version of the figure in previous page.
• Imagine calculating the gradient (a derivative of loss function)
• w.r.t h0 -> this will involve many factors of W(hh) and repeated gradient
computation at each neural network


• Intuition:
• Remember playing a party game in which one person whispers a message to
the person next to them and the story is then passed progressively to several
others, with inaccuracies accumulating as the game goes on.
• The point of the game is the amusement obtained from the last player’s
announcement of the story they heard, which typically is nothing like the
original.
• That is exactly what happens with RNNs — calculating gradients while
propagating backward becomes difficult as there is a chance of losing
information as we go backward.

 Challenges faced by RNNs:
• Exploding gradients: When many gradient values are >1
• Occurs when large error gradients accumulate and result in very large updates
to neural network model weights during training.
• Gradients are used during training to update the network weights and it
works best when these updates are small and controlled. When the
magnitudes of the gradients accumulate, an unstable network is likely to
occur, which can cause poor prediction results or even a model that reports
nothing useful whatsoever.
• There are methods to fix exploding gradients, which include gradient
clipping and weight regularization, among others.


• Vanishing gradients: When many gradient values are <1
• Since the gradients control how much the network learns during training, if
the gradients are very small or zero, then little to no training can take place,
leading to poor predictive performance. This also leads to capturing short-
term dependencies instead of long-term dependencies.
• Potential Solutions:
• Activation Functions: Using ReLU prevents gradients from shrinking when x>0
• Parameter Initialization: Initialize weights to identity matrix and biases to
zero — prevents weights from shrinking to zero
• Gated Cells: In the green box that we have encountered in the previous
diagrams, use some logic inside them (i.e. gated cells) which will control what
information is passed through. Based on the logic used in the gated cells we
classify them as LSTMs, GRUs etc.

Variants of Recurrent Neural Networks (RNNs)
• There are several variations of RNNs, each designed to address
specific challenges or optimize for certain tasks:
• Vanilla RNN
• This simplest form of RNN consists of a single hidden layer, where weights are
shared across time steps. Vanilla RNNs are suitable for learning short-term
dependencies but are limited by the vanishing gradient problem, which
hampers long-sequence learning.
• Bidirectional RNNs
• Bidirectional RNNs process inputs in both forward and backward directions,
capturing both past and future context for each time step. This architecture is
ideal for tasks where the entire sequence is available, such as named entity
recognition and question answering.


• Long Short-Term Memory Networks (LSTMs)
• Long Short-Term Memory Networks (LSTMs) introduce a memory mechanism
to overcome the vanishing gradient problem. Each LSTM cell has three gates:
• Input Gate: Controls how much new information should be added to the cell state.
• Forget Gate: Decides what past information should be discarded.
• Output Gate: Regulates what information should be output at the current step. This
selective memory enables LSTMs to handle long-term dependencies, making them ideal
for tasks where earlier context is critical.
• Gated Recurrent Units (GRUs)
• Gated Recurrent Units (GRUs) simplify LSTMs by combining the input and
forget gates into a single update gate and streamlining the output
mechanism. This design is computationally efficient, often performing
similarly to LSTMs, and is useful in tasks where simplicity and faster training
are beneficial.

Implementing a Text Generator Using
Recurrent Neural Networks (RNNs)
• Implementing a Text Generator Using Recurrent Neural Networks
(RNNs)
• Step 1: Import Necessary Libraries
• We start by importing essential libraries for data handling and building the
neural network.

• Step 2: Define the Input Text and Prepare Character Set
• We define the input text and identify unique characters in the text, which
we’ll encode for our model.


• Step 3: Create Sequences and Labels
• To train the RNN, we need sequences of fixed length (seq_length) and the
character following each sequence as the label


• Step 4: Convert Sequences and Labels to One-Hot Encoding
• For training, we convert X and y into one-hot encoded tensors
• Step 5: Build the RNN Model
• We create a simple RNN model with a hidden layer of 50 units and a Dense
output layer with softmax activation.


• Step 6: Compile and Train the Model
• We compile the model using the categorical_crossentropy loss and train it for
100 epochs.
• Step 7: Generate New Text Using the Trained Model
• After training, we use a starting sequence to generate new text character-by-
character.

https://www.geeksforgeeks.org/introduction-to-recurrent-neural-network/

 Encoder-Decoder Model
• Encoder-decoder models are used to handle sequential data,
specifically mapping input sequences to output sequences of
different lengths, such as neural machine translation, text
summarization, image captioning and speech recognition. In such
tasks, mapping a token in the input to one in the output is often
indirect.
•


• An encoder-decoder is a type of neural network architecture that is
used for sequence-to-sequence learning. It consists of two parts, the
encoder and the decoder. The encoder processes an input sequence
to produce a set of context vectors, which are then used by the
decoder to generate an output sequence.
• This architecture enables tasks such as machine translation, text
summarization,and image captioning, among others. The idea behind
it is to be able to take in one form of data (such as text) and convert it
to another (such as images). By doing this, machines can learn how to
understand complex relationships between different types of data
and use them for more efficient processing.


• The encoder is the first part of an encoder-decoder architecture. It
takes in an input sequence and processes it to create a set of context
vectors, which are then used by the decoder. The way in which the
encoding process works depends on the type of application being
used.
• For example, for text applications such as machine translation or
summarization, the words in each sentence will be converted into
numerical values that represent them mathematically.
• Then, these numbers are fed through a series of layers that reduce
their dimensionality while preserving relevant information about how
they relate to one another within the sentence structure. This
“encoded” version of each sentence is then passed along to the
decoder for further processing.


• The decoder is responsible for taking this encoded representation and
reconstructing it back into its original form (or something similar). In order
to do this, there must be some kind of relationship between what was
encoded and what needs to be reconstructed; otherwise it would just be
guessing randomly.
• To establish this link, most modern architectures use attention mechanisms
that allow specific parts of an input sequence (such as individual words) to
influence how later parts are processed or interpreted by the model —
essentially giving greater weightage or importance to certain elements
over others when generating output sequences from encoding data inputs.
• By doing so, models become much more accurate at producing outputs
that accurately reflect their input data sources and can even learn different
patterns across various datasets without needing additional training cycles
or parameter tuning procedures afterwards.

How the Sequence to Sequence Model works?
• To fully understand the model’s underlying logic, we will go over the
below illustration:


• Encoder
• Multiple RNN cells can be stacked together to form the encoder. RNN reads
each inputs sequentially
• For every timestep (each input) t, the hidden state (hidden vector) h is
updated according to the input at that timestep X[i].
• After all the inputs are read by encoder model, the final hidden state of the
model represents the context/summary of the whole input sequence.
• Example:
• Consider the input sequence “I am a Student” to be encoded. There will be totally 4
timesteps ( 4 tokens) for the Encoder model. At each time step, the hidden state h will be
updated using the previous hidden state and the current input.


• At the first timestep t1, the previous hidden state h0 will be
considered as zero or randomly chosen. So the first RNN cell will
update the current hidden state with the first input and h0. Each layer
outputs two things — updated hidden state and the output for each
stage. The outputs at each stage are rejected and only the hidden
states will be propagated to the next layer.
• The hidden states h_i are computed using the formula:


• At second timestep t2, the hidden state h1 and the second input X[2]
will be given as input , and the hidden state h2 will be updated
according to both inputs. Then the hidden state h1 will be updated
with the new input and will produce the hidden state h2. This
happens for all the four stages wrt example taken.
• A stack of several recurrent units (LSTM or GRU cells for better
performance) where each accepts a single element of the input
sequence, collects information for that element, and propagates it
forward.
• In the question-answering problem, the input sequence is a collection
of all words from the question. Each word is represented as x_i where
i is the order of that word.


• This simple formula represents the result of an ordinary recurrent
neural network. As you can see, we just apply the appropriate
weights to the previously hidden state h_(t-1) and the input vector
x_t.
• Encoder Vector
• This is the final hidden state produced from the encoder part of the model. It
is calculated using the formula above.
• This vector aims to encapsulate the information for all input elements in
order to help the decoder make accurate predictions.
• It acts as the initial hidden state of the decoder part of the model.


• Decoder
• The Decoder generates the output sequence by predicting the next output Yt
given the hidden state ht.
• The input for the decoder is the final hidden vector obtained at the end of
encoder model.
• Each layer will have three inputs, hidden vector from previous layer ht-1 and
the previous layer output yt-1, original hidden vector h.
• At the first layer, the output vector of encoder and the random symbol START,
empty hidden state ht-1 will be given as input, the outputs obtained will be y1
and updated hidden state h1 (the information of the output will be
subtracted from the hidden vector).


• The second layer will have the updated hidden state h1 and the previous
output y1 and original hidden vector h as current inputs, produces the
hidden vector h2 and output y2.
• The outputs that occurred at each timestep of decoder is the actual
output. The model will predict the output until the END symbol occurs.
• A stack of several recurrent units where each predicts an output y_t at a
time step t.
• Each recurrent unit accepts a hidden state from the previous unit and
produces an output as well as its own hidden state.
• In the question-answering problem, the output sequence is a collection of
all words from the answer. Each word is represented as y_i where i is the
order of that word.


• Any hidden state h_i is computed using the
formula:
• As you can see, we are just using the previous
hidden state to compute the next one.


• Output Layer
• We use Softmax activation function at the output layer.
• It is used to produce the probability distribution from a vector of values with
the target class of high probability.
• The output y_t at time step t is computed using the formula:
• We calculate the outputs using the hidden state at the current time
step together with the respective weight W(S). Softmax is used to
create a probability vector that will help us determine the final output
(e.g. word in the question-answering problem).


• The power of this model lies in the fact that it can map sequences of
different lengths to each other. As you can see the inputs and outputs
are not correlated and their lengths can differ. This opens a whole
new range of problems that can now be solved using such
architecture.

 Applications
• It possesses many applications such as
• Google’s Machine Translation
• Question answering chatbots
• Speech recognition
• Time Series Application etc.,


• Use Cases of the Sequence to Sequence Model
• A sequence to sequence model lies behind numerous systems which you face
on a daily basis. For instance, seq2seq model powers applications like Google
Translate, voice-enabled devices and online chatbots. Generally speaking,
these applications are composed of:
• Machine translation — a 2016 paper from Google shows how the seq2seq
model’s translation quality “approaches or surpasses all currently published
results”.
Screenshot of Google translation.

Applications of Encoder-Decoder
• Applications of encoder-decoders have also been explored in the field
of image captioning. Using an encoder-decoder architecture, the
model can take an input image and generate a caption that accurately
describes the contents of the image.
• This is achieved by first encoding each pixel within the image to
produce a set of context vectors which are then used by a decoder to
create output sequences (e. g., words). By utilizing attention
mechanisms between these two parts, models become much more
accurate at describing images based on their content rather than just
randomly generating sentences from scratch.


• Other applications for encoder-decoders include using them for tasks
such as machine translation or summarization, where they can be
used to translate text from one language into another while
preserving its meaning or summarize long documents into shorter
versions without losing important information.
• Additionally, researchers have also begun exploring how this type of
architecture could potentially be utilized in medical diagnosis and
natural language processing applications as well; although further
research is needed before its potential in these areas can be fully
realized.

Advantages of Encoder-Decoder
• One of the major advantages of using an encoder-decoder
architecture is its enhanced performance. This type of model can
learn complex relationships between different types of data and use
them to process information much faster than traditional methods.
• Additionally, since it does not rely on manual feature engineering, it is
able to quickly adapt to changing input data without needing
additional training cycles or parameter tuning procedures afterwards.
As a result, this makes for faster training times compared to other
architectures and allows models to achieve better results with fewer
resources being utilized in the process.


• Another advantage that comes with using an encoder-decoder
architecture is its ability to generalize well across various datasets and
tasks. By utilizing attention mechanisms between its two parts, the
model can accurately pick up patterns from different datasets without
requiring extensive retraining after each new one has been
introduced; thus making it extremely useful for applications such as
machine translation where multiple languages need to be supported
by a single system.
• Additionally, this means that any changes made during development
are easier and more efficient as they only require minor adjustments
rather than complete redesigns due to how quickly the model isable
too learn new concepts from existing data sources.


• Finally, encoder-decoders also have significant potential when applied
in medical diagnosis applications or natural language processing tasks
such as text summarization or image captioning; although further
research needs to be conducted before their full capabilities in these
areas can be realized.
• In conclusion, by combining enhanced performance with fast training
times and strong generalization abilities across various datasets/tasks
— it’s easy see why encoder-decoders have become so popular
amongst researchers over recent years

Limitations of Encoder-Decoder
• One of the main limitations with encoder-decoder architectures is
their ability to handle natural language processing (NLP) tasks. This
type of model relies heavily on data pre-processing in order to
accurately understand and interpret text, which can be a difficult and
time-consuming process.
• For instance, when dealing with large datasets containing complex
sentences or phrases that contain multiple levels of grammar or
syntax, the encoding process alone can be quite challenging; making
it difficult for models to accurately capture all relevant information
within a given input sequence.


• Additionally, since attention mechanisms are used between the two
parts of an encoder-decoder architecture — it’s important that these
are configured correctly so as not to give too much weightage to
certain elements over others; otherwise this could lead to output
sequences being generated that don’t accurately reflect their inputs.
• Another limitation associated with using an encoder-decoder
architecture is data pre-processing issues. Since most modern
applications require input data sets in numerical formats rather than
raw text or images — this means additional steps must be taken
before any training can begin.


• For example, if you wanted your model to learn how to translate
English into Spanish — you would first have to convert each sentence
into numerical values representing words from both languages;
otherwise the system would just be guessing randomly without any
context as it tries to make sense out of the input sequence provided
by its user/programmer.
• As such, taking extra care during this step is essential for ensuring
accurate results downstream once training begins — something
which makes using an encoder-decoders slightly more complicated
compared other AI systems available today.

Attention Mechanism in Deep Learning
• Brain instinctively focuses on specific parts of the image or particular
words in the sentence that are most relevant to your task.
• This selective focus is what we refer to as attention, and it’s
a fundamental aspect of human cognition.
• Attention mechanisms in deep learning aim to mimic this selective
focus process in artificial neural networks.

Attention Mechanism in Deep Learning
• What exactly is the attention
mechanism?
• which Georgetown player, the guys
in white, is wearing the captaincy
band?
• When you were trying to figure out
answers to the questions above, did
your mind do this weird thing where
it focused on only part of the image?


• What happened?
• You were ‘focusing’ on a smaller part of the whole thing because you knew
the rest of the image/sentence was not useful to you then.
• So when you were trying to figure out the color of the soccer ball, your mind
was showing you the soccer ball in HD but the rest of the image was almost
blurred.
• Similarly, when reading the question, once you understood that the guys in
white were Georgetown players, you could blur out that part of the sentence
to simplify its meaning.


• In an attempt to borrow inspiration from how a human mind works,
researchers in Deep Learning have tried replicating this behavior using what is
known as the ‘attention mechanism’.
• Very simply put, the attention mechanism is just a way of focusing on only a
smaller part of the complete input while ignoring the rest.


• Attention can be simply represented as a 3 step mechanism.
• Since we are talking about attention in general, I will not go into
details of how this adapts to CV or NLP, which is very straightforward.
• Create a probability distribution that rates the importance of various input
elements. These input representations can be words, pixels, vectors etc.
Creating these probability distributions is actually a learnable task.
• Scale the original input using this probability distribution such that values that
deserve more attention gets enhanced while others get diluted. Kinda like
blurring everything else that doesn’t need attention.
• Now use these newly scaled inputs and do further processing to get focused
outputs/results.

 Why are Attention Mechanisms
Important?
• Attention mechanisms have become indispensable in various deep-
learning applications due to their ability to address some critical
challenges:
• Long Sequences:
• Traditional neural networks struggle with processing long sequences, such as translating
a paragraph from one language to another. Attention mechanisms allow models to focus
on the relevant parts of the input, making them more effective at handling lengthy data.
• Contextual Understanding:
• In tasks like language translation, understanding the context of a word is crucial for
accurate translation. Attention mechanisms enable models to consider the context by
assigning different attention weights to each word in the input sequence.
• Improved Performance:
• Models equipped with attention mechanisms often outperform their non-attention
counterparts. They achieve state-of-the-art results in tasks like machine translation,
image classification, and speech recognition.


• Self-Attention Mechanism
• Self-attention, also known as intra-attention, is commonly used in tasks
involving sequences, such as natural language processing. It allows the model
to weigh the importance of each element in the sequence concerning all the
other elements. The Transformer model, for instance, relies heavily on self-
attention.
• Scaled Dot-Product Attention
• Scaled Dot-Product Attention is a key component of the Transformer
architecture. It calculates attention scores by taking the dot product of a
query vector and the keys, followed by scaling and applying a softmax
function. This type of attention mechanism is highly efficient and has
contributed to the success of Transformers in various applications.


• Multi-Head Attention
• Multi-Head Attention extends the idea of attention by allowing the model to
focus on different parts of the input simultaneously. It achieves this by using
multiple sets of learnable parameters, each generating different attention
scores. This technique enhances the model’s ability to capture complex
relationships within the data.
• Location-Based Attention
• Location-based attention is often used in image-related tasks. It assigns
attention scores based on the spatial location of elements in the input. This
can be particularly useful for tasks like object detection and image captioning.


• Implementing Attention Mechanisms
• Now that we understand the importance of attention mechanisms, let’s
explore how to implement them in your deep-learning models. For this, we’ll
use Python and the popular deep learning library, TensorFlow.
In this example, we’ve added a simple
self-attention layer to your model.
Depending on your specific task, you
can experiment with different types of
attention mechanisms and
architectures.

 Attention Mechanisms in Real-world
Applications
•

 Machine Translation
• Machine translation is an area where attention mechanisms have
revolutionized the game. Traditionally, translation models struggled
with handling long sentences or paragraphs.
• With attention mechanisms, these models can now focus on specific
words or phrases in the source language while generating the target
language, greatly improving translation accuracy.
• Google’s Transformer model, for instance, utilizes attention
mechanisms to provide more fluent and contextually accurate
translations.

 Sample of the Python code for
machine translation


• Image Captioning
• When it comes to describing the content of an image in natural language,
attention mechanisms are invaluable. Models equipped with these
mechanisms can focus on different regions of the image, generating captions
that not only describe the image accurately but also provide context (just
like GPT 4.0 can analyze an image).
• This technology is particularly useful in applications like autonomous vehicles,
where the vehicle needs to understand its surroundings and communicate
effectively.


• Speech Recognition
• In speech recognition, understanding context is essential for accurate
transcription. Attention mechanisms have played a crucial role in improving
speech recognition systems. By focusing on specific parts of the audio input,
these systems can transcribe spoken words more accurately, even in noisy
environments.
• Question Answering
• Question-answering systems, like those used in chatbots or virtual assistants,
benefit from attention mechanisms as well. These mechanisms help the
model focus on relevant parts of the input text while generating responses,
leading to more contextually accurate and coherent answers.

 The Evolution of Attention Mechanisms
• As with any technology, attention mechanisms have evolved.
Researchers continue to explore new variants and improvements to
make these mechanisms even more effective. Some recent
developments include:
• Sparse Attention:
• This approach aims to make attention more efficient by allowing models to
focus on only a subset of the input data, rather than all elements. This can
significantly reduce computational requirements while maintaining
performance.


• Memory Augmented Networks:
• These models combine attention mechanisms with external memory,
allowing them to store and retrieve information efficiently. This is
particularly useful in tasks that involve reasoning and long-term
dependencies.
• Cross-modal Attention:
• In scenarios where data comes from multiple modalities, such as text
and images, cross-modal attention mechanisms enable models to
learn relationships between different types of data. This is valuable in
applications like image captioning.

 How Attention Mechanism Works?
• Here’s how they work:
• Breaking Down the Input: Let’s say you have a bunch of words (or
any kind of data) that you want the computer to understand. First, it
breaks down this input into smaller pieces, like individual words.
• Picking Out Important Bits: Then, it looks at these pieces and decides
which ones are the most important. It does this by comparing each
piece to a question or ‘query’ it has in mind.
• Assigning Importance: Each piece gets a score based on how well it
matches the question. The higher the score, the more important that
piece is.


• Focusing Attention: After scoring each piece, it figures out how much
attention to give to each one. Pieces with higher scores get more
attention, while less important ones get less attention.
• Putting It All Together: Finally, it adds up all the pieces, but gives
more weight to the important ones. This way, the computer gets a
clearer picture of what’s most important in the input.

How Attention Mechanism was Introduced in
Deep Learning?
• The attention mechanism emerged as an improvement over the
encoder decoder-based neural machine translation system in natural
language processing (NLP). Later, this mechanism, or its variants, was
used in other applications, including computer vision, speech
processing, etc.
• Before Bahdanau et al proposed the first Attention model in 2015,
neural machine translation was based on encoder-
decoder RNNs/LSTMs. Both encoder and decoder are stacks of
LSTM/RNN units. It works in the two following steps:


1. The encoder LSTM is used to process the entire input sentence
and encode it into a context vector, which is the last hidden state
of the LSTM/RNN. This is expected to be a good summary of the
input sentence. All the intermediate states of the encoder are
ignored, and the final state id supposed to be the initial hidden
state of the decoder
2. The decoder LSTM or RNN units produce the words in a sentence
one after another


• In short, there are two RNNs/LSTMs. One we call the encoder – this
reads the input sentence and tries to make sense of it, before
summarizing it. It passes the summary (context vector) to the
decoder which translates the input sentence by just seeing it.
• The main drawback of this approach is evident. If the encoder makes
a bad summary, the translation will also be bad. And indeed it has
been observed that the encoder creates a bad summary when it tries
to understand longer sentences. It is called the long-range
dependency problem of RNN/LSTMs.


• RNNs cannot remember longer sentences and sequences due to the
vanishing/exploding gradient problem. It can remember the parts
which it has just seen. Even Cho et al (2014), who proposed the
encoder-decoder network, demonstrated that the performance of the
encoder-decoder network degrades rapidly as the length of the input
sentence increases.
• Although an LSTM is supposed to capture the long-range dependency
better than the RNN, it tends to become forgetful in specific cases.
Another problem is that there is no way to give more importance to
some of the input words compared to others while translating the
sentence.


• Now, let’s say, we want to predict the next word in a sentence, and its
context is located a few words back. Here’s an example – “Despite
originally being from Uttar Pradesh, as he was brought up in Bengal,
he is more comfortable in Bengali”.
• In these groups of sentences, if we want to predict the
word “Bengali”, the phrase “brought up” and “Bengal”- these two
should be given more weight while predicting it. And
although Uttar Pradesh is another state’s name, it should be
“ignored”.


• So is there any way we can keep all the relevant information in the
input sentences intact while creating the context vector?
• So, whenever the proposed model generates a sentence, it searches
for a set of positions in the encoder hidden states where the most
relevant information is available. This idea is called ‘Attention’.

 Understanding the Attention
Mechanism
• This is the diagram of the Attention model shown in
Bahdanau’s paper.
• The Bidirectional LSTM used here generates a
sequence of annotations (h1, h2,….., hTx) for each
input sentence.
• All the vectors h1,h2.., etc., used in their work are
the concatenation of forward and backward hidden
states in the encoder.


• To put it in simple terms, all the vectors h1,h2,h3…., hTx are
representations of Tx number of words in the input sentence.
• In the simple encoder and decoder model, only the last state of the
encoder LSTM was used (hTx in this case) as the context vector.


• Now, the question is how should the weights be calculated? Well, the
weights are also learned by a feed-forward neural network and I’ve
mentioned their mathematical equation below.
• The context vector ci for the output word yi is generated using the
weighted sum of the annotations:
• The weights αij are computed by a softmax function given by the following
equation:
• eij is the output score of a feedforward neural network described by the
function a that attempts to capture the alignment between input at j and
output at i.

• Basically, if the encoder produces Tx number of “annotations” (the hidden
state vectors) each having dimension d, then the input dimension of the
feedforward network is (Tx , 2d) (assuming the previous state of the
decoder also has d dimensions and these two vectors are
concatenated). This input is multiplied with a matrix Wa of (2d,
1) dimensions (of course followed by addition of the bias term) to get
scores eij (having a dimension (Tx , 1)).
• On the top of these eij scores, a tan hyperbolic function is applied followed
by a softmax to get the normalized alignment scores for output j:
•

Implementing a Simple Attention Model in
Python Using Keras
• we will discuss how a simple Attention model can be implemented in
Keras. The purpose of this demo is to show how a simple Attention
layer can be implemented in Python.
As an illustration, we have run this demo
on a simple sentence-level sentiment
analysis dataset collected from
the University of California Irvine Machine
Learning Repository. You can select any
other dataset if you prefer and can
implement a custom Attention layer to see
a more prominent result.


• Here, there are only two sentiment categories – ‘0’ means negative
sentiment, and ‘1’ means positive sentiment. You’ll notice that the
dataset has three files. Among them, two files have sentence-level
sentiments and the 3rd one has a paragraph level sentiment.
We are using the sentence level data
files (amazon_cells_labelled.txt,
yelp_labelled.txt) for simplicity.
We have read and merged the two
data files. This is what our data looks
like:

• Complete Code Available at
• https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-
attention-mechanism-deep-learning/
• Dataset Source:
• https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences
• Assignment: Implement it, run it, produce the result and submit
before next lecture.

Other tasks to Implement
• Autocorrect Feature using NLP in Python
• Complete Tutorial on NLP and its implementation using Deep
Learning
• https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-
understand-implement-natural-language-processing-codes-in-python/
• Good collection of Articles with Code
• https://paperswithcode.com/method/global-local-attention

 Self Attention
• The Transformer model has become a game-changer in natural
language processing (NLP). Its secret tool? A mechanism called self-
attention, or scaled dot-product attention. This innovative approach
allows the model to focus on relevant parts of the input sequence
when processing each word, unlike traditional models that treat all
words equally. In this article, we’ll break down how self-attention
works step-by-step, using a clear example to make the concepts
easier to grasp.

• In a Transformer model, each word is represented as a vector of
numbers, known as an embedding. These embeddings capture the
semantic meaning of the words. Let’s consider a simple example with
the following embeddings:
• Suppose our input sentence is “the cat sat on the mat”. The
corresponding embedded tokens would be:
 Embeddings: Representing Words
as Vectors

 Self-Attention Mechanism
• The goal of the self-attention mechanism is to determine which
words in the input sequence are relevant to each word. This involves
three steps:
• Compute dot products between queries and keys.
• Scale the dot products.
• Apply softmax to obtain attention weights.
• Use the attention weights to compute a weighted sum of the values.

 Explanation
• Queries, Keys, and Values: In the simplest case, we use the same
embeddings for queries (Q), keys (K), and values (V):
• Matrix Multiplication (Dot Product): We compute the dot product of
the query matrix Q and the transpose of the key matrix K:
• Scaling the Dot Products: We scale the dot products by dividing by
the square root of the dimension of the key vectors (dk = 3):


• This results in
• Applying Softmax: After computing the dot products and scaling
them, the next step in the attention mechanism is to apply the
softmax function to these scaled values to obtain the attention
weights. Let’s break down each term and the process in detail.


• What are Logits?
• In the context of neural networks, logits refer to the raw, unnormalized scores
output by a model. These scores are typically the result of a linear
transformation applied to the input features before applying an activation
function.
• In our case, the logits are the results of the dot products between the query
and key vectors. These raw scores indicate the similarity between the query
and each key, but they are not yet probabilities.


• Scaling the Logits
• Before applying the softmax function, we scale the logits. The reason for
scaling is to prevent the softmax function from producing extremely small
gradients, which can happen when the logits are too large. This scaling is
done by dividing each logit by the square root of the dimension of the key
vectors (denoted as dk):
• This scaling helps stabilize the gradients during training.


• Applying the Softmax Function
• The softmax function is used to convert the logits into probabilities. It
takes a vector of raw scores (logits) and transforms them into a
probability distribution. The softmax function is defined as:
• where Zi is the i-th logit, and the denominator is the sum of the
exponentials of all logits.
• For our scaled attention logits, the softmax function normalizes these
scores, ensuring they sum to 1. This normalization helps us interpret
the values as probabilities, which we call attention weights.

Scaled attention logits - Further
Explained
• Let’s revisit the scaled attention logits from our example:
• Apply Softmax Function: We apply the softmax function to each row
of the scaled logits to get the attention weights. For the first row, this
would be:
• attention_weights[0]=softmax([0.081,0.185,0.289,0.392,0.081,0.496])

Computing this step-by-step
• Compute exponentials:


• This process is repeated for each row in the scaled attention logits to
get the full attention weight matrix.
• These weights show how much attention the first token “the” should
pay to each token in the sequence, including itself.
• The token “mat” has the highest weight, indicating it is the most
relevant for “the” in this context.


• Weighted Sum of Values: Finally, we compute the output by
multiplying the attention weights by the value matrix V:
• For the first token “the”, the output is:


• Interpreting Attention Weights
• The attention weights indicate how much focus each word should
give to every other word in the sequence. Higher weights mean
higher relevance. For example, in our case, the word “the” pays the
most attention to the word “mat” (with a weight of 0.223).
•

Here is the full program for
your reference
• https://medium.com/@saraswatp/understanding-scaled-dot-
product-attention-in-transformer-models-5fe02b0f150c

 Visualizing Attention Weights
• To understand the model’s attention mechanism better, we can
visualize the attention weights using a heatmap (see final part of
code). Here’s a simple example using Matplotlib:

 Final Words
• Overall, the scaled dot-product attention mechanism allows the
Transformer model to focus on the most relevant parts of the input
for each word.
• By examining the attention weights, we can understand which words
the model considers important, providing insights into its decision-
making process.
• This mechanism is a powerful tool for capturing long-range
dependencies and improving the model’s ability to process complex
sequences.

Sequencing and Attention Models - 2nd Version

More Related Content

Similar to Sequencing and Attention Models - 2nd Version (20)

Recently uploaded (20)

Sequencing and Attention Models - 2nd Version