Attention Mechanisms in NLP: From Bahdanau to Self-Attention

Attention Mechanisms in NLP: From Bahdanau to Self-Attention

Not all tokens are created equal. Attention helps models decide where to look and when.

From powering neural machine translation to enabling massive language models like GPT and BERT, attention mechanisms lie at the heart of modern NLP. This in-depth guide walks through foundational and advanced attention concepts from Bahdanau and Luong to Multi-Head, Positional Encoding, Masking, Cross-Attention, and emerging techniques like Sparse and Linear Attention.

Whether you're an ML engineer, data scientist, or AI aspirant, this article will help you understand and apply attention effectively.


What is Attention in NLP?

Attention allows models to focus on the most relevant parts of input sequences when producing output. Instead of encoding an entire input into a single vector, attention lets the model weigh parts of the input dynamically.

Real-world Analogy: Think of reading comprehension. You don’t remember every word but focus on relevant phrases to answer questions. Models do the same with attention.


Why Is Attention Important?

Without attention:

  • Fixed-length vector bottleneck in encoder-decoder models
  • Long-distance dependencies get lost
  • Quality degrades on longer input sequences

With attention:

  • Dynamic context awareness
  • Better handling of long sequences
  • Enhanced model interpretability
  • Parallel computation (especially with self-attention)


Types of Attention Mechanisms

  1. Bahdanau (Additive) Attention
  2. Luong (Multiplicative) Attention
  3. Global vs Local Attention
  4. Self-Attention
  5. Multi-Head Attention
  6. Positional Encoding
  7. Masking
  8. Cross-Attention
  9. Sparse/Linear Attention (Emerging Trends)


1. Bahdanau Attention (Additive)

Proposed in 2014, Bahdanau Attention helps decoder steps focus on relevant encoder states using a learnable scoring mechanism.

  • Bahdanau Attention is also known as Additive attention as it performs a linear combination of encoder states and the decoder states.
  • The aim of this attention mechanism was to improve the seq2seq model in machine translation by aligning the decoder with the relevant input sentences and implementing attention.

Article content

The entire process of applying attention is as follows:

  • Encoder produces hidden states of each element in the input sequence.
  • Calculating alignment scores between the previous decoder hidden state and each of the encoder’s hidden states are calculated.
  • The alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed.
  • The encoder hidden states and their respective alignment scores are multiplied to form the context vector.
  • The context vector is concatenated with the previous decoder output and fed into the decoder for that time step along with the previous decoder hidden state to produce a new output.
  • The processes repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length.

Formula:

score(s_t, h_i) = v^T tanh(W_1 h_i + W_2 s_t)        

Where:

  • h_i: encoder hidden state
  • s_t: decoder state
  • W_1, W_2, v: learned parameters

Code Snippet:

Article content

2. Luong Attention (Multiplicative)

Introduced in 2015, Luong's attention computes alignment via dot-product, making it computationally lighter.

  • Luong’s attention is also referred to as Multiplicative attention.
  • It reduces encoder states and decoder state into attention scores by simple matrix multiplications which makes it more faster and space efficient.

Article content

Variants:

  • Dot: score = s_t^T h_i
  • General: score = s_t^T W h_i
  • Concat: score = v^T tanh(W [s_t; h_i])

Article content

3. Global vs Local Attention

Global: Considers all encoder hidden states (Bahdanau, Luong)

Article content

Local: Focuses on a subset (windowed) of encoder states

Article content

Analogy: Reading the whole book vs. scanning one paragraph


4. Self-Attention

Each token attends to all others in the same sequence. It's the engine behind Transformers.

Formula:

Attention(Q, K, V) = softmax(QK^T / √(d_k)) V        

Where:

  • Q: Query
  • K: Key
  • V: Value (all derived from input)

Article content

5. Multi-Head Attention

Multiple self-attention operations run in parallel with independent parameter sets.

Benefits:

  • Captures relationships in different subspaces
  • Adds model capacity without increasing per-head dimension

Article content

6. Positional Encoding

Self-attention ignores order. Positional encoding injects sequence info.

Article content

7. Masking in Attention

Use Cases:

  • Padding Mask: Prevents attention to padding tokens.
  • Look-Ahead Mask: Prevents peeking at future tokens (used in GPT-like models).

Article content

8. Cross-Attention

Used in encoder-decoder architectures where:

  • Query = Decoder output
  • Key/Value = Encoder output

This allows the decoder to focus on specific parts of the encoded input.

Example: Translation where English output attends to French input representations.


9. Sparse and Linear Attention (Emerging Trends)

Problem:

  • Standard self-attention has O(n^2) time & memory complexity.

Solutions:

  • Sparse Attention (Longformer, BigBird): Attends to limited tokens
  • Linear Attention (Performer, Linformer): Reduces complexity to O(n)

These make transformers feasible for longer sequences (e.g., 10,000+ tokens).


Diagnostics Table

Article content

Real-World Applications

Article content

Final Thoughts

The evolution from additive attention to multi-head, masked, and sparse attention has powered the deep learning revolution. Mastering these mechanisms opens doors to custom architectures, model efficiency, and cutting-edge AI capabilities.

Stay curious. The attention landscape is still evolving.


What attention mechanism have you implemented in your models? Which challenges did you face while deploying attention-heavy models? Comment below or message me on LinkedIn!


Read 𝐩𝐫𝐞𝐯𝐢𝐨𝐮𝐬 article on Language Modeling & Seq2Seq with Keras Functional API: From Unigrams to Transformer Precursors @ https://www.linkedin.com/pulse/language-modeling-seq2seq-keras-functional-api-from-unigrams-kharche-lkydf/?trackingId=fesNPT1xRhqdvmJkJTkPfQ%3D%3D


Stay tuned for 𝐧𝐞𝐱𝐭 article on: Large Language Models (LLMs)GPT, BERT, T5, LLaMA, Mistral, Claude


🔗 Follow my LinkedIn page: https://lnkd.in/dsJM8V4m

📰 Subscribe to my newsletter: https://lnkd.in/dF_EtDtg


#AttentionMechanism #SelfAttention #MultiHeadAttention #PositionalEncoding #TransformerModels #NLP #DeepLearning #SparseAttention #ExplainableAI #FromDataToDecisions #AmitKharche


To view or add a comment, sign in

Others also viewed

Explore topics