Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

Jitender Malik

SVP | Data engineering & Science(AI/ML, Gen AI, Computer Vision) | AI Engineering Lead at NatWest Group

Published Jul 13, 2024

This article talks about the journey of transformer architecture where the 4 groundbreaking research paper brought the revolution into NLP domain. These can largely be categorised into five important stages which lead to the evolution of transformer architecture and large language models.

Stage 1: Encoder-Decoder Architecture: "Sequence to Sequence Learning with Neural Networks

Stage 2: Attention Mechanism - "Neural Machine Translation by Joint Learning to Align and Translate

Stage 3: Transformers: "Attention Is All You Need."

Stage 4: Transfer Learning: Universal Language model fine-tuning for text classification.

Stage 5: Large language models

Stage 1 - Encoder-Decoder Architecture: "Sequence to Sequence Learning with Neural Networks" (arXiv:1409.3215v3 [cs.CL] 14 Dec 2014)

First, let's talk about Stage One, where we will discuss the Encoder-Decoder Architecture.In 2014, there was a team at Google led by Ilya Sutskever. He was the co-founder of OpenAI and one of the main people behind ChatGPT. So, he along with his colleagues, wrote a paper titled "Sequence to Sequence Learning with Neural Networks," which became a very popular paper.

In this paper, they proposed that sequence-to-sequence learning is a problem that hasn't been solved properly yet. So, they proposed a different architecture to solve this problem, which they called the Encoder-Decoder Network.

In a sequence-to-sequence task, you have an input sequence, and you want to create an output sequence from it. For example, let's take the example of machine translation, where we want to translate from English to Hindi. Suppose our English sentence is "I love India," and when you translate it into Hindi, it would be something like "मुझे भारत से प्यार है" (Mujhe Bharat se pyar hai). So, what they proposed was a very simple and elegant solution. They said that our architecture would have two parts: an encoder and a decoder.

What the encoder does is process your input sequence word by word and compress all that information. This compressed information is then sent to the decoder. Now, what the decoder does is take this compressed information and produce the output step by step. For example, the first word it might produce is "मुझे" (mujhe), then "भारत" (Bharat), then "से" (se), then "प्यार" (pyar), and finally "है" (hai). Now, you might be curious about what exactly is inside this encoder and decoder. It's actually quite simple: the answer is LSTM (Long Short-Term Memory).

According to their research, they placed an LSTM cell in both input as well as output layer of a neural network. So basically, what's happening is that on a word-by-word basis, you are sending each word of your sentence into this LSTM cell. At every time step, you send a word like "Transformers are great." You send "Transformers" at the first-time step. Now, what's happening is that the internal states of the LSTM, which are the cell state and the hidden state, are updating continuously, summarizing all the information you've sent so far.

When you send the last word, the final output is like a compressed representation of the entire input sequence. This compressed representation is then sent to the decoder. The decoder also has an LSTM cell inside, which you send this compressed representation to. The decoder then calculates the output step by step. For example, in above diagram, the output is in French. This is how the encoder-decoder modules work.

Now, what happened was that this encoder-decoder module worked well when you gave it small sentences. If you gave it short sentences, it would translate them correctly. But as soon as you started giving it longer sentences, say an entire paragraph, and asked it to translate the whole paragraph from English to Hindi, the output wouldn't make much sense. The translation would lose its meaning.

People tried to understand the reason behind this, and it was found that the main problem was that all the information was being compressed into a single context vector. If the sentence was very long, the network would kind of forget the earlier words, leading to a memory loss problem. So, the main problem with this architecture is that everything depends on the context vector at the last time step. Because of this, the translation quality degraded for longer sentences.

To solve this problem, a new mechanism called the Attention Mechanism was developed. And that is what we will discuss next.

Stage 2: Attention Mechanism - "Neural Machine Translation by Joint Learning to Align and Translate." 1409.0473 (arxiv.org )

Now, to understand what the Attention Mechanism is, let's once again discuss what the problem was with the encoder-decoder architecture. As mentioned earlier, the encoder-decoder architecture handles the machine translation problem in a very simple manner. The input sequence goes into the encoder, and the decoder produces the output sequence step by step. So, if you input an English sentence word by word, you get a Hindi sentence word by word as output.

The problem with this architecture is that the entire sequence's information is summarized into a single context vector, which is then used by the decoder to decode. The issue arises when you have a very long input sequence.

Experimentally, it has been proven that if the sentence is longer than 30 words, the translation through the decoder does not make any sense.

For example, let's read this sentence: "Sadly, he mistaken the offer as an incredible opportunity that led to significant personal and professional growth." Now, if you input this sentence into the encoder, the first word would be "sadly," then "he," then "mistaken," and so on, with "growth" at the end. Due to recency bias, the initial words are not captured well in the context vector. So, if the first two words are not captured well, the meaning of the sentence changes. And that's the problem with long sentences in the encoder-decoder architecture; the translation quality deteriorates.

In fact, a research paper plotted a graph to explain this issue, where the x-axis represents the number of words in the input sequence and the y-axis represents the translation quality measured by a metric called BLEU score. The graph showed that once you cross 30 words, the translation quality starts to degrade. This was the biggest problem with the encoder-decoder architecture.

To solve this problem, the Attention Mechanism was introduced. In 2017, Joshua Bengio, who is a very famous researcher, and his team wrote a paper titled "Neural Machine Translation by Joint Learning to Align and Translate." This paper introduced the attention mechanism for the first time.

If you read just the abstract of this paper, you will find it written that a potential issue with this encoder-decoder approach is that the neural network needs to be able to compress all the necessary information of the source sentence into a fixed-length vector. This may make it difficult for neural networks to cope with long sequences, especially those longer than sentences in the training.

Now, let me try to explain how the attention mechanism works. First, I'll explain the core idea behind the attention mechanism, and then I'll give you a step-by-step introduction to how it works. An attention-based encoder-decoder module is essentially an encoder-decoder, but with a significant difference in the decoder block.

In a traditional encoder-decoder architecture, the context vector is produced after the final step and then passed to the decoder, which translates step-by-step

But in an attention-based encoder-decoder, there isn't a single context vector. At any step of the decoder, it has access to all the internal states of the encoder.

This means that to predict the second word, the decoder can access any internal state of the encoder. In a traditional model, the decoder would only have the final state, which is the compressed information of the entire sequence. In an attention-based model, the decoder has information about every intermediate state. This means the context for each word to be printed includes the entire sentence's context..

However, the challenge is identifying which hidden state is most useful for printing the current word. This is where the attention mechanism comes into play. The attention mechanism dynamically identifies the most helpful hidden state for the current decoding step.

For instance, if the second word is being translated, the attention layer will figure out which encoder hidden state is most useful for that step. The information from the useful hidden state is then converted into a context vector and fed into the network to produce the output. This process is repeated for each word, with the attention layer dynamically identifying the relevant hidden state.

However, there was a drawback to the attention mechanism, and that was that you can see this architecture and easily understand that because we are calculating a separate context vector for each time step in the decoder, the computational, or you could say the training time, was increasing. Let's understand this point a bit better.

So here, you have each hidden state available, and you want to extract the most important hidden state for your current word. The way you extract this is by calculating the similarity between your output word and all the input words. You then use the word with the highest similarity to calculate this context vector. So, for each word that is present in your output, you have to calculate similarity scores with all the words that are present in your input.

So let's say if you have 'n' words in your input and 'm' words in your output, because it is not necessary that both are the same, you essentially have to perform cross computations. This results in quadratic complexity, which becomes very long. Therefore, you have to do it multiple times. This is the biggest problem with the attention mechanism: it becomes complex and quadratic, and the training time slows down

Different types of attention mechanisms were explored, but eventually, it was realized that the main problem was not with the attention mechanism itself but with using LSTM (Long Short-Term Memory), which essentially works sequentially.

This means it can only receive one word at a time, then the next word, then the third word. This whole process happens one step at a time, and that was the main problem. So research then started figuring out if there was a way to remove this sequential nature from the encoder-decoder architecture and somehow bring parallel processing into the picture. Because if parallel processing could be introduced, training time would drastically improve.

Stage 3: Transformers: "Attention Is All You Need."

This is the point, at Stage 3, where Transformers emerged, which completely changed the NLP landscape. In 2017, a groundbreaking research paper from Google Brain was published titled "Attention Is All You Need."

This paper became so popular that countless videos have been made about it, and today, there is probably a dedicated lecture on it in every deep learning course.

So the most significant change in the Transformer architecture was that the researchers completely ditched LSTM. They said that architecture does not need LSTM or, for that matter, any type of RNN cell. Attention is all you need; attention will handle everything. In fact, they implemented a new kind of attention called self-attention.

In the architecture, there is still an encoder-decoder, and you can see there is an encoder block and a decoder block. But now, LSTM is not used. In fact, one change is that attention is used within the encoder module as well as within the decoder. Along with that, a little bit of fully connected dense layers are used.

Now, the best part of the Transformer architecture is that the old encoder-decoder modules with and without attention had the biggest bottleneck: you could only read one word at a time. The biggest feature of Transformers is that they can view all the words in the input simultaneously, and this parallel processing is the main reason why Transformer training is much faster.

The Transformer model was easy to parallelize, and it was possible to train it in a fraction of the time and cost compared to previous encoder-decoder modules. This means that both training time was reduced.

So now, we move to the next stage to see how after the advent of Transformers, the entire history of NLP began to evolve rapidly. Transformers themselves were a revolution; they were very significant

The only problem is that training Transformers from scratch is a difficult task. some reasons are outlined below:

The first reason is that it requires hardware.

The second important factor is time. Even though Transformers, attention-based encoder-decoders, or normal encoders and decoders train faster, they still take significant time.

But the biggest problem in training Transformers from scratch is data. You need a lot of data. Let's say you want to solve a sentiment analysis task using a Transformer; you need a good amount of data, around a hundred thousand, a million rows of data.

The problem is that not everyone has that much data. We are not Google, which has an insane amount of data. Sometimes we have very little data, maybe only 100 rows or just 1,000 rows. Training a Transformer from scratch with such little data does not yield good results. So, this became a significant restricting factor.

Despite the powerful technology, not everyone could use Transformers because of these restrictions. To solve this problem, we move to our next stage, which is Transfer Learning.

Stage 4: Transfer learning: Universal Language Model Fine-tuning for text classification.

1801.06146 (arxiv.org )

So, in 2018, another very famous research paper came from Jeremy Howard and Sebastian Ruder known as ULMFiT.

So, this paper was again a landmark paper because it proposed that the concept of Transfer Learning could be used in the NLP domain. They first explained that, unfortunately, Transfer Learning was a concept that only worked in the computer vision domain. Unfortunately, Transfer Learning was not applied to NLP tasks. But in this paper, they provided a framework called ULMFiT, where they showed that Transfer Learning could also be applied to NLP tasks.

So, let's understand this whole thing. We'll try to understand it in two steps. First, we'll understand what Transfer Learning is, and then we explain why Transfer Learning didn't fit well in the NLP domain.

Transfer Learning is a technique in which knowledge learned from one task is reused to boost performance on a related task. For example, in classification, the knowledge gained while learning to recognize cars could be applied when you are trying to recognize trucks. It's simple, even human beings do this. If you have learned to ride a bicycle, it becomes easier for you to learn to ride a bike because you have already gained a lot of knowledge from a similar task, which you can transfer to a related task. That's why we call it Transfer Learning.

In Transfer Learning, there are two steps, which you can see here.

The first step is known as pre-training. In pre-training, you take your model and train it on a large, universal dataset with many samples. The goal is to learn all the features of the dataset. Once you are done with pre-training, you move to step number two, which is called fine-tuning. In fine-tuning, you take the same trained model and retain its initial weights, but you replace the weights of the later stages or layers with new weights. Then you train it on your specific dataset. Now, it gets fine-tuned according to your task and makes good predictions.

A very good example of this whole process is ImageNet. ImageNet is a dataset that contains millions of images of various things. So, what you can do is take any CNN architecture, and train it on this dataset.

In the pre-training stage, your model will learn some general features from this dataset, such as edges and basic shapes, which make up all the things in the world. Now, in the second stage, you can take this model and fine-tune it according to your task. Let's say you want to classify cats versus dogs, and you have only 100 images. You will take this ImageNet-trained model and fine-tune it, which means you will train the later stage or layer weights on your dataset, which is 100 images of cats and dogs. Since it has already learned to detect edges and shapes, it will also learn how a cat and a dog look and then make accurate predictions on new data. This is the basic concept behind Transfer Learning.

Now let's discuss why it is said that Transfer Learning was not as useful in the NLP domain. There are two major reasons why Transfer Learning could not be applied to the NLP domain until 2018. The first reason is task specificity. This means that it was perceived that tasks like sentiment analysis, named entity recognition, parts of speech tagging, machine translation, question answering, and text summarization are very different from each other and have their own requirements.

The second problem was the lack of data. For machine translation, you need a lot of labelled data. If you want to translate from English to Hindi, you need a lot of English sentences in one column and a lot of corresponding Hindi sentences in the other column. But unfortunately, that much data was not available to train a model on a machine translation task. So, because of these two reasons, Transfer Learning could never establish itself in the NLP domain. But all this changed in 2018 when the ULMFiT research paper came out.

In the ULMFiT research paper, they did not use machine translation for pre-training. Instead, they used a different task called language modelling.

Language modelling is an NLP task where you teach an NLP or deep learning model to predict the next word. For example, "I live in India, and the capital of India is ____." A trained model should be able to predict that the word should be "New Delhi."

Language modelling is a task where you teach a machine learning or deep learning model to predict the next word. Now, I will explain why language modelling as pre-training was so successful. There are two reasons for this, and we will discuss both one by one. The first advantage is rich feature learning. It means that even though this task seems simple, just predicting the next word, a model learns a lot by learning this small task. You learn not only grammatical text to form a sentence correctly but also the semantics of the sentence, its meaning, and sometimes even common sense. For example, if you have a sentence like, "The hotel was exceptionally clean, yet the service was ____." If a model is well-trained, it should predict a word with a negative sentiment like "bad" or "pathetic" because the word "yet" indicates that something negative is coming.

So, the advantage of language modelling is that when you teach a model to predict the next word on a large dataset, it gains a very basic understanding of the language. Once it has this basic understanding, that knowledge can be transferred to a lot of tasks.

The second reason is the huge availability of data. The problem was that if you were doing a machine translation task, you needed labeled data, which means you needed a lot of English sentences in one column and their corresponding Hindi translations in another column. So, in a way, this task was supervised, requiring labeled data. But if you are using language modeling, think about it; you don't need labeled data. You can take any PDF and generate a dataset from it without any labeling. This means you can use all the data available on the internet. In that sense, you can say this is an unsupervised task. This is why we call this kind of pre-training unsupervised pre-training. So, these were the two main reasons why language modeling was chosen as the primary task for pre-training in this research paper.

Now, I will explain the setup used by researchers in the paper for Transfer Learning. They first took a model, which was a variant of LSTM called AWD-LSTM. They used this on a dataset of Wikipedia text, basically taking a lot of Wikipedia articles and conducting unsupervised pre-training on them using the language modelling task, basically predicting the next word. Once the model was trained, they changed the output layer of this model with a classification layer. Hey showed that the accuracy of the model increased by 20-30%. This was the first-time it was demonstrated that Transfer Learning could be used in the NLP domain.

This marked the beginning of Transfer Learning in NLP. After that, things started to improve even further. In the same year, in 2018, a very famous research paper called OpenAI GPT (Generative Pre-trained Transformer) came out. This was the first time the Transformers architecture was combined with Transfer Learning. They took the language modelling task and did pre-training on a large dataset, fine-tuning it on specific tasks. This was the first time Transformers architecture was used in Transfer Learning.

After that, in the same year, 2018, BERT was introduced, a bidirectional encoder transformer that again used language modeling but added a masked language modeling task.

Stage 5 - LLMs

In 2018, meaning two new transformer-based language models, were released. One from Google, and another from Open AI GPT and both of these language models were basically doing the task of next word prediction. The only difference was that their training was done on a transformer architecture, and the dataset used to train these models was huge, like a very large dataset, Because of this, these two models were really good at the task of transfer learning. In fact, they were so good that you could fine-tune them to perform any kind of task.

After this, the field of NLP was totally transformed. In fact, OpenAI didn't stop. Successive versions of GPT came out, which created a massive impact.

In summary, NLP history started with traditional models like bag-of-words and TF-IDF, then evolved to word embeddings, followed by RNNs and LSTMs, and finally, Transformers and Transfer Learning. Transfer Learning made it possible for us to use models pre-trained on large datasets and fine-tune them on our specific tasks with little data, enabling better performance and faster training.

Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

Jitender Malik

SVP | Data engineering & Science(AI/ML, Gen AI, Computer Vision) | AI Engineering Lead at NatWest Group

Stage 1 - Encoder-Decoder Architecture: "Sequence to Sequence Learning with Neural Networks" (arXiv:1409.3215v3 [cs.CL] 14 Dec 2014)

Stage 3: Transformers: "Attention Is All You Need."

Stage 4: Transfer learning: Universal Language Model Fine-tuning for text classification.

More articles by this author

Others also viewed

Neurosymbolic AI: Merging Logical Reasoning with Deep Learning

Artificial Intelligence #36 : Is the future of AI = future of Deep Learning?

Janus Pro 7B vs DALL-E 3: A Comparative Analysis

Understanding Group of Experts: A Powerful Ensemble Learning Approach

Attention

Anatomy of the Beast with many heads! [with code]

Long Short-Term Memory explained

Demystifying the Add & Norm Block in the Transformer Neural Network Architecture: With Code

AI Atlas #9: Transformers

AI Hallucinations: Unveiling the Neural Mindbenders

Explore topics

Stage 1 - Encoder-Decoder Architecture: "Sequence to Sequence Learning with Neural Networks" (arXiv:1409.3215v3 [cs.CL] 14 Dec 2014)

Stage 3: Transformers: "Attention Is All You Need."

Stage 4: Transfer learning: Universal Language Model Fine-tuning for text classification.

Advanced Entity Matching in Financial AI using LLM-Distillation Hybrid Architecture

Aug 4, 2025

LLM reasoning through transformer

May 3, 2025

Key issues (Post-production) in an ML based solution

Mar 2, 2025

6 Core steps for choosing a ML Model

Feb 24, 2025

LLM Agents: Reasoning and acting (ReAct)

Jan 5, 2025

LLM's: Chain of thought prompting

Oct 6, 2024

Deep Learning 1: ANN (Artificial Neural Network) Architecture

Jul 28, 2024

Logistic regression: A deep learning approach.

Jul 20, 2024

A data driven approach for scalable Integration testing.

Jun 29, 2024

EMIR Refit Pairing and matching : A machine learning approach.

Jun 22, 2024

Others also viewed

Neurosymbolic AI: Merging Logical Reasoning with Deep Learning

Artificial Intelligence #36 : Is the future of AI = future of Deep Learning?

Janus Pro 7B vs DALL-E 3: A Comparative Analysis

Understanding Group of Experts: A Powerful Ensemble Learning Approach

Attention

Anatomy of the Beast with many heads! [with code]

Long Short-Term Memory explained

Demystifying the Add & Norm Block in the Transformer Neural Network Architecture: With Code

AI Atlas #9: Transformers

AI Hallucinations: Unveiling the Neural Mindbenders

Explore topics