Building a large language models from scratch .pdf

Build a Large Language Model
Understanding LLMs

CloudKarya
1.1 What is an LLM?
● Neural network that can understand,generate and respond like human
● Trained on large data
● “Large” in “Large language Model” refers to a) Model size(parameters) b)dataset
● Utilizes transformer architecture with attention mechanism
● Also known as Gen AI because of their generative capabilities
Artiﬁcial Intelligence
Machine Learning
Deep Learning
Large Language Models
Gen AI

CloudKarya
1.2 Applications of LLMs
● Machine Translation
● Text Summarization
● Sentiment Analysis
● Content Creation
● Code generation
● Conversational agents like chatbots

CloudKarya
1.3 Stages of building and using LLMs
● Data preparation
● Pretraining LLM on large unlabelled text data
○ This will have text completion and few shot capabilities
● Preparing the labelled datasets for specific tasks
● Train LLM’s on these task specific datasets to get a fine-tuned LLM
○ Classification
○ Summarization
○ Translation
○ Personal assistant
● Finetuning has 2 types
○ Instruction fine-tuning
○ Classification fine-tuning

CloudKarya
1.4 Introducing the Transformer Architecture
● Original Transformer
○ Developed for machine translation , English to
German
● Encoder
○ Processes the input text and produces an
embedding representation
● Decoder
○ The output from encoder can be used by decoder to
generate the translated text one word at time
● Self-attention mechanism
● BERT (encoder based model)
○ Masked language modeling
○ X uses Bert
● GPT (Decoder only model)
○ Auto regressive model

CloudKarya
1.5 Utilizing large datasets
● Huge corpus with billions of words
● Common datasets
○ CommonCrawl
○ WebText2
○ Books1
○ Book2
○ Wikipedia
● GPT training datasets were not released
● Dolma

CloudKarya
1.6 A closer look at the GPT architecture
● Decoder -only architecture
● Auto regressive Models
● GPT-3 has 96 transformer layers and 175 billion parameters
● Emergent behavior

CloudKarya
1.7 Building a LLM
● Stage 1
○ Building an LLM
■ Data Preparation and Sampling
■ Attention mechanism
■ LLM architecture
● Stage 2
○ Foundational model
■ Training loop
■ Model evaluation
■ Load pretrained weights
● Stage 3
○ Fine tuning
■ Classiﬁer
■ Personal assistant

Build a Large Language Model
Working With Text Data

CloudKarya
Understanding Word Embeddings
● Embedding: Converting data into a vector format.
● Types of embeddings
○ Text, Audio, Video
● Types of text embeddings
○ Word, Sentence, Paragraphs (RAG)
○ Whole documents
● Word2Vec
○ Similar context - same embedding
● Models for word embeddings
○ Static Models (Word2Vec, GloVe, FastText)
○ Contextual Models ( BERT, GPT, etc)
● LLMs produce their own embeddings which are updated during training.
● GPT-2 - 768 dimensions, GPT-3- 12,288 dimensions
Discrete Objects Continuous Space
Nonnumeric
Machine
Readable

CloudKarya
Tokenizing Text
● 1st step in creating embeddings
● Tokens
○ Individual words or special characters, including punctuations.
● LLM Training
○ The-verdict - a short story by Edith Wharton
○ Goal - tokenize 20,479 Character short story
I love reading books.
I love reading books .
Input Text
Tokenized Text

CloudKarya
Converting Tokens into Token ID’s
● Intermediate step before converting tokens into embeddings
● Vocabulary
○ Deﬁnes how we map each unique word and special character to a unique integer
○ The vocab size for The-verdict is 1,130

CloudKarya
Adding Special Context Tokens
● Need for special tokens
○ To handle unknown words <|unk|>
○ To identify start and end of the text
○ To pad the shorter texts to match the length of longer texts
● Popular tokens used
○ [BOS] (beginning of sequence)
○ [EOS](end of sequence)
○ [PAD](padding)
● The tokenizer for GPT models uses only <|endoftext|>
● GPT models handle unknown words using BPE

CloudKarya
Byte Pair Encoding
● It is famous tokenization technique used to train GPT-2, GPT-3, RoBERTa, BART, and
DeBERTa.
● Training phase
○ BPE learns a vocabulary of subwords by iteratively merging the most frequent character pairs.
● Tokenization Phase
○ Split text into characters.
○ Iteratively match the longest possible subwords from the vocabulary.
○ Replace matched subwords with their corresponding token IDs.
● Tiktoken
○ An open source python library used for implementing BPE.
○ BPE tokenizer used for GPT-2 and GPT-3 have a vocabulary size of 50,257
● Handling unknown words
○ Unknown words are breakdown into individual characters ensures that LLM can process any text.

CloudKarya
Data Sampling With a Sliding Window
● LLMs prediction task is to predict the next word that follows the input block
● Input - target pairs needs to be created
● To perform data sampling
○ We make use of PyTorch’s built-in Dataset and DataLoader classes.
○ Hyperparameters for DataLoader
■ Batch_size, max_legth, stride, num_workers

CloudKarya
Creating Token Embeddings
● Last step in preparing input text for LLM training
● Token ids are converted to embeddings
○ These embeddings are initialized with random values
○ This serves as a starting point for LLMs learning process
● Using torch.nn.Embedding create a embedding layer
○ This embedding layer is a lookup operation that retrieves rows from the embedding layer’s weight
matrix
●

CloudKarya
Encoding Word Positions
● Need
○ Self-attention mechanism doesn’t have a notion of position order for the tokens within a sequence
○ Embedding layer will return same embedding for same token ID every time irrespective of the position.
● So we inject the positional encoding to add positional information
● It is of two types
○ Absolute Positional Embeddings
○ Relative Positional Embeddings
● OpenAI’s GPT models use absolute positional embeddings
● These embeddings are optimized during the training process
● The dimensions of positional encoding will be batch_size x context_length

Building a large language models from scratch .pdf

More Related Content

What's hot (20)

Similar to Building a large language models from scratch .pdf (20)

Recently uploaded (20)

Building a large language models from scratch .pdf