SlideShare a Scribd company logo
10
Most read
12
Most read
17
Most read
Build a Large Language Model
Understanding LLMs
CloudKarya
1.1 What is an LLM?
● Neural network that can understand,generate and respond like human
● Trained on large data
● “Large” in “Large language Model” refers to a) Model size(parameters) b)dataset
● Utilizes transformer architecture with attention mechanism
● Also known as Gen AI because of their generative capabilities
Artificial Intelligence
Machine Learning
Deep Learning
Large Language Models
Gen AI
CloudKarya
1.2 Applications of LLMs
● Machine Translation
● Text Summarization
● Sentiment Analysis
● Content Creation
● Code generation
● Conversational agents like chatbots
CloudKarya
1.3 Stages of building and using LLMs
● Data preparation
● Pretraining LLM on large unlabelled text data
○ This will have text completion and few shot capabilities
● Preparing the labelled datasets for specific tasks
● Train LLM’s on these task specific datasets to get a fine-tuned LLM
○ Classification
○ Summarization
○ Translation
○ Personal assistant
● Finetuning has 2 types
○ Instruction fine-tuning
○ Classification fine-tuning
CloudKarya
1.4 Introducing the Transformer Architecture
● Original Transformer
○ Developed for machine translation , English to
German
● Encoder
○ Processes the input text and produces an
embedding representation
● Decoder
○ The output from encoder can be used by decoder to
generate the translated text one word at time
● Self-attention mechanism
● BERT (encoder based model)
○ Masked language modeling
○ X uses Bert
● GPT (Decoder only model)
○ Auto regressive model
CloudKarya
1.5 Utilizing large datasets
● Huge corpus with billions of words
● Common datasets
○ CommonCrawl
○ WebText2
○ Books1
○ Book2
○ Wikipedia
● GPT training datasets were not released
● Dolma
CloudKarya
1.6 A closer look at the GPT architecture
● Decoder -only architecture
● Auto regressive Models
● GPT-3 has 96 transformer layers and 175 billion parameters
● Emergent behavior
CloudKarya
1.7 Building a LLM
● Stage 1
○ Building an LLM
■ Data Preparation and Sampling
■ Attention mechanism
■ LLM architecture
● Stage 2
○ Foundational model
■ Training loop
■ Model evaluation
■ Load pretrained weights
● Stage 3
○ Fine tuning
■ Classifier
■ Personal assistant
Build a Large Language Model
Working With Text Data
CloudKarya
Understanding Word Embeddings
● Embedding: Converting data into a vector format.
● Types of embeddings
○ Text, Audio, Video
● Types of text embeddings
○ Word, Sentence, Paragraphs (RAG)
○ Whole documents
● Word2Vec
○ Similar context - same embedding
● Models for word embeddings
○ Static Models (Word2Vec, GloVe, FastText)
○ Contextual Models ( BERT, GPT, etc)
● LLMs produce their own embeddings which are updated during training.
● GPT-2 - 768 dimensions, GPT-3- 12,288 dimensions
Discrete Objects Continuous Space
Nonnumeric
Machine
Readable
CloudKarya
Tokenizing Text
● 1st step in creating embeddings
● Tokens
○ Individual words or special characters, including punctuations.
● LLM Training
○ The-verdict - a short story by Edith Wharton
○ Goal - tokenize 20,479 Character short story
I love reading books.
I love reading books .
Input Text
Tokenized Text
CloudKarya
Converting Tokens into Token ID’s
● Intermediate step before converting tokens into embeddings
● Vocabulary
○ Defines how we map each unique word and special character to a unique integer
○ The vocab size for The-verdict is 1,130
CloudKarya
Adding Special Context Tokens
● Need for special tokens
○ To handle unknown words <|unk|>
○ To identify start and end of the text
○ To pad the shorter texts to match the length of longer texts
● Popular tokens used
○ [BOS] (beginning of sequence)
○ [EOS](end of sequence)
○ [PAD](padding)
● The tokenizer for GPT models uses only <|endoftext|>
● GPT models handle unknown words using BPE
CloudKarya
Byte Pair Encoding
● It is famous tokenization technique used to train GPT-2, GPT-3, RoBERTa, BART, and
DeBERTa.
● Training phase
○ BPE learns a vocabulary of subwords by iteratively merging the most frequent character pairs.
● Tokenization Phase
○ Split text into characters.
○ Iteratively match the longest possible subwords from the vocabulary.
○ Replace matched subwords with their corresponding token IDs.
● Tiktoken
○ An open source python library used for implementing BPE.
○ BPE tokenizer used for GPT-2 and GPT-3 have a vocabulary size of 50,257
● Handling unknown words
○ Unknown words are breakdown into individual characters ensures that LLM can process any text.
CloudKarya
Data Sampling With a Sliding Window
● LLMs prediction task is to predict the next word that follows the input block
● Input - target pairs needs to be created
● To perform data sampling
○ We make use of PyTorch’s built-in Dataset and DataLoader classes.
○ Hyperparameters for DataLoader
■ Batch_size, max_legth, stride, num_workers
CloudKarya
Creating Token Embeddings
● Last step in preparing input text for LLM training
● Token ids are converted to embeddings
○ These embeddings are initialized with random values
○ This serves as a starting point for LLMs learning process
● Using torch.nn.Embedding create a embedding layer
○ This embedding layer is a lookup operation that retrieves rows from the embedding layer’s weight
matrix
●
CloudKarya
Encoding Word Positions
● Need
○ Self-attention mechanism doesn’t have a notion of position order for the tokens within a sequence
○ Embedding layer will return same embedding for same token ID every time irrespective of the position.
● So we inject the positional encoding to add positional information
● It is of two types
○ Absolute Positional Embeddings
○ Relative Positional Embeddings
● OpenAI’s GPT models use absolute positional embeddings
● These embeddings are optimized during the training process
● The dimensions of positional encoding will be batch_size x context_length

More Related Content

PPTX
Security in IoT
PDF
Tietoon perustuva kuntoutus mikä rooli implementaatiotutkimuksella on? – Mitä...
PDF
Is Cyber-offence the New Cyber-defence?
PDF
Introduction to Filecoin
PPTX
PDF
Ethical hacking and social engineering
PPTX
AI and the Impact on Cybersecurity
PPTX
Edge computing
Security in IoT
Tietoon perustuva kuntoutus mikä rooli implementaatiotutkimuksella on? – Mitä...
Is Cyber-offence the New Cyber-defence?
Introduction to Filecoin
Ethical hacking and social engineering
AI and the Impact on Cybersecurity
Edge computing

What's hot (20)

PPT
Module 1 :Introduction to Cyber Security
PPT
Infrastructure Security by Sivamurthy Hiremath
PPTX
The Design of Smart Home
PPTX
Cyber security
PDF
Overview of IoT and Security issues
PPTX
Security Orchestration, Automation & Incident Response
PPTX
CISSP Chapter 7 - Security Operations
PDF
FortiGate Firewall HOW-TO - DMZ
PPTX
Vuorovaikuta - Opi kuuntelun 6 tasoa
PPTX
10 predictions for the future of IoT
PDF
Cybersecurity Awareness E-Book - WeSecureApp
PPTX
Iot Security, Internet of Things
PPTX
Threat Modeling Lessons From Star Wars
PDF
Edge Computing and 5G - SDN/NFV London meetup
PPT
PPTX
Cyber Security and the CEO
PPTX
Meet the Ghost of SecOps Future by Anton Chuvakin
PDF
Enterprise Information Systems Security: A Case Study in the Banking Sector
PPTX
Iot and cloud computing
PPTX
Actos de investigación que pueden afectar las garantías Constitucionales en N...
Module 1 :Introduction to Cyber Security
Infrastructure Security by Sivamurthy Hiremath
The Design of Smart Home
Cyber security
Overview of IoT and Security issues
Security Orchestration, Automation & Incident Response
CISSP Chapter 7 - Security Operations
FortiGate Firewall HOW-TO - DMZ
Vuorovaikuta - Opi kuuntelun 6 tasoa
10 predictions for the future of IoT
Cybersecurity Awareness E-Book - WeSecureApp
Iot Security, Internet of Things
Threat Modeling Lessons From Star Wars
Edge Computing and 5G - SDN/NFV London meetup
Cyber Security and the CEO
Meet the Ghost of SecOps Future by Anton Chuvakin
Enterprise Information Systems Security: A Case Study in the Banking Sector
Iot and cloud computing
Actos de investigación que pueden afectar las garantías Constitucionales en N...
Ad

Similar to Building a large language models from scratch .pdf (20)

PPTX
Understanding Large Language Models (1).pptx
PDF
Mattingly "Text Mining Techniques"
PDF
Named Entity Recognition from Online News
PDF
"Chat with your private data using Llama3 and LLPhant in PHP", Enrico Zimuel
PPTX
GPT, LLM, RAG, and RAG in Action: Understanding the Future of AI-Powered Info...
PDF
Deprecating the state machine: building conversational AI with the Rasa stack...
PDF
Deprecating the state machine: building conversational AI with the Rasa stack
PDF
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
PPTX
Gnerative AI presidency Module1_L4_LLMs_new.pptx
PPTX
log analytic using generative AI transformer model
PPTX
14_04_transformerso3459834759883457983475.pptx
PPTX
attention mechanism need_transformers.pptx
PPTX
GCD ChatGPT.pptx
PDF
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
PDF
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
PDF
Conversational AI with Rasa - PyData Workshop
PDF
Porting 100k Lines of Code to TypeScript
PPTX
The Professional Programmer
PDF
Applied Machine learning for business analytics
PDF
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
Understanding Large Language Models (1).pptx
Mattingly "Text Mining Techniques"
Named Entity Recognition from Online News
"Chat with your private data using Llama3 and LLPhant in PHP", Enrico Zimuel
GPT, LLM, RAG, and RAG in Action: Understanding the Future of AI-Powered Info...
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Gnerative AI presidency Module1_L4_LLMs_new.pptx
log analytic using generative AI transformer model
14_04_transformerso3459834759883457983475.pptx
attention mechanism need_transformers.pptx
GCD ChatGPT.pptx
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Conversational AI with Rasa - PyData Workshop
Porting 100k Lines of Code to TypeScript
The Professional Programmer
Applied Machine learning for business analytics
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
Ad

Recently uploaded (20)

PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Cell Structure & Organelles in detailed.
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
01-Introduction-to-Information-Management.pdf
PPTX
master seminar digital applications in india
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Final Presentation General Medicine 03-08-2024.pptx
Microbial disease of the cardiovascular and lymphatic systems
Microbial diseases, their pathogenesis and prophylaxis
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
GDM (1) (1).pptx small presentation for students
Cell Structure & Organelles in detailed.
O7-L3 Supply Chain Operations - ICLT Program
O5-L3 Freight Transport Ops (International) V1.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Module 4: Burden of Disease Tutorial Slides S2 2025
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pre independence Education in Inndia.pdf
01-Introduction-to-Information-Management.pdf
master seminar digital applications in india
STATICS OF THE RIGID BODIES Hibbelers.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Renaissance Architecture: A Journey from Faith to Humanism
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf

Building a large language models from scratch .pdf

  • 1. Build a Large Language Model Understanding LLMs
  • 2. CloudKarya 1.1 What is an LLM? ● Neural network that can understand,generate and respond like human ● Trained on large data ● “Large” in “Large language Model” refers to a) Model size(parameters) b)dataset ● Utilizes transformer architecture with attention mechanism ● Also known as Gen AI because of their generative capabilities Artificial Intelligence Machine Learning Deep Learning Large Language Models Gen AI
  • 3. CloudKarya 1.2 Applications of LLMs ● Machine Translation ● Text Summarization ● Sentiment Analysis ● Content Creation ● Code generation ● Conversational agents like chatbots
  • 4. CloudKarya 1.3 Stages of building and using LLMs ● Data preparation ● Pretraining LLM on large unlabelled text data ○ This will have text completion and few shot capabilities ● Preparing the labelled datasets for specific tasks ● Train LLM’s on these task specific datasets to get a fine-tuned LLM ○ Classification ○ Summarization ○ Translation ○ Personal assistant ● Finetuning has 2 types ○ Instruction fine-tuning ○ Classification fine-tuning
  • 5. CloudKarya 1.4 Introducing the Transformer Architecture ● Original Transformer ○ Developed for machine translation , English to German ● Encoder ○ Processes the input text and produces an embedding representation ● Decoder ○ The output from encoder can be used by decoder to generate the translated text one word at time ● Self-attention mechanism ● BERT (encoder based model) ○ Masked language modeling ○ X uses Bert ● GPT (Decoder only model) ○ Auto regressive model
  • 6. CloudKarya 1.5 Utilizing large datasets ● Huge corpus with billions of words ● Common datasets ○ CommonCrawl ○ WebText2 ○ Books1 ○ Book2 ○ Wikipedia ● GPT training datasets were not released ● Dolma
  • 7. CloudKarya 1.6 A closer look at the GPT architecture ● Decoder -only architecture ● Auto regressive Models ● GPT-3 has 96 transformer layers and 175 billion parameters ● Emergent behavior
  • 8. CloudKarya 1.7 Building a LLM ● Stage 1 ○ Building an LLM ■ Data Preparation and Sampling ■ Attention mechanism ■ LLM architecture ● Stage 2 ○ Foundational model ■ Training loop ■ Model evaluation ■ Load pretrained weights ● Stage 3 ○ Fine tuning ■ Classifier ■ Personal assistant
  • 9. Build a Large Language Model Working With Text Data
  • 10. CloudKarya Understanding Word Embeddings ● Embedding: Converting data into a vector format. ● Types of embeddings ○ Text, Audio, Video ● Types of text embeddings ○ Word, Sentence, Paragraphs (RAG) ○ Whole documents ● Word2Vec ○ Similar context - same embedding ● Models for word embeddings ○ Static Models (Word2Vec, GloVe, FastText) ○ Contextual Models ( BERT, GPT, etc) ● LLMs produce their own embeddings which are updated during training. ● GPT-2 - 768 dimensions, GPT-3- 12,288 dimensions Discrete Objects Continuous Space Nonnumeric Machine Readable
  • 11. CloudKarya Tokenizing Text ● 1st step in creating embeddings ● Tokens ○ Individual words or special characters, including punctuations. ● LLM Training ○ The-verdict - a short story by Edith Wharton ○ Goal - tokenize 20,479 Character short story I love reading books. I love reading books . Input Text Tokenized Text
  • 12. CloudKarya Converting Tokens into Token ID’s ● Intermediate step before converting tokens into embeddings ● Vocabulary ○ Defines how we map each unique word and special character to a unique integer ○ The vocab size for The-verdict is 1,130
  • 13. CloudKarya Adding Special Context Tokens ● Need for special tokens ○ To handle unknown words <|unk|> ○ To identify start and end of the text ○ To pad the shorter texts to match the length of longer texts ● Popular tokens used ○ [BOS] (beginning of sequence) ○ [EOS](end of sequence) ○ [PAD](padding) ● The tokenizer for GPT models uses only <|endoftext|> ● GPT models handle unknown words using BPE
  • 14. CloudKarya Byte Pair Encoding ● It is famous tokenization technique used to train GPT-2, GPT-3, RoBERTa, BART, and DeBERTa. ● Training phase ○ BPE learns a vocabulary of subwords by iteratively merging the most frequent character pairs. ● Tokenization Phase ○ Split text into characters. ○ Iteratively match the longest possible subwords from the vocabulary. ○ Replace matched subwords with their corresponding token IDs. ● Tiktoken ○ An open source python library used for implementing BPE. ○ BPE tokenizer used for GPT-2 and GPT-3 have a vocabulary size of 50,257 ● Handling unknown words ○ Unknown words are breakdown into individual characters ensures that LLM can process any text.
  • 15. CloudKarya Data Sampling With a Sliding Window ● LLMs prediction task is to predict the next word that follows the input block ● Input - target pairs needs to be created ● To perform data sampling ○ We make use of PyTorch’s built-in Dataset and DataLoader classes. ○ Hyperparameters for DataLoader ■ Batch_size, max_legth, stride, num_workers
  • 16. CloudKarya Creating Token Embeddings ● Last step in preparing input text for LLM training ● Token ids are converted to embeddings ○ These embeddings are initialized with random values ○ This serves as a starting point for LLMs learning process ● Using torch.nn.Embedding create a embedding layer ○ This embedding layer is a lookup operation that retrieves rows from the embedding layer’s weight matrix ●
  • 17. CloudKarya Encoding Word Positions ● Need ○ Self-attention mechanism doesn’t have a notion of position order for the tokens within a sequence ○ Embedding layer will return same embedding for same token ID every time irrespective of the position. ● So we inject the positional encoding to add positional information ● It is of two types ○ Absolute Positional Embeddings ○ Relative Positional Embeddings ● OpenAI’s GPT models use absolute positional embeddings ● These embeddings are optimized during the training process ● The dimensions of positional encoding will be batch_size x context_length