SlideShare a Scribd company logo
Running LLM in Kubernetes
Volodymyr Tsap
CTO @ SHALB
What is LLM?
A large language model (LLM) is a language model notable for its ability to achieve general-purpose language
generation and understanding.
LLMs acquire these abilities by learning statistical relationships from text documents during a computationally
intensive self-supervised and semi-supervised training process.
LLMs are artificial neural networks, the largest and most capable of which are built with a transformer-based
architecture.
Wikipedia
What is Transformers?
! Transformers are a type of deep learning model that have revolutionized the way natural
language processing tasks are approached.
! Transformers utilize a unique architecture that relies on self-attention mechanisms to weigh
the significance of different words in a sentence. This allows the model to capture the context of
each word more effectively than previous models, leading to better understanding and
generation of text.
Building LLM. Data Collection and Preparation.
! Collect a large and diverse dataset from various sources such as books, websites, and other
texts.
! Clean and preprocess the data to remove irrelevant content, normalize text (e.g., lowercasing,
removing special characters), and ensure data quality.
Building LLM. Tokenization and Vocabulary Building.
! Tokenize the text data into smaller units (tokens) such as words, subwords, or characters. This
step may involve choosing a specific tokenization algorithm (e.g., BPE, WordPiece).
! Create a vocabulary of unique tokens and possibly generate embeddings for them. This could
involve pre-training embeddings or using embeddings from an existing model.
Building LLM. Model Architecture Design.
! Choose a transformer architecture (e.g., GPT, BERT) that suits the goals of your LLM. This
involves deciding on the number of layers, attention heads, and other hyperparameters.
! Implement or adapt an existing transformer model framework using deep learning libraries such
as TensorFlow or PyTorch.
Building LLM. Model Architecture Design.
Building LLM. Training.
! Split the data into training, validation, and test sets.
! Pre-train the model on the collected data, which involves running it through the computation of
weights over multiple epochs. This step is computationally intensive and can take from hours to
weeks depending on the model size and hardware capabilities.
! Use techniques such as gradient clipping, learning rate scheduling, and regularization to
improve training efficiency and model performance.
Building LLM. Fine-Tuning (Optional).
! Fine-tune the pre-trained model on a smaller, task-specific dataset if the LLM will be used for
specific applications (e.g., question answering, sentiment analysis).
! Adjust hyperparameters and training settings to optimize performance for the target task.
Building LLM. Evaluation and Testing.
! Evaluate the model on a test set to measure its performance using appropriate metrics (e.g.,
accuracy, F1 score, perplexity).
! Perform error analysis and adjust the training process as necessary to improve model quality.
Building LLM. Saving and Deployment.
! Save the trained model weights and configuration to files.
! Deploy the model for inference, which can involve setting up a serving infrastructure capable of
handling requests in real-time or batch processing.
TLDR. Watch Andrej Karpathy Explanation.
Hugging Face - GitHub for LLM’s
LLM Files
LLM Files
How to run? Using Google Colab with T4 gpu
How to run? Using laptop and llama.cpp. Quantization.
Using Managed Cloud Services.
! Amazon SageMaker
! Google Cloud AI Platform & Vertex AI
! Microsoft Azure Machine Learning
! NVIDIA AI Enterprise
! Hugging Face Endpoints
! AnyScale Endpoints
Why to run them in Kubernetes?
1. We already know him :)
2. Scalability. Resource efficiency, HPA, auto-scaling, API Limits, etc..
3. Price. Managed service 20-40% overhead. Reserved instances.
4. GPU sharing.
5. ML ecosystem - pipelines, artifacts. (KubeFlow, Ray Framework).
6. No vendor lock. Transportable.
LLM Serving Frameworks
Options to run LLM on K8s.
1. KServe from Kubeflow.
2. Ray Serve from Ray Framework.
3. Flux AI controller.
4. Own Kubernetes wrapper on top of Frameworks.
We choose TGI and made it Kubernetes ready.
We have Docker, Lets adapt it to Kubernetes
Demo Time!
Let’s bootstrap infrastructure from cluster.dev template
Then add model configuration
Apply and check the model is running
Changing models and infrastructure
Enabling HF chat-ui
Deploy Monitoring and Metrics with DCGM Exporter
Thank you! Now Questions?

More Related Content

PPTX
Memory management
PDF
Resource replication in cloud computing.
PPTX
4. Memory virtualization and management
PDF
Logical Network Perimeter in Cloud Computing
PDF
Big Data Analytics using Mahout
PPT
Assembler
PPTX
Basic Technology - Module 13 cloud computing
PPTX
Introduction to Apache Mahout
Memory management
Resource replication in cloud computing.
4. Memory virtualization and management
Logical Network Perimeter in Cloud Computing
Big Data Analytics using Mahout
Assembler
Basic Technology - Module 13 cloud computing
Introduction to Apache Mahout

What's hot (20)

PPT
Object oriented modeling and design
PDF
Intelligent agent In Artificial Intelligence
PPT
multiprocessors and multicomputers
PPT
Process Synchronization And Deadlocks
PPT
UML Prezentation class diagram
PPTX
Computer architecture multi processor
PPTX
Operating system 31 multiple processor scheduling
PPTX
Requirements modeling
PDF
UNIT 1 -BIG DATA ANALYTICS Full.pdf
PPTX
Memory Hierarchy
PPT
C O R B A Unit 4
PPTX
Thread
PPTX
Data and functional modeling
PPTX
Case study operating systems
PPTX
Hypervisor
PDF
Object Modelling Technique " ooad "
PPTX
Simulation of water reservoir
Object oriented modeling and design
Intelligent agent In Artificial Intelligence
multiprocessors and multicomputers
Process Synchronization And Deadlocks
UML Prezentation class diagram
Computer architecture multi processor
Operating system 31 multiple processor scheduling
Requirements modeling
UNIT 1 -BIG DATA ANALYTICS Full.pdf
Memory Hierarchy
C O R B A Unit 4
Thread
Data and functional modeling
Case study operating systems
Hypervisor
Object Modelling Technique " ooad "
Simulation of water reservoir
Ad

Similar to "Running Open-Source LLM models on Kubernetes", Volodymyr Tsap (20)

PDF
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
PDF
Top 15 LLMOps Tools For Career Success in 2025 | USAII®
PDF
Build a Large Language Model From Scratch MEAP Sebastian Raschka
PDF
Master LLMs with LangChain -the basics of LLM
PDF
LLMs for the “GPU-Poor” - Franck Nijimbere.pdf
PDF
odsc_2023.pdf
PDF
1721436375967hhhhhhhhhhhhhuuuuuuuuuu.pdf
PDF
Quick Start Guide To Large Language Models Second Edition Sinan Ozdemir
PDF
LLM.pdf
PDF
Train foundation model for domain-specific language model
PDF
Top Comparison of Large Language ModelsLLMs Explained.pdf
PDF
Top Comparison of Large Language ModelsLLMs Explained.pdf
PDF
Top Comparison of Large Language ModelsLLMs Explained (2).pdf
PDF
solulab.com-Comparison of Large Language Models The Ultimate Guide (1).pdf
PPTX
Understanding Large Language Models (1).pptx
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
Comparison of Large Language Models The Ultimate Guide.pdf
PPTX
Generative AI and Large Language Models (LLMs)
PPTX
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
PDF
solulab.com-Top Comparison of Large Language ModelsLLMs Explained.pdf
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Top 15 LLMOps Tools For Career Success in 2025 | USAII®
Build a Large Language Model From Scratch MEAP Sebastian Raschka
Master LLMs with LangChain -the basics of LLM
LLMs for the “GPU-Poor” - Franck Nijimbere.pdf
odsc_2023.pdf
1721436375967hhhhhhhhhhhhhuuuuuuuuuu.pdf
Quick Start Guide To Large Language Models Second Edition Sinan Ozdemir
LLM.pdf
Train foundation model for domain-specific language model
Top Comparison of Large Language ModelsLLMs Explained.pdf
Top Comparison of Large Language ModelsLLMs Explained.pdf
Top Comparison of Large Language ModelsLLMs Explained (2).pdf
solulab.com-Comparison of Large Language Models The Ultimate Guide (1).pdf
Understanding Large Language Models (1).pptx
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Comparison of Large Language Models The Ultimate Guide.pdf
Generative AI and Large Language Models (LLMs)
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
solulab.com-Top Comparison of Large Language ModelsLLMs Explained.pdf
Ad

More from Fwdays (20)

PDF
"Mastering UI Complexity: State Machines and Reactive Patterns at Grammarly",...
PDF
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
PPTX
"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai
PPTX
"Як ми переписали Сільпо на Angular", Євген Русаков
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
PDF
"Validation and Observability of AI Agents", Oleksandr Denisyuk
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
PPTX
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
PPTX
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
PDF
"AI is already here. What will happen to your team (and your role) tomorrow?"...
PPTX
"Is it worth investing in AI in 2025?", Alexander Sharko
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
PDF
"Scaling in space and time with Temporal", Andriy Lupa.pdf
PDF
"Database isolation: how we deal with hundreds of direct connections to the d...
PDF
"Scaling in space and time with Temporal", Andriy Lupa .pdf
PPTX
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
PPTX
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
PPTX
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
PPTX
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...
"Mastering UI Complexity: State Machines and Reactive Patterns at Grammarly",...
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai
"Як ми переписали Сільпо на Angular", Євген Русаков
"AI Transformation: Directions and Challenges", Pavlo Shaternik
"Validation and Observability of AI Agents", Oleksandr Denisyuk
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
"AI is already here. What will happen to your team (and your role) tomorrow?"...
"Is it worth investing in AI in 2025?", Alexander Sharko
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Database isolation: how we deal with hundreds of direct connections to the d...
"Scaling in space and time with Temporal", Andriy Lupa .pdf
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
project resource management chapter-09.pdf
PDF
August Patch Tuesday
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
1. Introduction to Computer Programming.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Getting Started with Data Integration: FME Form 101
PDF
DP Operators-handbook-extract for the Mautical Institute
Encapsulation_ Review paper, used for researhc scholars
Building Integrated photovoltaic BIPV_UPV.pdf
project resource management chapter-09.pdf
August Patch Tuesday
Unlocking AI with Model Context Protocol (MCP)
1. Introduction to Computer Programming.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Hindi spoken digit analysis for native and non-native speakers
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Zenith AI: Advanced Artificial Intelligence
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Group 1 Presentation -Planning and Decision Making .pptx
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Getting Started with Data Integration: FME Form 101
DP Operators-handbook-extract for the Mautical Institute

"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap

  • 1. Running LLM in Kubernetes Volodymyr Tsap CTO @ SHALB
  • 2. What is LLM? A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and understanding. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks, the largest and most capable of which are built with a transformer-based architecture. Wikipedia
  • 3. What is Transformers? ! Transformers are a type of deep learning model that have revolutionized the way natural language processing tasks are approached. ! Transformers utilize a unique architecture that relies on self-attention mechanisms to weigh the significance of different words in a sentence. This allows the model to capture the context of each word more effectively than previous models, leading to better understanding and generation of text.
  • 4. Building LLM. Data Collection and Preparation. ! Collect a large and diverse dataset from various sources such as books, websites, and other texts. ! Clean and preprocess the data to remove irrelevant content, normalize text (e.g., lowercasing, removing special characters), and ensure data quality.
  • 5. Building LLM. Tokenization and Vocabulary Building. ! Tokenize the text data into smaller units (tokens) such as words, subwords, or characters. This step may involve choosing a specific tokenization algorithm (e.g., BPE, WordPiece). ! Create a vocabulary of unique tokens and possibly generate embeddings for them. This could involve pre-training embeddings or using embeddings from an existing model.
  • 6. Building LLM. Model Architecture Design. ! Choose a transformer architecture (e.g., GPT, BERT) that suits the goals of your LLM. This involves deciding on the number of layers, attention heads, and other hyperparameters. ! Implement or adapt an existing transformer model framework using deep learning libraries such as TensorFlow or PyTorch.
  • 7. Building LLM. Model Architecture Design.
  • 8. Building LLM. Training. ! Split the data into training, validation, and test sets. ! Pre-train the model on the collected data, which involves running it through the computation of weights over multiple epochs. This step is computationally intensive and can take from hours to weeks depending on the model size and hardware capabilities. ! Use techniques such as gradient clipping, learning rate scheduling, and regularization to improve training efficiency and model performance.
  • 9. Building LLM. Fine-Tuning (Optional). ! Fine-tune the pre-trained model on a smaller, task-specific dataset if the LLM will be used for specific applications (e.g., question answering, sentiment analysis). ! Adjust hyperparameters and training settings to optimize performance for the target task.
  • 10. Building LLM. Evaluation and Testing. ! Evaluate the model on a test set to measure its performance using appropriate metrics (e.g., accuracy, F1 score, perplexity). ! Perform error analysis and adjust the training process as necessary to improve model quality.
  • 11. Building LLM. Saving and Deployment. ! Save the trained model weights and configuration to files. ! Deploy the model for inference, which can involve setting up a serving infrastructure capable of handling requests in real-time or batch processing.
  • 12. TLDR. Watch Andrej Karpathy Explanation.
  • 13. Hugging Face - GitHub for LLM’s
  • 16. How to run? Using Google Colab with T4 gpu
  • 17. How to run? Using laptop and llama.cpp. Quantization.
  • 18. Using Managed Cloud Services. ! Amazon SageMaker ! Google Cloud AI Platform & Vertex AI ! Microsoft Azure Machine Learning ! NVIDIA AI Enterprise ! Hugging Face Endpoints ! AnyScale Endpoints
  • 19. Why to run them in Kubernetes? 1. We already know him :) 2. Scalability. Resource efficiency, HPA, auto-scaling, API Limits, etc.. 3. Price. Managed service 20-40% overhead. Reserved instances. 4. GPU sharing. 5. ML ecosystem - pipelines, artifacts. (KubeFlow, Ray Framework). 6. No vendor lock. Transportable.
  • 21. Options to run LLM on K8s. 1. KServe from Kubeflow. 2. Ray Serve from Ray Framework. 3. Flux AI controller. 4. Own Kubernetes wrapper on top of Frameworks.
  • 22. We choose TGI and made it Kubernetes ready.
  • 23. We have Docker, Lets adapt it to Kubernetes
  • 25. Let’s bootstrap infrastructure from cluster.dev template
  • 26. Then add model configuration
  • 27. Apply and check the model is running
  • 28. Changing models and infrastructure
  • 30. Deploy Monitoring and Metrics with DCGM Exporter
  • 31. Thank you! Now Questions?