"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap

Running LLM in Kubernetes
Volodymyr Tsap
CTO @ SHALB

What is LLM?
A large language model (LLM) is a language model notable for its ability to achieve general-purpose language
generation and understanding.
LLMs acquire these abilities by learning statistical relationships from text documents during a computationally
intensive self-supervised and semi-supervised training process.
LLMs are artificial neural networks, the largest and most capable of which are built with a transformer-based
architecture.
Wikipedia

What is Transformers?
! Transformers are a type of deep learning model that have revolutionized the way natural
language processing tasks are approached.
! Transformers utilize a unique architecture that relies on self-attention mechanisms to weigh
the significance of different words in a sentence. This allows the model to capture the context of
each word more effectively than previous models, leading to better understanding and
generation of text.

Building LLM. Data Collection and Preparation.
! Collect a large and diverse dataset from various sources such as books, websites, and other
texts.
! Clean and preprocess the data to remove irrelevant content, normalize text (e.g., lowercasing,
removing special characters), and ensure data quality.

Building LLM. Tokenization and Vocabulary Building.
! Tokenize the text data into smaller units (tokens) such as words, subwords, or characters. This
step may involve choosing a specific tokenization algorithm (e.g., BPE, WordPiece).
! Create a vocabulary of unique tokens and possibly generate embeddings for them. This could
involve pre-training embeddings or using embeddings from an existing model.

Building LLM. Model Architecture Design.
! Choose a transformer architecture (e.g., GPT, BERT) that suits the goals of your LLM. This
involves deciding on the number of layers, attention heads, and other hyperparameters.
! Implement or adapt an existing transformer model framework using deep learning libraries such
as TensorFlow or PyTorch.

Building LLM. Model Architecture Design.

Building LLM. Training.
! Split the data into training, validation, and test sets.
! Pre-train the model on the collected data, which involves running it through the computation of
weights over multiple epochs. This step is computationally intensive and can take from hours to
weeks depending on the model size and hardware capabilities.
! Use techniques such as gradient clipping, learning rate scheduling, and regularization to
improve training efficiency and model performance.

Building LLM. Fine-Tuning (Optional).
! Fine-tune the pre-trained model on a smaller, task-specific dataset if the LLM will be used for
specific applications (e.g., question answering, sentiment analysis).
! Adjust hyperparameters and training settings to optimize performance for the target task.

Building LLM. Evaluation and Testing.
! Evaluate the model on a test set to measure its performance using appropriate metrics (e.g.,
accuracy, F1 score, perplexity).
! Perform error analysis and adjust the training process as necessary to improve model quality.

Building LLM. Saving and Deployment.
! Save the trained model weights and configuration to files.
! Deploy the model for inference, which can involve setting up a serving infrastructure capable of
handling requests in real-time or batch processing.

TLDR. Watch Andrej Karpathy Explanation.

Hugging Face - GitHub for LLM’s

How to run? Using Google Colab with T4 gpu

How to run? Using laptop and llama.cpp. Quantization.

Using Managed Cloud Services.
! Amazon SageMaker
! Google Cloud AI Platform & Vertex AI
! Microsoft Azure Machine Learning
! NVIDIA AI Enterprise
! Hugging Face Endpoints
! AnyScale Endpoints

Why to run them in Kubernetes?
1. We already know him :)
2. Scalability. Resource efficiency, HPA, auto-scaling, API Limits, etc..
3. Price. Managed service 20-40% overhead. Reserved instances.
4. GPU sharing.
5. ML ecosystem - pipelines, artifacts. (KubeFlow, Ray Framework).
6. No vendor lock. Transportable.

Options to run LLM on K8s.
1. KServe from Kubeflow.
2. Ray Serve from Ray Framework.
3. Flux AI controller.
4. Own Kubernetes wrapper on top of Frameworks.

We choose TGI and made it Kubernetes ready.

We have Docker, Lets adapt it to Kubernetes

Let’s bootstrap infrastructure from cluster.dev template

Apply and check the model is running

Changing models and infrastructure

Deploy Monitoring and Metrics with DCGM Exporter

"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap

More Related Content

What's hot (20)

Similar to "Running Open-Source LLM models on Kubernetes", Volodymyr Tsap (20)

More from Fwdays (20)

Recently uploaded (20)

"Running Open-Source LLM models on Kubernetes", Volodymyr Tsap