How to test LLMs in production.pdf

1/14
How to test LLMs in production?
leewayhertz.com/how-to-test-llms-in-production
In today’s AI-driven era, large language model-based solutions like ChatGPT have become
integral in diverse scenarios, promising enhanced human-machine interactions. As the
proliferation of these models accelerates, so does the need to gauge their quality and
performance in real-world production environments. Testing LLMs in production poses
significant challenges, as ensuring their reliability, accuracy, and adaptability is no
straightforward task. Approaches such as executing unit tests with an extensive test bank,
selecting appropriate evaluation metrics, and implementing regression testing when
modifications are made to prompts in a production environment are indeed beneficial.
However, scaling these operations often necessitates substantial engineering resources and
the development of dedicated internal tools. This is a complex task that requires a significant
investment of both time and manpower. The absence of a standardized testing method for
these models complicates matters further.
This article delves into the nuts and bolts of testing LLMs, primarily focusing on assessing
them in a production environment. We will explore different testing methodologies, discuss
the role of user feedback, and highlight the importance of bias and anomaly detection. This
insight aims to provide a comprehensive understanding of how we can evaluate and ensure
the reliability of these AI-powered language models in real-world settings.
What is an LLM?

2/14
Large Language Models (LLMs) represent the pinnacle of current language modeling
technology, leveraging the power of deep learning algorithms and an immense quantity of
text data. Such models have the remarkable ability to emulate human-written text and
execute a multitude of natural language processing tasks.
To comprehend language models in general, we can think of them as systems that confer
probabilities to word sequences predicated on scrutinizing text corpora. Their complexity can
vary from straightforward n-gram models to more intricate neural network models.
Nevertheless, large language models commonly denote models harnessing deep learning
techniques and boasting an extensive array of parameters, potentially amounting from
millions to billions. They are adept at recognizing intricate language patterns and crafting text
that often mimics human composition.
Building a ” large language model,” an extensive transformer model, usually requires
resources beyond a single computer’s capabilities. Consequently, they are often offered as a
service via APIs or web interfaces. Their training involves extensive text data from diverse
sources like books, articles, websites, and other written content forms. This exhaustive
training allows the models to understand statistical correlations between words, phrases, and
sentences, enabling them to generate relevant and cohesive responses to prompts or
inquiries.
An example of such a model is ChatGPT’s GPT-3 model, which underwent training on an
enormous quantity of internet text data. This process enables it to comprehend various
languages and exhibit knowledge of a wide range of subjects.
Importance of testing LLMs in production
Testing large language models in production helps ensure their robustness, reliability, and
efficiency in serving real-world use cases, contributing to trustworthy and high-quality AI
systems. To delve deeper, we can broadly categorize the importance of testing LLMs in
production, as discussed below.
To avoid the threats associated with LLMs
There are a certain potential risks associated with LLMs that significantly make production
testing important for the optimum performance of the model:
Adversarial attacks: Proactive testing of models can help identify and defend against
potential adversarial attacks. To avoid such attacks in a live environment, models can
be scrutinized with adversarial examples to enhance their resilience before
deployment.

3/14
Data authenticity and inherent bias: Typically, data sourced from various platforms
can be unstructured and may inadvertently capture human biases, which can be
reflected in the trained models. These biases may discriminate against certain groups
based on attributes such as gender, race, religion, or sexual orientation, with
repercussions varying depending on the model’s application scope. Evaluations may
overlook such biases, as they primarily focus on performance rather than the model’s
behavior driven by the data’s role.
Identification of failure points: Potential failures can occur when integrating ML
systems like LLMs into a production setting. These may be attributed to biases in
performance, lack of robustness, or input model failures. Certain evaluations might not
detect these failures, even though they indicate underlying issues. For instance, a
model with 90% accuracy indicates challenges with the remaining 10% of the data,
suggesting difficulties in generalizing this portion. This insight can trigger a closer
examination of the data for errors, leading to a deeper understanding of how to address
them. As evaluations don’t capture everything, creating structured tests for conceivable
scenarios is vital, helping identify potential failure modes.
To overcome challenges involved in moving LLMs to enterprise-scale
production
Exorbitant operational and experimental expenses: Using really large models
always means spending a lot of money. These models need big computer systems to
work properly and spread their workload over many parts. On top of that, trying things
out and making changes can get expensive quickly, and you might run out of money
before the model is even ready for use. So, it is crucial to ensure the model performs
as expected.
Language misappropriation concerns: Large language models use lots of data from
different places. One big problem is that this data can have biases based on where it
comes from – things like culture and society. Plus, checking that so much information is
accurate can take a lot of work and time. If the model learns from data that is biased or
wrong, it can make these problems worse and give results that are unfair or
misleading. It’s also really hard to make these models understand human thinking and
the different meanings of the same information. The key is to make sure that the
models reflect the wide range of human beliefs and views.
Adaptation for specific tasks: Large language models are great at handling lots of
data, but making them work for specific tasks can be tricky. This means tweaking the
big models to create smaller ones that focus on certain jobs. These smaller models
keep the good performance of the original ones, but getting them just right can take
some time. You have to think carefully about what data to use, how to set up the model,
and what base models to adjust. Getting these things right is important for making sure
we can understand how the model works.

4/14
Hardware constraints: Even if you have a lot of money to spend on using large
models, figuring out the best way to set up and share out the computer systems they
need can be tough. There’s no one-size-fits-all solution for these models, so you need
to work out the best setup for your own model. Plus, you need to have good ways of
making sure your computer resources can handle the changes in your large model’s
size.
Given the scarcity of expertise in parallel and distributed computing resources, the onus falls
on your organization to acquire specialists adept at handling LLMs.
What sets testing LLMs in production apart from testing them in
earlier stages of the development process?
End-user feedback is the ultimate validation of model quality— it’s crucial to measure
whether users deem the responses as “good” or “bad,” and this feedback should guide your
improvement efforts. High-quality input/output pairs gathered in this way can further be
employed to fine-tune the large language models.
Explicit user feedback is gleaned when users respond with a clear indicator, like a thumbs up
or thumbs down, while interacting with the LLM output in your interface. However, actively
soliciting such feedback may not yield a large enough response volume to gauge overall
quality effectively. If the rate of explicit feedback collection is low, it may be advisable to use
implicit feedback, if feasible.
Implicit feedback, on the other hand, is inferred from the user’s reaction to the LLM output.
For instance, suppose an LLM produces the initial draft of an email for a user. If the user
dispatches the email without making any modifications, it likely indicates a satisfactory
response. Conversely, if they opt to regenerate the message or rewrite it entirely, that
probably signifies dissatisfaction. Implicit feedback may not be viable for all use-cases, but it
can be a potent tool for assessing quality.
The importance of feedback, particularly in the context of testing in a production
environment, is underscored by the real-world and dynamic interactions users have with the
LLM. In comparison, testing in other stages, such as development or staging, often involves
predefined datasets and scenarios that may not capture the full range of potential user
interactions or uncover all the possible model shortcomings. This difference highlights why
testing in production, bolstered by user feedback, is a crucial step in deploying and
maintaining high-quality LLMs.
Testing LLMs in production allows you to understand your model better and helps identify
and rectify bugs early. There are different approaches and stages of production testing for
LLMs. Let’s get an overview.

5/14
Enumerate use cases
The first step in testing LLMs is to identify the possible use cases for your application.
Consider both the objectives of the users (what they aim to accomplish) and the various
types of input your system might encounter. This step helps you understand the broad range
of interactions your users might have with the model and the diversity of data it needs to
handle.
Define behaviors and properties, and develop test cases
Once you have identified the use cases, contemplate the high-level behaviors and properties
that can be tested for each use case. Use these behaviors and properties to write specific
test cases. You can even use the LLM to generate ideas for test cases, refining the best
ones and then asking the LLM to generate more ideas based on your selection. However, for
practicality, choose a few easy use cases to test the fundamental properties. While some use
cases might need more comprehensive testing, starting with basic properties can provide
initial insights.
Investigate discovered bugs
Once you identify errors in the initial tests, delve deeper into these bugs. For example,
inspect these errors closely in a use case where the LLM is tasked with making a draft more
concise, and you notice an error rate of 8.3%. Often, you can identify patterns in these
errors, which can provide insights into the underlying issues. A prompt can be developed to
facilitate this process, mimicking the AdaTest approach where the prompt/UI optimization is
prioritized.
Unit testing
Unit testing involves testing of individual components of a software system or application. In
the context of LLMs, this could include various elements of the model, such as:
Input data quality checks: Testing to ensure that the inputs are correct and in the
right format and that the parameters used are accurate. This will involve validating the
format and content of the dataset used in the model.
Algorithms: Testing the underlying algorithms in the LLMs, such as sorting and
searching algorithms, machine learning algorithms, etc. This is done to verify the
accuracy of the output, given the input.
Architecture: Testing the architecture of the LLM to validate that it is working correctly.
This could involve the layers of a deep learning model, the features in a decision tree,
the weights in a neural network, etc.
Configuration: Validating the configuration settings of the model.
Model evaluation: The output of the models should be tested against known answers
to ensure accuracy.

6/14
Performance: The performance of the LLM model in terms of speed and efficiency
needs to be tested.
Memory: Memory usage of the model should be tested and optimized.
Parameters: Testing the parameters used in the LLM, such as the learning rate,
momentum, and weight decay in a neural network.
These components might be tested individually or in combinations, depending on the
requirements of the model and the results of previous tests. Each component may have a
different effect on the model’s overall performance, so it is important to examine them
individually to identify any issues that may impact the LLM’s performance.
Integration testing
After validating individual components, test how different parts of the LLM interact.
Integration testing involves testing the various parts of a system in an integrated manner to
assess whether they function together as intended. Here is how the process works for a
language model:
Data integrity: Check the flow of data in the system. For instance, if a language model
is fed data, check whether the right kind of data is being processed correctly and the
output is as expected.
Layer interaction: In the case of a deep learning model like a neural network, it’s
important to test how information is processed and passed from one layer to the next.
This involves checking the weight and bias values and ensuring data transfer is
happening correctly. This could be as simple as checking to see if the data from one
layer is correctly passed to the next layer without any loss or distortion.
Feature testing: Test the feature extraction capability of the model. Good features are
essential for good performance in a deep learning model. You might need to test
whether the features extracted by the model are appropriate and contribute to the
overall performance of the model.
Model performance: The performance of the model is critical. Once trained, you need
to test whether the model can correctly classify, regress, or do whatever it is designed
to do correctly. This involves a lot of testing to ensure that the model, once trained,
works correctly.
Output testing: This is about testing the output of the whole system. You have an
input, and you know what the output should be. Give the system the input and compare
the output to the expected result.

7/14
Interface testing: Here, you will look at how the different components of the system
work together. For instance, how well does the user interface work with the database?
Or how well does the front-end web interface work with the back-end processing
scripts?
Remember that most of these tests are about a single function or feature of the whole
system. Once you’ve ensured that each feature works correctly, you can move on to
testing how those features work together, which is the ultimate goal of integration
testing.
Regression testing
For an LLM, regression testing involves running a suite of tests to ensure that changes such
as those added through feature engineering, hyperparameter tuning, or changes in the input
data have not adversely affected performance. These can include re-running the model and
comparing the results to the original, checking for differences in the results, or running new
tests to verify that the model’s performance metrics have not changed.
As you can see, regression testing is an essential part of the model development process,
and its primary function is to catch any problems that may arise during the upgrade process.
This involves comparing the model’s current performance with the results obtained when the
model was first developed. Regression testing ensures that new updates, patches or
improvements do not cause problems with the existing functionality, and it can help detect
any problems that may arise in the future.
It’s important to note that regression testing can also be done after the model is deployed to
production. This can be achieved by re-running the same tests on the upgraded model to
see how it performs. Regression testing can also be done by comparing the model’s
performance metrics with those obtained from a suite of tests. If the metrics are not
significantly different, then the model is considered to be in good health.
While regression testing is a very important part of the model development process, it’s
important to note that it is not the only way to test a model. Other methods can be used to
check the performance of a model, such as unit testing, functional testing, and load testing.
However, regression testing is a very important part of the model development process, and
it is a process that can be done at any time during the model’s life cycle. It’s important to
ensure that your model is performing at its best and not introducing any new bugs or
problems.
Load testing
Load testing for LLMs involves the model processing a large amount of data. This can often
happen when a system is required to process a high volume of data in a short amount of
time.

8/14
Identify the key scenarios: Load testing should begin by identifying the scenarios
where the system may face high demand. These might be common situations that the
system will face or be worst-case scenarios. The load testing should consider how the
system will behave in these situations.
Design and implement the test: Once the scenarios are identified, tests should be
designed to simulate these scenarios. The tests may need to account for various
factors, such as the volume of data, the speed of data input, and the complexity of the
data.
Execute the test: The system should be monitored closely during the test to see how it
behaves. This might involve checking the server load, the response times, and the
error rates. It may also be necessary to perform the test multiple times to ensure
reliable results.
Analyze the results: Once the test is completed, the results should be analyzed to see
how the system behaves. This can involve looking at metrics such as the number of
users, the response time, the error rate, and the server load. These results can help to
identify any issues that need to be addressed.
Repeat the process: Load testing should be repeated regularly to ensure the system
can still handle the expected load. As the system evolves and the scenarios change,
the tests may need to be updated.
Load testing is crucial to ensuring that a system can handle the load it is expected to face.
By understanding how a system behaves under load, it is possible to design and build more
resilient systems that can handle high volumes of data. This can help to ensure that a
system can continue to provide a high level of service, even under heavy load.
Feedback loop
Implement a feedback loop system where users can provide explicit or implicit feedback on
the model’s responses. This allows you to collect real-world user feedback, which is
invaluable for improving the model’s performance.
User feedback is instrumental in the iterative process of model refinement, and it plays a
crucial role in the performance of machine learning models. This kind of feedback can be
considered as a direct communication channel with the users, and it is useful for the machine
learning model in the following ways:
User needs understanding: Feedback from users can provide critical information
about what users want, what they find useful, and the areas where the machine
learning model might improve. Understanding these requirements can help tailor the
machine learning model’s functionality more closely to users’ needs.

9/14
Model refinement: User feedback can guide the model refinement process, helping
developers understand where the model falls short and what improvements can be
made. This is especially true in the case of machine learning models, where user
feedback can directly impact the model’s ability to ‘learn.’
Model validation: User feedback can also play a key role in model validation. For
instance, if a user flags a certain response as inaccurate, this can be considered when
updating and training the model.
Detection of shortcomings: User feedback can also help to detect any shortcomings
or gaps in the model. These can be areas where the model is weak or does not meet
user needs. By identifying these gaps, developers can work to improve the model and
its outputs.
Improving accuracy: By using user feedback, developers can work to improve the
accuracy of the model’s responses. For instance, if a model consistently receives
negative feedback on a particular type of response, the developers can investigate this
and make adjustments to improve the accuracy.
A/B testing
If you have multiple versions of a model or different models, use A/B testing to compare their
performance in the production environment. This involves serving different model versions to
different user groups and comparing their performance metrics. A/B testing, also known as
split testing, is a technique used to compare two versions of a system to determine which
one performs better. In the context of large language models, A/B testing can compare
different versions of the same or entirely different models.
Here is a detailed description of how A/B testing can be employed for LLMs:
Model comparison: If you have two versions of a language model (for example, two
different training runs or the same model trained with two different sets of
hyperparameters), you can use A/B testing to determine which performs better in a
production environment.
Feature testing: You can use A/B testing to evaluate the impact of new features. For
instance, if you introduce a new preprocessing step or incorporate additional training
data, you can run an A/B test to compare the model’s performance with and without the
new feature.
Error analysis: A/B testing can also be used for error analysis. If users report an issue
with the LLM’s responses, you can run an A/B test with the fix in place to verify whether
the issue has been resolved.
User preference: A/B testing can help understand user preferences. By presenting a
group of users with responses generated by two different models or model versions,
you can gather feedback on which model’s responses are preferred.

10/14
Deployment decisions: The results of A/B testing can inform decisions about which
version of a model to deploy in a production environment. If one model version
consistently outperforms another in A/B tests, it is likely a good candidate for
deployment.
During A/B testing, it’s important to ensure that the test is fair and that any differences in
performance can be attributed to the differences between the models rather than to external
factors. This typically involves randomly assigning users or requests to the different models
and controlling for variables that could influence the results.
Bias and fairness testing
Conduct tests to identify and mitigate potential biases in the model’s outputs. This involves
using fairness metrics and bias evaluation tools to measure the model’s equity across
different demographic groups.
Bias and fairness are important considerations when testing and deploying LLMs. They are
crucial because biased responses or decisions the model makes can have serious
consequences, leading to unfair treatment or discrimination.
Bias and fairness testing for LLMs typically involves the following steps:
Data audit: The data used must be audited for potential biases before training an LLM.
This includes understanding the sources of the data, its demographics, and any
potential areas of bias it might contain. The model will often learn biases in the training
data, so it’s important to identify and address these upfront.
Bias metrics: Implement metrics to quantify bias in the model’s outputs. These could
include metrics that measure disparity in error rates or the model’s performance across
different demographic groups.
Test case generation: Generate test cases that help uncover biases. This could
involve creating synthetic examples covering a range of demographics and situations,
particularly those prone to bias.
Model evaluation: The LLM should be evaluated using the test cases and bias
metrics. If bias is found, the developers need to understand why it is happening. Is it
due to the training data or due to some aspect of the model’s architecture or learning
algorithm?
Model refinement: If biases are detected, the model may need to be refined or
retrained to minimize them. This could involve changes to the model or require
collecting more balanced or representative training data.
Iterative process: Bias and fairness testing is an iterative process. As new versions of
the model are developed, or the model is exposed to new data in a production
environment, the tests should be repeated to ensure that the model continues to
behave fairly and unbiasedly.

11/14
User feedback: Allow users to provide feedback about the model’s outputs. This can
help detect biases that the testing process may have missed. User feedback is
especially valuable as it provides real-world insights into how the model is performing.
Ensuring bias and fairness in LLMs is a challenging and ongoing task. However, it’s a crucial
part of the model’s development process, as it can significantly affect its performance and
impact on users. By systematically testing for bias and fairness, developers can work
towards creating fair and unbiased models, which leads to better, more equitable outcomes.
Anomaly detection
Implement anomaly detection systems to alert you when the model’s behavior deviates from
what is expected. This can help identify issues in real time, allowing you to respond quickly.
Anomaly detection, also known as outlier detection, identifies items, events, or observations
that differ significantly from most of the data. In the context of LLMs, anomaly detection can
be essential to ensuring the model’s responses are within expected parameters and
identifying any unusual or potentially problematic output.
Here’s a detailed breakdown of how anomaly detection can be performed in LLMs:
Define normal behavior: Anomaly detection starts with defining what is “normal” for
the LLM’s output. This could be based on past responses, training data, or defined
constraints. For example, the length of the generated text, the topic, the sentiment, or
the type of language used can be factors that define normal behavior.
Set thresholds: Once the normal behavior is defined, thresholds need to be set to
determine when a response is considered an anomaly. These thresholds could be
based on statistical methods (e.g., anything beyond three standard deviations from the
mean might be considered an outlier) or domain-specific rules (e.g., a response
containing explicit language might be considered an anomaly).
Monitor model outputs: As the model generates responses, these should be
monitored and compared to the defined thresholds. Any response that falls outside
these thresholds is flagged as a potential anomaly.
Investigate anomalies: Any identified anomalies should be investigated to understand
why they occurred. This can help in identifying whether the anomaly is due to an issue
with the model (e.g., bias in the training data, a bug in the model, or an unexpected
interaction between different parts of the model) or whether it’s an acceptable response
that just happens to be unusual.

12/14
Update model or thresholds: Depending on the findings of the investigation, you may
need to update the model or the thresholds. For example, if an anomaly is due to a bug
in the model, you would need to fix the bug. If the anomaly is due to bias in the training
data, you may need to retrain the model with more balanced data. Alternatively, if the
anomaly is an acceptable but unusual response, you may need to adjust your
thresholds to accommodate these responses.
Remember that anomaly detection is an ongoing process. As the LLM continues to learn and
adapt to new data, what is considered “normal” may change, and the thresholds may need to
be adjusted accordingly. By continuously monitoring the model’s outputs and investigating
any anomalies, you can ensure that the model continues performing as expected and
delivers high-quality responses.
Key metrics for evaluating LLMs in production
There are several key metrics to assess the performance of a large language model in
production.
Interaction and user engagement
This metric quantifies the model’s proficiency in maintaining user engagement throughout a
conversation. It explores the model’s propensity to ask pertinent follow-up questions, clarify
ambiguities, and foster a fluid dialogue. Established usage metrics gathered through user
surveys or other tools can be used to gauge engagement, including average query volume,
average query size, response feedback rating, and average session duration.
Response coherence
This metric focuses on the model’s capacity to generate coherent and contextually
appropriate responses. It verifies the model’s proficiency in producing relevant and
meaningful answers. Language scoring techniques such as Bilingual Evaluation Understudy
(BLEU) and Recall Oriented Understudy for Gisting Evaluation (ROUGE) can be utilized to
measure this aspect.
Fluency
Fluency evaluates the model’s responses’ structural integrity, grammatical correctness, and
linguistic coherence. It assesses the model’s competency in producing language that sounds
natural and fluid. The perplexity metric, the normalized inverse probability of the test set
normalized by the number of words, can be used to measure fluency.
Relevance

13/14
Relevance assesses the alignment of the model’s responses with the user’s input or query. It
checks whether the model accurately grasps the user’s intention and provides suitable, on-
topic responses. Metrics such as the F1-Score and techniques like BERT can measure
relevance.
Contextual awareness
This metric gauges the model’s capacity to understand the conversation’s context. It verifies
the model’s ability to reference prior messages, track dialogue history, and deliver consistent
responses. Cross Mutual Information (XMI) can help measure context awareness.
Sensibleness and specificity
This metric evaluates the sensibility and specificity of the model’s responses. It checks
whether the model provides sensible, detailed answers rather than generic or illogical
responses. To measure sensibleness and specificity, one could compute the average scores
given by evaluators for the model’s responses across the entire dataset. These average
scores will give an overall measurement of the sensibility and specificity of the model’s
responses.
Endnote
While the process of testing may be demanding, particularly when using large language
models, the alternatives present their own sets of challenges. Benchmarking tasks that
involve generation, where there are multiple correct answers, can be inherently complex,
leading to a lack of confidence in the results. Obtaining human evaluations of a model’s
output can be even more time-consuming and may lose relevance as the model evolves,
rendering the collected labels less useful.
Choosing not to test could result in a lack of understanding of the model’s behavior, a
situation that could pave the way for potential failures. On the other hand, a well-structured
testing approach can unearth bugs, provide deeper insights into the task at hand, and reveal
serious specification issues early in the process, thereby allowing time for course correction.
In weighing the pros and cons, it becomes evident that investing time in rigorous testing is a
judicious choice. This not only ensures a deep understanding of the model’s performance
and behavior but also guarantees that any potential issues are identified and addressed
promptly, contributing to the successful deployment of the LLM in a production environment.
For your large language models to excel, ongoing testing is indispensable, with a specific
focus on production testing. Partnering with LeewayHertz means gaining access to custom
models and solutions tailored to your business needs, all fortified with rigorous testing to
ensure resilience, security, and accuracy.

14/14
Start a conversation by filling the form

How to test LLMs in production.pdf

More Related Content

Similar to How to test LLMs in production.pdf (20)

More from AnastasiaSteele10 (11)

Recently uploaded (20)

How to test LLMs in production.pdf