LLM-as-a-Judge is a dangerous way to do evaluation. In most cases, it just doesn't work. While it gives a false impression of having a grasp on your system's performance, it lures you with general metrics such as correctness, faithfulness, or completeness. They hide several complexities: - What does "completeness" mean for your application? In the case of a marketing AI assistant, what characterizes a complete post from an incomplete one? If the score goes higher, does it mean that the post is better? - Often, these metrics are scores between 1-5. But what is the difference between a 3 and a 4? Does a 5 mean the output is perfect? What is perfect, then? - If you "calibrate" the LLM-as-a-judge with scores given by users during a test session, how do you ensure the LLM scoring matches user expectations? If I arbitrarily set all scores to 4, will I perform better than your model? However, if LLM-as-a-judge is limited, it doesn't mean it's impossible to evaluate your AI system. Here are some strategies you can adopt: - Online evaluation is the new king in the GenAI era Log and trace LLM outputs, retrieved chunks, routing… Each step of the process. Link it to user feedback as binary classification: was the final output good or bad? Then take a look at the data. Yourself, no LLM. Take notes and try to find patterns among good and bad traces. At this stage, you can use an LLM to help you find clusters within your notes, not the data. After taking this time, you'll already have some clues about how to improve the system. - Evaluate deterministic steps that come before the final output Was the retrieval system performant? Did the router correctly categorize the initial query? Those steps in the agentic system are deterministic, meaning you can evaluate them precisely: Retriever: Hit Rate@k, Mean Reciprocal Rank, Precision, Recall Router: Precision, Recall, F1-Score Create a small benchmark, synthetically or not, to evaluate offline those steps. It enables you to improve them later on individually (hybrid search instead of vector search, fine-tune a small classifier instead of relying on LLMs…) - Don't use tools that promise to externalize evaluation Your evaluation system is your moat. If you externalize it, not only will you have no control over your AI application, but you also won't be able to improve it. This is your problem, your users, your revenue. Evaluate your system. Not a generic one. All problems are different. Yours is unique as well. ... Those are some unequivocal ideas proposed by the AI community. Yet, I still see AI projects relying on LLM-as-a-judge and generic metrics among companies. Being able to evaluate your system gives you the power to improve it. So take the time to create the perfect evaluation for your use case.
👏 Couldn’t agree more. LLM-as-a-judge gives you numbers but not signal.
Or not evaluating them at all...you forgot that option 😂 I am sure that many companies do it
What's the better way?
If you have a better evaluation system, do it, otherwise LLM as a judge give you strong baseline with room to improve and it doesnt take too much effort and performance of LLM as a judge also depend on how you setup, multiple LLM with voting mechanism or in-context can enhance the accuracy, along with how you choose the evaluation policy Im not saying it doesnt have weekness, but improve the weakness make everything better
This is the most important part for AIPMs: „Your evaluation system is your moat. If you externalize it, not only will you have no control over your AI application, but you also won't be able to improve it. This is your problem, your users, your revenue. Evaluate your system. Not a generic one. All problems are different. Yours is unique as well.“ Thanks sharing ✌️
One thing - which only applies on specific situations - if you have the ability for neuro-symbolic reasoning, either to check the result or check the LLM aaj, then you can do a bunch of cool stuff. Neuro-symbolic reasoning just means translate between different domains - I e. If you asked an LLM to solve simple math, then took the input math and had a calculator solve it, you can in a server check your work.
Let's complete that ridiculous train of thought.. //Computers in the workplace are dangerous. Keep them in data centers and off desks.// //Learning from books is dangerous.. learn only from personal experience with tangible things or trusted people. You don't know the authors of books and you don't even know if what's in the book is really what they meant. Never trust printed books ! Don't trust newspapers !//
Saying “LLM-as-judge doesn’t work” is too strong. It works in bounded settings when you: - use pairwise comparisons instead of raw Likert scores, - apply a tight rubric with checklists, - calibrate to human gold labels, and - guard against known biases. In those conditions, model judges often track human preferences well enough to rank models and catch regressions. They are weak as a final KPI, strong as a fast, cheap filter.
Machine Learning Engineer
1whttps://substack.com/@jeremyarancio?utm_source=user-menu