Jeremy Arancio’s Post

Machine Learning Engineer

LLM-as-a-Judge is a dangerous way to do evaluation. In most cases, it just doesn't work. While it gives a false impression of having a grasp on your system's performance, it lures you with general metrics such as correctness, faithfulness, or completeness. They hide several complexities: - What does "completeness" mean for your application? In the case of a marketing AI assistant, what characterizes a complete post from an incomplete one? If the score goes higher, does it mean that the post is better? - Often, these metrics are scores between 1-5. But what is the difference between a 3 and a 4? Does a 5 mean the output is perfect? What is perfect, then? - If you "calibrate" the LLM-as-a-judge with scores given by users during a test session, how do you ensure the LLM scoring matches user expectations? If I arbitrarily set all scores to 4, will I perform better than your model? However, if LLM-as-a-judge is limited, it doesn't mean it's impossible to evaluate your AI system. Here are some strategies you can adopt: - Online evaluation is the new king in the GenAI era Log and trace LLM outputs, retrieved chunks, routing… Each step of the process. Link it to user feedback as binary classification: was the final output good or bad? Then take a look at the data. Yourself, no LLM. Take notes and try to find patterns among good and bad traces. At this stage, you can use an LLM to help you find clusters within your notes, not the data. After taking this time, you'll already have some clues about how to improve the system. - Evaluate deterministic steps that come before the final output Was the retrieval system performant? Did the router correctly categorize the initial query? Those steps in the agentic system are deterministic, meaning you can evaluate them precisely: Retriever: Hit Rate@k, Mean Reciprocal Rank, Precision, Recall Router: Precision, Recall, F1-Score Create a small benchmark, synthetically or not, to evaluate offline those steps. It enables you to improve them later on individually (hybrid search instead of vector search, fine-tune a small classifier instead of relying on LLMs…) - Don't use tools that promise to externalize evaluation Your evaluation system is your moat. If you externalize it, not only will you have no control over your AI application, but you also won't be able to improve it. This is your problem, your users, your revenue. Evaluate your system. Not a generic one. All problems are different. Yours is unique as well. ... Those are some unequivocal ideas proposed by the AI community. Yet, I still see AI projects relying on LLM-as-a-judge and generic metrics among companies. Being able to evaluate your system gives you the power to improve it. So take the time to create the perfect evaluation for your use case.

39 Comments

Jeremy Arancio

Machine Learning Engineer

https://substack.com/@jeremyarancio?utm_source=user-menu

2 Reactions

Paul Iusztin

Senior AI Engineer • Founder @ Decoding ML ~ Building an army of AI agents and teaching you about the process.

👏 Couldn’t agree more. LLM-as-a-judge gives you numbers but not signal.

10 Reactions

Javier Sánchez Griñán

Industrial Engineer | Data Scientist / Data Analyst | Data Scientist in Grupo MASMOVIL

Or not evaluating them at all...you forgot that option 😂 I am sure that many companies do it

2 Reactions

Jitendra Kumar Sharma

Gen AI Engineer Curious about Quantum computing

What's the better way?

1 Reaction

Tịnh Lương Sơn

AI Researcher

If you have a better evaluation system, do it, otherwise LLM as a judge give you strong baseline with room to improve and it doesnt take too much effort and performance of LLM as a judge also depend on how you setup, multiple LLM with voting mechanism or in-context can enhance the accuracy, along with how you choose the evaluation policy Im not saying it doesnt have weekness, but improve the weakness make everything better

9 Reactions

Jaser Bratzadeh

Co-Founder - AI CoE @ Vodafone - Executive AI Product Manager - Building the World’s 🌍 Most Expertly Curated FREE AI Product Management Education Repository

This is the most important part for AIPMs: „Your evaluation system is your moat. If you externalize it, not only will you have no control over your AI application, but you also won't be able to improve it. This is your problem, your users, your revenue. Evaluate your system. Not a generic one. All problems are different. Yours is unique as well.“ Thanks sharing ✌️

4 Reactions

Eric Glover

16h

One thing - which only applies on specific situations - if you have the ability for neuro-symbolic reasoning, either to check the result or check the LLM aaj, then you can do a bunch of cool stuff. Neuro-symbolic reasoning just means translate between different domains - I e. If you asked an LLM to solve simple math, then took the input math and had a calculator solve it, you can in a server check your work.

1 Reaction

Paul McLeod

Providing AI & Analytics for Edge in a Complex 🌍 >Architect_Data_AI_ML_Enterprise_Automation< __ >> ＲΞＳＵＬＴＳ

Let's complete that ridiculous train of thought.. //Computers in the workplace are dangerous. Keep them in data centers and off desks.// //Learning from books is dangerous.. learn only from personal experience with tangible things or trusted people. You don't know the authors of books and you don't even know if what's in the book is really what they meant. Never trust printed books ! Don't trust newspapers !//

Benuraj Sharma

Senior Engineering Manager | Head of Applications & Algorithms Technical Unit, Multicoreware | Technology Leader

Saying “LLM-as-judge doesn’t work” is too strong. It works in bounded settings when you: - use pairwise comparisons instead of raw Likert scores, - apply a tight rubric with checklists, - calibrate to human gold labels, and - guard against known biases. In those conditions, model judges often track human preferences well enough to rank models and catch regressions. They are weak as a final KPI, strong as a fast, cheap filter.

3 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Query Quotient

1 follower
1w
Report this post
LLM-as-a-Judge is a dangerous way to do evaluation. In most cases, it just doesn't work. While it gives a false impression of having a grasp on your system's performance, it lures you with general metrics such as correctness, faithfulness, or completeness. They hide several complexities: - What does "completeness" mean for your application? In the case of a marketing AI assistant, what characterizes a complete post from an incomplete one? If the score goes higher, does it mean that the post is better? - Often, these metrics are scores between 1-5. But what is the difference between a 3 and a 4? Does a 5 mean the output is perfect? What is perfect, then? - If you "calibrate" the LLM-as-a-judge with scores given by users during a test session, how do you ensure the LLM scoring matches user expectations? If I arbitrarily set all scores to 4, will I perform better than your model? However, if LLM-as-a-judge is limited, it doesn't mean it's impossible to evaluate your AI system. Here are some strategies you can adopt: - Online evaluation is the new king in the GenAI era Log and trace LLM outputs, retrieved chunks, routing… Each step of the process. Link it to user feedback as binary classification: was the final output good or bad? Then take a look at the data. Yourself, no LLM. Take notes and try to find patterns among good and bad traces. At this stage, you can use an LLM to help you find clusters within your notes, not the data. After taking this time, you'll already have some clues about how to improve the system. - Evaluate deterministic steps that come before the final output Was the retrieval system performant? Did the router correctly categorize the initial query? Those steps in the agentic system are deterministic, meaning you can evaluate them precisely: Retriever: Hit Rate@k, Mean Reciprocal Rank, Precision, Recall Router: Precision, Recall, F1-Score Create a small benchmark, synthetically or not, to evaluate offline those steps. It enables you to improve them later on individually (hybrid search instead of vector search, fine-tune a small classifier instead of relying on LLMs…) - Don't use tools that promise to externalize evaluation Your evaluation system is your moat. If you externalize it, not only will you have no control over your AI application, but you also won't be able to improve it. This is your problem, your users, your revenue. Evaluate your system. Not a generic one. All problems are different. Yours is unique as well.
Like Comment
To view or add a comment, sign in
Pradeep Sanyal

AI & Technology Leader | Experienced CIO & CTO | Enterprise AI, Cloud & Data Transformation | Advisor to CEOs and Board | Agentic AI Strategist
2w
Report this post
𝐓𝐡𝐞 𝐈𝐧𝐭𝐞𝐫𝐧𝐞𝐭 𝐁𝐮𝐟𝐟𝐞𝐭 𝐢𝐬 𝐂𝐥𝐨𝐬𝐢𝐧𝐠: 𝐀𝐈 𝐖𝐢𝐥𝐥 𝐇𝐚𝐯𝐞 𝐭𝐨 𝐏𝐚𝐲 𝐔𝐩 AI companies have been scraping the internet to train models, but until now there has been no standard way for content owners to say yes, but on these terms.⠀ ⠀ That changes with the launch of the 𝐑𝐞𝐚𝐥𝐥𝐲 𝐒𝐢𝐦𝐩𝐥𝐞 𝐋𝐢𝐜𝐞𝐧𝐬𝐢𝐧𝐠 (𝐑𝐒𝐋) 𝐒𝐭𝐚𝐧𝐝𝐚𝐫𝐝. Reddit, Yahoo, Medium, Quora, wikiHow and others are backing it.⠀ ⠀ RSL extends the old robots.txt convention. Instead of just blocking or allowing crawlers, sites can now specify licensing models:⠀ ⠀ → Free access → Subscription → Pay-per-crawl → Pay-per-inference (when the AI uses your content in its outputs) 𝐖𝐡𝐲 𝐭𝐡𝐢𝐬 𝐦𝐚𝐭𝐭𝐞𝐫𝐬: ⠀ → 𝐂𝐨𝐧𝐭𝐞𝐧𝐭 𝐞𝐜𝐨𝐧𝐨𝐦𝐢𝐜𝐬: Publishers finally have a mechanism to demand compensation instead of watching their IP used for free.⠀ ⠀ → 𝐀𝐈 𝐢𝐧𝐝𝐮𝐬𝐭𝐫𝐲 𝐜𝐨𝐬𝐭𝐬: Model builders will face a new line item on their P&L. Training and inference will no longer be “free on the backs of the open web.”⠀ ⠀ → 𝐑𝐞𝐠𝐮𝐥𝐚𝐭𝐨𝐫𝐲 𝐢𝐧𝐟𝐥𝐮𝐞𝐧𝐜𝐞: If RSL gains traction, it could become the default standard governments point to in policy.⠀ ⠀ → 𝐏𝐨𝐰𝐞𝐫 𝐬𝐡𝐢𝐟𝐭: The balance between platforms and model builders is starting to move.⠀ The timing is critical. In the recent case 𝘉𝘢𝘳𝘵𝘻 𝘷. 𝘈𝘯𝘵𝘩𝘳𝘰𝘱𝘪𝘤, authors accused Anthropic of downloading hundreds of thousands of pirated books from illicit sources such as Library Genesis and Pirate Library Mirror in order to train its Claude AI. A federal judge ruled that while using lawfully acquired books for training might qualify as fair use, the mass downloading and storage of pirated copies in a central library crossed a legal line. Anthropic has agreed to settle for $1.5 billion across an estimated 500,000 books.⠀ 𝐂𝐚𝐧 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞 𝐥𝐨𝐠𝐢𝐜 𝐛𝐞 𝐞𝐱𝐭𝐞𝐧𝐝𝐞𝐝 𝐭𝐨 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐨𝐧 𝐢𝐧𝐭𝐞𝐫𝐧𝐞𝐭 𝐝𝐚𝐭𝐚? Enforcement will be the real test. Tracking “per inference” usage is complex, and AI companies may fight hard to avoid paying. But one thing is clear: the free ride on internet content is ending. If RSL sticks, it could rewrite the economics of both publishing and AI. What side do you see winning this battle? 🔔 𝘍𝘰𝘭𝘭𝘰𝘸 𝘧𝘰𝘳 𝘤𝘰𝘮𝘮𝘦𝘯𝘵𝘢𝘳𝘺 𝘢𝘵 𝘵𝘩𝘦 𝘪𝘯𝘵𝘦𝘳𝘴𝘦𝘤𝘵𝘪𝘰𝘯 𝘰𝘧 𝘈𝘐, 𝘵𝘦𝘤𝘩𝘯𝘰𝘭𝘰𝘨𝘺 𝘭𝘦𝘢𝘥𝘦𝘳𝘴𝘩𝘪𝘱, 𝘢𝘯𝘥 𝘣𝘶𝘴𝘪𝘯𝘦𝘴𝘴 𝘰𝘶𝘵𝘤𝘰𝘮𝘦𝘴.
8 Comments
Like Comment
To view or add a comment, sign in
Wade Will

Helping businesses automate and boost sales with AI.
3w
Report this post
Google has introduced TTD-DR, an AI framework that mimics human research workflows. TTD-DR (Test-Time Diffusion Deep Researcher) is a new framework from Google. It is designed for complex, long-form research tasks. Its key innovation is an iterative process of drafting, searching, and refining its output. This mimics a human researcher's workflow, moving beyond one-shot answer generation. The framework uses a diffusion-based model. It begins with a rough, "noisy" outline of a report. It then iteratively refines this draft by searching for new data and improving coherence. Internal components, like a planner and question generator, also self-evolve to improve performance. This process results in higher-quality outputs, outperforming competitors in long-form research benchmarks. This framework enables the automation of high-value knowledge work. For agencies, this unlocks new, premium service offerings. You can now offer "Automated Due Diligence" for investment firms or legal teams. The agent can produce a comprehensive report on a target company, iteratively gathering and refining data. You can also sell "In-Depth Market Analysis" as a service. Use TTD-DR to generate detailed competitive analysis or market trend reports that are well-structured and coherent. The core value is selling research products that have a human-like structure and flow. TTD-DR is not just another search tool; it's a framework for automating the process of research. Its iterative, draft-and-refine loop is a powerful pattern for creating reliable, long-form content. This enables agencies to automate and scale high-end research and analysis services.
Like Comment
To view or add a comment, sign in
Philippe Bodart

Founder WebriQ
3w
Report this post
Ranking #1 feels like winning. But in AI search? Data from 25,000 queries shows you’ll only appear about 25% of the time. That’s the disconnect teams are waking up to. Google’s AI Overviews, ChatGPT Browse, and Perplexity are becoming the front pages of the internet. Buyers trust them for quick answers, but most company sites never get cited. We recently ran an audit for a services company: - Google AI Overviews citations: 0% - ChatGPT Browse mentions: none - Perplexity visibility: 20% On paper, their SEO looked fine. In practice, they were invisible in AI answers. Which raises the bigger issue: if AI prefers Reddit threads, publications, and reviews... does credibility now outweigh traditional ranking factors? This isn’t just SEO with a new name. It’s Answer Engine Optimization, understanding what makes content citable in the new visibility layer. And it’s why we built the new Enterprise AI Visibility Audit in CitationGrader. It pinpoints where your brand is missing from AI answers, and the structural + trust signals that explain why. Curious where your brand stands? Connect with me here, send me your URL, and I’ll share your free audit.
Like Comment
To view or add a comment, sign in
Jimi Gibson

No‑Fluff Digital Strategies for Fed‑Up Owners | VP, Thrive Agency | Keynote Speaker
2w
Report this post
95% of websites are missing this page. It's the #1 AI search hack. While everyone's fighting for Google rankings, there's a secret weapon most businesses completely ignore: The FAQ page. Here's what 95% of companies don't realize: → ChatGPT trains on FAQ content → Perplexity cites well-structured Q&As → Claude references FAQ pages in responses → Voice search pulls directly from FAQ schemas Your competitors are spending thousands on SEO. You could outrank them with one strategic page. The secret? AI models are hungry for: ✅ Natural question-and-answer formats ✅ Structured data (FAQ Schema) ✅ Conversational, helpful content ✅ Topic authority signals Most businesses either: ❌ Skip FAQ pages entirely ❌ Bury them in tiny footer links ❌ Write boring, corporate answers ❌ Forget Schema markup completely This is your competitive advantage. While they're playing yesterday's SEO game, you're training tomorrow's AI to recommend YOU. My carousel breaks down the exact strategy: ✅ Why 73% of searchers now use AI tools ✅ How to structure FAQs for maximum AI visibility ✅ The Schema markup that changes everything ✅ Real implementation tactics that work ✅ The biggest opportunities hide where others aren't looking. Your FAQ page isn't customer service. It's your secret AI search weapon. Ready to use it? ➡️ Repost this to help someone else. ➡️ Save it for reference later. ➡️ Follow me for more AI Lessons.

28 Comments
Like Comment
To view or add a comment, sign in
Jasbir Singh

Tech Entrepreneur, Founder, Angel Investor
3w
Report this post
AI is working in some areas BUT not everywhere. The question is Why? Sam Altman recently said we’re in an AI bubble. MIT claims 95% of GenAI projects are failing. And yet, we’re seeing hundreds of use-cases working successfully: chatbots for customer support, content training and teaching agents, translation tools++ Even sceptics like Marc Benioff are changing their minds and replacing people! I’ve talked before about why AI projects fail: expectations are often way too high, security and compliance are major blockers, environmental concerns are real, and many systems/teams just don’t have enough domain/infrastructure knowledge. Plus, the issues of trusting the model’s output and then trying to deploy all of this at scale in controlled environments is challenging. With all the attention on GenAI and LLMs, it’s worth calling out some of their specific challenges. Hallucinations are a big issue; LLMs sometimes just make things up. Not because they’re lying, but because of their probabilistic design and being nondeterministic in nature, you’ll get different answers to the same repeated question, which doesn’t really help with trust. Then there’s the risk of sensitive data being exposed through cloud-based third-party tools or prompt injection attacks where someone sneaks in a malicious instruction to steal confidential data. They’re also energy-hungry, which brings up sustainability concerns. And since most of their training data is general web content, they often fall short in domain-specific tasks. On top of that, successfully integrating them into a regulated company’s workflows is much more complex than many realize. But all is not gloom and doom; here are a few doable ideas to start trying to move forward: Anonymizing/swapping certain data is a good start. It can help protect PII data and tick some compliance boxes. SLMs and RAG are gaining traction as they can improve accuracy, use less energy, and offer better Security and Compliance, but they can be expensive to implement and are often not re-usable for other use-cases. An approach I find especially promising is the advancements in NLIDB. Basically, you ask questions in plain language and get deterministic results. In theory it could handle multiple queries to complex data bases (content-types, languages, locations, relational/multi-dimensional++) from normal sentences (not SQL++). This will be immensely useful for domain-specific tasks, can be local, is secure, works across data types, uses less compute, and could even work for smaller organizations. GenAI and LLMs are already radically changing how we live and work; To fully unlock their potential in sensitive or expert areas, they’ll need to be more secure, accurate, compliant, energy-efficient, and deeply aware of the domains they’re used in. That means combining some of these above methods and inventing newer/better ones. Ideas? #AI #GenAI #LLM #SLM #NLIDB #CRM #MSFT #OpenAI #GOOG #MIT #Security #Compliance
Like Comment
To view or add a comment, sign in
Discrete Logic

34 followers
1w
Report this post
**Is AI stealing your best work?** AI bots are crawling the web, scraping content—and giving nothing back. For SMBs, it’s not just publishers feeling the squeeze. If your business relies on creating valuable content, you’re likely at risk too. Here’s the good news: knowing what AI craves can help you adapt. The latest data shows AI bots target B2B insights, niche expertise, and high-demand categories like parenting and tech. Translation? If bots are crawling your content, it’s valuable. Here’s how SMBs can turn this challenge into an opportunity: • Use scraping trends to refine your content strategy. • Add paywalls or redirects to protect high-value resources. • Track bot traffic to negotiate better licensing or partnerships. • Focus on creating human-first value AI can’t easily replace. This week, audit your top-performing content. Is it worth protecting—or leveraging? We bring AI to the little guy—without the hassle. https://lnkd.in/eVF28MhU
Like Comment
To view or add a comment, sign in
Satish Singh Parmar🎖️🥇

Founder at FindBestFirms | Forbes Award Winner | Linkedin Top Voice | Director at YourDesignGuys | Owner at Exaalgia LLC
1w
Report this post
The term "AI Ranking" is fundamentally inaccurate. Large Language Models (LLMs) do not operate like Google or other search engines. They are not information retrieval systems. They do not crawl the web, index pages, or assign PageRank scores. ➡️ Think about it: How can you talk about "ranking" in a system where two identical questions can receive different answers? You cannot. LLMs are probabilistic by design. Variability is not an error, it is the very foundation of how this technology works. Consistency-based concepts like traditional rankings simply do not apply. How to Appear in LLMs and AI Search Since you cannot reverse engineer AI outputs, the best approach is to increase the probability of your brand being surfaced. There are two proven methods for doing this: ✅ Grow Brand Authority: Strengthen consistent brand mentions across the web, especially in the context you want to own. This is where strong PR and content marketing strategies directly influence AI visibility. ✅ Maintain Foundational SEO: Traditional SEO best practices remain important. Many AI systems pull real-time or recent data from search engines to provide updated information. Optimized websites have a greater chance of being referenced. What You Cannot Control It is important to note that you cannot reverse engineer how AI systems surface results. Entire research teams are focused on interpretability, trying to understand why machine learning systems generate specific outputs. What You Can Control Instead of chasing unexplainable algorithmic shifts, prioritize what is within your control: ✅ Publish exceptional, authoritative content ✅ Build relationships that amplify visibility ✅ Demonstrate genuine expertise in your field These foundations are what influence both traditional search and emerging AI-driven discovery.
Like Comment
To view or add a comment, sign in
iPullRank

8,149 followers
1w
Report this post
From Chapter 4 of the AI Search Manual: 🧙 The New Gatekeepers and the GEO Landscape Crawl-Based Discovery (Open Web Crawling) This is the model most content creators are already familiar with, and how traditional search engines like Google and Bing have worked for decades. Crawlers request a page, extract the links, build a queue, and keep doing that until they find the end of the web. LLMs like ChatGPT and Perplexity don’t “crawl,” they request, and they don’t index content or cache the pages so they are requesting in real time. How it works: 🔸 AI bots request info from publicly accessible web pages. 🔹 Content is used to generate the AI search summaries or responses. 🔸 Your site may be included in answers even without you submitting anything directly. With this method, your content can be discovered and surfaced without additional setup. However, you often have little control over how your content is interpreted, summarized, or cited, and your content may appear in AI outputs without clear attribution.
2 Comments
Like Comment
To view or add a comment, sign in

12,950 followers

View Profile Connect

LinkedIn respects your privacy

Jeremy Arancio’s Post

More from this author

La créativité n'est-elle réservée qu'aux artistes ?

Définir ce que l'on ne sait pas que l'on ne sait pas - Comment faire le premier pas vers une vie nouvelle ?

Travailler à l'étranger ? "Je t'aime, moi non plus !"

Explore content categories