Dr. Michael Fröhlich’s Post

View profile for Dr. Michael Fröhlich

Software Engineering at Langfuse

Now you know which scores represent "Quality" in your AI agent. But next: Applying it to every AI call in production? That’s the real game. You can’t check thousands of user interactions manually.
You need automation. 3 fundamental ways: 1. Human Annotation → Ground truth. Small sample, deep accuracy. 2. Rule-based Checks → Black-and-white. Fast. Cheap. Every call. 3. LLM-as-a-Judge → Scales nuance (e.g. helpfulness, relevance). Combine all 3 → Continuous, reliable, scalable evals. That’s how you stop hoping your AI works… and know it does. Diving into AI Observability & Evals (5/6) #AIObservability #Tracing #LLM #AI

  • No alternative text description for this image
Alexis Gamboa

Co-Founder at loopid.com

6d

Dr. Michael Fröhlich If you had to choose: Human annotations vs LLM Judge? We struggle with customers not being able to invest enough time curating the agent's behaviour, specially when scaling. How far you think we can go with mostly LLM-as-a-Judge?

To view or add a comment, sign in

Explore content categories