Valentin von Seggern’s Post

I rarely do public speaking, but earlier this year I stepped on stage at Haystack US (Charlottesville, VA) to share how we at AMBOSS evaluate AI-powered search features: Our AI Shortcuts, Vector Search and more. I'm curious: Are you using LLMs to evaluate features? How aligned are offline & online evaluations in your experience? Any cool tips & tricks you recently learned? My take why evaluation matters? 🔍 60 % of US med students and hundreds of thousands of clinicians trust AMBOSS at the bedside. In a high-stakes domain like medicine, a shiny feature that looks smart can’t ship without proof it actually helps. Here’s our playbook: 1️⃣ Outcome-first metrics. No baskets or purchases here, so we re-defined “conversion” around knowledge gained and task completion signals. 2️⃣ CTR ≠ relevance (always). A high click rate might mean the snippet was wrong and forced users to dig deeper. Context is everything. 3️⃣ Hybrid evaluation loop. Offline evaluation + an online A/B framework around engagement heuristics creates a framework that tells us in days (not quarters) whether a new algorithm moves the needle. Huge kudos to the Haystack crew & OpenSource Connections for curating a room full of search nerds, and to my team (Mehdi, Ágnes, Johannes, Serdar, Hanhan, Boaz, Matteo, Daniele, Sergii) who make me look smarter than I am. 🤓 https://lnkd.in/dgABBbz7

To view or add a comment, sign in

Explore content categories