What's the RIGHT way to benchmark multi-step reasoning in LLM agents? Tomaz Bratanic dives into creating evaluation datasets that reflect how agents actually work with graph databases, beyond simple text-to-query translation.
What's the RIGHT way to benchmark multi-step reasoning in LLM agents? Tomaz Bratanic dives into creating evaluation datasets that reflect how agents actually work with graph databases, beyond simple text-to-query translation.