Debugging AI is a pain. What if your observability platform could let an AI fix bugs for you? This is what @samuel_colvin demoed with Pydantic Logfire. ✅ Full-stack traces (FastAPI + LLM) ✅ Auto-retries on validation errors ✅ An AI agent that queries logs & fixes your code This is the future of building AI apps. See how it works in the recording or talk notes below Oh and Pydantic is sponsoring our AI coding course that starts soon and will provide $200 of credit to each student https://lnkd.in/e7Xwzcrd Check out the talk for full details 👇 https://lnkd.in/eqgBkP3k
How Pydantic Logfire uses AI to debug AI apps
More Relevant Posts
-
🤖 LaunchDarkly AI Configs now support agents! In this tutorial, Scarlett Attensil shows you how to built a multi-agent system that: - Uses RAG to answer questions based on your data - Redacts PII for sensitive queries - Is configurable at runtime so you can change your models, prompts, and parameters without needing to deploy - Automatically collects metrics such as latency, costs, and response quality https://lnkd.in/eGyW8fzR
To view or add a comment, sign in
-
🔹 Experimenting with Multi-Agent AI using CrewAI + Groq LLMs 🔹 I just built a simple but powerful pipeline where multiple AI agents collaborate to complete a task — powered by CrewAI and Groq’s blazing-fast LLaMA 3.1. ⚡ 💡 What the system does: A Researcher Agent gathers insights on the latest Generative AI trends in 2025. A Writer Agent takes that research and turns it into a structured, human-like article. Both agents are orchestrated by a Crew, sharing context and working together seamlessly. 🔧 Tech stack: crewai → Orchestrating multi-agent workflows langchain_groq → Plugging in Groq LLMs (llama-3.1-8b-instant) Python + a few lines of code to bring it all together 📌 Why this matters: Multi-agent AI isn’t just about chatbots anymore — it’s about distributed intelligence, where specialized agents collaborate like a real-world team. This architecture has huge potential for: Automated research & reporting Complex decision-making systems Autonomous workflows in enterprises Here’s the best part: swapping in a different LLM backend is as simple as passing llm=... into your agent definition. Flexibility + speed = 🔥 🌟 Excited about where multi-agent frameworks like CrewAI are heading. This is just the beginning. 👉 Full code available here: https://lnkd.in/dz4aNXir 👉 Curious to hear — how do you see multi-agent AI being applied in real-world business workflows? #AI #CrewAI #Groq #MultiAgentSystems #GenerativeAI #LLMs
To view or add a comment, sign in
-
Breaking: OpenAI just dropped GPT-5 Codex CLI right as Anthropic recovers from their month-long model degradation nightmare 😳 Talk about kicking someone when they're down. Why? Because while Claude Code users were struggling with broken suggestions and unreliable outputs, OpenAI quietly built the perfect competitor. Here's what GPT-5 Codex CLI brings to the table: – Truly Agentic: Works independently for hours, fixing tests and iterating without hand-holding – Smart Resource Allocation: Quick for simple tasks, deep thinking for complex refactors – Better Code Review: Finds high-impact bugs, skips the noise – Visual Capabilities: Screenshots, mockup analysis, progress tracking – Deep IDE Integration: Seamless VS Code extension bridging local and cloud work Right now: – The timing couldn't be more brutal for Anthropic. – Claude Code spent a month delivering inconsistent, frustrating results. – Their status page confirmed Sonnet 4 and Opus were affected for weeks. Translation for developers: OpenAI swoops in with exactly what you've been craving—reliability over flashy features, consistent performance over cutting-edge promises, a tool that actually works when you need it. Bonus twist for AI enthusiasts: This isn't just product competition anymore; it's a trust war. And OpenAI is winning by simply showing up when Anthropic stumbled. Love it. P.S. check out 🔔aislice.substack.com🔔, it's the only no-nonsense AI newsletter for builders. Practical insights without the AI hype BS.
To view or add a comment, sign in
-
-
I boosted my AI Agent's performance by 184% Using a 100% open-source technique: Top AI Engineers never do manual prompt engineering!! Today, I'll show you how to automatically find the best prompts for any agentic workflow you're building. And, we'll use Comet's Opik to do so. 🚀 The idea is simple yet powerful: 1. Start with an initial prompt & eval dataset 2. Let the optimizer iteratively improve the prompt 3. Get the optimal prompt automatically! ✨ And this is done using just a few lines of code, as shown in the image below. Why use Opik? Opik is a 100% open-source LLM evaluation platform. It helps optimize LLM systems for better, faster, cheaper performance, from RAG chatbots to code assistants. Opik offers tracing, evaluations, and dashboards. The best part is that everything can run 100% locally because, you can use any local LLM to power your optimizers and evaluators. I've shared a link to their repo in the comments, where you can find more details. ____ Share this with your network if you found this insightful ♻️ Follow me (Akshay Pachaar) for more insights and tutorials on AI and Machine Learning!
To view or add a comment, sign in
-
-
Akshay Pachaar really nails it here — this code perfectly illustrates one of the biggest challenges we face when building agents today, along with a clear path to address it. The distinction between prompting and metaprompting is such a valuable topic, and definitely worth exploring when optimizing our agent architectures. Great insights!"
I boosted my AI Agent's performance by 184% Using a 100% open-source technique: Top AI Engineers never do manual prompt engineering!! Today, I'll show you how to automatically find the best prompts for any agentic workflow you're building. And, we'll use Comet's Opik to do so. 🚀 The idea is simple yet powerful: 1. Start with an initial prompt & eval dataset 2. Let the optimizer iteratively improve the prompt 3. Get the optimal prompt automatically! ✨ And this is done using just a few lines of code, as shown in the image below. Why use Opik? Opik is a 100% open-source LLM evaluation platform. It helps optimize LLM systems for better, faster, cheaper performance, from RAG chatbots to code assistants. Opik offers tracing, evaluations, and dashboards. The best part is that everything can run 100% locally because, you can use any local LLM to power your optimizers and evaluators. I've shared a link to their repo in the comments, where you can find more details. ____ Share this with your network if you found this insightful ♻️ Follow me (Akshay Pachaar) for more insights and tutorials on AI and Machine Learning!
To view or add a comment, sign in
-
-
Google’s Gemini CLI AI agent has been integrated with the Zed code editor, bringing Gemini models directly into Zed’s Rust-based environment. The result, says Google, is a fast, responsive AI experience. https://lnkd.in/g7VdSmh6
To view or add a comment, sign in
-
This is the seventh article in the Agentic AI series, designed to help developers, innovators, and AI enthusiasts unlock the power of intelligent automation. From concept to deployment, this guide equips you with the knowledge and tools to build autonomous, reliable, and scalable multi-agent systems. 🔑 What you’ll learn: ➡️ Foundational Concepts – Agent autonomy, types (reactive, deliberative, hybrid), memory systems, and inter-agent communication. ➡️ Architectural Blueprint – How to design modular, scalable, and interoperable MAS with well-defined roles and messaging strategies. ➡️ Tooling & Tech Stack – Hands-on insights into LangChain, AutoGen, CrewAI, FAISS, Pub/Sub messaging, plus deployment with Docker & Kubernetes. ➡️ End-to-End Build – A full walkthrough: YAML-configured agent roles, Pub/Sub pipelines, FAISS-powered memory, GPT-4o reasoning, and containerised deployment. 🌏Get ready to turn theory into practice and build production-grade multi-agent systems. Read the full article here: https://lnkd.in/gX5weZY3 #AgenticAI #MultiAgentSystems #LLM #AIEngineering
To view or add a comment, sign in
-
Experienced many of these in production including model changes and api changes overnight. Without stability evaluation is hard and building real products people rely on is harder.
For the past year, we have been evaluating AI agents for the Holistic Agent Leaderboard (hal.cs.princeton.edu). In the process, I’ve realized that we’re probably in the “Windows 95 era” of AI agent evaluations: there is an urgent need to improve standardization. Some of the challenges to look out for: 1. High evaluation costs prevent uncertainty estimation. Some benchmarks cost thousands of dollars per model to evaluate. Running multiple trials is prohibitively expensive. 2. Providers swap model weights behind stable endpoints. Together AI changed their DeepSeek R1 endpoint to serve DeepSeek R1 0528 on release day, keeping the same API endpoint name. 3. API changes break backward compatibility. When OpenAI released o4-mini and o3, they removed support for the "stop_keyword" argument that many agent scaffolds relied on. This forced agent developers to update their code. 4. Provider aggregators can serve different quantization. OpenRouter could serve a model with FP4 for one call and FP8 for another, routing requests to different providers without notifying users. 5. Errors can create false negatives. When agents hit rate limits but don't implement retries or properly surface the error, they fail silently. 6. Provider rate and spend limits constrain evaluation scale. Anthropic's default spend limit was just $5k per month even at the highest spending tier, requiring special approval for larger evaluation runs. 7. Critical infrastructure relies on hardcoded hacks. LiteLLM hardcoded whether models could use reasoning effort through a regex that only matched OpenAI's o-series model names, but it didn’t match GPT-5, leading to errors on launch day. 8. Reasoning effort settings aren't comparable across providers. Providers define "low," "medium," and "high" reasoning effort differently. LiteLLM maps "high" to 4096 reasoning tokens, while OpenAI doesn't disclose what their settings mean. 9. Provider APIs lack standardization for identical capabilities. OpenRouter uses a different parameter format for setting reasoning effort than the native OpenAI API, even when serving the exact same model. 10. Task specifications and agent scaffolds are improperly entangled. AssistantBench includes instructions like "don't guess the answer" directly in benchmark tasks, when these should be part of the agent scaffold. 11. It is hard to ensure computational reproducibility within a rapidly changing ecosystem. Reproducibility require frozen dependencies to ensure results can be compared. When new models are released, a tradeoff between using the latest library versions at the expense of computational reproducibility, or hotpatching older libraries and incurring technical debt. 12. Upstream bugs in logging infrastructure can block evaluation for months. Weave, Wandb's logging library, had a bug that took months to resolve despite direct access to their engineering team. These dependencies create bottlenecks that evaluation frameworks can't work around.
To view or add a comment, sign in
-
-
Even if you’re not deep into the emerging tech around AI agents, this is worth a read. Evals are where the rubber hits the road - or as this list implies, the sh*t hits the fan. As a catalogue of challenges facing teams trying to properly evaluate agents, ostensibly these are engineering and architectural issues, but they read to me like huge questions for product people in this space. Will my product be reliable? Can my product be delivered in a sustainable and economical way? If my customers could see what’s actually happening beneath the surface of the product, would it match their expectations? Fundamentally, can what I’m selling deliver on its promise? We’re super early stage on all of this, sure, and we’ll only get there by trying and failing - but all this stuff matters. Product people have a huge role to play in ensuring that each part of the stack from user/business facing down through to the infrastructure works as consistently, reliably and safely as you’d hope if you’re building a business on top of it.
For the past year, we have been evaluating AI agents for the Holistic Agent Leaderboard (hal.cs.princeton.edu). In the process, I’ve realized that we’re probably in the “Windows 95 era” of AI agent evaluations: there is an urgent need to improve standardization. Some of the challenges to look out for: 1. High evaluation costs prevent uncertainty estimation. Some benchmarks cost thousands of dollars per model to evaluate. Running multiple trials is prohibitively expensive. 2. Providers swap model weights behind stable endpoints. Together AI changed their DeepSeek R1 endpoint to serve DeepSeek R1 0528 on release day, keeping the same API endpoint name. 3. API changes break backward compatibility. When OpenAI released o4-mini and o3, they removed support for the "stop_keyword" argument that many agent scaffolds relied on. This forced agent developers to update their code. 4. Provider aggregators can serve different quantization. OpenRouter could serve a model with FP4 for one call and FP8 for another, routing requests to different providers without notifying users. 5. Errors can create false negatives. When agents hit rate limits but don't implement retries or properly surface the error, they fail silently. 6. Provider rate and spend limits constrain evaluation scale. Anthropic's default spend limit was just $5k per month even at the highest spending tier, requiring special approval for larger evaluation runs. 7. Critical infrastructure relies on hardcoded hacks. LiteLLM hardcoded whether models could use reasoning effort through a regex that only matched OpenAI's o-series model names, but it didn’t match GPT-5, leading to errors on launch day. 8. Reasoning effort settings aren't comparable across providers. Providers define "low," "medium," and "high" reasoning effort differently. LiteLLM maps "high" to 4096 reasoning tokens, while OpenAI doesn't disclose what their settings mean. 9. Provider APIs lack standardization for identical capabilities. OpenRouter uses a different parameter format for setting reasoning effort than the native OpenAI API, even when serving the exact same model. 10. Task specifications and agent scaffolds are improperly entangled. AssistantBench includes instructions like "don't guess the answer" directly in benchmark tasks, when these should be part of the agent scaffold. 11. It is hard to ensure computational reproducibility within a rapidly changing ecosystem. Reproducibility require frozen dependencies to ensure results can be compared. When new models are released, a tradeoff between using the latest library versions at the expense of computational reproducibility, or hotpatching older libraries and incurring technical debt. 12. Upstream bugs in logging infrastructure can block evaluation for months. Weave, Wandb's logging library, had a bug that took months to resolve despite direct access to their engineering team. These dependencies create bottlenecks that evaluation frameworks can't work around.
To view or add a comment, sign in
-
-
We have started building an AI support agent system and are gradually improving it together with human involvement. In recent days, we have been discussing with a client how to design a system of AI agents that can answer customer questions with minimal human participation. For the tech stack, we chose Python and FastAPI for the server, vector search (Weaviate or Elasticsearch), and a separate storage for operational data such as prices and schedules. How does it work? When a customer asks a question, the agent searches the knowledge base and, if necessary, accesses up-to-date data. To retrieve the most current and continuously updated information (for example, working hours or center data from Google Sheets), the agent uses RAG — Retrieval Augmented Generation. This ensures that the system always pulls in fresh numbers and facts rather than relying only on a static knowledge base. After that, the response is checked by AI correctors using a simple three-point scale (from -1 to 1). Importantly, the AI correctors themselves are also selectively evaluated by a human operator. This not only allows us to monitor the system’s quality but also continuously refine prompts for both agents and correctors. As a result, the system is not static — it evolves over time with human input. Ultimately, the customer always receives an accurate and verified answer, while humans intervene only when the system truly requires clarification or retraining.
To view or add a comment, sign in
-
AI, Cloud Computing, Virtualization, Containerization & Orchestration, Infrastructure-as-Code, Configuration Management, Continuous Integration & Deployment, Observability, Security & Compliance.
2wExciting advancements in AI debugging. 🌟