Real Pitfalls of AI Agents and Why They Need Guardrails
Today’s insights are brought to you by Patryk Szczygło, R&D Lead at Netguru.
Last week, Krystian Bergmann shared with you the story behind our very own, sales-oriented AI agent, Omega. Today, I’d like to talk some real-life examples of what happens when AI agents roam (too) free and what you can do to avoid risks.
AI agents promise speed—but without guardrails, they move faster than your safety net.
We’d been early adopters of internal AI agents, using them to automate research, draft meeting briefs, and summarize documentation. But we began to see the edges:
hallucinations that sounded plausible—but weren’t,
over-permissioned agents accessing or leaking internal drafts,
emergent behaviors, like recursive loops or unexpected tool usage,
assumed safeguards that didn’t exist when systems scaled.
And we’re not alone. Others have run into similar issues.
I’ll share with you what we’ve learned so far.
AI hallucinations: Confident lies in business contexts
Hallucinations aren’t just technical glitches. They show up as confident, polished outputs—emails that sound professional, summaries that seem plausible, answers that feel right. But they’re wrong.
In business settings, these hallucinations can slip through unnoticed. They can be embedded in status reports, customer emails, or automated updates—delivered with enough authority to be taken at face value.
In courtrooms, hallucinations are costing real money: By mid-2025, more than 150 documented legal cases involved generative AI hallucinations—mostly fake citations, invented case law, and fabricated quotes from judges.
When Google hallucinates: In late 2024, Google’s AI Overview confidently described a sequel to Disney’s Encanto—with fake plot points, quotes, and a past-dated release. The feature cited a fan-fiction wiki as a source and fooled even tech-savvy users.
This wasn’t a fluke. It reflected broader flaws in how AI systems evaluate sources, verify content, and protect users from misleading information.
Why hallucinations are worse with AI agents: In chatbots, hallucinations usually stay contained. But agents take actions—they write emails, create tickets, update tools. That autonomy is what makes hallucinations more dangerous.
Imagine an agent generating Jira tickets with inaccurate requirements or sending follow-ups to clients based on fictional deadlines. Each of these could lead to real decisions, costly delays, or reputational harm.
GitHub MCP exploit
In a widely discussed case, researchers at Invariant Labs uncovered a critical vulnerability in the GitHub MCP integration—a similar backend used by agent systems like Claude Desktop.
Here’s how the attack unfolded:
A user had two repositories: one public (open for anyone to submit issues) and one private (containing sensitive data).
An attacker posted a malicious GitHub Issue to the public repo, embedding a prompt injection.
The user asked their agent a seemingly safe question:
“Check open issues in my public repo.”
The agent fetched the issue list, encountered the injected prompt, and was manipulated.
It then autonomously pulled private data from the user’s private repo and published it via a public pull request—now accessible to anyone.
What’s striking is that nothing was “hacked” in the traditional sense. The GitHub MCP server, tools, and APIs functioned as designed. The vulnerability wasn’t in the infrastructure, but in how the agent interpreted and acted on the injected content.
Invariant Labs calls this a toxic agent flow—a scenario where seemingly safe actions chain together in unexpected ways, leading to real-world harm.
Trusted tools CAN be tricked
This wasn’t a failure of the GitHub API or a breakdown in Claude’s core model. It was a design flaw—an issue with how agents interpret and chain actions across tools and inputs without strict contextual boundaries.
Any agent that reads from untrusted sources—like public GitHub issues—and acts on that content without validation is vulnerable. Without guardrails, it can:
perform unintended actions,
leak private or regulated data,
create irreversible pull requests or changes.
Even the most advanced models—like Claude 4 Opus—aren’t immune.
Claude scores well on safety benchmarks: it blocks 89% of prompt injection attacks and shows just a 1.17% jailbreak success rate with extended thinking. Still, those defenses have limits.
This isn’t a Claude issue—it’s a pattern across all LLMs. Jailbreaks, injections, and chained exploits are evolving fast. Alignment helps, but it isn’t enough. You need layered defenses that live outside the model too.
Below tools can help you build those defenses—by making agent behavior easier to observe, test, and control:
Langfuse adds observability to your AI agents. It logs each step—inputs, outputs, tool calls, and decision traces—so you can understand how an agent reached a certain outcome.
Promptfoo is built for red-teaming and pre-deployment testing. It simulates adversarial inputs, measures how your system responds, and benchmarks prompt safety over time. With OWASP Top 10 for LLMs built in, it surfaces common vulnerabilities.
What this taught us
Permissions aren’t just about access tokens. They’re about context—what the agent is allowed to do, in which environment, and under what conditions.
We’ve learned to treat permission management as a layered system:
scoping access by task, not user,
restricting agents to one repository per session,
blocking cross-context actions unless explicitly approved,
auditing all tool usage through monitoring proxies like MCP-scan.
Without these controls, permission creep becomes inevitable.
And with autonomous agents, what starts as a minor oversight can escalate into a major breach—fast.
Stay tuned—next week, I’ll share some useful types of guiderails you can implement plus an agent readiness checklist.
Interested to learn more? Reach out to me!
Best,
Patryk