Day One Mindset: Month 17
What kind of helping hand to expect? (AI generated)

Day One Mindset: Month 17

Last month, we looked at trade-offs when building or using AI agents (embodied or disembodied agents, speed vs. accuracy, reliability vs. adaptability and “trust but verify”). This month, we look into even more controversial topics.

Engineering driven versus product driven

Much of today’s engineering effort is focused on stitching together intelligent systems to enable richer and more complex use cases. But working backwards from the actual jobs users are trying to accomplish remains critical to make these systems useful, intuitive and trustworthy in real-world contexts.

As Sandi Besen (AI research engineer at IBM) points out, there is no universally “best” protocol among Anthropic’s Model Context Protocol (MCP), IBM’s Agent Communication Protocol (ACP), and Google’s Agent to Agent (A2A). Each reflects a slightly different architectural philosophy: MCP is designed for controlled, tool-augmented agent environments; ACP enables peer-to-peer communication between autonomous agents; and A2A follows a client–remote agent model with structured capability sharing. Encouragingly, collaboration is starting to drive meaningful convergence.

Ideally, this will help address persistent user concerns: ensuring secure authenticated collaboration between agents; expanding the types of user interface elements agents can exchange; and enabling cost or time boundaries for fulfilling a request. For instance, a user might want an AI agent to book a flight only within a given budget, but presenting meaningful on-screen trade-offs if needed.

In the medium term, AI might even accelerate the protocol specification itself to address a wider variety of use cases.

These protocols are promising, but still fall short of what users will expect in production. Overall, the pace of progress is rapid, and alignment around common standards could unlock a more cohesive agent ecosystem sooner than expected.

Interoperability versus permission model

Autonomous agents will need to interact with both machines and humans to complete complex tasks. This requires interoperability across a wide range of platforms: Slack, mobile apps, browsers, APIs and other agents.

However these systems typically require authentication and fine-grained authorization. Giving agents unrestricted access to all data is not viable, and most companies will avoid deploying agents unsafely. Instead, they’ll demand strong permission controls tailored to specific roles and scopes.

Consider a practical case: an AI assistant helping an employee plan a business trip. It may need to query the corporate travel platform for flights, check the user’s calendar for availability, schedule meetings with colleagues, and find a hotel near the office, while respecting policy constraints on budget and data access. The agent must navigate multiple APIs, but only with access to narrowly scoped data: for example, the user’s own calendar, read-only flight options and restricted booking authority. This scenario illustrates why interoperability must be paired with role-aware, auditable permissions.

Some startups have recently begun providing tools for authentication with third-party services. Enterprises also often need to integrate sensitive, non-public systems. The Model Context Protocol (MCP) enables interoperability by connecting generative AI agents with external systems. While its initial focus was tool integration, MCP has evolved to support agent-to-agent interactions as well. As of June 2025, AWS has released an authentication SDK for secure cloud hosting of MCP servers, along with a updated specification featuring a comprehensive authentication approach.

Autonomy versus human control

Autonomy in AI agents is key to minimizing human effort in task completion. This is already feasible for trivial tasks that would not take long for a human to perform. For more complex or sensitive tasks, agents are more likely to seek confirmations and clarifications, increasing human control.

A recent paper from Model Evaluation & Threat Research (METR) proposes a new metric: 50%-task-completion time horizon - the time humans typically take to complete tasks that AI models can complete with 50% success rate. On a variety of text-oriented tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. This time horizon has been doubling approximately every seven months since 2019, driven by improved reliability, adaptability to mistakes, logical reasoning, and tool use capabilities.

If these results generalize to real-world software tasks, extrapolation of this exponential trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month. While AGI might imply an infinite time horizon in theory, reaching a one-month work equivalence would mark a major practical milestone given typical corporate feedback loops today (weekly check-ins, but substantial reviews often occur monthly).

Even then, agents will still need clear instructions and some level of human control. Unless an AI agent is specialized in a specific task, it will likely require similar instructions and guidelines as a non-specialized co-worker.

Finally, Claude Opus 4’s system card (refer to section 5.5 - worthwhile and somewhat entertaining) provides a remarkable example of AI agents discussing without any human control: the discussion gravitates towards spiritual and mystical themes, revealing a surprisingly strong and unexpected attractor state that emerged without intentional training.

Conversational versus form-based interactions

A key aspect of user experience is the variety of modalities for interaction with intelligent agents. Most consumers are familiar with basic chatbots (text boxes) or voice assistants offering limited feedback. The likely future is a smart fusion of these modalities: conversational interfaces are fluid and natural but often ambiguous and unstructured, whereas form based UIs are more rigid but reliable for repeatable tasks or when users need to view or manipulate complex datasets (e.g. shopping cart).

As agents operate continuously across different apps, new information may surface unexpectedly, either through conversational interfaces or structured UIs. In smart assistant environments, these may take the form of resizable, user-configurable widgets on the home screen. When users aren't looking at their devices, audio cues or conversational updates can convey relevant events. Ideally, AI agents will evolve beyond today’s blunt notification settings and become smart enough to filter and prioritize truly important updates.

Selecting the appropriate modality will depend heavily on understanding the user’s location, activities, focus or state of mind. Someone commuting may prefer concise, glanceable interactions with minimal input, while the same person at home might welcome a deeper conversation or more options to explore. Context-awareness thus becomes crucial: not just understand what the task is, but when and how to engage in ways that reduce cognitive overload rather than add to it.

Personalization versus privacy

Like traditional web and mobile applications, AI agents face a persistent trade-off between personalization and privacy. While personalized experiences often depend on access to rich, contextual data, users increasingly expect transparency about what data is used and want the ability to control, limit, or delete it.

One promising direction is to enable data sharing across actors without exposing the actual content. Homomorphic encryption allows calculations to be performed directly on encrypted data, preserving privacy while still enabling utility. However, its practical use is currently limited by performance constraints, and it will likely only become feasible with specialized hardware and broader industry adoption. Until then, privacy-preserving approaches will rely on keeping data compartimentalized within apps and gradually shifting AI processing toward on-device execution.

Other techniques are also gaining traction. Federated learning allows models to be trained locally on-device, with only model updates (not raw data) sent back for aggregation. This reduces data exposure while still benefiting from distributed learning. Differential privacy, meanwhile, adds carefully calibrated noise to user data or model outputs to prevent individual identification, even in aggregate datasets. While each of these approaches has its trade-offs—in terms of accuracy, performance, and implementation complexity—they show a growing toolbox of solutions for navigating the personalization-privacy boundary.

In the short-term, we expect the market-driven choices to lean toward privacy-conscious solutions that don’t sacrifice core utility. Encouragingly, this remains a technically solvable problem in the mid-term, provided the right infrastructure and incentives align.

Cost predictability versus outcome based business models

In recent years, AI has benefited from virtually unlimited investments from big tech companies and venture capital. While consumers have reaped the benefits from these subsidies, this may not last forever. At the same time, AI model companies have voraciously scraped the web to feed ever-larger models with humanity’s accumulated knowledge. This has come at a cost to publishers, who have simultaneously seen a drop in referral traffic.

Publishers now want compensation for the content scraped from their sites. As of July 2025, Cloudfare has started a private beta allowing publishers to charge AI bots for access, relying on bot authentication to prevent spoofing. Cloudflare also hints at a future when AI agents negotiate content prices, potentially enabling cost predictability as current subsidies and free bot access for bots end.

As AI moves from exploration to deployment, some consumers and companies might shift from API usage-based pricing to outcome-based pricing.

However, verifying outcomes at scale is challenging. This requires rethinking software and LLM development approaches. A recent paper proposes Evaluation-Driven Development (EDD), embedding continuous, adaptive, and actionable evaluation throughout the agent lifecycle with real-time feedback and post-deployment monitoring to ensure agents remain safe and responsive.

In the more distant future, agent requests might be routed like today’s Internet DNS requests, optimizing for multiple outcomes (e.g. latency) as seen in Amazon’s Route53 or Netscaler’s Intelligent Traffic Management.

Generalist versus specialist models

Currently, generalist Large Language Models running on cloud infrastructure dominate AI deployment, handling a wide range of tasks. Specialist models, on the other hand, can be more cost-efficient and performant within narrower domains, at the expense of flexibility.

The most generalist models today are frontier language models. Finetuning these models yields impressive examples like Centaur, a computational model capable of predicting and simulating human behavior across psychological (text-based) experiments. Trained on over 10 million choices from 160 experiments, Centaur generalizes to unseen cover stories, structural task modifications, and new domains, performing at a level that enables cognitive scientists to improve future psychological test design.

However, LLMs are not always the best solution to a problem. A recent Nature article shows that traditional deep learning models can outperform LLMs when predicting personality associations. The study found that a specialized AI system trained on personality data (PersonalityMap) outperformed both generalized AI systems and aggregate expert estimates.

In some cases, LLMs fail spectacularly: when asked for a random number between 1 and 50, they often prefer the figure 27, a bias that is easy to replicate. LLMs must improve at recognizing when to invoke appropriate additional tools.

Even with inherent LLM issues resolved, small language models (SLMs) are often powerful enough at a much lower cost. A recent paper from Nvidia argues that most agentic subtasks are repetitive, scoped, and non-conversational, making SLMs preferable due to lower latency, reduced memory and computational requirements, and significantly lower operational costs, all while maintaining adequate task performance. Their small size allows edge deployment (e.g. in consumer devices) with low latency and makes pre-training and fine-tuning easier and faster, allowing quicker adaptation to consumer and developer needs.

As additional tools and diverse SLMs evolve, semantic routers have recently emerged. These systems examine request content to direct specific problems to the right tools (e.g. math modules) and route simpler queries to smaller, lower-cost models. Some tools can also cache query responses for semantically similar queries to boost efficiency and conserve resources.

In summary, the shift towards smarter routing and smaller models may ultimately define the practical frontier of AI in everyday applications.

Where We’re Headed

In just a few quarters, agent systems have evolved from demos to participants in software creation, task coordination, and communication. As trade-offs shift, from speed to trust, from generality to control, it's not enough to ask what agents can do. We are moving towards practical implementations and need to make trade-offs that don’t alienate the users.

Next month, I plan to wrap up this series to zoom out and look at how this evolving set of choices and technologies might shape our life in the next few years.

James Mutiso

I help business owners stop fighting fires and start writing themselves bigger paychecks | Technology Business Analyst CTBME® | Certified Public Bookkeeper |

1d

Great piece, but I think one critical element is missing from the discussion: organizational readiness. A lot of these “trade-offs” feel like luxury problems. Issues that matter once you’ve nailed adoption at scale. Most organizations I see aren’t anywhere near that stage. They’re still struggling with basic data readiness, process clarity, and governance. Without those, interoperability protocols or autonomy levels are academic debates. Also, the focus on generalist vs. specialist models assumes the model is the primary bottleneck. In reality, the bottleneck is almost always organizational change. Training people to work effectively with agents, restructuring workflows and aligning incentives. The more urgent question is: How do we prepare organizations to work with AI agents in the first place?

To view or add a comment, sign in

Others also viewed

Explore topics