Agent Observability: Latency vs. Throughput for Agentic AI Executions

Debmalya Biswas

AI @ UBS | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

Published Apr 26, 2025

1. Introduction

The discussion around ChatGPT (in general, generative AI), has now evolved into agentic AI. While ChatGPT is primarily a chatbot that can generate text responses, AI agents can execute complex tasks autonomously, e.g., make a sale, plan a trip, make a flight booking, book a contractor to do a house job, order a pizza. The figure below illustrates the evolution of agentic AI systems.

Article content — Fig: Agentic AI evolution

Bill Gates recently envisioned a future where we would have an AI agent that is able to process and respond to natural language and accomplish a number of different tasks. Gates used planning a trip as an example.

Ordinarily, this would involve booking your hotel, flights, restaurants, etc. on your own. But an AI agent would be able to use its knowledge of your preferences to book and purchase those things on your behalf.

In this article, we deep-dive into agentic inferencing aspects, e.g., observability, latency, throughput, non-determinism, etc. - critical to deploying multi-agent systems (MAS) at scale.

We first consider the dimensions impacting LLM inferencing, e.g.,

input and output context window: high level, words are converted into tokens, and models like Llama run on about 4k-8k tokens or roughly 3000–6000 words in English.
model size: are we running the model at full precision, or a quantized version?
first-token latency, inter-token latency, last token latency; and, finally
throughput: defined as the number of requests an LLM can process in a given period.

We then extrapolate the same to agentic AI:

mapping token latency to latency of executing the first agent vs. the full agentic orchestration
considering the output of (preceding) agent together with the overall execution state / contextual understanding as part of the input context window size of the following agent; and finally
accommodating the inherent non-determinism in agentic executions.

2. LLM Inferencing

LLM inference sizing depends on many use-case dimensions, e.g.,

input and output context window: high level, words are converted into tokens, and models like Llama run on about 4k-8k tokens or roughly 3000–6000 words in English.
model size: are we running the model at full precision, or a quantized version?
first-token latency, inter-token latency, last token latency; and, finally
throughput: defined as the number of requests an LLM can process in a given period.

Let us consider the batch scenario first. Here, we mostly know our input and output context lengths; so the focus is on optimizing throughput. (Latency is not relevant here given the offline / batch nature of the execution. ) To achieve high throughput:

Determine if your LLM fits in one GPU?
If not, apply pipeline / tensor parallelism to optimize the number of GPUs needed. Then, just increase the batch size to be as large as possible.

For the streaming scenario, we need to consider the trade-off between throughput and latency. To understand latency, let us take a look at the processing stages of a typical LLM request: Prefill and Decoding (illustrated in the below figure).

Prefill is the latency between pressing ‘enter’ and the first output token appearing on the screen. Decoding occurs when the other words in the response are generated. In most requests prefill takes less than 20% of the end-to-end latency, while decoding takes more than 80%.

Given this, most LLM implementations tend to send tokens back to the client as soon as they are generated — to reduce latency.

To summarize, in streaming mode, we primarily care about the time to first token, as this is the time during which the client is waiting for the first token. Afterwards, the following tokens are generated much faster, and the rate of generation is usually faster than the average human reading speed.

Note that for RAG pipelines, even the first-token latency can be significantly high.

RAGs typically target the full context window as a result of adding chunks of documents to the input prompt. In sequential model, we have to wait for the end result; and as a result we care about the end-to-end latency. This is the time to produce all the tokens in the (response) output sequence.

Finally, reg. the trade-off between latency & throughput — increasing the batch size (running multiple requests through the LLM concurrently) tends to make latency worse but throughput better. Of course, upgrading the underlying hardware / GPU can improve both throughput and latency. Refer to Nvidia’s tutorial on LLM inference sizing for a detailed discussion on this topic.

3. Agentic AI Inferencing

In section 2, we discussed in detail the sizing dimensions impacting a single LLM use-case. In this section, we extend the same to agentic AI — which can be considered as an orchestration of multiple LLM use-cases / agents.

Andrew Ng recently talked about this aspect:

Today, a lot of LLM output is for human consumption. But in an agentic workflow, an LLM might be prompted repeatedly to reflect on and improve its output, use tools, plan and execute multiple steps, or implement multiple agents that collaborate. So, we might generate hundreds of thousands of tokens or more before showing any output to a user. This makes fast token generation very desirable and makes slower generation a bottleneck to taking better advantage of existing models.

Below we highlight the key steps in extrapolating LLM to agentic AI inferencing:

3.1 Agent Observability

Token latency maps to agent processing latency. The first-token versus end-to-end token latency discussion maps to first-agent versus end-to-end execution latency of the full orchestration / decomposed plan in this case.

We thus need to balance the requirement of streaming agent execution outputs as soon as they finish their execution versus outputting the result once execution of the full orchestration has terminated.

For a detailed discussion, refer to my previous article on stateful representation of AI agents enabling both real-time and batch observability of the agentic orchestration.

3.2 Agentic Context Window Size

The output of one agent becomes the input of the next agent to be executed in a multi-agent orchestration. So it is very likely that (at least some part of) the preceding agent output together with the overall execution state / contextual understanding (stored in the memory management layer) will become part of the input context passed to the following agent — and this needs to be taken into account as part of the agentic context window size requirements.

3.3 Non-determinism in Agentic AI Execution

Finally, we need to consider the inherent non-determinism in agentic AI systems. For example, let us consider the e-shopping scenario illustrated in the below figure.

There are two non-deterministic operators in the execution plan: ‘Check Credit’ and ‘Delivery Mode’. The choice ‘Delivery Mode’ indicates that the user can either pick-up the order directly from the store or have it shipped to his address. Given this, shipping is a non-deterministic task and may not get invoked during the actual execution.

To summarize, given the presence of ‘choice’ operators in an orchestration, we do not know the exact tasks / agents that will get executed as part of a specific execution.

Different strategies can be applied here inc. a flattening of the full execution plan to determine the tasks / agents than can potentially get executed as part of best-case and worst-case (peak) scenarios.

Refer to my ICAART 2024 paper (Constraints Enabled Autonomous Agent Marketplace: Discovery and Matchmaking) for a detailed discussion on the applicable strategies to accommodate non-determinism in agentic AI executions.

Debmalya Biswas

AI @ UBS | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

3mo

The full article is now published in AI Advances https://ai.gopubby.com/agentic-observability-latency-vs-throughput-c0201cf68e9f

2 Reactions

Zak Morris, CISSP

3mo

If only you knew how many of your articles I've been fusing into my notebookLM alchemy machine. :) Great writeup - Is there a common capture format akin to PCAP arising for genAi conversations?

1 Reaction

Nikunj J Parekh

3mo

Excellent breakdown, Debmalya. The convergence of observability and AI agents, especially with real-time inferencing capabilities, is transforming how we AI systems operate in a manner that’s predictable. I researched observability with IEEE few years back ever since Maseively Parallel Prodessing and Containerization in the cloud had been urgent, requiring clear monitoring and sub-ms latencies for queries. It’s exciting to see practical insights on integrating autonomous agents into infrastructure telemetry. Looking forward to seeing more use cases where AI not only observes but also acts.

1 Reaction

Abhijit Dey

3mo

Debmalya Biswas what if we had put one orchestrator that governs all the agents..it controls and guardrails all the agents’ activities?

1 Reaction

Nirina M. Razanamparany

Founder at Code & Scale : building AI/ML software for govs and businesses | Harvard ALM | Asperger

3mo

Sounds great Debmalya Biswas, i will look carefully on each topics on my own. Thanks for the inspiration.

Agent Observability: Latency vs. Throughput for Agentic AI Executions

Debmalya Biswas

AI @ UBS | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

1. Introduction

2. LLM Inferencing

3. Agentic AI Inferencing

3.1 Agent Observability

3.2 Agentic Context Window Size

3.3 Non-determinism in Agentic AI Execution

More articles by this author

Explore topics

1. Introduction

2. LLM Inferencing

3. Agentic AI Inferencing

3.1 Agent Observability

3.2 Agentic Context Window Size

3.3 Non-determinism in Agentic AI Execution

Agentic AI Lifecycle for Enterprise Processes

Jun 30, 2025

Causal AI Agents

Jun 14, 2025

Responsible Development of AI Agents

May 25, 2025

Reimagining Customer Service Desk by leveraging autonomous AI Agents

Apr 18, 2025

Agentic AI for Data Engineering

Mar 23, 2025

Agentic AI Pricing Dimensions

Mar 2, 2025

Prompt Stores: Prompt Engineering for the Enterprise

Feb 22, 2025

Gen AI for Security: migrating Legacy Policies to LLM based Risk Classifiers

Feb 1, 2025

"Gossiping" AI Agents & their Privacy Implications

Jan 27, 2025

Federated Learning for Composite AI Agents

Jan 19, 2025

Explore topics