Agent Observability: Latency vs. Throughput for Agentic AI Executions

Agent Observability: Latency vs. Throughput for Agentic AI Executions

1. Introduction

The discussion around ChatGPT (in general, generative AI), has now evolved into agentic AI. While ChatGPT is primarily a chatbot that can generate text responses, AI agents can execute complex tasks autonomously, e.g., make a sale, plan a trip, make a flight booking, book a contractor to do a house job, order a pizza. The figure below illustrates the evolution of agentic AI systems.

Article content
Fig: Agentic AI evolution

Bill Gates recently envisioned a future where we would have an AI agent that is able to process and respond to natural language and accomplish a number of different tasks. Gates used planning a trip as an example.

Ordinarily, this would involve booking your hotel, flights, restaurants, etc. on your own. But an AI agent would be able to use its knowledge of your preferences to book and purchase those things on your behalf.

In this article, we deep-dive into agentic inferencing aspects, e.g., observability, latency, throughput, non-determinism, etc. - critical to deploying multi-agent systems (MAS) at scale.

We first consider the dimensions impacting LLM inferencing, e.g., 

  • input and output context window: high level, words are converted into tokens, and models like Llama run on about 4k-8k tokens or roughly 3000–6000 words in English.
  • model size: are we running the model at full precision, or a quantized version?
  • first-token latency, inter-token latency, last token latency; and, finally
  • throughput: defined as the number of requests an LLM can process in a given period.

We then extrapolate the same to agentic AI:

  • mapping token latency to latency of executing the first agent vs. the full agentic orchestration
  • considering the output of (preceding) agent together with the overall execution state / contextual understanding as part of the input context window size of the following agent; and finally
  • accommodating the inherent non-determinism in agentic executions.

2. LLM Inferencing

LLM inference sizing depends on many use-case dimensions, e.g.,

  • input and output context window: high level, words are converted into tokens, and models like Llama run on about 4k-8k tokens or roughly 3000–6000 words in English.
  • model size: are we running the model at full precision, or a quantized version?
  • first-token latency, inter-token latency, last token latency; and, finally
  • throughput: defined as the number of requests an LLM can process in a given period.

Let us consider the batch scenario first. Here, we mostly know our input and output context lengths; so the focus is on optimizing throughput. (Latency is not relevant here given the offline / batch nature of the execution. ) To achieve high throughput:

  • Determine if your LLM fits in one GPU?
  • If not, apply pipeline / tensor parallelism to optimize the number of GPUs needed. Then, just increase the batch size to be as large as possible.

For the streaming scenario, we need to consider the trade-off between throughput and latency. To understand latency, let us take a look at the processing stages of a typical LLM request: Prefill and Decoding (illustrated in the below figure).

Article content
Fig: LLM processing stages: Prefill & Decoding

Prefill is the latency between pressing ‘enter’ and the first output token appearing on the screen. Decoding occurs when the other words in the response are generated. In most requests prefill takes less than 20% of the end-to-end latency, while decoding takes more than 80%.

Given this, most LLM implementations tend to send tokens back to the client as soon as they are generated — to reduce latency.

To summarize, in streaming mode, we primarily care about the time to first token, as this is the time during which the client is waiting for the first token. Afterwards, the following tokens are generated much faster, and the rate of generation is usually faster than the average human reading speed.

Note that for RAG pipelines, even the first-token latency can be significantly high.

RAGs typically target the full context window as a result of adding chunks of documents to the input prompt. In sequential model, we have to wait for the end result; and as a result we care about the end-to-end latency. This is the time to produce all the tokens in the (response) output sequence.

Finally, reg. the trade-off between latency & throughput — increasing the batch size (running multiple requests through the LLM concurrently) tends to make latency worse but throughput better. Of course, upgrading the underlying hardware / GPU can improve both throughput and latency. Refer to Nvidia’s tutorial on LLM inference sizing for a detailed discussion on this topic.

3. Agentic AI Inferencing

In section 2, we discussed in detail the sizing dimensions impacting a single LLM use-case. In this section, we extend the same to agentic AI — which can be considered as an orchestration of multiple LLM use-cases / agents.

Andrew Ng recently talked about this aspect:

Today, a lot of LLM output is for human consumption. But in an agentic workflow, an LLM might be prompted repeatedly to reflect on and improve its output, use tools, plan and execute multiple steps, or implement multiple agents that collaborate. So, we might generate hundreds of thousands of tokens or more before showing any output to a user. This makes fast token generation very desirable and makes slower generation a bottleneck to taking better advantage of existing models.

Below we highlight the key steps in extrapolating LLM to agentic AI inferencing:


Article content
Fig: Multi-agentic AI inferencing dimensions

3.1 Agent Observability

Token latency maps to agent processing latency. The first-token versus end-to-end token latency discussion maps to first-agent versus end-to-end execution latency of the full orchestration / decomposed plan in this case.

We thus need to balance the requirement of streaming agent execution outputs as soon as they finish their execution versus outputting the result once execution of the full orchestration has terminated.

For a detailed discussion, refer to my previous article on stateful representation of AI agents enabling both real-time and batch observability of the agentic orchestration.

3.2 Agentic Context Window Size

The output of one agent becomes the input of the next agent to be executed in a multi-agent orchestration. So it is very likely that (at least some part of) the preceding agent output together with the overall execution state / contextual understanding (stored in the memory management layer) will become part of the input context passed to the following agent — and this needs to be taken into account as part of the agentic context window size requirements.

3.3 Non-determinism in Agentic AI Execution

Finally, we need to consider the inherent non-determinism in agentic AI systems. For example, let us consider the e-shopping scenario illustrated in the below figure.

Article content
Fig: E-shopping scenario with non-determinism

There are two non-deterministic operators in the execution plan: ‘Check Credit’ and ‘Delivery Mode’. The choice ‘Delivery Mode’ indicates that the user can either pick-up the order directly from the store or have it shipped to his address. Given this, shipping is a non-deterministic task and may not get invoked during the actual execution.

To summarize, given the presence of ‘choice’ operators in an orchestration, we do not know the exact tasks / agents that will get executed as part of a specific execution.

Different strategies can be applied here inc. a flattening of the full execution plan to determine the tasks / agents than can potentially get executed as part of best-case and worst-case (peak) scenarios.

Refer to my ICAART 2024 paper (Constraints Enabled Autonomous Agent Marketplace: Discovery and Matchmaking) for a detailed discussion on the applicable strategies to accommodate non-determinism in agentic AI executions.

Debmalya Biswas

AI @ UBS | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

3mo
Zak Morris, CISSP

Founder - Triton Studios | Futurist | Design Engineer | Agentic Development | Human Centric Artificial Intelligence Optimist | | Mergers & Acquisitions Risk Expert | Consultant | Speaker | Multimedia Artist |

3mo

If only you knew how many of your articles I've been fusing into my notebookLM alchemy machine. :) Great writeup - Is there a common capture format akin to PCAP arising for genAi conversations?

Nikunj J Parekh

Chief AI Architect and Head of AI Pre Sales | Senior Principal DMTS (Top 100 employees) | CTO @ parEx AI | Board Advisor for AI Startups | IEEE Author | Speaker | President, Silicon Valley AI CXO Club

3mo

Excellent breakdown, Debmalya. The convergence of observability and AI agents, especially with real-time inferencing capabilities, is transforming how we AI systems operate in a manner that’s predictable. I researched observability with IEEE few years back ever since Maseively Parallel Prodessing and Containerization in the cloud had been urgent, requiring clear monitoring and sub-ms latencies for queries. It’s exciting to see practical insights on integrating autonomous agents into infrastructure telemetry. Looking forward to seeing more use cases where AI not only observes but also acts.

Abhijit Dey

Senior Vice President @ Axis Bank | AI & API Banking Innovator | Digital Transformation Leader | GenAI | International Keynote Speaker | 2x LinkedIn Top Product Voice

3mo

Debmalya Biswas what if we had put one orchestrator that governs all the agents..it controls and guardrails all the agents’ activities?

Nirina M. Razanamparany

Founder at Code & Scale : building AI/ML software for govs and businesses | Harvard ALM | Asperger

3mo

Sounds great Debmalya Biswas, i will look carefully on each topics on my own. Thanks for the inspiration.

To view or add a comment, sign in

Explore topics