Agent Observability: Latency vs. Throughput for Agentic AI Executions
1. Introduction
The discussion around ChatGPT (in general, generative AI), has now evolved into agentic AI. While ChatGPT is primarily a chatbot that can generate text responses, AI agents can execute complex tasks autonomously, e.g., make a sale, plan a trip, make a flight booking, book a contractor to do a house job, order a pizza. The figure below illustrates the evolution of agentic AI systems.
Bill Gates recently envisioned a future where we would have an AI agent that is able to process and respond to natural language and accomplish a number of different tasks. Gates used planning a trip as an example.
Ordinarily, this would involve booking your hotel, flights, restaurants, etc. on your own. But an AI agent would be able to use its knowledge of your preferences to book and purchase those things on your behalf.
In this article, we deep-dive into agentic inferencing aspects, e.g., observability, latency, throughput, non-determinism, etc. - critical to deploying multi-agent systems (MAS) at scale.
We first consider the dimensions impacting LLM inferencing, e.g.,
We then extrapolate the same to agentic AI:
2. LLM Inferencing
LLM inference sizing depends on many use-case dimensions, e.g.,
Let us consider the batch scenario first. Here, we mostly know our input and output context lengths; so the focus is on optimizing throughput. (Latency is not relevant here given the offline / batch nature of the execution. ) To achieve high throughput:
For the streaming scenario, we need to consider the trade-off between throughput and latency. To understand latency, let us take a look at the processing stages of a typical LLM request: Prefill and Decoding (illustrated in the below figure).
Prefill is the latency between pressing ‘enter’ and the first output token appearing on the screen. Decoding occurs when the other words in the response are generated. In most requests prefill takes less than 20% of the end-to-end latency, while decoding takes more than 80%.
Given this, most LLM implementations tend to send tokens back to the client as soon as they are generated — to reduce latency.
To summarize, in streaming mode, we primarily care about the time to first token, as this is the time during which the client is waiting for the first token. Afterwards, the following tokens are generated much faster, and the rate of generation is usually faster than the average human reading speed.
Note that for RAG pipelines, even the first-token latency can be significantly high.
RAGs typically target the full context window as a result of adding chunks of documents to the input prompt. In sequential model, we have to wait for the end result; and as a result we care about the end-to-end latency. This is the time to produce all the tokens in the (response) output sequence.
Finally, reg. the trade-off between latency & throughput — increasing the batch size (running multiple requests through the LLM concurrently) tends to make latency worse but throughput better. Of course, upgrading the underlying hardware / GPU can improve both throughput and latency. Refer to Nvidia’s tutorial on LLM inference sizing for a detailed discussion on this topic.
3. Agentic AI Inferencing
In section 2, we discussed in detail the sizing dimensions impacting a single LLM use-case. In this section, we extend the same to agentic AI — which can be considered as an orchestration of multiple LLM use-cases / agents.
Andrew Ng recently talked about this aspect:
Today, a lot of LLM output is for human consumption. But in an agentic workflow, an LLM might be prompted repeatedly to reflect on and improve its output, use tools, plan and execute multiple steps, or implement multiple agents that collaborate. So, we might generate hundreds of thousands of tokens or more before showing any output to a user. This makes fast token generation very desirable and makes slower generation a bottleneck to taking better advantage of existing models.
Below we highlight the key steps in extrapolating LLM to agentic AI inferencing:
3.1 Agent Observability
Token latency maps to agent processing latency. The first-token versus end-to-end token latency discussion maps to first-agent versus end-to-end execution latency of the full orchestration / decomposed plan in this case.
We thus need to balance the requirement of streaming agent execution outputs as soon as they finish their execution versus outputting the result once execution of the full orchestration has terminated.
For a detailed discussion, refer to my previous article on stateful representation of AI agents enabling both real-time and batch observability of the agentic orchestration.
3.2 Agentic Context Window Size
The output of one agent becomes the input of the next agent to be executed in a multi-agent orchestration. So it is very likely that (at least some part of) the preceding agent output together with the overall execution state / contextual understanding (stored in the memory management layer) will become part of the input context passed to the following agent — and this needs to be taken into account as part of the agentic context window size requirements.
3.3 Non-determinism in Agentic AI Execution
Finally, we need to consider the inherent non-determinism in agentic AI systems. For example, let us consider the e-shopping scenario illustrated in the below figure.
There are two non-deterministic operators in the execution plan: ‘Check Credit’ and ‘Delivery Mode’. The choice ‘Delivery Mode’ indicates that the user can either pick-up the order directly from the store or have it shipped to his address. Given this, shipping is a non-deterministic task and may not get invoked during the actual execution.
To summarize, given the presence of ‘choice’ operators in an orchestration, we do not know the exact tasks / agents that will get executed as part of a specific execution.
Different strategies can be applied here inc. a flattening of the full execution plan to determine the tasks / agents than can potentially get executed as part of best-case and worst-case (peak) scenarios.
Refer to my ICAART 2024 paper (Constraints Enabled Autonomous Agent Marketplace: Discovery and Matchmaking) for a detailed discussion on the applicable strategies to accommodate non-determinism in agentic AI executions.
AI @ UBS | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
3moThe full article is now published in AI Advances https://ai.gopubby.com/agentic-observability-latency-vs-throughput-c0201cf68e9f
Founder - Triton Studios | Futurist | Design Engineer | Agentic Development | Human Centric Artificial Intelligence Optimist | | Mergers & Acquisitions Risk Expert | Consultant | Speaker | Multimedia Artist |
3moIf only you knew how many of your articles I've been fusing into my notebookLM alchemy machine. :) Great writeup - Is there a common capture format akin to PCAP arising for genAi conversations?
Chief AI Architect and Head of AI Pre Sales | Senior Principal DMTS (Top 100 employees) | CTO @ parEx AI | Board Advisor for AI Startups | IEEE Author | Speaker | President, Silicon Valley AI CXO Club
3moExcellent breakdown, Debmalya. The convergence of observability and AI agents, especially with real-time inferencing capabilities, is transforming how we AI systems operate in a manner that’s predictable. I researched observability with IEEE few years back ever since Maseively Parallel Prodessing and Containerization in the cloud had been urgent, requiring clear monitoring and sub-ms latencies for queries. It’s exciting to see practical insights on integrating autonomous agents into infrastructure telemetry. Looking forward to seeing more use cases where AI not only observes but also acts.
Senior Vice President @ Axis Bank | AI & API Banking Innovator | Digital Transformation Leader | GenAI | International Keynote Speaker | 2x LinkedIn Top Product Voice
3moDebmalya Biswas what if we had put one orchestrator that governs all the agents..it controls and guardrails all the agents’ activities?
Founder at Code & Scale : building AI/ML software for govs and businesses | Harvard ALM | Asperger
3moSounds great Debmalya Biswas, i will look carefully on each topics on my own. Thanks for the inspiration.