Experimenting with AI Agents for Consulting Workflows: A Multi-Agent Case Study

Karan Chandra Dey

AI Product Strategist | ML Engineer | Human-AI Interaction Designer | LLM Engineer | SF / Los Angeles

Published Jul 13, 2025

In the rapidly evolving world of generative AI, we now have an array of powerful models (OpenAI’s GPT-4, Anthropic’s Claude 4, Google’s Gemini, and the new xAI Grok 4, among others) vying for supremacy. Each model has its own strengths and quirks. As an MBA student and tech enthusiast, I wanted to see how leveraging multiple AI agents, each doing what it does best, could streamline a consulting workflow for business insights and report generation.

I conducted an experiment using a multi-agent AI approach to produce a polished, consulting-style analysis. The idea was to break the project into specialized tasks and assign each to the AI model most suited for it. The results were eye-opening, showing how AI “teamwork” can yield a report worthy of a senior consultant. Below, I share the process, findings, and key takeaways from this experiment.

Why a Multi-Agent Workflow?

Modern large language models are incredibly capable generalists, but asking one model to do everything isn’t always optimal. Complex projects (like data analysis + visualization + narrative writing) can overwhelm a single AI agent, leading to errors or generic outputs. Instead, splitting the job into distinct subtasks and using multiple specialized agents can enhance clarity, quality, and performance. Research engineers have noted that when automating tasks like data extraction, analysis, and report writing, it’s wise to “start separating responsibilities” – each agent focusing on its own domain and tools. In other words, just as consulting teams have expert roles (data analyst, visualization specialist, presentation writer), an AI workflow can have dedicated agents for each step of the process.

This multi-agent strategy not only plays to each model’s strengths, but also mirrors real consulting workflows. It prevents one AI from juggling too many tools or context at once, a known cause of confusion in single-agent setups. The expectation is that the end-to-end process becomes more efficient and error-resistant, with each “AI consultant” excelling at its given task.

Step 1: Data Querying with an LLM Agent (Text-to-SQL)

Every good analysis starts with good data. For the data extraction and query step, I built a custom LLM-based Data Agent using LlamaIndex connected to Google BigQuery. Under the hood, this agent was powered by Google’s Gemini 2.5 Pro model fine-tuned on my project’s database schema and context. Its job: convert natural language questions into advanced SQL queries against our BigQuery dataset.

Why Gemini? Google’s Gemini LLM has proven excellent for text-to-SQL tasks thanks to its strong natural language understanding and code generation abilities. It can translate complex user questions into correct SQL, even across multiple tables and filters. LlamaIndex, meanwhile, made it easy to integrate the database: it indexed the BigQuery table schemas and provided the relevant context to the LLM when needed. This Retrieval Augmented Generation setup ensured the AI only got the schema info it needed, avoiding hallucinations in SQL.

Using this agent, I fed in business questions (in plain English) and received tailored SQL queries in return. Impressively, all the SQL queries in the analysis were generated by this fine-tuned LLM agent. For example, if I asked, “What are the top 5 products by sales growth in the last quarter, and their contribution to total revenue?”, the agent would produce a properly JOINed and filtered SQL query to answer it. The queries were then executed in BigQuery on the company’s dataset, yielding the raw results and initial charts.

This approach saved enormous time and ensured accuracy:

I didn’t have to manually write complex SQL. The AI agent handled it, and it did so in seconds, not hours.
By using a vector database of schema embeddings, the agent stayed aware of the database structure and produced highly relevant, correct queries.
In fact, others have found Gemini-based systems to excel at such tasks – one independent benchmark showed Gemini 2.5 Pro achieving the highest accuracy and consistency on complex SQL generation for financial data, outperforming even GPT-4 and Claude. This validated my choice to use Gemini as the “database specialist” of the team.

By the end of this step, I had clean data and even some preliminary visualizations (BigQuery can output basic charts) ready for the next agent.

Step 2: Automated Analysis Draft with Perplexity Labs

With data in hand, the next task was to generate a narrative analysis and visuals – essentially a first draft of a client-ready report. For this I turned to Perplexity Labs, an AI tool designed for deep research and report generation. Perplexity Labs acts like an autonomous analyst: it can take data and prompts, run its own searches and code, and produce rich outputs like charts, summaries, even interactive dashboards.

I uploaded the BigQuery results (data tables and charts) into Perplexity Labs and prompted it with a high-level instruction:

“Please analyze the data comprehensively and deliver a polished, McKinsey-calibre business analysis that reflects the depth and rigor of a senior consultant with 20 years at the firm. Present your findings in clear, well-structured narrative paragraphs only—no tables.”

In essence, I asked the AI for a top-tier consulting report, focusing purely on written insights (I didn’t want raw tables in the output). Perplexity Labs went to work – and notably, it took several minutes (Labs runs for up to 10+ minutes using various tools to gather info and craft outputs). The wait was worth it. The AI delivered a structured draft report complete with key findings and even embedded charts that it generated from the data!

Perplexity Labs’ output impressed me for a few reasons:

It produced multiple custom charts and graphs to support the analysis, without me explicitly asking for each visual. The charts were relevant (e.g. trend lines, bar charts for comparisons) and helped illustrate the story in the data.
The written analysis was structured like a consultant’s storyline – with sections for an executive summary, analysis of trends, and recommendations. The tone and depth did feel reminiscent of a McKinsey-style report.
It appeared to use a broad knowledge base (likely pulling contextual info from the web or my attachments) to add insight. In fact, Perplexity Labs is known to search multiple data sources (web, academic, etc.) simultaneously and even consider attached files. The result was a more comprehensive narrative than a single source could provide.

Perplexity Labs can generate detailed reports with charts and insights autonomously, making it a powerful AI research assistant for creating business presentations.

By the end of this step, I had a solid first draft of the report. It wasn’t perfect – some sections needed expansion, and I noticed a couple of missing pieces in the visuals – but it was a tremendous head start.

(Side note: Perplexity Labs is a new offering (launched May 2025) aimed exactly at these kinds of use cases. Instead of just giving a wall of text like a chatbot, Labs can output polished documents with charts, images, and even interactive elements. It’s like having a junior analyst who can write and make charts on command.)

Step 3: Enhancing Visuals with an AI Coding Assistant

While I loved the graphs Perplexity Labs generated, I identified a few additional visuals that would strengthen the report (for example, a more granular breakdown chart, or a different view of the data that wasn’t in the initial draft). To get these missing charts, I enlisted another specialist: GitHub Copilot, an AI coding assistant.

Using Copilot (within a Jupyter notebook environment), I described the extra analyses I needed, and it helped me whip up Python code (using libraries like Pandas and Matplotlib) to produce those charts. In essence, Copilot acted as my data visualization coder. For instance, I might prompt it, “Plot a histogram of customer acquisition by month for the last 2 years” or “Generate a comparative bar chart of Product A vs Product B profitability by quarter.” Copilot would suggest the code to do so, which I could tweak if necessary and run to get the chart.

This step highlights an important aspect of AI collaboration: not all useful AI outputs come from natural language prompts alone. Sometimes, the quickest path to a result is to have an AI assist in writing code for analysis. By doing so, I was able to:

Create highly customized charts that the general-purpose analysis agent (Perplexity) didn’t cover.
Ensure data accuracy by directly using the data in code. Copilot even caught a few edge cases (like handling missing values) that improved the robustness of the charts.
Save time vs. writing the code purely myself – Copilot usually got me 80% of the way there, and I just fine-tuned the rest.

After this, I incorporated these new graphs into the report draft. Now I had all the pieces: a thorough analysis write-up with plenty of visuals to back it up. The last challenge was to polish and condense the findings into an executive-ready form.

Step 4: Summarizing and Polishing the Report with Grok 4

For the final step – synthesizing the refined analysis into a top-notch report – I turned to Grok 4, xAI’s latest large language model. I was curious how Grok (which is billed as having “superhuman-level reasoning” and a massive 256K context window) would perform on summarizing a lengthy, complex document. Grok 4 also has an agentic approach, especially in its “Heavy” version where multiple agents can collaborate on a task. This made it an intriguing choice for Report Agent – potentially it could “think deeply” through the content and organize it optimally.

I fed the entire draft report (which was quite long, dozens of pages of text and graphs) into Grok 4 and asked it to produce a concise, well-structured narrative highlighting all the key insights for an executive audience. Essentially: “Here’s the full analysis; now give me the 5-page bulletproof version.”

To my delight, Grok 4 delivered a fantastic final report. The output read as if a seasoned consultant had distilled the analysis down to the most crucial points and recommendations:

The language was crisp and professional, with zero fluff. It maintained the evidence-based tone (referring to data points from the analysis) but cut out extraneous details.
The structure was on-point: an executive summary at top, followed by sections logically flowing through findings, and a brief conclusion with actionable recommendations.
Impressively, Grok seemed to truly understand the document. It didn’t miss or misstate any of the important findings. Credit may go to its huge context window and robust reasoning – Grok is designed to handle very large inputs and reason “like a human consultant” in strategy and decision-making contexts.

At this stage, I had the polished analysis ready for the client. As a final QA, I cross-checked a few sections with the original data (always a good practice) and found everything in order. The multi-agent assembly line – from data query to draft to final copy – had produced a report that truly felt “McKinsey-calibre.”

Model Face-Off: ChatGPT vs Claude vs Gemini vs Grok

Throughout this experiment, I also ran a little side-by-side comparison of the major AI models on the final report-writing task. I prompted the same summarization request to the following models: ChatGPT (GPT-4) via OpenAI, Anthropic’s Claude (4 Sonnet version), Google Gemini (2.5 Pro in ‘deep research’ mode), and xAI’s Grok 4. My aim was to see differences in style and substance when each was asked to produce the polished report. The differences were quite illuminating:

ChatGPT (GPT-4) – Generalized and Safe: The report from ChatGPT was well-structured and fluent, but it stayed very high-level. It tended to use generic business language and cautious statements. All facts were correct, yet it lacked some depth or bold insights. It felt like it was playing it safe, possibly to avoid any hallucination. For a quick overview, it was fine, but it didn’t dive into the nuance of the data as much as I’d hoped.
Claude 4 (Sonnet Extended) – Extremely Detailed (to a Fault): Claude’s version was rich in detail and very lengthy. It didn’t seem to miss any point – in fact, it often expanded on points with additional context or hypotheticals. While this thoroughness was impressive, it made the report overly verbose. Important insights were somewhat buried in the volume of text. Claude was like that analyst who gives you a 50-slide deck where 20 would suffice. Good analysis, but not sharply edited for key messages.
Gemini 2.5 Pro (Deep Research) – Powerful Analysis, but Some Off-Context Tangents: The Gemini model, when used in a “deep research” capacity, did provide strong analytical paragraphs. However, I observed an interaction quirk: it occasionally went beyond the provided document context, as if it “activated” an internal research mode and brought in outside info that wasn’t relevant. This made parts of its report not directly tied to the dataset at hand – potentially problematic in a client setting. The core content was solid, but Gemini didn’t stay fully within scope, despite the fine-tuning. (This aligns with my earlier usage: Gemini shines in data/query tasks, but for constrained summary, it needed more guardrails.)
Grok 4 – Balanced and Insightful: The Grok 4 report was the best structured and most on-point of all. It managed to combine the succinctness of ChatGPT with the depth of Claude – without the downsides of either. The writing was tight and executive-friendly. Every paragraph was relevant to the main story. It drew on the data and analysis appropriately, and even the phrasing felt like what you’d expect from a consultant with decades of experience. Grok did not drift off-topic, nor did it miss subtle insights. It struck that ideal balance between brevity and detail.

It’s worth noting that Grok 4’s strong performance here doesn’t mean it’s universally the top model for everything. As mentioned, my Gemini agent was superior for generating the SQL queries. In fact, a recent study found Gemini 2.5 Pro outperformed Grok 4 and others in real-world SQL accuracy tests, underscoring that each model has distinct strengths. The takeaway for me is that model selection should be task-dependent: use the right AI for the right job. In my experiment, that meant Gemini for data retrieval, Perplexity for initial analysis, Copilot for coding charts, and Grok for final synthesis.

Key Takeaways and Future Outlook

This multi-agent AI consulting workflow was a fascinating success. The experience taught me several important lessons about the future of consulting and how we might work alongside AI:

Leverage AI’s Specialized Strengths: No single AI (as of now) is best at everything. But if you compose them like a team of specialists, you can cover all bases with high quality. As we saw, one model might generate flawless SQL, another produces great visuals, and another writes an excellent narrative. Knowing the forte of each AI model is going to be a key skill.
Human Oversight and Orchestration is Key: While the AI agents did the heavy lifting, the overall workflow still benefited from a human in the loop – namely, me orchestrating the process, verifying outputs, and adding prompts where needed. This hybrid approach aligns with how consulting is evolving. AI automates the grind work (data crunching, initial drafts), freeing up human consultants to add judgment, domain knowledge, and the final polish. Studies already show AI can cut analysis and writing time nearly in half for professionals, allowing us to focus on higher-value thinking.
Role Definition for AI Agents: I learned the importance of “role assignment” for AI. Much like you’d assign roles in a project team, you should assign roles to different AI agents: data scientist, research analyst, visualizer, editor, etc. Each model was given a personalized task at what it’s best at, and this role clarity helped prevent errors and overlaps. This concept of AI orchestration is gaining traction – multi-agent systems can tackle complex workflows more reliably by delegating subtasks to the right agent.
Quality of Output vs. Model Hype: Newer or bigger isn’t always better. It was eye-opening that the much-hyped models didn’t always win in every category. GPT-4 (ChatGPT) is incredibly powerful, yet a smaller model tuned on the domain might beat it for a specific task. Similarly, Grok 4 is touted as a game-changer, but I only saw its true edge in the final summarization, not earlier phases. The best solution might be an ensemble of models and tools, rather than betting on one “super AI” to do everything.
Future of Consulting – Human-AI Teams: As an MBA student preparing for a career in consulting, this experiment felt like a glimpse into the future. I can easily imagine project teams where alongside the human members, you have AI agents working on data and insights continuously. The human consultants define problems, guide the AI, and then synthesize the final strategy and client recommendations. This human-AI collaboration could massively accelerate project cycles while maintaining (or improving) quality. The key will be learning how to work with AI as teammates, including understanding their “personalities” (styles, limitations) and communicating effectively in prompts.

In conclusion, my multi-agent approach to generating a consulting report was a resounding success. It delivered not just an “optimal solution” for this case, but also a valuable lesson: the whole can truly be greater than the sum of its AI parts. By orchestrating specialized AI agents, we can achieve outcomes that none of them could produce alone. For anyone in business analytics or consulting, this opens up exciting opportunities to boost productivity, enhance insight quality, and redefine workflows. The future of consulting may well be consultants + AI agents working hand-in-hand – and based on my experience, that future is extremely promising.

Experimenting with AI Agents for Consulting Workflows: A Multi-Agent Case Study

Karan Chandra Dey

AI Product Strategist | ML Engineer | Human-AI Interaction Designer | LLM Engineer | SF / Los Angeles

Business Innovationist

627 followers

More articles by this author

Others also viewed

Annotation Guidelines: Building a Clear Framework for Data Labeling

The Role of Human Annotation in Automated Data Labeling

Insights on Data Analytics and AI with Nithya Subramanian

Understanding Human-in-the-Loop Data Annotation and Labeling

Software Company Vs. Data Analytics Vs. Artificial Intelligence (AI) Company

Non‑Technical AI Roles on the Rise

What is Image Annotation: Types, Techniques & Best Practices

In-House vs Outsourcing Data Annotation: Pros, Cons & Costs

What is Data Annotation And Its Uses in Machine Learning

Best Practices for Video Annotation for Computer Vision Datasets

Explore topics

Business Innovationist

627 followers

From Hunches to Hard Strategy

May 1, 2025

Living Intelligence: The Next Technological Revolution?

Jan 18, 2025

Medical Diagnosis Coding Automation by Artificial Intelligence

Jan 6, 2025

Personalized Customer Strategy in Healthcare in the Age of AI

Dec 24, 2024

Artificial Intelligence for Marketing 2024: A Benchmark to Plan in 2025

Dec 16, 2024

The $700 Million Beauty Blunder: What Beautycounter and Carlyle can teach us about Private Equity

Jul 12, 2024

AI-Augmented Organisational Management for Creating an Automated System of Value Delivery

Jun 20, 2024

Towards Software Anti-Fragility: A Framework to Mitigate Black Swan Events in Software Engineering Projects

Jun 9, 2024

System Thinking in Software Engineering by Redesigning an E-commerce Application

Jun 7, 2024

Optimising Mergers & Acquisitions: Strategies for Success

Apr 18, 2024