Decoding the True Cost of Generative AI for Your Enterprise
Image generated with Midjourney

Decoding the True Cost of Generative AI for Your Enterprise

"What’s the cost of generative AI?" is a question that comes up in almost every customer conversation I'm involved in: an airline looking to incorporate Large Language Models (LLMs) for customer service, an insurance company looking to use LLMs to summarize claims, and a bank using LLMs to classify customer complaints. Today, generative AI technologies are being used by enterprises for use cases including summarization, classification, content generation, information extraction, and content-grounded Q&A.

As enterprises start adopting LLMs for these tasks, they will quickly realize that the path to success is not straightforward. Choosing the "right" LLM from a collection of thousands of open-source and proprietary models requires a careful examination of the tradeoff between cost and performance. In this article, I look into the common pricing metrics that dominate the LLM landscape today. In a future article, I will walk through the cost/performance tradeoff and picking the right LLM for a specific use case.

💡 Unlike traditional ML, generative models such as LLMs require GPU compute, making their use several times more expensive.


The true cost of generative AI

There are four common costs to operationalizing generative AI within an enterprise. I will describe these in terms of an LLM because they are the most popular at the moment, but these costs apply to other types of generative models as well (e.g. diffusion models, GANs, VAEs, etc.).

  • Inference cost — the cost of calling an LLM to generate a response
  • Tuning cost — the cost of tuning an LLM to drive tailored pre-trained model responses
  • Pre-training cost – the cost of training a new LLM from scratch
  • Hosting cost — the cost of deploying and maintaining a model behind an API, supporting inference or tuning

Inference cost

Each time you provide input to an LLM (the prompt) in order to generate an output (the completion), it uses compute resources. This process of invoking a trained LLM to generate an output is called inference. The cost of this process, primarily driven by GPU compute, is called the inference cost.

For LLMs, inference operates on discrete units of information called tokens. One token roughly consists of 3-5 characters of text.

The cost for a single inference includes the number of tokens in both the prompt and the completion.

💰💡👉 Inference is typically priced per 1K tokens (about 750 words)

No alt text provided for this image
Example inference cost for 1000 prompt tokens and 1000 completiontokens

Tuning cost

Pre-trained models may be tuned in two ways in order to drive tailored responses: prompt tuning and fine tuning.

Prompt tuning is an efficient, low-cost way of adapting a pre-trained model to new tasks. It does not alter the pre-trained model.

Fine tuning is the process of adapting a pre-trained model to perform a specific task by conducting additional training with new data. Fine tuning a model alters the pre-trained model and therefore has a much higher training cost.

💰💡👉 Tuning is typically priced per compute hour

The cost of tuning is a function of how much compute is required to perform the tuning operation. It is typically charged on a per compute hour basis, with the hourly rate depending on the type of GPU used to perform the tuning. Larger models will require more compute cycles for tuning, leading to a higher tuning cost.

No alt text provided for this image
Example tuning costs for an LLM undergoing prompt tuning and fine tuning

Pre-training cost

Pre-training is the process of training a new LLM from scratch. This process gives enterprises full control over the data used to train an LLM, but it can also be cost prohibitive.

As an example, GPT-3's pre-training was performed using 1,024 GPUs over the course of 34 days, costing $4.6M in compute resources alone. Hence, only a small number of model providers have emerged in the marketplace that have taken on the challenge of pre-training LLMs from scratch.

No alt text provided for this image
Example pre-training costs for an LLM undergoing 5 months training

Hosting cost

There are two ways to make an LLM available for inference:

  1. Inference API. The LLM is pre-deployed by a platform provider and made available via an API. The cost of inference for a single API call is based on the number of tokens processed (prompt plus completion) in that API call.
  2. Hosting. The LLM is made available for deployment by a platform provider. Customers who wish to deploy a model (making it available for their application) pay based on the duration of that model's deployment.

💰💡👉 Hosting is typically priced per hour

No alt text provided for this image
Example costs for hosting an LLM versus using it with an inference API

Case study

Let’s examine the case of Emma, a fictitious head of product in a tech startup. We met Emma previously as we examined how she scaled her team too quickly. Emma is intrigued with the opportunities provided by generative AI and would like to evaluate the ROI of integrating LLMs within her firm.

One of the popular applications of generative AI is summarization. Emma's firm embraced remote work during the pandemic and made a strategic decision to permanently embrace remote work. As a remote-first culture, a large number of online meetings occur within the firm. Let's assume that each employee attends an average of five, 30-minute meetings each day, with an average of 3 employees in each meeting.

Emma has a great idea to use LLMs to automatically generate one-page summaries for each meeting in her firm that highlight action items and important decisions that were made. Emma expects automating this process will save her colleagues time and enhance transparency within the firm, resulting in a productivity boost that can translates to a significant revenue increase next year.

Let's look into the cost of implementing Emma's scenario and if the revenue increase justifies the investment.

Assumptions

  • Number of employees at Emma's firm: 700
  • Each meeting is (on average) 30 minutes long
  • Each employee has (on average) 5 meetings each day, with (on average) 2 colleagues per meeting (= 3 people per meeting)

Step 1: Estimate the number of summarization requests per day

Let N = Number of summarization requests per day.

N = Number of employees * Number of meetings per day per employee / Number of employees per meeting

N = 700 * 5 / 3 = 1,166 meetings per day

Step 2: Calculate the size of the prompt and completion

In a conversation, people typically speak at a rate of 120-150 words/min.

Therefore, a 30-minute call would contain approximately 30 * 150 = 4,500 words as a rough upper bound.

1,000 tokens for an LLM corresponds to approximately 750 words, which means a 30-minute meeting transcript would require about (4,500 words / 750 words per 1K tokens) * 1,000 tokens = 6,000 tokens. This will be the size of the prompt (ignoring a negligible number of tokens to specify the task instruction, such as "Produce a summary of this meeting").

The output from the model will be a one-page meeting summary. One page of text is approximately 500 words, corresponding to 500 / 750 = 666 tokens. Therefore, 666 tokens are needed for the completion.

Step 3: Calculate the cost of generating one meeting summary

Recall how to compute inference cost:

Inference Cost = # prompt tokens * prompt cost per token + # completion tokens * completion cost per token

For a single meeting, the inference cost would be:

Inference Cost = 6,000 tokens (prompt) * prompt cost per token + 666 tokens (completion) * completion cost per token

Many platform providers list their pricing in units of 1K tokens. For example, Anthropic's Claude V2 model, which has 52B parameters, has the following pricing (as of August 14, 2023):

  • $0.01102/1K tokens (prompt)
  • $0.03268/1K tokens (completion)

Therefore, our inference cost would be:

Inference Cost = 6 * prompt cost per 1K tokens + 0.666 * completion cost per 1K token

= 6 * $0.01102 + 0.666 * $0.03268 = $0.09

Step 4: Calculate the total cost for producing meeting summaries for a single day at Emma's firm

With 1,166 meetings per day, and a cost of $0.09 per meeting, Emma's firm would spend $105 per day to produce meeting summaries.

This translates to an annual cost of 365 * $105 = $38,325. Let's round this up to $40K / year for our analysis.

Analysis

Recall from our prior case study that Emma projects a $27.5K revenue for next year. She expects her implementation of LLMs to increase her projected revenue to an impressive all-time high of $40K/year! 1.7 times the revenue sounds great, but should she celebrate yet?

Looking at her income statement, an annual spend of $40K will crash her entire business by yielding a very unhealthy 13% Rule of 40!

Article content
Ratio Analysis for Emma's case when she implements very large general-purpose LLMs


The LLM that Emma picked for this scenario is a large, general-purpose LLM with a high inference cost. These kinds of LLMs demonstrate impressively high performance on a wide-range of tasks, but as we see in this example, the high cost of adopting such models at enterprise scale may not be viable.

Currently, a popular trend is to use large, general-purpose LLMs to experiment with various business use cases and evaluate ROI. Once a use case has been identified, it makes sense to reduce operational costs by tuning a much smaller model to achieve the same performance at a fraction of the cost.

For example, Emma can significantly reduce her LLM cost by tuning a much smaller LLM for a summarization use case. One example of a smaller model is MPT-Instruct2, which has the following pricing as hosted on IBM's watsonx.ai:

  • $0.0006 per 1K tokens (prompt and completion)

The cost for Emma's firm of using this 3B parameter model is then:

Inference Cost = 6 * $0.0006 + 0.666 * $0.0006 = $0.0036 + $0.0003996 = $0.0039996

Annual Cost = $0.0039996 * 1,166 meetings / day * 365 days = $1,702.19

Of course, this annual cost doesn't take into account the cost of tuning the model. Let's assume that we need to spend 48 hours of compute to tune the model. At a cost of $24/hr to tune a model, that adds in a $1,152 tuning cost.

Total Cost = $1,702 (inference) + $1,152 (tuning) = $2,854

Article content
Ratio Analysis for Emma's case when she implements a smaller, tuned LLM


By using a smaller, tuned LLM, Emma reduced her total cost by 14X!

💡 The cost of using an LLM through a subscription falls under Cost of Goods Sold (COGS), which will significantly impact the profitability of every enterprise embracing LLM technology.

This example illustrates the importance of picking the right model:

You can make or break your business depending on which LLM you pick

As a product leader, it's crucial to understand the true cost of generative AI at the scale of your enterprise to navigate the cost/performance trade-off.

In my next article, I’ll dive deeper into this trade-off between model size and performance and how to pick the right LLM that achieves reasonable performance at a fraction of the cost of large, general-purpose models.

I occasionally share my insider tips for maximizing strategy and performance tracking. Subscribe here for free to stay tuned.

Adefoluke Shemsu

Policy Fellow at SeedAI | Master's Candidate at GSPM | Ex-Accenture & Tech Startups | Lover of the interplay between big ideas and implementation

1y

Massively helpful, Maryam! I enjoyed the case study as well. Thanks for sharing! How might you consider incorporating these costs at an AI product company? In a previous product role where we offered Banking as a Service, there was a CTM (Cost to Maintain) model that compiled the varied costs of software (specifically, the software enabling our products) and gave us a clearer snapshot of progress toward varied revenue goals, given our current stack of tech partnerships and features for customers. I wondered if LLM costs could be attributed to that same internal metric, given the prudence of not only understanding the ROI/value for our end users but also the ROI applied against GTM/Product KPIs. I'm especially curious about how this might break out when considering a customer-hosted solution versus a company-hosted one, where I imagine there being a significant difference in hosting and pre-training costs to consider. Do you think the equations from the case study would still be applicable as the provider of the AI product versus being the adopter of it?

Like
Reply
Aga (Agnieszka) Walewska PMP®

Change Manager || AI Project Management || Business Automation Consultant || Process Mining / Celonis || Data Scientist

1y

Thank you so much, Maryam Ashoori, PhD - it's exactly what I needed today!

Like
Reply
Ravi Karan K.

Sales Strategy & New Initiatives

1y

I was eagerly looking for resources to know about the costing of generative AI, this write up is a good primer in this regard.

Like
Reply
Muru Ramakrishnamuthu

Director, Product Management & Data Management at Silicon Valley Bank

1y

This can provide quick help, especially for small and medium-sized enterprises.

Like
Reply
Karine Lévénès

Senior Project Manager Hybrid Cloud Services @ IBM Consulting

1y

Very interesting article while enterprises are developing their AI strategy.

Like
Reply

To view or add a comment, sign in

Explore topics