Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.
Generative AI on Vertex AI quotas and system limits
Stay organized with collections
Save and categorize content based on your preferences.
This page introduces two ways to consume generative AI services, provides a list
of quotas by region and model, and shows you how to view and edit your quotas in
the Google Cloud console.
Overview
There are two ways to consume generative AI services. You can choose
pay-as-you-go (PayGo), or you can pay in advance using
Provisioned Throughput.
If you're using PayGo, your usage of generative AI features is subject to one of
the following quota systems, depending on which model you're using:
Models earlier than Gemini 2.0 use a standard quota system for each
generative AI model to help ensure fairness and to reduce spikes in resource
use and availability. Quotas apply to Generative AI on
Vertex AI requests for a given Google Cloud project and
supported region.
Newer models use Dynamic shared quota
(DSQ), which
dynamically distributes available PayGo capacity among all customers for a
specific model and region, removing the need to set quotas and to submit
quota increase requests. There are no quotas with DSQ.
To help ensure high availability for your application and to get predictable
service levels for your production workloads, see
Provisioned Throughput.
Non-Gemini and earlier Gemini models use the standard
quota system. For more information, see
Vertex AI quotas and limits.
Tuned model quotas
Tuned model inference shares the same quota as the base model.
There is no separate quota for tuned model inference.
Text embedding limits
Each request can have up to 250 input texts (generating 1 embedding per input text) and 20,000
tokens per request. Only the first 2,048 tokens in each input text are used to compute the
embeddings. For gemini-embedding-001, the
quota is listed under the name
gemini-embedding.
Embed content input tokens per minute per base model
Unlike previous embedding models which were primarily limited by RPM quotas, the quota for the
Gemini Embedding model limits the number of tokens that can be sent per minute per
project.
Quota
Value
Embed content input tokens per minute
5,000,000
Vertex AI Agent Engine limits
The following limits apply to Vertex AI Agent Engine for a given project in each region:
Description
Limit
Create, delete, or update Vertex AI Agent Engine per minute
10
Create, delete, or update Vertex AI Agent Engine sessions per minute
100
Query or StreamQuery Vertex AI Agent Engine per minute
60
Append event to Vertex AI Agent Engine sessions per minute
300
Maximum number of Vertex AI Agent Engine resources
100
Create, delete, or update Vertex AI Agent Engine memory resources per minute
100
Get, list, or retrieve from Vertex AI Agent Engine Memory Bank per minute
300
Batch prediction
The quotas and limits for batch inference jobs are the same across all regions.
Concurrent batch inference job limits for Gemini models
There are no predefined quota limits on batch inference for Gemini models. Instead, the
batch service provides access to a large, shared pool of resources, dynamically
allocated based on the model's real-time availability and demand across all customers for that model.
When more customers are active and saturated the model's capacity, your batch requests might be
queued for capacity.
To adjust the quota, copy and paste the property
aiplatform.googleapis.com/textembedding_gecko_concurrent_batch_prediction_jobs
in the Filter. Press Enter.
Click the three dots at the end of the row, and select Edit quota.
Enter a new quota value in the pane, and click Submit request.
Vertex AI RAG Engine
For each service to perform retrieval-augmented generation (RAG) using RAG Engine, the
following quotas apply, with the quota measured as requests per minute (RPM).
Service
Quota
Metric
RAG Engine data management APIs
60 RPM
VertexRagDataService requests per minute per region
RetrievalContexts API
1,500 RPM
VertexRagService retrieve requests per minute per region
base_model: textembedding-gecko
1,500 RPM
Online prediction requests per base model per minute per region per base_model
An additional filter for you to specify is base_model: textembedding-gecko
The following limits apply:
Service
Limit
Metric
Concurrent ImportRagFiles requests
3 RPM
VertexRagService concurrent import requests per region
Maximum number of files per ImportRagFiles request
10,000
VertexRagService import rag files requests per region
The Gen AI evaluation service uses gemini-2.0-flash as a default judge model
for model-based metrics.
A single evaluation request for a model-based metric might result in multiple underlying requests to
the Gen AI evaluation service. Each model's quota is calculated on a per-project basis, which means
that any requests directed to gemini-2.0-flash for model inference and
model-based evaluation contribute to the quota.
Quotas for the Gen AI evaluation service and the underlying judge model are shown
in the following table:
Request quota
Default quota
Gen AI evaluation service requests per minute
1,000 requests per project per region
Online prediction requests per minute for base_model: gemini-2.0-flash
If you receive an error related to quotas while using the Gen AI evaluation service, you might
need to file a quota increase request. See View and manage
quotas for more information.
Limit
Value
Gen AI evaluation service request timeout
60 seconds
When you use the Gen AI evaluation service for the first time in a new project, you might
experience an initial setup delay up to two minutes. If your first request fails, wait a few minutes
and then retry. Subsequent evaluation requests typically complete within 60 seconds.
The maximum input and output tokens for model-based metrics depend on the model used
as the judge model. See Google models for a
list of models.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-15 UTC."],[],[]]