Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

Serve Diffusion Transformer models using xDiT container on Cloud GPUs

This document shows you how to serve Diffusion Transformer (DiT) models with the xDiT container on Cloud GPUs in Vertex AI. This document covers the following topics:

Benefits: Learn about the performance and scalability advantages of using xDiT.
Supported models: Find out which models in Model Garden are compatible with xDiT.
Parallelism and optimization techniques: Understand the hybrid parallelism and single-GPU acceleration methods xDiT uses to speed up inference.
Get started in Model Garden: Follow instructions to deploy a model using one-click deployment or a Colab Enterprise notebook.
Customize your deployment: Learn how to override default settings and find a full list of configurable arguments for optimal performance.

The following diagram summarizes the overall workflow:

xDiT is an open-source library that accelerates inference for Diffusion Transformer (DiT) models by using parallelism and optimization techniques. These techniques enable a scalable multi-GPU setup for demanding workloads. This document shows you how to deploy DiT models by using xDiT and Cloud GPUs on Vertex AI.

For more information about xDiT, see the xDiT GitHub project.

Benefits

Using xDiT to serve DiT models on Vertex AI provides the following benefits:

Up to three times faster generation: Generate high-resolution images and videos in a fraction of the time compared to other serving solutions.
Scalable multi-GPU support: Efficiently distribute workloads across multiple GPUs for optimal performance. xDiT supports various parallel processing approaches, such as unified sequence parallelism, PipeFusion, CFG parallelism, and data parallelism. You can combine these methods to optimize performance.
Optimized single-GPU performance: xDiT provides faster inference even on a single GPU by incorporating several kernel acceleration methods and using techniques from DiTFastAttn.
Easy deployment: Get started with one-click deployment or Colab Enterprise notebooks in Vertex AI Model Garden.

Supported models

xDiT is available for certain DiT model architectures in Vertex AI Model Garden such as Flux.1 Schnell, CogVideoX-2b, and Wan2.1 text-to-video model variants. To check if a DiT model in Model Garden supports xDiT, view its model card in Model Garden.

Parallelism and optimization techniques

Hybrid parallelism for multi-GPU performance

xDiT uses a combination of parallelism techniques to maximize performance on multi-GPU setups. These techniques work together to distribute the workload and optimize resource utilization.

Parallelism Technique	How it works	Use Case
Unified sequence parallelism	This technique splits the input data (such as splitting an image into patches) across multiple GPUs, which reduces memory usage and improves scalability.	Reduces memory usage per GPU, which enables larger models or higher resolutions.
PipeFusion	PipeFusion divides the DiT model into stages and assigns each stage to a different GPU, which enables parallel processing of different parts of the model.	Enables parallel processing of different parts of the model for a single input.
CFG parallelism	This technique optimizes models by using classifier-free guidance. It parallelizes the computation of conditional and unconditional branches, which leads to faster inference.	Speeds up inference for models that use CFG to control output style.
Data Parallelism	This method replicates the entire model on each GPU, with each GPU processing a different batch of input data.	Increases overall throughput by processing multiple inputs simultaneously.

For more information about performance improvements, see the xDiT report on Flux.1 Schnell or CogVideoX-2b. Google has reproduced these results on Vertex AI Model Garden.

Single GPU acceleration

The xDiT library provides benefits for single-GPU serving by using torch.compile and onediff to enhance runtime speed on GPUs. You can also use these techniques in conjunction with hybrid parallelism.

xDiT also has an efficient attention computation technique, called DiTFastAttn, to address DiT's computational bottleneck. This technique is only available for single GPU setups or in conjunction with data parallelism.

Get started in Model Garden

The xDiT-optimized serving container is available in Vertex AI Model Garden. For supported models, deployments use this container when you use one-click deployment or the Colab Enterprise notebook examples.

The following table compares the two deployment methods.

Deployment Method	Description	Pros	Use Case
One-click deployment	Deploys a model to a Vertex AI endpoint with pre-configured default settings directly from the Model Garden UI.	Simple, fast, and requires no code.	Quickly deploying models for testing or standard use cases without needing custom configurations.
Colab Enterprise notebook	Uses the Vertex AI SDK for Python within a notebook to deploy a model, which allows for detailed configuration and customization.	Highly flexible; allows for full control over deployment parameters and parallelism strategies.	Advanced users who need to optimize performance for specific workloads or integrate deployment into a larger workflow.

The following examples use the Flux.1-schnell model to show you how to deploy a DiT model on an xDiT container.

Use one-click deployment

To deploy a custom Vertex AI endpoint with the xDiT container from a model card, follow these steps:

Go to the model card page and click Deploy.
Select a model variation and a machine type for your deployment.
Click Deploy. The deployment process begins. You receive an email notification when the model is uploaded and another when the endpoint is ready.

Use the Colab Enterprise notebook

For more flexibility and customization, you can use the Colab Enterprise notebook examples to deploy a Vertex AI endpoint with the xDiT container by using the Vertex AI SDK for Python.

Go to the model card page and click Open notebook.
Select the Vertex Serving notebook to open it in Colab Enterprise.
Run the notebook to deploy the model using the xDiT container and send prediction requests to the endpoint. The following code shows the deployment:

import vertexai
from vertexai import model_garden

vertexai.init(project=<YOUR_PROJECT_ID>, location=<REGION>)

model = model_garden.OpenModel("black-forest-labs/FLUX.1-schnell")
endpoint = model.deploy()

To learn how to customize the deployment by overriding default settings, see Customize your deployment.

Customize your deployment

Model Garden provides default xDiT parallelization configurations for supported models. You can view these default settings and override them to meet your needs.

Override default settings

To view the default deployment configuration for a model, such as black-forest-labs/FLUX.1-schnell, use the Vertex AI SDK for Python as shown in the following code:

import vertexai
from vertexai import model_garden

vertexai.init(project=<YOUR_PROJECT_ID>, location=<REGION>)

model = model_garden.OpenModel("black-forest-labs/FLUX.1-schnell")
deploy_options = model.list_deploy_options()


# Example Response
# ['black-forest-labs/flux1-schnell@flux.1-schnell']
# [model_display_name: "Flux1-schnell"
# container_spec {
#   image_uri: "us-docker.pkg.dev/deeplearning-platform-release/vertex-model-garden/xdit-serve.cu125.0-2.ubuntu2204.py310"
#  env {
#    name: "DEPLOY_SOURCE"
#    value: "UI_NATIVE_MODEL"
#  }
#  env {
#    name: "MODEL_ID"
#    value: "gs://vertex-model-garden-restricted-us/black-forest-labs/FLUX.1-schnell"
#  }
#  env {
#    name: "TASK"
#    value: "text-to-image"
#  }
#  env {
#    name: "N_GPUS"
#    value: "2"
#  }
#  env {
#    name: "USE_TORCH_COMPILE"
#    value: "true"
#  }
#  env {
#    name: "RING_DEGREE"
#    value: "2"
#  }
# ..........]

The list_deploy_options() method returns the container specifications, including the environment variables (env) that define the xDiT configuration.

To customize the parallelism strategy, override these environment variables when you deploy the model. The following example shows you how to modify the RING_DEGREE and ULYSSES_DEGREE for a 2-GPU setup to change the parallelism approach:

import vertexai
from vertexai import model_garden

# Replace with your project ID and region
vertexai.init(project="<YOUR_PROJECT_ID>", location="<REGION>")

model = model_garden.OpenModel("black-forest-labs/FLUX.1-schnell")

# Custom environment variables to override default settings
# This example sets N_GPUS as 2, so RING_DEGREE * ULYSSES_DEGREE must equal 2
container_env_vars = {
    "N_GPUS": "2",
    "RING_DEGREE": "1",
    "ULYSSES_DEGREE": "2"
    # Add other environment variables to customize here
}

machine_type = "a3-highgpu-2g"
accelerator_type = "NVIDIA_H100_80GB"
accelerator_count = 2

# Deploy the model with the custom environment variables
endpoint = model.deploy(
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
  container_env_vars=container_env_vars
)

For more examples of serving recipes and configurations for different models, see the xDiT official documentation. For more information about the Model Garden SDK, see the documentation.

xDiT arguments reference

xDiT offers several server arguments that you can configure as environment variables to optimize performance. This section describes key arguments that you might need to configure.

Model Configuration

MODEL_ID (string): Specifies the model identifier to load. This value should match the model name in your registry or path.

Runtime Optimization Arguments

N_GPUS (integer): Specifies the number of GPUs to use for inference. The default is 1.
WARMUP_STEPS (integer): The number of warmup steps to perform before inference begins. This is important when PipeFusion is enabled to help with stable performance. The default is 1.
USE_PARALLEL_VAE (boolean): Enables efficient processing of high-resolution images (larger than 2048 pixels) by parallelizing the VAE component across devices. This can help prevent out-of-memory (OOM) issues for large images. The default is false.
USE_TORCH_COMPILE (boolean): Enables single-GPU acceleration through torch.compile, which provides kernel-level optimizations for improved performance. The default is false.
USE_ONEDIFF (boolean): Enables OneDiff compilation acceleration to optimize GPU kernel execution speed. The default is false.

Data Parallel Arguments

DATA_PARALLEL_DEGREE (integer): Sets the degree of data parallelism. To disable data parallelism, leave this value empty.
USE_CFG_PARALLEL (boolean): Enables parallel computation for classifier-free guidance (CFG), also known as Split Batch. When enabled, the constant parallelism degree is 2. Set to true when you use CFG to control output style and content. The default is false.

Sequence Parallel Arguments (USP - Unified Sequence Parallelism)

ULYSSES_DEGREE (integer): Sets the Ulysses degree for the unified sequence parallel approach, which combines DeepSpeed-Ulysses and Ring-Attention. This setting controls the all-to-all communication pattern. To use the default, leave this value empty.
RING_DEGREE (integer): Sets the Ring degree for peer-to-peer communication in sequence parallelism. This works with ULYSSES_DEGREE to form the 2D process mesh. To use the default, leave this value empty.

Tensor Parallel Arguments

TENSOR_PARALLEL_DEGREE (integer): Sets the degree of tensor parallelism, which splits model parameters across devices along feature dimensions to reduce memory costs per device. To disable tensor parallelism, leave this value empty.
SPLIT_SCHEME (string): Defines how to split the model tensors across devices (for example, by attention heads or hidden dimensions). To use the default splitting scheme, leave this value empty.

Ray Distributed Arguments

USE_RAY (boolean): Enables the Ray distributed execution framework for scaling computations across multiple nodes. The default is false.
RAY_WORLD_SIZE (integer): The total number of processes in the Ray cluster. The default is 1.
VAE_PARALLEL_SIZE (integer): The number of processes dedicated to VAE parallel processing when you use Ray. The default is 0.
DIT_PARALLEL_SIZE (integer): The number of processes dedicated to DiT backbone parallel processing when you use Ray. The default is 0.

PipeFusion Parallel Arguments

PIPEFUSION_PARALLEL_DEGREE (integer): Sets the degree of parallelism for PipeFusion, a sequence-level pipeline parallelism that uses the input temporal redundancy characteristics of diffusion models. Higher values increase parallelism but also require more memory. The default is 1.
NUM_PIPELINE_PATCH (integer): The number of patches to split the sequence into for pipeline processing. To use automatic determination, leave this value empty.
ATTN_LAYER_NUM_FOR_PP (string): Specifies which attention layers to use for pipeline parallelism. You can provide values as a comma-separated string (for example, "10,9") or a space-separated string (for example, "10 9"). To use all layers, leave this value empty.

Memory Optimization Arguments

ENABLE_MODEL_CPU_OFFLOAD (boolean): Offloads model weights to CPU memory when they are not in use. This reduces GPU memory usage but increases latency. The default is false.
ENABLE_SEQUENTIAL_CPU_OFFLOAD (boolean): Sequentially offloads model layers to the CPU during the forward pass. This enables inference of models that are larger than the available GPU memory. The default is false.
ENABLE_TILING (boolean): Reduces GPU memory usage by decoding the VAE component one tile at a time. This argument is useful for larger images or videos and can help prevent out-of-memory errors. The default is false.
ENABLE_SLICING (boolean): Reduces GPU memory usage by splitting the input tensor into slices for VAE decoding. The default is false.

DiTFastAttn Arguments (Attention Optimization)

USE_FAST_ATTN (boolean): Enables DiTFastAttn acceleration for single-GPU inference, which uses Input Temporal Reduction to reduce computational complexity. The default is false.
N_CALIB (integer): The number of calibration samples for DiTFastAttn optimization. The default is 8.
THRESHOLD (float): The similarity threshold for Temporal Similarity Reduction in DiTFastAttn. The default is 0.5.
WINDOW_SIZE (integer): The window size for Window Attention with Residual Caching, which is used to reduce spatial redundancy. The default is 64.
COCO_PATH (string): The path to the COCO dataset for DiTFastAttn calibration. This is required when USE_FAST_ATTN is true. If you are not using DiTFastAttn, leave this value empty.

Cache Optimization Arguments

USE_CACHE (boolean): Enables general caching mechanisms to reduce redundant computations. The default is false.
USE_TEACACHE (boolean): Enables the TeaCache optimization method for caching intermediate results. The default is false.
USE_FBCACHE (boolean): Enables the First-Block-Cache (FBCache) optimization method. The default is false.

Precision Optimization Arguments

USE_FP8_T5_ENCODER (boolean): Enables FP8 (8-bit floating point) precision for the T5 text encoder. This reduces memory usage and can improve throughput with minimal impact to quality. The default is false.