Se usó la API de Cloud Translation para traducir esta página.

Recomendaciones de LoRA y QLoRA para LLM

This guide provides configuration recommendations for tuning large language models (LLMs) on Vertex AI by using LoRA and QLoRA.

Choose a tuning method

LoRA and QLoRA are parameter-efficient fine-tuning (PEFT) methods that let you adapt a large language model to your specific tasks without the high computational cost of full fine-tuning. The following table compares LoRA and QLoRA to help you choose the best method for your use case.

Method	Description	Pros	Cons
LoRA	A tuning method that freezes the original model's weights and injects trainable low-rank matrices into the Transformer layers. This significantly reduces the number of trainable parameters.	Faster tuning speed and lower cost compared to QLoRA.	Higher GPU memory usage, which limits batch size and maximum sequence length. May not be feasible on lower-memory GPUs.
QLoRA	An optimized version of LoRA that uses 4-bit quantization and other memory-saving techniques to further reduce memory usage.	Significantly lower GPU memory usage, allowing for larger models, larger batch sizes, and longer sequence lengths on the same hardware.	Slower tuning speed and slightly higher cost compared to LoRA.

Tuning recommendations

The following table summarizes recommendations for tuning LLMs by using LoRA or QLoRA.

Specification	Recommended	Details
GPU memory efficiency	QLoRA	QLoRA has about 75% smaller peak GPU memory usage compared to LoRA.
Speed	LoRA	LoRA is about 66% faster than QLoRA in terms of tuning speed.
Cost efficiency	LoRA	While both methods are relatively inexpensive, LoRA is up to 40% less expensive than QLoRA.
Higher max sequence length	QLoRA	Higher max sequence length increases GPU memory consumption. QLoRA uses less GPU memory, so it can support higher max sequence lengths.
Accuracy improvement	Same	Both methods offer similar accuracy improvements.
Higher batch size	QLoRA	QLoRA supports much higher batch sizes. For example, the following are batch size recommendations for tuning openLLaMA-7B on the following GPUs: 1 x A100 40G: LoRA: Batch size of 2 is recommended. QLoRA: Batch size of 24 is recommended. 1 x L4: LoRA: Batch size of 1 fails with an out of memory error (OOM). QLoRA: Batch size of 12 is recommended. 1 x V100: LoRA: Batch size of 1 fails with an out of memory error (OOM). QLoRA: Batch size of 8 is recommended.

Recomendaciones de LoRA y QLoRA para LLM Organiza tus páginas con colecciones Guarda y categoriza el contenido según tus preferencias.

Choose a tuning method

Tuning recommendations

Recomendaciones de LoRA y QLoRA para LLM