Choose a tuning method
LoRA and QLoRA are parameter-efficient fine-tuning (PEFT) methods that let you adapt a large language model to your specific tasks without the high computational cost of full fine-tuning. The following table compares LoRA and QLoRA to help you choose the best method for your use case.
Method | Description | Pros | Cons |
---|---|---|---|
LoRA | A tuning method that freezes the original model's weights and injects trainable low-rank matrices into the Transformer layers. This significantly reduces the number of trainable parameters. | Faster tuning speed and lower cost compared to QLoRA. | Higher GPU memory usage, which limits batch size and maximum sequence length. May not be feasible on lower-memory GPUs. |
QLoRA | An optimized version of LoRA that uses 4-bit quantization and other memory-saving techniques to further reduce memory usage. | Significantly lower GPU memory usage, allowing for larger models, larger batch sizes, and longer sequence lengths on the same hardware. | Slower tuning speed and slightly higher cost compared to LoRA. |
Tuning recommendations
The following table summarizes recommendations for tuning LLMs by using LoRA or QLoRA.
Specification | Recommended | Details |
---|---|---|
GPU memory efficiency | QLoRA | QLoRA has about 75% smaller peak GPU memory usage compared to LoRA. |
Speed | LoRA | LoRA is about 66% faster than QLoRA in terms of tuning speed. |
Cost efficiency | LoRA | While both methods are relatively inexpensive, LoRA is up to 40% less expensive than QLoRA. |
Higher max sequence length | QLoRA | Higher max sequence length increases GPU memory consumption. QLoRA uses less GPU memory, so it can support higher max sequence lengths. |
Accuracy improvement | Same | Both methods offer similar accuracy improvements. |
Higher batch size | QLoRA | QLoRA supports much higher batch sizes. For example, the following are batch size recommendations for tuning openLLaMA-7B on the following GPUs:
|