MoRA: High-Rank Adaptation for PEFT

efficiently finetune LLMs with high rank adaptation

Paper link: https://arxiv.org/pdf/2405.12130v1

One of the popular PEFT approaches is called Low-Rank Adaptation (LoRA). LoRA works by updating only a small part of the model's parameters through low-rank matrices, which are like simplified representations of the full information. This can be quite effective, but it has limitations.

However, LoRA's reliance on low-rank updates might hinder its ability to fully learn and store new information. From the paper:

One plausible explanation for this limitation observed with LoRA could be its reliance on low-rank updates (Lialin et al., 2023). The low-rank update matrix, ∆W , struggles to estimate the full-rank updates in FFT, particularly in memory-intensive tasks like continual pretraining that require memorizing domain-specific knowledge.

To address this problem, MoRA was proposed in this paper, which uses a square matrix instead of low-rank matrices. This allows for higher-rank updates while using the same number of trainable parameters as LoRA. To manage the increased size of the matrix, MoRA incorporates non-parameter operators that act like "compressors" and "decompressors" to handle the input and output dimensions. Think of it like a special coding system that efficiently shrinks and expands the data before and after it's processed by the square matrix.

Methodology

To overcome the limitations of LoRA, the paper introduces MoRA. The main difference is that MoRA uses a square matrix (M) instead of two low-rank matrices. This allows MoRA to achieve higher ranks with the same number of trainable parameters, giving it more capacity to learn and store new information.

However, simply replacing low-rank matrices with a large square matrix would require a lot more computation and memory. To address this, MoRA introduces non-parameter operators, fcomp and fdecomp , which act like compression and decompression functions.

These operators help to reduce the input and output dimensions for the square matrix, reducing the computational burden. The paper explores several ways to implement these operators, including:

  • Truncation: Simply truncating the input and padding the output with zeros. This leads to significant information loss.
  • Sharing Rows and Columns: This method groups rows and columns of the square matrix together, sharing information and reducing redundancy. It's efficient for larger ranks, but less effective for smaller ranks.
  • Reshaping: This involves reshaping the input into a matrix before applying the square matrix, preserving more information but adding some computational overhead.
  • Rotation: Similar to RoPE, this method incorporates rotation operators to further enhance the expressiveness of the square matrix.

The key is that these operators are non-parameterized, meaning they don't require any learning. This ensures that MoRA can be merged back into the LLM after training, just like LoRA, without introducing additional computational cost during inference.

Experimental Setup

The authors evaluated MoRA on several tasks to understand its impact. They used different datasets and compared its performance against LoRA and other baseline methods.

Memorizing UUID Pairs

This task was designed to test how well the methods could learn and store new knowledge. The task involves associating pairs of UUIDs (randomly generated unique identifiers). The model has to learn this new knowledge during fine-tuning, and not rely on its pre-trained knowledge. The paper found that MoRA significantly outperformed LoRA in this task.

Our method shows significant improvements over LoRA with the same number of trainable parameters, benefiting from high-rank updating. We also report character-level accuracy at various training steps in Table 2. MoRA requires fewer training steps to memorize these UUID pairs compared to LoRA.

Finetuning Tasks

The paper evaluated MoRA on three different finetuning tasks:

Instruction Tuning:

This task aims to adapt the LLM to better understand and respond to specific instructions. The authors used Tülu v2, a dataset combining several high-quality instruction datasets.

Mathematical Reasoning

This task assesses the LLM's ability to solve mathematical problems. The authors used MetaMath, a dataset of 395k samples designed to enhance mathematical reasoning capabilities, as well as GSM8K and MATH for further evaluation.

Continual Pretraining

This task involves fine-tuning the LLM to perform well on specific domains, like biomedicine or finance. The authors used PubMed abstracts and financial news data to train the models.

Pretraining

The paper also tested MoRA on a pretraining task, where they trained the transformer model from scratch on the C4 dataset. They compared MoRA with LoRA and ReLoRA, a method that merges low-rank matrices into the model during training to increase the rank of the updates. The paper introduced ReMoRA, which combines MoRA with the merge-and-reint strategy from ReLoRA.

Benchmarking and Main Results

Across the different tasks, the results showed that:

MoRA achieved comparable performance to LoRA on instruction tuning and mathematical reasoning tasks. This suggests that MoRA can effectively leverage existing knowledge.

MoRA outperformed LoRA on continual pretraining and the memory tasks. This shows that MoRA is better suited for tasks that require the model to acquire and memorize new knowledge.

MoRA's performance on pretraining was better than both LoRA and ReLoRA. This indicates that high-rank updating is beneficial even when training the model from scratch.

ReMoRA achieved further improvements over MoRA, demonstrating the effectiveness of merging the square matrix during training.

The paper also examined the singular values of the learned weight updates. They found that MoRA and ReMoRA had significantly more significant singular values compared to LoRA and ReLoRA, indicating that they achieved higher ranks. This suggests a strong correlation between the number of significant singular values and the overall performance of the model.

Business Implications

MoRA has several important implications for businesses that rely on LLMs:

Improved Adaptability

MoRA's ability to learn new information effectively means that LLMs can be adapted to new tasks and domains more quickly and efficiently. This is particularly valuable for businesses that work in rapidly evolving industries or need to customize LLMs for specific applications.

Enhanced Memory

MoRA's increased capacity for memorization allows LLMs to retain and recall information more accurately. This is crucial for applications requiring knowledge-intensive tasks, like question answering, summarization, or personalized recommendations.

Cost Savings

MoRA's parameter efficiency means that it can achieve comparable or even better performance than LoRA using the same number of trainable parameters. This translates to lower computational costs for training, deployment, and inference, making LLMs more accessible for businesses with limited resources.

Flexibility

The ability to merge the square matrix back into the LLM after training provides flexibility. Businesses can fine-tune the model with MoRA and then easily deploy it without sacrificing inference speed.

Conclusion

The paper demonstrates the importance of high-rank updating for LLMs, particularly for tasks requiring the acquisition of new knowledge. MoRA, with its square matrix and non-parameterized operators, offers a promising alternative to LoRA, achieving comparable or better performance with the same number of trainable parameters.

What makes MoRA work?

MoRA's success lies in its ability to achieve higher ranks with the same number of trainable parameters as LoRA. This allows MoRA to store and process more complex information. Think of it like expanding the LLM's memory capacity while keeping its overall size the same. The paper also highlights the importance of choosing appropriate compression and decompression functions within the MoRA framework. These functions act like specialized coding systems, making sure that the information isn't lost when the data is shrunk and expanded before and after it's processed by the square matrix.

The paper is a step forward in addressing the challenges of adapting LLMs to new tasks. As research in this area continues, we can expect to see even more innovative and efficient PEFT methods, further accelerating the progress of LLMs and their impact on various domains.