Neural Network Quantization

a foundational understanding of quantization in the context of LLMs

The primary challenge when working with LLMs is to get the models to run on commodity hardware. This is mandatory, especially for development and testing purposes, and not everybody has access to high-end GPUs with many GBs of VRAM. Fortunately, latest developments found an effective technique to reduce the model size, known as quantization.

Before understanding quantization, it's important to discuss why neural networks, in general, take up so much memory. A neural network consists of a series of interconnected neurons in a series of layers. As shown in Figure 1, a standard neural network consists of layers of interconnected neurons, each with its own weight, bias, and activation function associated with it.

Figure 1. Each connection between neurons has its own weight, while each neuron has its own bias and activation function.
n example of a neuron within a neural network. Each connection between neurons has its own weight, while each neuron has its own bias and activation function. Image used courtesy of V7 Labs

These weights and biases are often referred to as the "parameters" of a neural network.

Each neuron also has its own "activation," which is a number that dictates how active that neuron will be. The activation of a neuron is based on both its weight and bias values, as well as the activation function used.

The weights and biases are the parameters that then get tuned during training, and by extension, so does the neuron's activation.

These values, the weights, biases, and activations, are the majority of what gets physically stored in memory by a neural network. The standard is to represent these numbers as 32-bit floating-point values, which allows for a high level of precision and, ultimately, accuracy for the neural network.

This accuracy is why neural networks tend to take up so much memory. For a network with millions of parameters and activations, each being stored as a 32-bit value, memory usage quickly adds up.

For example, the 50-layer ResNet architecture contains roughly 26 million weights and 16 million activations. By using 32-bit floating-point values to represent both the weights and the activations, the entire architecture would require 168 MB of storage.

What is Quantization for Neural Networks?

Quantization is the process of reducing the precision of the weights, biases, and activations such that they consume less memory.

In other words, the process of quantization is the process of taking a neural network, which generally uses 32-bit floats to represent parameters, and instead converts it to use a smaller representation, like 8-bit integers.

Going from 32-bit to 8-bit, for example, would reduce the model size by a factor of 4, so one obvious benefit of quantization is a significant reduction in memory.

An example is shown in the figure below:

Quantization shrinks neural networks by decreasing the precision of weights, biases, and activations.
Quantization shrinks neural networks by decreasing the precision of weights, biases, and activations. Image used courtesy of Qualcomm

Another benefit of quantization is that it can lead to lower network latency and better power efficiency.

Network speed is improved because operations can be performed using integer, rather than floating-point data types. These integer operations require fewer computations on most processor cores, including microcontrollers.

Overall, power efficiency is then improved both because of decreased computation and decreased memory access.

Despite the benefits, the tradeoff with quantization is that neural networks can lose accuracy since they are not representing information precisely. However, depending on how much precision is lost, the network architecture, and the network training/quantization scheme, it has been shown that quantization can often result in a very minimal loss of accuracy, especially when weighed against the improvements in latency, memory usage, and power.

How To Quantize a Machine Learning Model

In practice, there are two main ways to go about quantization:

  1. Post-training quantization
  2. Quantization-aware training

As the name implies, post-training quantization is a technique in which the neural network is entirely trained using floating-point computation and then gets quantized afterward.

To do this, once the training is over, the neural network is frozen, meaning its parameters can no longer be updated, and then parameters get quantized. The quantized model is what ultimately gets deployed and used to perform inference without any changes to the post-training parameters.

Though this method is simple, it can lead to higher accuracy loss because all of the quantization-related errors occur after training is completed, and thus cannot be compensated for.

Quantization-aware training, as shown in Figure 3, works to compensate for the quantization-related errors by training the neural network using the quantized version in the forward pass during training.

Flow chart of quantization-aware training.
Flow chart of quantization-aware training. Image used courtesy of Novac et al

The idea is that the quantization-related errors will accumulate in the total loss of the model during training, and the training optimizer will work to adjust parameters accordingly and reduce error overall.

Quantization-aware training has the benefit of much lower loss than post-training quantization.

For a much more in-depth, technical discussion about the mathematics behind quantization, it is recommended to read this paper from Gholami, et al.

All in all, quantization is necessary for three main reasons:

  1. Quantization significantly reduces model size—this makes it more feasible to run ML on a memory-constrained device like a microcontroller.
  2. Quantization allows for ML models to run while requiring fewer processing capabilities.
  3. Quantization allows for a reduction in power consumption.