AI: Deep Dive into Quantization

AI: Building a LLM but want to keep computational costs down? Deep Dive into Quantization

Potentially the most significant pain point that companies (especially startups and smaller enterprises) building large language models face is keeping computational costs down. Training and maintaining large language models requires substantial computational resources. This leads to high costs in terms of hardware, energy and infrastructure.

There are several strategies that companies building LLMs can implement in an attempt to reduce computational costs. One strategy is Model Optimization, which can be broken down into three techniques, pruning, knowledge distillation and quantization. Today we will deep dive into quantization but first we will give a brief overview of the former techniques.

Pruning

The process of removing less important neurones and parameters from the model to reduce its size. Smaller models will cost less because they require less hardware, energy and infrastructure. The aim of pruning is to reduce the model to its smallest form before its performance is significantly affected.

Knowledge Distillation

Smaller models or “students” have lower computational requirements than larger models “teachers”. Knowledge distillation is a method in which you train the smaller model to achieve a similar performance as the larger model.

Quantization

The aim of quantization is to improve computational efficiency and reduce memory requirements or usage. Quantization is the technique of lowering the precision of the model’s weights and/or activations.

Convert the high precision numbers, typically used in neural networks, to lower precision formats. For example, 32-bit floating point could be converted to 16-bit. You might trail using this lower precision format and if the model was still able to perform to a high level you would lower the weight and/or activations again. For example, you could convert 16-bit to 8-bit integers.

In what different ways can I apply the technique of quantization to the model?

Uniform Quantization

The range of values is divided into equal-sized intervals. This is simple and efficient but may not be optimal for all data distributions.

Non-Uniform Quantization

Intervals are varying sizes, often based on the data distribution, providing better precision for more frequently occurring values.

Post-Training Quantization

Quantization is applied after the model has been trained by reducing the precision of weights and/or activations.

Dynamic Quantization

Only weights are quantized, while activations are computed at higher precision and quantized dynamically during inference.

Static Quantization

Both weights and activations are quantized. A calibration step using a representative dataset is performed to determine the optimal quantization ranges.

Quantization-Aware Training (QAT)

In this method, the model is trained with quantization in mind, simulating low-precision arithmetic during the forward and backward  passes to adapt the model to the quantized domain.

What are the Challenges of Quantization?

Accuracy Loss

When high precision numbers are converted into lower precision formats (quantization), reduced precision can result in increased errors. Increased errors will lead to a drop in model accuracy, if this happens the aim of quantization (keeping computational costs down) will be redundant. One way to mitigate this challenge is to ensure quantization-aware training is delivered to teams.

Range and Dynamic Range Handling

To avoid accuracy loss, it is crucial that you determine the min and max (optimal range) for quantization. Poor range selection will likely lead to significant quantization errors.

Hardware Support

To efficiently run quantized models you will need hardware that supports low-precision arithmetic operations. Many modern processors and accelerators (like GPUs and TPUs) offer support but compatibility can vary.

How can Quantization specifically benefit startups and smaller businesses?

Limited Budgets

Small businesses specifically often have limited budgets and may not have access to expensive hardware. Quantization reduces the computational requirements of models, resulting in significant savings on infrastructure.

Limited Hardware

Small businesses that run applications on local servers or edge devices are in greater need of energy efficiency. Lower precision computations used in quantization typically consume less power, lowering energy costs

Some small businesses rely on mobile applications or IoT devices, lower precision numbers require less storage, making models more compact and suitable for deployment on mobile applications or IoT devices.

Competitive Edge against bigger businesses

Small businesses often have a huge focus on scalability as this is what helps the business to develop and grow. With lower computational and memory requirements quantized models can be deployed on a broader range of devices, from smartphones to edge devices, this allows small businesses to expand their technological reach (increasing the opportunity for scalability) without the need for significant investment.

Small businesses wanting to compete with larger enterprisers will often need to offer high-quality products and services while limiting the need for extensive resources. Quantization is an advanced technique and when small businesses adopt this technique they are able to leverage cutting-edge technology.

Conclusion

By deep diving into quantization, it is clear that there are numerous ways for businesses building LLMs to keep costs down and increase efficiency. We believe that quantization may be one of the best ways for small businesses and startups to reduce computational costs while maintaining quality, giving them a competitive edge against larger businesses.

If you found this useful and would like to see more deep dives into either another technique for model optimisation or another way LLMs can save on costs, please tag us on LinkedIn letting us know what you’d like to see: AdaptTalent

Alternatively, if you are a business building LLMs are are looking for the right talent to support your business growth, Kieran Toner is our specialist in the Generative AI and Machine Learning space and is always happy for a confidential chat.