Trading Accuracy with Size: Quantization Technology

Blog post description.

MACHINE LEARNINGLARGE LANGUAGE MODELS

6/7/20232 min read

As the field of artificial intelligence continues to evolve, large language models (LLMs) have become increasingly prevalent. These models, which can have billions or even trillions of parameters, are capable of impressive feats, from generating human-like text to answering complex questions. However, their size presents a significant challenge: they require substantial computational resources, making them inaccessible to many users with lower-end hardware.

One promising solution to this problem is quantization, a technique that can significantly reduce the size of these models without significantly compromising their performance. This article will explore what quantization is, how it is implemented, and its use case as a method of shrinking models to make the largest, most powerful models available to lower-end hardware.

Quantization is a process that reduces the numerical precision of the weights in a neural network. In the context of machine learning, weights are typically stored as 32-bit floating-point numbers. However, this level of precision is often not necessary for the model to function effectively. By reducing the precision of these weights, we can decrease the memory footprint of the model, making it more accessible for devices with limited computational resources. In the following I will suggest 32 to 8-bit quantization but there are other bit size quantization.

Quantization is implemented through a two-part procedure. The first part involves vector-wise quantization, where a separate normalization constant is used for each inner product in the matrix multiplication. This allows for most of the features to be quantized effectively.

However, there are often outlier features that require more precision. These outliers are handled through a mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication, while still allowing more than 99.9% of values to be multiplied in 8-bit. This two-part procedure is referred to as LLM.int8().

Quantization is a powerful tool for reducing the size of large language models. A billion parameter 16/32-bit model checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance.

As we scale models, outlier features with large magnitudes emerge and strongly affect all layers and their quantization. These outlier features are highly systematic and only representing at most 7 unique feature dimensions. By using LLM.int8(), we can perform inference in LLMs with up to 175B parameters without any performance degradation.

While quantization is a powerful tool for reducing model sizes, it's important to note that as models continue to grow in capability, we should not expect their sizes to shrink proportionally. There will likely be a minimum size per capability, and quantization will probably be the only viable method for reducing model size dramatically without compromising the subject LLM.

Quantization is a promising technique for making large language models more accessible. By reducing the precision of the weights in these models, we can significantly decrease their memory footprint without significantly compromising their performance. This makes it possible to use very large models, such as those with up to 175B parameters, on a single server with consumer GPUs. As the field of AI continues to evolve, techniques like quantization will be crucial for ensuring that the benefits of large language models are accessible to as many users as possible.