What is Quantization? | AI & LLM Glossary

Quantization is a model compression technique that reduces the precision of a neural network's numerical representations, typically converting 32-bit or 16-bit floating-point weights and activations to lower-precision formats like 8-bit or 4-bit integers, resulting in smaller model sizes and faster inference with minimal accuracy loss.

Neural networks store their learned knowledge as millions or billions of numerical parameters (weights), typically represented as 32-bit floating-point numbers (FP32). Each parameter uses 4 bytes of memory. For a 70-billion parameter LLM, this means roughly 280 GB of memory just for the weights, requiring multiple expensive GPUs. Quantization addresses this by representing these numbers with fewer bits.

The key insight behind quantization is that neural networks are surprisingly tolerant of reduced numerical precision. While training requires high precision to accumulate tiny gradient updates, inference (making predictions) can often work well with much lower precision. Converting from FP32 to INT8 (8-bit integers) reduces memory by 4x, and going to INT4 (4-bit) achieves an 8x reduction. This means our 280 GB model could fit in 35 GB with INT4 quantization.

There are two main approaches to quantization. Post-training quantization (PTQ) converts a pre-trained model to lower precision without additional training, making it quick and easy to apply. Quantization-aware training (QAT) incorporates quantization into the training process itself, typically achieving better accuracy at low bit-widths because the model learns to be robust to the reduced precision.

Modern quantization techniques for LLMs have become remarkably sophisticated. Methods like GPTQ, AWQ (Activation-aware Weight Quantization), and GGUF format enable 4-bit quantization of large language models with minimal perplexity increase. These techniques make it possible to run models that previously required data center GPUs on consumer hardware or even mobile devices, democratizing access to powerful AI.

How It Works

Calibration

A small representative dataset is passed through the full-precision model to collect statistics about the range and distribution of weight values and activations in each layer. These statistics determine the optimal mapping from high-precision to low-precision values.

Mapping and Conversion

Using the calibration data, a quantization scheme maps the continuous range of floating-point values to a discrete set of lower-precision values. This involves determining scale factors and zero points for each layer or group of weights that minimize the quantization error.

Mixed-Precision Strategy

Not all layers tolerate quantization equally. Sensitive layers (like the first and last layers, or attention mechanisms) may be kept at higher precision while less sensitive layers are aggressively quantized. This mixed-precision approach balances compression with accuracy preservation.

Validation and Deployment

The quantized model is evaluated on benchmark tasks and compared against the full-precision version. If accuracy degradation is within acceptable thresholds, the quantized model is deployed using inference engines optimized for low-precision computation, such as TensorRT, llama.cpp, or vLLM.

Examples

Running a 70B LLM on a single consumer GPU

A developer wants to run a 70-billion parameter LLM locally on a 24 GB consumer GPU. Using GPTQ 4-bit quantization, the model's memory footprint drops from 140 GB (FP16) to approximately 35 GB. With further optimizations like offloading some layers to CPU RAM, the model runs on a single RTX 4090 with only a 2-3% drop in benchmark performance.

Mobile deployment of a translation model

A travel app company quantizes their translation model from FP32 to INT8 for on-device inference. The model size drops from 800 MB to 200 MB, inference speed doubles on mobile processors, and battery consumption for translation tasks decreases by 40%. Translation quality, measured by BLEU scores, drops by less than 1 point.

Reducing serving costs for an API provider

An AI API provider serves millions of requests daily using large language models. By quantizing their serving models from FP16 to INT8 using AWQ, they double the number of requests each GPU can handle per second. This reduces their GPU fleet requirements by nearly half, saving hundreds of thousands of dollars per month in infrastructure costs.

Why It Matters

Quantization is one of the most impactful techniques for making AI accessible and economically viable. It enables powerful models to run on cheaper hardware, reduces energy consumption and carbon footprint, lowers API serving costs, and brings AI capabilities to edge devices and personal computers. Without quantization, many of the recent advances in open-source LLMs would remain inaccessible to most developers and organizations.

Frequently Asked Questions

How much accuracy do you lose with quantization?

With modern quantization methods, 8-bit quantization typically causes less than 1% accuracy loss on standard benchmarks. 4-bit quantization usually causes 1-3% loss, though this varies by model and task. Some tasks like creative writing are more tolerant of quantization than precise reasoning tasks. The latest techniques like AWQ and GPTQ have significantly narrowed the gap.

What is the difference between GPTQ, AWQ, and GGUF?

GPTQ and AWQ are quantization methods (algorithms for converting models to lower precision), while GGUF is a file format for storing quantized models used by llama.cpp. GPTQ uses calibration data to minimize layer-wise quantization error. AWQ identifies and preserves the most important weights based on activation patterns. Models quantized with either method can be stored in various formats.

Can you quantize any model?

Most modern neural networks can be quantized, but results vary. Larger models generally quantize better than smaller ones because they have more redundancy. Very small models may lose significant accuracy with aggressive quantization. Some architectures are more quantization-friendly than others, and certain task types (like exact numerical computation) are more sensitive to precision loss.

Is quantization the same as model compression?

Quantization is one type of model compression. Other compression techniques include pruning (removing unnecessary weights), distillation (training a smaller model to mimic a larger one), and low-rank factorization (decomposing weight matrices). These techniques are complementary and can be combined for maximum compression. Quantization is often the simplest to apply and most widely used.

Monitor Quantized Model Quality with Respan

Quantization trades precision for efficiency, but how much quality do you actually lose in production? Respan helps you compare quantized and full-precision model outputs side by side in real workloads, tracking quality metrics to ensure quantization savings do not come at an unacceptable accuracy cost.

Try Respan free

What is Quantization? | AI & LLM Glossary

How It Works

Calibration

Mapping and Conversion

Mixed-Precision Strategy

Validation and Deployment

Examples

Running a 70B LLM on a single consumer GPU

Mobile deployment of a translation model

Reducing serving costs for an API provider

Why It Matters

Frequently Asked Questions

How much accuracy do you lose with quantization?

What is the difference between GPTQ, AWQ, and GGUF?

Can you quantize any model?

Is quantization the same as model compression?

Monitor Quantized Model Quality with Respan

Try Respan free

What is Quantization? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Monitor Quantized Model Quality with Respan

What is Quantization? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Monitor Quantized Model Quality with Respan