Model Quantization

Quantization shrinks neural network models by using fewer bits to represent each number. A model stored in 32-bit floating point might be converted to 8-bit integers (INT8) or even 4-bit (INT4), reducing memory by 4-8x and speeding up inference on hardware that supports lower-precision arithmetic. The accuracy loss is typically small, often under 1% for well-executed quantization.

There are two main approaches: post-training quantization (PTQ), which converts an already-trained model, and quantization-aware training (QAT), which simulates quantization during training so the model learns to be robust to reduced precision. PTQ is simpler but less accurate; QAT produces better results but requires retraining.

For teams deploying self-hosted models, quantization is often the highest-impact optimization. Running a 70B parameter model in 4-bit quantization (GPTQ, AWQ, or GGUF format) requires roughly 35GB of VRAM instead of 140GB, making it feasible on a single high-end GPU. The practical impact is that quantization can cut your GPU costs by 50-75% while maintaining production-quality outputs. Libraries like llama.cpp, vLLM, and TGI support quantized inference out of the box.

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering