Model Quantization
A technique that reduces model size and inference cost by representing weights and activations with lower-precision numbers, such as converting 32-bit floats to 8-bit or 4-bit integers.
Quantization shrinks neural network models by using fewer bits to represent each number. A model stored in 32-bit floating point might be converted to 8-bit integers (INT8) or even 4-bit (INT4), reducing memory by 4-8x and speeding up inference on hardware that supports lower-precision arithmetic. The accuracy loss is typically small, often under 1% for well-executed quantization.
There are two main approaches: post-training quantization (PTQ), which converts an already-trained model, and quantization-aware training (QAT), which simulates quantization during training so the model learns to be robust to reduced precision. PTQ is simpler but less accurate; QAT produces better results but requires retraining.
For teams deploying self-hosted models, quantization is often the highest-impact optimization. Running a 70B parameter model in 4-bit quantization (GPTQ, AWQ, or GGUF format) requires roughly 35GB of VRAM instead of 140GB, making it feasible on a single high-end GPU. The practical impact is that quantization can cut your GPU costs by 50-75% while maintaining production-quality outputs. Libraries like llama.cpp, vLLM, and TGI support quantized inference out of the box.
Related Terms
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Fine-Tuning
The process of further training a pre-trained LLM on a domain-specific dataset to specialize its behavior, style, or knowledge for a particular task.
Prompt Engineering
The practice of designing and iterating on LLM input instructions to reliably produce desired outputs for a specific task.