Quantization

Updated Jun 13, 2026 ·

Overview

Running raw LLMs is expensive in memory because their parameters are stored as high-precision numbers like float32 or float16. Most consumer machines cannot handle this, so we use a technique called quantization to reduce memory usage.

Quantization compresses model weights into smaller numeric formats, which allows large models to run on local hardware without needing a data center.

Instead of storing every parameter in high-precision format, quantization shrinks them into smaller representations so the model becomes much easier to run locally.

Main idea: We trade a small amount of precision for a large reduction in memory usage.

How Quantization Works

Quantization takes a fully trained model and compresses its parameters into smaller numeric formats, usually integers.

Converts float32 or float16 values into int4 or int8
Integer values remove decimal precision
Each parameter takes less memory space
Compression is done using mathematical techniques

This process reduces memory usage dramatically while keeping most of the model’s performance intact. In most cases, the difference in output quality is minimal for everyday use.

info

With quantization, the model becomes lighter without “forgetting” how to perform well.

As an example example, a large model that originally requires 100GB of memory might drop to 25GB or even 12GB after quantization, depending on the method used. This is the key reason why large models can now run on consumer laptops instead of only on data center hardware.

You can typically find quantized versions of popular models on model hubs such as Hugging Face, and they are often labeled with their quantization level (e.g., “Q4” for 4-bit quantization).

Memory Still Includes Context

Even after quantization, memory is still needed for more than just weights.

Model weights are the biggest memory cost
Context window also uses memory
Input and output tokens are stored during inference
Larger prompts increase memory usage slightly

However, compared to model weights, the context window usually uses much less memory unless you are working with very large inputs.

GGUF and Quantized Files

Quantized models are often stored in a special format called GGUF that is designed for local use.

Stores compressed weights and metadata
Optimized for local inference tools
Common in tools like Ollama and llama.cpp

File names often include patterns like “Q4” or “Q8”, which indicate the quantization level used.

Overview​

How Quantization Works​

Memory Still Includes Context​

GGUF and Quantized Files​

Overview

How Quantization Works

Memory Still Includes Context

GGUF and Quantized Files