Quantization
An optimization technique that represents model weights and activations in lower bit precision (e.g., INT8) to accelerate inference and reduce memory footprint.
Quantization converts neural network parameters and computations from 32-bit floating point (FP32) to lower bit-width representations like INT8 or FP16. This compresses model size by up to 4x and improves inference speed by 2-4x, essential for real-time image processing on edge devices.
Converting FP32 weights to INT8 reduces per-parameter memory from 4 bytes to 1 byte. ResNet-50 (about 100MB) shrinks to approximately 25MB after INT8 quantization, fitting smartphone memory constraints while maintaining practical accuracy.
- Post-Training Quantization (PTQ): Determines scale and zero-point using a small calibration dataset (a few hundred images). No retraining needed but may incur noticeable accuracy loss
- Quantization-Aware Training (QAT): Simulates quantization during training so parameters adapt for quantized execution. Higher accuracy than PTQ at additional training cost
- Mixed precision: Keeps sensitive layers (first and last) in FP16 while quantizing others to INT8, providing fine-grained accuracy-speed control
TensorFlow Lite offers automatic quantization via tf.lite.Optimize.DEFAULT, and ONNX Runtime supports INT8 natively. Quantized models run in WebAssembly for browser-based processing. Accuracy loss typically stays within 1-2%, acceptable for most production use.