Inference
The process of feeding new data into a trained model to obtain predictions. Unlike training, inference does not update model parameters.
Inference is the process of passing unseen data through a trained neural network to obtain predictions such as class labels, bounding boxes, or segmentation masks. Unlike training, inference executes only the forward pass with frozen weights, making it computationally lighter per sample.
Performance is measured by latency-accuracy trade-off. Real-time object detection requires sub-33ms per frame (30 FPS). YOLOv8 achieves about 1.5ms per image on GPU, while MobileNetV3 runs in approximately 5ms on CPU.
- Batch inference: Processing multiple inputs simultaneously exploits GPU parallelism, maximizing throughput. Server deployments typically use batch sizes of 8 to 64
- Edge inference: Running models on smartphones or IoT devices using engines like TensorFlow Lite, ONNX Runtime, and Core ML that optimize execution for constrained hardware
- Inference optimization: Quantization (FP32 to INT8), pruning (removing redundant weights), and knowledge distillation (compressing large models) improve speed while preserving accuracy
Browser-based inference via WebAssembly enables client-side image processing without server communication, benefiting privacy and latency. Since inference costs dominate cloud expenses, model optimization is critical for production deployment.