Attention Mechanism
A neural network component that dynamically computes relevance scores across input elements, enabling the model to focus on the most informative parts of the data.
An attention mechanism dynamically computes importance weights for each element of an input, allowing the model to focus on the most relevant information. Originally proposed for machine translation in 2014, it became the cornerstone of the Transformer architecture in 2017.
In computer vision, self-attention models long-range dependencies between distant spatial locations, overcoming the limited receptive field of convolutions. Vision Transformer (ViT) showed that pure self-attention over image patches can match or exceed CNN performance.
- Scaled Dot-Product Attention: Computes weights from query
Q, keyK, and valueVusingsoftmax(QK^T / sqrt(d_k))V - Multi-Head Attention: Runs multiple attention operations in parallel across subspaces, capturing diverse relational patterns. Vision models typically use 12 to 16 heads
- Cross-Attention: Learns correspondences between modalities such as text and image. In Stable Diffusion, cross-attention aligns text embeddings with spatial features
Attention is integral to object detection (DETR), segmentation, and image generation. The quadratic cost has spurred efficient variants including linear attention, flash attention, and sparse patterns.