15
Jul
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Attention is a core component of the transformer architecture used in large language models (LLMs). But as LLMs grow larger and handle longer input sequences, the computational cost of attention becomes a bottleneck. To address this challenge, researchers from Colfax Research, Meta, Nvidia, Georgia Tech, Princeton University, and Together AI have introduced FlashAttention-3, a new technique that significantly speeds up attention computation on Nvidia Hopper GPUs (H100 and H800). FlashAttention-3 builds upon previous work on FlashAttention and FlashAttention-2 and further optimizes…