It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More
Matrix multiplications (MatMul) are the most computationally expensive operations in large language models (LLM) using the Transformer architecture. As LLMs scale to larger sizes, the cost of MatMul grows significantly, increasing memory usage and latency during training and inference.
Now, researchers at the University of California, Santa Cruz, Soochow University and University of California, Davis have developed a novel architecture that completely eliminates matrix multiplications from language models while maintaining strong performance at large scales.
In their paper, the researchers introduce MatMul-free language models that achieve performance on par with state-of-the-art Transformers while requiring far less memory during inference.
MatMul
Matrix multiplication is a fundamental operation in deep learning, where it is used to combine data and weights in neural networks. MatMul is crucial for tasks like transforming input data through layers of a neural network to make predictions during training and inference.
VB Transform 2024 Registration is Open
Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now
GPUs are designed to perform many MatMul operations simultaneously, thanks to their highly parallel architecture. This parallelism allows GPUs to handle the large-scale computations required in deep learning much faster than traditional CPUs, making them essential for training and running complex neural network models efficiently.
However, with LLMs scaling to hundreds of billions of parameters, MatMul operations have become a bottleneck, requiring very large GPU clusters during both training and inference phases. Replacing MatMul with a simpler operation can result in huge savings in memory and computation. But previous efforts to replace MatMul operations have produced mixed results, reducing memory consumption but slowing down operations because they do not perform well on GPUs.
Replacing MatMul with ternary operations
In the new paper, the researchers suggest replacing the traditional 16-bit floating point weights used in Transformers with 3-bit ternary weights that can take one of three states: -1, 0 and +1. They also replace MatMul with additive operations that provide equally good results at much less computational costs. The models are composed of “BitLinear layers” that use ternary weights.
“By constraining the weights to the set {−1, 0, +1} and applying additional quantization techniques, MatMul operations are replaced with addition and negation operations,” the researchers write.
They also make more profound changes to the language model architecture. Transformer blocks consist of two main components: a token mixer and a channel mixer. The token mixer is responsible for integrating information across different tokens in a sequence. In traditional Transformer models, this is typically achieved using self-attention mechanisms, which use MatMul operations to compute relationships between all pairs of tokens to capture dependencies and contextual information.
However, in the MatMul-free architecture described in the paper, the token mixer is implemented using a MatMul-free Linear Gated Recurrent Unit (MLGRU). The GRU is a deep learning for sequence modeling that was popular before the advent of Transformers. The MLGRU processes the sequence of tokens by updating hidden states through simple ternary operations without the need for expensive matrix multiplications.
The channel mixer is responsible for integrating information across different feature channels within a single token’s representation. The researchers implemented their channel mixer using a Gated Linear Unit (GLU), which is also used in Llama-2 and Mistral. However, they modified the GLU to also work with ternary weights instead of MatMul operations. This enabled them to reduce computational complexity and memory usage while maintaining the effectiveness of feature integration
“By combining the MLGRU token mixer and the GLU channel mixer with ternary weights, our proposed architecture relies solely on addition and element-wise products,” the researchers write.
Evaluating MatMul-free language models
The researchers compared two variants of their MatMul-free LM against the advanced Transformer++ architecture, used in Llama-2, on multiple model sizes.
Interestingly, their scaling projections show that the MatMul-free LM is more efficient in leveraging additional compute resources to improve performance in comparison to the Transformer++ architecture.
The researchers also evaluated the quality of the models on several language tasks. The 2.7B MatMul-free LM outperformed its Transformer++ counterpart on two advanced benchmarks, ARC-Challenge and OpenbookQA, while maintaining comparable performance on the other tasks.
“These results highlight that MatMul-free architectures are capable achieving strong zero-shot performance on a diverse set of language tasks, ranging from question answering and commonsense reasoning to physical understanding,” the researchers write.
Expectedly, MatMul-free LM has lower memory usage and latency compared to Transformer++, and its memory and latency advantages become more pronounced as the model size increases. For the 13B model, the MatMul-free LM used only 4.19 GB of GPU memory at a latency of 695.48 ms, whereas Transformer++ required 48.50 GB of memory at a latency of 3183.10 ms.
Optimized implementations
The researchers created an optimized GPU implementation and a custom FPGA configuration for MatMul-free language models. With the GPU implementation of the ternary dense layers, they were able to accelerate training by 25.6% and reduce memory consumption by up to 61.0% over an unoptimized baseline implementation.
“This work goes beyond software-only implementations of lightweight models and shows how scalable, yet lightweight, language models can both reduce computational demands and energy use in the real-world,” the researchers write.
The researchers believe their work can pave the way for the development of more efficient and hardware-friendly deep learning architectures.
Due to computational constraints, they were not able to test the MatMul-free architecture on very large models with more than 100 billion parameters. However, they hope their work will serve as a call to action for institutions and organizations that have the resources to build the largest language models to invest in accelerating lightweight models.
Ideally, this architecture will make language models much less dependent on high-end GPUs like those from Nvidia, and will enable researchers to run powerful models on other, less expensive and less supply constrained types of processors. The researchers have released the code for the algorithm and models for the research community to build on.
“By prioritizing the development and deployment of MatMul-free architectures such as this one, the future of LLMs will only become more accessible, efficient, and sustainable,” the researchers write.
Source link lol