Ad image

How Microsoft’s Next-Generation BitNet Architecture Will Enhance LLM Efficiency

MONews
6 Min Read

Sign up for our daily and weekly newsletters for the latest updates and exclusive content on the industry’s best AI coverage. Learn more


1-bit large language models (LLMs) have emerged as a promising approach to making generative AI more accessible and affordable. By representing model weights with a very limited number of bits, 1-bit LLM significantly reduces the memory and computational resources required to run it.

Microsoft Research We have expanded the boundaries of 1-bit LLM with the BitNet architecture. at new paperResearchers introduce BitNet a4.8, a new technique that further improves the efficiency of 1-bit LLM without sacrificing performance.

The emergence of 1-bit LLM

Traditional LLM uses 16-bit floating point numbers (FP16) to represent parameters. This requires a lot of memory and computing resources, limiting accessibility and deployment options for LLM. 1-bit LLM solves this problem by drastically reducing the precision of the model weights while matching the performance of a full-precision model.

Previous BitNet models used 1.58-bit values ​​(-1, 0, 1) to represent model weights and 8-bit values ​​for activations. Although this approach significantly reduces memory and I/O costs, the computational cost of matrix multiplication still remains a bottleneck, and optimizing neural networks with extremely low bit parameters is difficult.

Two techniques help solve this problem. Sparsification reduces the number of computations by truncating activations of smaller size. Activation values ​​are particularly useful in LLM because they tend to have a long tail distribution with a few very large values ​​and many small values.

Quantization, on the other hand, uses fewer bits to represent activations, which reduces the computational and memory costs of processing them. However, simply reducing the precision of activation can result in significant quantization errors and performance degradation.

Moreover, combining sparsification and quantization is difficult and presents special challenges when training 1-bit LLMs.

“Both quantization and sparsification introduce non-differentiable operations, making gradient calculations during training particularly difficult,” Furu Wei, partner research manager at Microsoft Research, told VentureBeat.

Gradient computation is essential for calculating errors and updating parameters when training a neural network. The researchers also had to ensure that their technique could be implemented efficiently on existing hardware while retaining the benefits of sparsification and quantization.

bitnet a4.8

BitNet a4.8 solves the 1-bit LLM optimization problem through what the researchers describe as “hybrid quantization and sparsification.” They achieved this by designing an architecture that selectively applies quantization or sparsification to different components of the model based on specific distribution patterns of activations. This architecture uses 4-bit activations for input to the attention and feedforward network (FFN) layers. We use 8-bit sparsification for intermediate states to retain only the top 55% of the parameters. Additionally, the architecture is optimized to leverage existing hardware.

“With BitNet b1.58, the inference bottleneck in 1-bit LLM shifts from memory/IO to computation, which is limited by the enable bits (e.g. 8 bits in BitNet b1.58),” Wei said. “BitNet a4.8 pushes the enable bits to 4 bits, allowing us to leverage 4-bit kernels (e.g. INT4/FP4) to double the speed of LLM inference on GPU devices. “The combination of 1-bit model weights from BitNet b1.58 and 4-bit activations from BitNet a4.8 effectively addresses both memory/IO and computational constraints of LLM inference.”

BitNet a4.8 also uses 3-bit values ​​to represent the key (K) and value (V) states of the attention mechanism. KV cache is an important component of transformer models. Stores a representation of the previous token in the sequence. By lowering the precision of KV cache values, BitNet a4.8 further reduces memory requirements, especially when processing long sequences.

The promise of BitNet a4.8

Experimental results show that BitNet a4.8 provides similar performance to its predecessor, BitNet b1.58, while using less compute and memory.

Compared to the full-precision Llama model, BitNet a4.8 reduces memory usage by a factor of 10 and achieves a 4x speedup. Compared to BitNet b1.58, we achieve a 2x speedup with a 4-bit activation kernel. But design can offer much more.

“The expected computational improvements are based on existing hardware (GPUs),” Wei said. “Using hardware specifically optimized for 1-bit LLM can significantly improve computational performance. BitNet introduces a new computational paradigm that minimizes the need for matrix multiplication, which is a major focus of current hardware design optimization.”

The efficiency of BitNet a4.8 makes it particularly suitable for deploying LLM on the edge and resource-constrained devices. This can have significant privacy and security implications. By enabling on-device LLM, users can enjoy the powerful benefits of these models without having to send their data to the cloud.

Wei and his team are continuing work on 1-bit LLM.

“We continue to advance our research and vision for the 1-bit LLM era,” Wei said. “While our current focus is on model architecture and software support (e.g. bitnet.cpp), we aim to explore co-design and co-evolution of model architecture and hardware to fully exploit the potential of 1-bit LLM. .”

Share This Article
Leave a comment