The ‘Mixture of Experts (MoE) architecture is a model architecture for deep learning, where the calculation costs are proportional to the number of parameters, enabling simpler scaling’. MoE is currently the only approach that has been demonstrated to scale deep learning models to trillions of parameters, paving the way for models capable of learning even more information and enhancing computer vision, speech recognition, natural language processing and machine translation systems. , including applications that can help people and organizations in new ways.
Tutel is a high-performance MoE library developed by Microsoft researchers to assist in the development of large-scale DNN (Deep Neural Network) models; Tutel is highly optimized for the new Azure NDm A100 v4 series, and Tutel’s diverse and flexible MoE algorithmic support enables developers across AI domains to execute MoE more easily and efficiently. Tutel achieves an 8.49x speedup on an NDm A100 v4 node with 8 GPUs and a 2.75x speedup on 64 NDm A100 v4 nodes with 512 A100 GPUs compared to state-of-the-art MoE implementations like Meta’s Facebook AI Research Sequence-to-Sequence Toolkit (fairseq) in PyTorch for a single MoE layer.
Tutel delivers a more than 40% speed increase for Meta’s 1.1 trillion parameters MoE language model with 64 NDm A100 v4 nodes for end-to-end performance, thanks to optimization for all-in-all communication. Working on the Azure NDm A100 v4 cluster, Tutel delivers outstanding compatibility and comprehensive capabilities to deliver outstanding performance. Tutel is free and open source software that has been integrated into fairseq.
Tutel is a high-level MoE solution that complements existing high-level MoE solutions such as fairseq and FastMoE by focusing on MoE-specific computation and all-to-all communication optimizations and other diverse and flexible MoE algorithmic support. . Tutel has a straightforward user interface that makes it easy to combine with other MoE systems. Developers can also use the Tutel interface to include independent MoE layers in their own bottom-line DNN models and take advantage of the highly optimized advanced MoE features right away.
MoE-based DNN models rely on a naive combination of numerous off-the-shelf DNN operators provided by deep learning frameworks such as PyTorch and TensorFlow to consolidate the MoE calculation due to lack of efficient implementations. Due to redundant data processing, such a method results in significant performance costs. Tutel creates and implements many GPU cores that provide operators for MoE-specific calculations. For example, Tutel reduces the time complexity by sending “gating output” from 0 (N3) to 0 (N2), which dramatically improves the data transmission efficiency. Tutel also uses a fast cumsum-minus-one operator, which speeds up the process by 24 times compared to fairseq.
Tutel optimizes all-in-all collective communication on Azure NDm A100 v4 clusters for large-scale MoE training, including CPU-GPU binding and adaptive routing (AR) adjustment. On a NUMA (multi-non-uniform memory access) system, efficient CPU-GPU binding is essential for all-in-all performance, especially on the NDm A100 v4 nodes. Unfortunately, the current machine learning frameworks lack an effective all-in-all communication library, resulting in large-scale distributed training performance regression. Tutel automatically optimizes the binding and provides an intuitive interface for fine-tuning the user. Tutel also uses multipath technology, specifically AR, on NDm A100 v4 clusters. The total data traffic size of the communication for each GPU does not change in MoE’s all-in-all communication, but the data size between each GPU pair shrinks as the number of GPUs grows.
On the Azure NDm A100 v4, Meta has used Tutel to train its large language model, which uses an attention-based neural architecture similar to GPT-3. The model consists of 32 layers of attention, each with 32 x 128-dimensional heads. One MoE layer is present in every other layer, and each GPU has an expert. As all-in-all communication becomes the bottleneck as the number of GPUs increases, Tutel wins up to 131 percent with 8 A100 GPUs to 40 percent with 512 A100 GPUs. In the next version, more optimizations are expected.
MoE is a technology that has many potentials. It allows for holistic training using approaches from a variety of areas, such as systematic routing and large-scale network balancing, and it can even take advantage of GPU-based acceleration. Tutel performed better than the fairseq frame and has since been incorporated into the DeepSpeed frame. Tutel and related connections will help Azure services, especially for companies that want to scale large models easily. Tutel will continue to evolve and offer more exciting results as MoE is still in its early stages and greater efforts are required to realize its full potential.