Large-scale network models based on transformer architecture have strong versatility in many fields. Due to the computation-intensive and large-scale characteristics of the model, large-scale training on domestic heterogeneous accelerators is restricted by aspects such as computing and communication efficiency, resulting in poor training performance. Aiming at this problem, the hot functions and performance bottlenecks in the training process are studied and analyzed, and the corresponding performance optimization methods are proposed based on the hardware characteristics of domestic heterogeneous accelerators. In order to solve the problem of low performance in low accuracy training, the low accuracy package optimization is carried out for the underlying matrix multiplication core operator. To solve the problem of significant startup delay of kernel function caused by fine-grained core operators, the LightSeq framework is transplanted on the domestic heterogeneous platform for the first time, and the core fine-grained operators are specially optimized to adapt to the hardware structure according to the characteristics of network structure to accelerate the training process. In large-scale training, in order to solve the problem of low bandwidth in cross node communication, distributed communication optimization is carried out from two levels of data transmission and hardware topology, and communication efficiency is improved by reducing the frequency of communication and increasing the communication bandwidth. The experimental results show that using the WMT '14 English-German translation dataset, the performance is improved by two times after optimization on a single node without loss of training accuracy. The computing scale is gradually expanded to 128 nodes (512 accelerator cards) for large-scale distributed training and verification. Under the premise of ensuring performance improvement, the scalability can reach more than 90% in 256 accelerator cards.
|