Alibaba Cloud engineer and researcher Ennan Zhai shared his research paper via GitHub, revealing the Cloud provider's design for use in data centers for Large Language Model (LLM) training. The PDF document, titled "Alibaba HPN: A Data Center Network for Large Language Model Training," describes how Alibaba used Ethernet to allow its 15,000 GPUs to communicate with each other.
General cloud computing generates constant but small streams of data at speeds below 10 Gbps. On the other hand, LLM training produces periodic bursts of data that can reach 400 Gbps. According to the paper, "this characteristic of LLM training predisposes Equal-Cost Multi-Path (ECMP), the commonly used load-balancing scheme in traditional data centers, to hash polarization, which causes problems such as uneven traffic distribution."
To avoid this, Zhai and his team developed the High-Performance Network (HPN), which used a " 2-layer, dual-plane architecture" that reduces the number of possible ECMP instances while allowing the system to "accurately select network paths capable to deal with 'elephant flows'."
HPN also used dual top-of-rack (ToR) switches that allowed them to support each other. These switches are the most common single point error for LLM training, requiring GPUs to complete iterations in sync. Alibaba Cloud divided its data centers into hosts, with each host equipped with eight GPUs. Each GPU has its own network interface card (NIC) with two ports, and each GPU-NIC system is called a 'rail'.
Each host also has an additional NIC to connect to the backend network. Each rail then connects to two different ToR switches so that the whole host is not affected even if one switch fails. Although Alibaba Cloud has chosen to drop NVlink for inter-host communication, they still use Nvidia's proprietary technology for the intra-host network, as the communication between GPUs within a host requires more bandwidth. However, since the communication between rails is much slower, the "dedicated 400 Gbps RDMA network throughput resulting in a total bandwidth of 3.2 Tbps" per host is more than enough to maximize the bandwidth of the PCIe Gen5x16 graphics cards.
Alibaba Cloud also uses a 51.2 Tb/sec Ethernet single-chip ToR switch, as multi-chip solutions are more prone to instability, with a four times higher failure rate than single-chip switches. However, these switches run hot and there are no cooling profiles available on the market to prevent them from shutting down due to overheating. So the company has created its own solution by creating a vapor chamber cooling profile with several columns in the middle to transport thermal energy much more efficiently.
Ennan Zhai and his team will present their work at the SIGCOMM (Special Interest Group on Data Communications) conference in Sydney, Australia, in August. Many companies, including AMD, Intel, Google and Microsoft, will be interested in this project, mainly because they have joined forces to create Ultra Accelerator Link - an open-standard connector made to compete with NVlink.
This is especially true since Alibaba Cloud has been using HPN for over eight months, meaning this technology has already been tried and tested. But HPN still has some disadvantages, the biggest one being its complex wiring structure. With each host having nine NICS and each NIC connected to two different ToR switches, there are many opportunities to mix up which jack goes to which port.
Nevertheless, this technology is probably more affordable than NVlink, allowing any institution setting up a data center to save a lot of money on setup costs (and maybe even avoid Nvidia technology, especially if it is one of the companies subject to sanctions from the US in the ongoing chip war with China).