Research on the Application of Equal Cost Multi-Path (ECMP) Technology in Data Center Networks

Research on the Application of Equal Cost Multi-Path (ECMP) Technology in Data Center Networks

Time: September 3rd, 2024

At present, a large number of ECMP (Equal-Cost Multipath Routing, abbreviated as ECMP) are applied in the Fabric architecture widely used in data center networks. Its advantages are mainly reflected in improving network redundancy and reliability, and also improving network resource utilization; a large number of ECMP links will cause other problems during operation in specific scenarios. For example, when an ECMP link is disconnected, all link traffic in the ECMP group will be re-hashed, which will cause an avalanche in stateful server areas (such as LVS), or multi-level ECMP hash polarization will occur, causing link congestion, etc.

This article will analyze the above issues based on the ECMP operation principle and explore how to optimize the use of ECMP.

Equal-cost multipath routing

Equal-cost multi-path routing means that there are multiple paths with equal costs to the same destination address. When the device supports equal-cost routing, the Layer 3 forwarding traffic sent to the destination IP or destination network segment can be shared through different paths to achieve load balancing of network links and fast switching when a link fails.

ECMP Implementation Process

▲Fig 1: ECMP flow chart

Step 1: Selection of HASH Factor

First, the data packet forwarding queries the routing table to confirm that there are multiple equal-cost routes and then extracts the key fields involved in the HASH calculation, namely the HASH factor, according to the traffic balancing algorithm currently configured by the user. The HASH factors that can be selected for ECMP traffic balancing are as follows:

Traffic load balancing mode	HASH Factor
SRC-MAC	Source IP address (SlP)
DST-MAC
SRC-DST-MAC
SRC-MAC
SRC-IP	Source and destination IP address (SlP + DlP)
DST-IP	Source and destination IP address (SlP + DlP)
SRC-DST-IP	Source and destination IP address, L4 port source and destination (SIP + DIP + SP + DP)
SRC-DST-IP-L4PORT	Enhanced mode, extracting message fields based on load balancing profile, can define and configure existing hash factors, or customize hash perturbation factors

▲ Table 1: Traffic balancing mode corresponding to HASH factor table

Note: Because ECMP is a three-layer forwarding, even if the configuration is based on source MAC, destination MAC, or source-destination MAC as the HASH factor, the system will default to selecting the source IP as the HASH factor. In addition, when selecting to extract the HASH factor as the destination IP, the system will default to selecting the source-destination IP as the HASH factor.

Step 2: HASH calculation

HASH factor extracted in step 1, the corresponding HASH lb-key (load-balance key) is calculated according to the HASH algorithm. The HASH algorithms supported by ECMP traffic balancing include XOR, CRC, CRC+ scrambling, etc.

There are many types of HASH algorithms. We will use the XOR algorithm as an example. The XOR algorithm rule includes that if the two input bits are the same, they are 0, and if they are different, they are 1. The HASH factors are different, and the operation results are also different.

1. HASH factor is IP address source (SIP)

● SIP XOR 0, to get a 32-bit value a;

● Perform XOR calculation on the high 16 bits and low 16 bits of value a to get the 16-bit value b;

● Perform XOR calculation on bits 15 to 12 and bits 11 to 8 of value b to get 4-bit value c;

● The value c replaces the 11~8 bits of the value b to get the value d;

● The lower 10 bits of the value d are the lb key.

2. HASH factor is SIP+DIP/DIP

● DIP XOR SIP, get a 32-bit value a;

● The remaining operation steps are consistent with the SIP operation.

3.HASH factor is SIP+DIP+SP+DP

● SIP XOR DIP gets the 32-bit value a;

● The lower 16 bits of value a are XORed with SP to get the 32-bit value b;

● The lower 16 bits of value b are XORed with DP to obtain the 32-bit value c;

● The high 16 bits of the value c, are XORed with the low 16 bits to get the 16-bit value d;

● XOR 11 to 8 bits of the value d with 15 to 12 bits to get the 4-bit value e.

● The value e replaces the 11th to 8th bits of the value d to obtain the value f;

● The lower 10 bits of the value f are truncated to form lb-key.

Step 3: Confirm the next forwarding hop

After the data message passes through the routing table lookup, the corresponding ECMP base value (base-ptr) is found. After the HASH lb-key is obtained through the HASH algorithm according to the HASH factor, the ECMP next-hop link number (Member-count) is calculated, and then the addition operation is performed with the ECMP base value to obtain the forwarding next-hop index, which determines the next-hop forwarding route.

Calculation formula: Next-hop = (lb-key % Member-count) + base-ptr

The above process is the normal forwarding process of ECMP, but problems may occur during operation in a specific network environment. Next, we will continue to analyze two common problems encountered by ECMP in data center networks.

Problem 1: A single link failure causes all data flows in the ECMP group to be re-hashed

When the Leaf switch sends 6 data streams to the LVS server, the Leaf first performs HASH calculations and load balances them to each LVS server. Normal traffic forwarding is shown in Figure 2:

▲ Fig 2: ECMP forwarding diagram

When a certain LVS server network card fails or a link fails, the Leaf switch will re-hash the data flow in the ECMP group and then load balance it to the remaining valid links, which will cause the TCP session to be disconnected and an avalanche phenomenon to occur. For example, in some payment services, the payment process of the same user will call multiple business services. The business side requires that the payment process should fall on the same processing server. When a single link fails, it will not only affect the users carried by the LVS where the link is located but also affect the users carried by other LVS in the ECMP group, as shown in Figure 3:

▲ Fig 3: ECMP forwarding diagram after failure

Optimization plan:

To prevent a single LVS server failure or a single link failure from causing the entire ECMP group traffic to be re-hashed, ECMP can use an elastic HASH algorithm for optimization. After using the elastic HASH algorithm, only the traffic on the failed link is re-hashed to other active links, while the data flow on the non-faulty link does not need to change the next hop. The implementation effect is shown in Figure 4:

▲Fig 4: ECMP elastic HASH algorithm

Specific implementation principle of elastic HASH:

▲Fig 5: Elastic HASH process

Generate an index table (RH Flow Set Table) on the switch to store the next-hop routing address corresponding to the relevant index value. After the data message passes through the routing table, the corresponding ECMP base value is found, and the HASH factor is extracted for the HASH operation. When the HASH Key and the ECMP number are taken as the remainder, the remainder operation is performed with the initial number regardless of whether there is a faulty link. Therefore, the operation results are consistent, and the data of the non-faulty link is still forwarded according to the original link. As shown in the figure below, after link 3 fails, the software CPU will update the RH flow table in time and replace the failed link with the normal link evenly.

▲Fig 6: Schematic diagram of elastic HASH index table replacement

Problem 2: HASH polarization problem

As shown in Figure 7, when both Leaf devices and Spine devices use an even number of uplink links and the ECMP algorithm and HASH factor are consistent, the data flow undergoes a HASH calculation on the Leaf device, and the data flow load is shared among the two Spine devices. After balancing, data flows 1, 2, and 3 are forwarded to Spine-1, and data flows 4, 5, and 6 are forwarded to Spine-2. The spine then performs HASH calculation to share the load among the two DCI cores. Because the HASH algorithm used in the Spine layer is consistent with the HASH algorithm of the Leaf, Finally, data streams 1, 2, and 3 of Spine-1 are all forwarded to DCI-1 and are not load-balanced to any data stream on DCI-2. Data streams 4, 5, and 6 of Spine-2 are all forwarded to DCI-2 and are not load-balanced to any data stream on DCI-1. The same is true for the data stream sent by Leaf-2, which leads to the HASH polarization problem. As a result, one link between SPINE and DCI is idle, which greatly wastes network resources and even causes traffic congestion.

▲Fig 7: HASH polarization

Optimization plan:

When Leaf devices and Spine devices from the same manufacturer use the same number of uplink links, avoid using the same load-balancing algorithm on two adjacent devices.

When the device is running HASH calculation, in addition to the traditional five-tuple, a disturbance factor can be added to avoid the same HASH calculation results.

During the calculation of HASH disturbance, the HASH factor is still extracted normally, and then the user-defined random disturbance factor is added. After the HASH algorithm is calculated, the HASH calculation results of different switches will be inconsistent, so as to avoid the occurrence of HASH polarization.

▲ Fig 8: HASH perturbation calculation process

Implementation of Dynamic Load Balancing Technology

In data center networks, there are many burst flows, and elephant flows and mouse flows coexist. The HASH algorithm based on data flow quintuples described in this article is combined with HASH perturbation factor technology to achieve traffic load balancing, but it cannot achieve traffic load balancing between multiple links in a network where elephant flows and mouse flows coexist.

Ruijie Networks' next-generation 25G data center network solution the new chip can support the DLB (Dynamic load balance) feature, which can realize dynamic HASH load balancing based on the traffic load status. The specific implementation method is that the switch creates a flow table for each data flow to be load balanced, records traffic statistics based on the flow table and dynamically adjusts the link load balancing according to the traffic statistics.

Summary

This article discusses the application of Equal Cost Multi-Path (ECMP) technology in data center networks. ECMP is widely used in data center networks to improve network redundancy, reliability, and resource utilization. However, it can also cause issues such as re-hashing of traffic and hash polarization. The document explores optimization techniques to mitigate these problems, including the use of elastic HASH algorithms and dynamic load balancing technology.

Technology Feast | How to build a lossless network for RDMA

Technology Feast | A Brief Discussion on 100G Optical Modules in Data Centers

Higher Education

5-Star Hotel Solution