feat: Add Expert Affinity Aware EPLB algorithm. #2

shangyuan-ant · 2025-09-19T07:57:56Z

Motivation

The natively implemented EPLB algorithm primarily focuses on balancing the computational load across each GPU and machine but does not adequately account for inter-expert communication (such as cross-node communication). In large-scale expert parallelism scenarios, excessive cross-node communication is more likely to compromise computational efficiency.

Modifications

Building upon expert load tracking, we further record the top-k expert groups activated in each iteration to compute an expert affinity matrix (i.e., the probability of co-activation). After intra-card load balancing via EPLB, we adjust card placement based on the affinity between the expert with the highest load in one gpu and other experts within other gpus, thereby reducing subsequent cross-node communication. This approach can achieve an additional ~5% performance improvement over standard EPLB.

Accuracy Tests

Benchmarking and Profiling

■ request-rate = 5 | max-concurrency(batch-size) = (512 896 1024 1536 2048)
■ num-prompts = 4096 | input-len = 4096 | output-len = 1536
■ dataset: ShareGPT_V3_unfiltered_cleaned_split.json

batch-size	Performance	W/o EPLB	With EPLB(vanilla)	With EPLB(Expert-Affinity Aware)
64	P50-TTFT	566.78	540.49	559.74
	P90-TPOT	45.02	44.95	44.94
	QPS	1.35	1.36	1.36
128	P50-TTFT	539.93	537.22	541.04
	P90-TPOT	49.18	49.10	49.10
	QPS	2.36	2.36	2.36
256	P50-TTFT	764.42	754.62	758.67
	P90-TPOT	56.32	56.18	56.06
	QPS	3.37	3.37	3.37
1536	P50-TTFT	1464.77	1463.27	1485.56
	P90-TPOT	85.12	84.31	81.38
	P95-ITL	102.60	100.71	97.22
	QPS	4.48	4.48	4.49
2048	P50-TTFT	1470.45	1463.95	1480.91
	P90-TPOT	85.15	84.60	81.39
	P95-ITL	102.87	100.21	97.08
	QPS	4.48	4.48	4.49

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Signed-off-by: shangyuan-ant <[email protected]>

yuan-luo · 2025-09-27T09:41:03Z

Could you paste the performance gain result?

chenglu66 · 2025-12-04T12:11:25Z

你好，我注意到代码中有这样一个计算，我有点不理解，跨节点通信高，那么G1G2 交换后，原本g1节点内通信，变成了节点间通信，这里应该是成本而不是收益。

Compute the gain from swapping g1 and g2

                        for other_g1 in node_groups[node1]:
                            if other_g1 != g1:
                                gain += group_comm_cost[g1, other_g1]
                                gain -= group_comm_cost[g2, other_g1]

                        for other_g2 in node_groups[node2]:
                            if other_g2 != g2:
                                gain += group_comm_cost[g2, other_g2]
                                gain -= group_comm_cost[g1, other_g2]

代码是否应该为？
for other_g1 in node_groups[node1]:
if other_g1 != g1:
gain += group_comm_cost[g2, other_g1]
gain -= group_comm_cost[g1, other_g1]

                        for other_g2 in node_groups[node2]:
                            if other_g2 != g2:
                                gain += group_comm_cost[g1, other_g2]
                                gain -= group_comm_cost[g2, other_g2]

我刚看代码，如果我理解不对，请随时指出。

shangyuan-ant · 2025-12-05T02:09:32Z

你好，我注意到代码中有这样一个计算，我有点不理解，跨节点通信高，那么G1G2 交换后，原本g1节点内通信，变成了节点间通信，这里应该是成本而不是收益。

Compute the gain from swapping g1 and g2
                        for other_g1 in node_groups[node1]:
                            if other_g1 != g1:
                                gain += group_comm_cost[g1, other_g1]
                                gain -= group_comm_cost[g2, other_g1]

                        for other_g2 in node_groups[node2]:
                            if other_g2 != g2:
                                gain += group_comm_cost[g2, other_g2]
                                gain -= group_comm_cost[g1, other_g2]
代码是否应该为？ for other_g1 in node_groups[node1]: if other_g1 != g1: gain += group_comm_cost[g2, other_g1] gain -= group_comm_cost[g1, other_g1]
                        for other_g2 in node_groups[node2]:
                            if other_g2 != g2:
                                gain += group_comm_cost[g1, other_g2]
                                gain -= group_comm_cost[g2, other_g2]
我刚看代码，如果我理解不对，请随时指出。

hi chenglu, 你理解的是对的, 这里的收益应该分别是

gain += (group_comm_cost[g2, other_g1].item() - group_comm_cost[g1, other_g1].item())

gain += (group_comm_cost[g1, other_g2].item() - group_comm_cost[g2, other_g2].item())

我会推一个新的版本
感谢指正：）

feat: Add Expert Affinity Aware EPLB algorithm.

1d5f106

Signed-off-by: shangyuan-ant <[email protected]>

This was referenced Oct 20, 2025

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices sgl-project/sglang#11854

Closed

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices #4

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Expert Affinity Aware EPLB algorithm. #2

feat: Add Expert Affinity Aware EPLB algorithm. #2

shangyuan-ant commented Sep 19, 2025 •

edited

Loading

Uh oh!

yuan-luo commented Sep 27, 2025

Uh oh!

chenglu66 commented Dec 4, 2025

Uh oh!

shangyuan-ant commented Dec 5, 2025

Compute the gain from swapping g1 and g2

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add Expert Affinity Aware EPLB algorithm. #2

Are you sure you want to change the base?

feat: Add Expert Affinity Aware EPLB algorithm. #2

Conversation

shangyuan-ant commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

yuan-luo commented Sep 27, 2025

Uh oh!

chenglu66 commented Dec 4, 2025

Compute the gain from swapping g1 and g2

Uh oh!

shangyuan-ant commented Dec 5, 2025

Compute the gain from swapping g1 and g2

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shangyuan-ant commented Sep 19, 2025 •

edited

Loading