WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@shangyuan-ant
Copy link

@shangyuan-ant shangyuan-ant commented Sep 19, 2025

Motivation

The natively implemented EPLB algorithm primarily focuses on balancing the computational load across each GPU and machine but does not adequately account for inter-expert communication (such as cross-node communication). In large-scale expert parallelism scenarios, excessive cross-node communication is more likely to compromise computational efficiency.

Modifications

Building upon expert load tracking, we further record the top-k expert groups activated in each iteration to compute an expert affinity matrix (i.e., the probability of co-activation). After intra-card load balancing via EPLB, we adjust card placement based on the affinity between the expert with the highest load in one gpu and other experts within other gpus, thereby reducing subsequent cross-node communication. This approach can achieve an additional ~5% performance improvement over standard EPLB.

Accuracy Tests

Benchmarking and Profiling

■ request-rate = 5 | max-concurrency(batch-size) = (512 896 1024 1536 2048)
■ num-prompts = 4096 | input-len = 4096 | output-len = 1536
■ dataset: ShareGPT_V3_unfiltered_cleaned_split.json
batch-size Performance W/o EPLB With EPLB(vanilla) With EPLB(Expert-Affinity Aware)
64 P50-TTFT 566.78 540.49 559.74
P90-TPOT 45.02 44.95 44.94
QPS 1.35 1.36 1.36
128 P50-TTFT 539.93 537.22 541.04
P90-TPOT 49.18 49.10 49.10
QPS 2.36 2.36 2.36
256 P50-TTFT 764.42 754.62 758.67
P90-TPOT 56.32 56.18 56.06
QPS 3.37 3.37 3.37
1536 P50-TTFT 1464.77 1463.27 1485.56
P90-TPOT 85.12 84.31 81.38
P95-ITL 102.60 100.71 97.22
QPS 4.48 4.48 4.49
2048 P50-TTFT 1470.45 1463.95 1480.91
P90-TPOT 85.15 84.60 81.39
P95-ITL 102.87 100.21 97.08
QPS 4.48 4.48 4.49

Checklist

@yuan-luo
Copy link
Collaborator

Could you paste the performance gain result?

@chenglu66
Copy link

你好,我注意到代码中有这样一个计算,我有点不理解,跨节点通信高,那么G1G2 交换后,原本g1节点内通信,变成了节点间通信,这里应该是成本而不是收益。

Compute the gain from swapping g1 and g2

                        for other_g1 in node_groups[node1]:
                            if other_g1 != g1:
                                gain += group_comm_cost[g1, other_g1]
                                gain -= group_comm_cost[g2, other_g1]

                        for other_g2 in node_groups[node2]:
                            if other_g2 != g2:
                                gain += group_comm_cost[g2, other_g2]
                                gain -= group_comm_cost[g1, other_g2]

代码是否应该为?
for other_g1 in node_groups[node1]:
if other_g1 != g1:
gain += group_comm_cost[g2, other_g1]
gain -= group_comm_cost[g1, other_g1]

                        for other_g2 in node_groups[node2]:
                            if other_g2 != g2:
                                gain += group_comm_cost[g1, other_g2]
                                gain -= group_comm_cost[g2, other_g2]

我刚看代码,如果我理解不对,请随时指出。

@shangyuan-ant
Copy link
Author

你好,我注意到代码中有这样一个计算,我有点不理解,跨节点通信高,那么G1G2 交换后,原本g1节点内通信,变成了节点间通信,这里应该是成本而不是收益。

Compute the gain from swapping g1 and g2

                        for other_g1 in node_groups[node1]:
                            if other_g1 != g1:
                                gain += group_comm_cost[g1, other_g1]
                                gain -= group_comm_cost[g2, other_g1]

                        for other_g2 in node_groups[node2]:
                            if other_g2 != g2:
                                gain += group_comm_cost[g2, other_g2]
                                gain -= group_comm_cost[g1, other_g2]

代码是否应该为? for other_g1 in node_groups[node1]: if other_g1 != g1: gain += group_comm_cost[g2, other_g1] gain -= group_comm_cost[g1, other_g1]

                        for other_g2 in node_groups[node2]:
                            if other_g2 != g2:
                                gain += group_comm_cost[g1, other_g2]
                                gain -= group_comm_cost[g2, other_g2]

我刚看代码,如果我理解不对,请随时指出。

hi chenglu, 你理解的是对的, 这里的收益应该分别是

gain += (group_comm_cost[g2, other_g1].item() - group_comm_cost[g1, other_g1].item())

gain += (group_comm_cost[g1, other_g2].item() - group_comm_cost[g2, other_g2].item())

我会推一个新的版本
感谢指正:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants