-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Distributed training currently does not work across multiple nodes in compliant detonation chamber. This is due to a bug in the set_environment_variables_for_nccl_backend method in pymarlin.utils.distributed where the master node's address is taken from the environment variable AZ_BATCH_MASTER_NODE rather than AZ_BATCHAI_MPI_MASTER_NODE. While this works for signed builds, the former option is disabled in detonation chamber. Thus, can we modify the codebase to enable this behavior? Based off of this stackoverflow post, the recommendation seems to be to always use AZ_BATCHAI_MPI_MASTER_NODE over AZ_BATCH_MASTER_NODE.
Given the current implementation for set_environment_variables_for_nccl_backend, this should be a pretty straightforward change of removing the if statement for single_node. I have already verified these changes in compliant detonation chamber.
def set_environment_variables_for_nccl_backend():
"""Sets distributed training environments for azureml openmpi runs with NCCL backend."""
# NCCL environment. Still works without it.
os.environ["NCCL_SOCKET_IFNAME"] = "^docker0,lo"
os.environ["NCCL_IB_DISABLE"] = "0" # for IB
single_node = int(os.environ["OMPI_COMM_WORLD_LOCAL_SIZE"]) == int(
os.environ["OMPI_COMM_WORLD_SIZE"]
)
if single_node:
master_node = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
master_port = "54965"
else:
master_node_params = os.environ["AZ_BATCH_MASTER_NODE"].split(":")
master_node = master_node_params[0]
master_port = (
os.environ["MASTER_PORT"] if "MASTER_PORT" in os.environ else "6105"
)
# set env variables
os.environ["MASTER_ADDR"] = master_node
os.environ["MASTER_PORT"] = master_port