WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Enable distributed training over multiple node in compliant detonation chamber #93

@nifarn

Description

@nifarn

Distributed training currently does not work across multiple nodes in compliant detonation chamber. This is due to a bug in the set_environment_variables_for_nccl_backend method in pymarlin.utils.distributed where the master node's address is taken from the environment variable AZ_BATCH_MASTER_NODE rather than AZ_BATCHAI_MPI_MASTER_NODE. While this works for signed builds, the former option is disabled in detonation chamber. Thus, can we modify the codebase to enable this behavior? Based off of this stackoverflow post, the recommendation seems to be to always use AZ_BATCHAI_MPI_MASTER_NODE over AZ_BATCH_MASTER_NODE.

Given the current implementation for set_environment_variables_for_nccl_backend, this should be a pretty straightforward change of removing the if statement for single_node. I have already verified these changes in compliant detonation chamber.

def set_environment_variables_for_nccl_backend():
    """Sets distributed training environments for azureml openmpi runs with NCCL backend."""

    # NCCL environment. Still works without it.
    os.environ["NCCL_SOCKET_IFNAME"] = "^docker0,lo"
    os.environ["NCCL_IB_DISABLE"] = "0"  # for IB

    single_node = int(os.environ["OMPI_COMM_WORLD_LOCAL_SIZE"]) == int(
        os.environ["OMPI_COMM_WORLD_SIZE"]
    )

    if single_node:
        master_node = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
        master_port = "54965"
    else:
        master_node_params = os.environ["AZ_BATCH_MASTER_NODE"].split(":")

        master_node = master_node_params[0]
        master_port = (
            os.environ["MASTER_PORT"] if "MASTER_PORT" in os.environ else "6105"
        )

    # set env variables
    os.environ["MASTER_ADDR"] = master_node
    os.environ["MASTER_PORT"] = master_port

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions