添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

I have problems getting CUDA-aware MPI running in Julia. I have a C++ example that works perfectly if I run it with mpiexec -n 2 ./alltoall_test . The equivalent Julia code run with mpiexec -n 2 julia alltoall_test.jl fails on a segmentation fault with [1642930332.032032] [gcn19:4087661:0] gdr_copy_md.c:122 UCX ERROR gdr_pin_buffer failed. length :65536 ret:22 . I have set up the MPI using the system MPI and it reports the correct version, so my impression is that the problem lies on the CUDA side which wants to download artefacts as soon as I use CuArray , despite CUDA being installed. My impression is that am using an CUDA version that is incompatible with the compilation of MPI, but I do not know how to verify this. I used the JULIA_CUDA_USE_BINARY_BUILDER=false setting. The two test codes are:

using MPI
using CUDA
np = 2
MPI.Init()
comm = MPI.COMM_WORLD
mpiid = MPI.Comm_rank(comm)
print("The MPI rank is: $mpiid\n")
device!(mpiid)
print("The CUDA device is: $(device())\n")
n = 1024
data_cpu = rand(n)
data_out_cpu = similar(data_cpu)
data = CuArray(data_cpu)
data_out = similar(data)
# Test the alltoall on the CPU
mpi_data_cpu = MPI.UBuffer(data_cpu, 512)
mpi_data_out_cpu = MPI.UBuffer(data_out_cpu, 512)
@time MPI.Alltoall!(mpi_data_cpu, mpi_data_out_cpu, comm)
@time MPI.Alltoall!(mpi_data_cpu, mpi_data_out_cpu, comm)
# Test the alltoall on the GPU
print("$mpiid has CUDA: $(MPI.has_cuda())\n")
mpi_data = MPI.UBuffer(data, 512)
mpi_data_out = MPI.UBuffer(data_out, 512)
@time MPI.Alltoall!(mpi_data, mpi_data_out, comm)
@time MPI.Alltoall!(mpi_data, mpi_data_out, comm)
# Close the MPI.
MPI.Finalize()
#include <iostream>
#include <vector>
#include <mpi.h>
#include <cuda_runtime_api.h>
#include <cuda.h>
#include <chrono>
int main()
    MPI_Init(NULL, NULL);
    int n, id;
    MPI_Comm_size(MPI_COMM_WORLD, &n);
    MPI_Comm_rank(MPI_COMM_WORLD, &id);
    const size_t size_tot = 1024*1024*1024;
    const size_t size_max = size_tot / n;
    // CPU TEST
    std::vector<double> a_cpu_in (size_tot);
    std::vector<double> a_cpu_out(size_tot);
    std::fill(a_cpu_in.begin(), a_cpu_in.end(), id);
    std::cout << id << ": Starting CPU all-to-all\n";
    auto time_start = std::chrono::high_resolution_clock::now();
    MPI_Alltoall(
            a_cpu_in .data(), size_max, MPI_DOUBLE,
            a_cpu_out.data(), size_max, MPI_DOUBLE,
            MPI_COMM_WORLD);
    auto time_end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration<double, std::milli>(time_end-time_start).count();
    std::cout << id << ": Finished CPU all-to-all in " << std::to_string(duration) << " (ms)\n";
    // GPU TEST
    int id_local = id % 4;
    cudaSetDevice(id_local);
    double* a_gpu_in;
    double* a_gpu_out;
    cudaMalloc((void **)&a_gpu_in , size_tot * sizeof(double));
    cudaMalloc((void **)&a_gpu_out, size_tot * sizeof(double));
    cudaMemcpy(a_gpu_in, a_cpu_in.data(), size_tot*sizeof(double), cudaMemcpyHostToDevice);
    int id_gpu;
    cudaGetDevice(&id_gpu);
    std::cout << id << ", " << id_local << ", " << id_gpu << ": Starting GPU all-to-all\n";
    time_start = std::chrono::high_resolution_clock::now();
    MPI_Alltoall(
            a_gpu_in , size_max, MPI_DOUBLE,
            a_gpu_out, size_max, MPI_DOUBLE,
            MPI_COMM_WORLD);
    time_end = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration<double, std::milli>(time_end-time_start).count();
    std::cout << id << ", " << id_local << ", " << id_gpu << ": Finished GPU all-to-all in " << std::to_string(duration) << " (ms)\n";
    MPI_Finalize();
    return 0;
              

I did that, but it does not seem to work. I have it already loaded as a environment variable, but I see this happening at the moment the code reaches the first data = CuArray(data_cpu) line.

chiel@gcn19:/gpfs/scratch1/shared/chiel/MicroHH.jl/test$ JULIA_CUDA_USE_BINARY_BUILDER=false mpiexec -n 2 julia alltoall_test.jl 
The MPI rank is: 1
The MPI rank is: 0
The CUDA device is: CuDevice(0)
The CUDA device is: CuDevice(1)
 Downloading Downloading artifact: CUDA
 artifact: CUDA
              
JULIA_CUDA_USE_BINARY_BUILDER=false

needs to be set for each process. One way could be to put it in ~.bashrc or similar, another way to set it explicitly everywhere, i.e. the idea is:

mpiexec -n 2 'JULIA_CUDA_USE_BINARY_BUILDER=false julia alltoall_test.jl'

Maybe you will need to put JULIA_CUDA_USE_BINARY_BUILDER=false julia alltoall_test.jl in a shell script and call

mpiexec -n 2 my_julia_multi_gpu_app.sh
              

This does not work, I added a line to my script to check the variable (which is now in my .bashrc) and it has the correct value, yet it starts installing its own CUDA:

The MPI rank is: 1
The MPI rank is: 0
The CUDA device is: CuDevice(1), JULIA_CUDA_USE_BINARY_BUILDER is false
The CUDA device is: CuDevice(0), JULIA_CUDA_USE_BINARY_BUILDER is false
 Downloading Downloading artifact: CUDA
 artifact: CUDA

(and some more environment variables for MPI). CUDA artifacts are not downloaded and CUDA-aware MPI works.

Now, what happens if you do mpiexec -n 1 ...? Does it still download the artifacts? And what if you run the same without artifacts? This should help to further localize the issue…

Moreover, did you install MPI with JULIA_MPI_BINARY=system? Else I think CUDA-aware MPI cannot be supported yet.

One more thing: to make it easier to choose the right MPI library etc., you can use the Julia mpiexec wrapper:
https://juliaparallel.github.io/MPI.jl/stable/configuration/#Julia-wrapper-for-mpiexec

Also with mpiexec -n 1 the artifact download starts. I do not know how to suppress this. The CUDA-aware MPI works perfectly in the enclosed C++ example, which gives me still hope that somehow I could fix this. I also installed the MPI with the JULIA_MPI_BINARY=system so I am a little confused where to look now.

I deleted my .julia folder to reinstall the packages again. I manage now to avoid the artifact download, but the error remains the same with a segfault. I have tested both mpiexec and mpiexecjl. A single process run (which probably skips the entire transfer in the MPI.Alltoall function does not crash.

That is good, the first issue - downloading of CUDA artifacts - is solved.

For the second issue - the error message - I assume that you know how to set the required variables for CUDA-aware MPI in general as you say it works with C++…

So, try maybe this in order to see if it is related to the functionality you are using in you all-to-all example:

using MPI
using CUDA
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")
N = 4
send_mesg = CuArray{Float64}(undef, N)
recv_mesg = CuArray{Float64}(undef, N)
fill!(send_mesg, Float64(rank))
#rreq = MPI.Irecv!(recv_mesg, src,  src+32, comm)
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")

If the problem remains, set LD_PRELOAD in order to point to your libcuda.so and libcudart.so. On Piz Daint this was, e.g., done as follows: LD_PRELOAD=/usr/lib64/libcuda.so:/usr/local/cuda/lib64/libcudart.so
This was a workaround required there before Cray fixed an issue in Cray-MPICH…

rank=1, size=2, dst=0, src=0
rank=0, size=2, dst=1, src=1
[1642940044.417983] [gcn19:4129247:0]    gdr_copy_md.c:122  UCX  ERROR gdr_pin_buffer failed. length :65536 ret:22
signal (11): Segmentation fault
in expression starting at /gpfs/scratch1/shared/chiel/MicroHH.jl/test/help.jl:15
uct_gdr_copy_mkey_pack at /tmp/jenkins/build/UCXCUDA/1.10.0/GCCcore-10.3.0-CUDA-11.3.1/ucx-1.10.0/src/uct/cuda/gdr_copy/gdr_copy_md.c:68

Now with your preloading suggestion. I did not find a libcuda.so, except in the stubs folder in the lib. Preloading those with export LD_PRELOAD=/sw/arch/Centos8/EB_production/2021/software/CUDA/11.3.1/lib/libcudart.so:/sw/arch/Centos8/EB_production/2021/software/CUDA/11.3.1/lib/stubs/libcuda.so give me on the line device!(mpiid):

CUDA error: OS call failed or operation not supported on this OS (code 304, ERROR_OPERATING_SYSTEM)
Stacktrace:
  [1] throw_api_error(
signal (15): Terminated
in expression starting at none:0
              

I have checked that, but I am not using threads and the assignment of devices to MPI tasks seem to go OK. I noticed that CUDA does not like the stub library and gives an error over that. If I only preload the libcudart.so then CUDA.version() gives me this: v"11.6.0" which is not my cuda version, because this is 11.3.1.

I tried it and indeed it is pointing to artifact CUDA. I deleted the artifact and did it again and then, despite the correct environment variables, it started downloading the artifact again. I think I need to point Julia in more detail to the correct and already installed CUDA but do not know how.

┌ Debug: Trying to use artifacts...
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:131
┌ Debug: Selecting artifacts based on driver compatibility 11.6.0
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:143
 Downloading artifact: CUDA
 Downloaded artifact: CUDA
┌ Debug: Using CUDA 11.6.0 from an artifact at /home/chiel/.julia/artifacts/7b09e1deca842d1e5467b6f7a8ec5a96d47ae0b4
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:168
CUDA toolkit 11.6, artifact installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.6
Libraries: 
- CUBLAS: 11┌ Debug:  cuBLAS (v11.6) function cublasStatus_t cublasGetProperty(libraryPropertyType, int*) called:
│   type: type=SOME TYPE; val=0
│   value: type=int; val=POINTER (IN HEX:0x0x14e36fa302e0)
│  Time: 2022-01-23T14:29:46 elapsed from start 0.000000 minutes or 0.000000 seconds
│ Process=2487884; Thread=22969315264320; GPU=0; Handle=POINTER (IN HEX:0x(nil))
│  COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
└ @ CUDA.CUBLAS ~/.julia/packages/CUDA/nYggH/lib/cublas/CUBLAS.jl:220
.┌ Debug:  cuBLAS (v11.6) function cublasStatus_t cublasGetProperty(libraryPropertyType, int*) called:
│   type: type=SOME TYPE; val=1
│   value: type=int; val=POINTER (IN HEX:0x0x14e36f10d8b0)
│  Time: 2022-01-23T14:29:46 elapsed from start 0.000000 minutes or 0.000000 seconds
│ Process=2487884; Thread=22969315264320; GPU=0; Handle=POINTER (IN HEX:0x(nil))
│  COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
└ @ CUDA.CUBLAS ~/.julia/packages/CUDA/nYggH/lib/cublas/CUBLAS.jl:220
8┌ Debug:  cuBLAS (v11.6) function cublasStatus_t cublasGetProperty(libraryPropertyType, int*) called:
│   type: type=SOME TYPE; val=2
│   value: type=int; val=POINTER (IN HEX:0x0x14e36f10d8c0)
│  Time: 2022-01-23T14:29:46 elapsed from start 0.000000 minutes or 0.000000 seconds
│ Process=2487884; Thread=22969315264320; GPU=0; Handle=POINTER (IN HEX:0x(nil))
│  COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
└ @ CUDA.CUBLAS ~/.julia/packages/CUDA/nYggH/lib/cublas/CUBLAS.jl:220
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+470.57.2
┌ Debug: Using CUDNN from an artifact at /home/chiel/.julia/artifacts/b0757335df76c8a6732f8261b705210afd7d2583
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:536
- CUDNN: 8.30.2 (for CUDA 11.5.0)┌ Debug: CuDNN (v8302) function cudnnGetVersion() called:
│ Time: 2022-01-23T14:29:47.448540 (0d+0h+0m+0s since start)
│ Process=2487884; Thread=2487884; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ CUDA.CUDNN ~/.julia/packages/CUDA/nYggH/lib/cudnn/CUDNN.jl:134
┌ Debug: Using CUTENSOR library cutensor from an artifact at /home/chiel/.julia/artifacts/b4714d43eda0a77581c8664d279f7456a0adfb47
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:609
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)
Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
┌ Debug: Toolchain with LLVM 12.0.1, CUDA driver 11.6 and toolkit 11.6 supports devices 3.5, 3.7, 5.0, 5.2, 5.3, 6.0, 6.1, 6.2, 7.0, 7.2, 7.5 and 8.0; PTX 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5 and 7.0
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/compatibility.jl:210
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
┌ Debug: Toolchain with LLVM 12.0.1, CUDA driver 11.6 and toolkit 11.6 supports devices 3.5, 3.7, 5.0, 5.2, 5.3, 6.0, 6.1, 6.2, 7.0, 7.2, 7.5 and 8.0; PTX 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5 and 7.0
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/compatibility.jl:210
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
Environment:
- JULIA_CUDA_USE_BINARY_BUILDER: false
1 device:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
              

You typed the variable name wrong: it must be JULIA_CUDA_USE_BINARYBUILDER, not JULIA_CUDA_USE_BINARY_BUILDER. Better copy-paste long variable names :wink:
That should solve the artifact downloading issue. Then, we will see if there is still another issue…

Excellent observation, sorry for missing that! This at least brings me to an almost correct CUDA installation, but still with error. I wonder where this 11.6 CUDA driver comes from, nvidia-smi gives me: NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 , so I do not know where this 11.6 originates from.

CUDA toolkit 11.3, local installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.6

It does give more error information though:

[1642947535.067588] [gcn19:4179357:0]    cuda_ipc_md.c:233  UCX  ERROR cuIpcGetMemHandle(&key->ph, (CUdeviceptr)addr)() failed: invalid argument
 Chiil:

Excellent observation, sorry for missing that! This at least brings me to an almost correct CUDA installation, but still with error. I wonder where this 11.6 CUDA driver comes from, nvidia-smi gives me: NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 , so I do not know where this 11.6 originates from.

CUDA toolkit 11.3, local installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.6

I suppose @maleadt can immediately tell how to interpret this…

A minor addition to @samo 's hints, you could try setting the CUDA memory pool to none:

export JULIA_CUDA_MEMORY_POOL=none

This may help for the CUDA-aware MPI error.

For the artifact download issue, I’d make sure, starting from scratch once more, to:

  • Have MPI and CUDA on path (or module loaded) that were used to build the CUDA-aware MPI
  • Make sure to have:
    export JULIA_CUDA_MEMORY_POOL=none
    export JULIA_MPI_BINARY=system
    export JULIA_CUDA_USE_BINARYBUILDER=false
    
  • Add CUDA and MPI packages in Julia. Build MPI.jl in verbose mode to check whether correct versions are built/used:
    julia -e 'using Pkg; pkg"add CUDA"; pkg"add MPI"; Pkg.build("MPI"; verbose=true)'
    
  • Then in Julia, upon loading MPI and CUDA modules, you can check
  • CUDA version: CUDA.versioninfo()
  • If MPI has CUDA: MPI.has_cuda()
  • If you are using correct MPI implementation: MPI.identify_implementation()
  •