I have problems getting CUDA-aware MPI running in Julia. I have a C++ example that works perfectly if I run it with
mpiexec -n 2 ./alltoall_test
. The equivalent Julia code run with
mpiexec -n 2 julia alltoall_test.jl
fails on a segmentation fault with
[1642930332.032032] [gcn19:4087661:0] gdr_copy_md.c:122 UCX ERROR gdr_pin_buffer failed. length :65536 ret:22
. I have set up the MPI using the system MPI and it reports the correct version, so my impression is that the problem lies on the CUDA side which wants to download artefacts as soon as I use
CuArray
, despite CUDA being installed. My impression is that am using an CUDA version that is incompatible with the compilation of MPI, but I do not know how to verify this. I used the
JULIA_CUDA_USE_BINARY_BUILDER=false
setting. The two test codes are:
using MPI
using CUDA
np = 2
MPI.Init()
comm = MPI.COMM_WORLD
mpiid = MPI.Comm_rank(comm)
print("The MPI rank is: $mpiid\n")
device!(mpiid)
print("The CUDA device is: $(device())\n")
n = 1024
data_cpu = rand(n)
data_out_cpu = similar(data_cpu)
data = CuArray(data_cpu)
data_out = similar(data)
# Test the alltoall on the CPU
mpi_data_cpu = MPI.UBuffer(data_cpu, 512)
mpi_data_out_cpu = MPI.UBuffer(data_out_cpu, 512)
@time MPI.Alltoall!(mpi_data_cpu, mpi_data_out_cpu, comm)
@time MPI.Alltoall!(mpi_data_cpu, mpi_data_out_cpu, comm)
# Test the alltoall on the GPU
print("$mpiid has CUDA: $(MPI.has_cuda())\n")
mpi_data = MPI.UBuffer(data, 512)
mpi_data_out = MPI.UBuffer(data_out, 512)
@time MPI.Alltoall!(mpi_data, mpi_data_out, comm)
@time MPI.Alltoall!(mpi_data, mpi_data_out, comm)
# Close the MPI.
MPI.Finalize()
#include <iostream>
#include <vector>
#include <mpi.h>
#include <cuda_runtime_api.h>
#include <cuda.h>
#include <chrono>
int main()
MPI_Init(NULL, NULL);
int n, id;
MPI_Comm_size(MPI_COMM_WORLD, &n);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
const size_t size_tot = 1024*1024*1024;
const size_t size_max = size_tot / n;
// CPU TEST
std::vector<double> a_cpu_in (size_tot);
std::vector<double> a_cpu_out(size_tot);
std::fill(a_cpu_in.begin(), a_cpu_in.end(), id);
std::cout << id << ": Starting CPU all-to-all\n";
auto time_start = std::chrono::high_resolution_clock::now();
MPI_Alltoall(
a_cpu_in .data(), size_max, MPI_DOUBLE,
a_cpu_out.data(), size_max, MPI_DOUBLE,
MPI_COMM_WORLD);
auto time_end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration<double, std::milli>(time_end-time_start).count();
std::cout << id << ": Finished CPU all-to-all in " << std::to_string(duration) << " (ms)\n";
// GPU TEST
int id_local = id % 4;
cudaSetDevice(id_local);
double* a_gpu_in;
double* a_gpu_out;
cudaMalloc((void **)&a_gpu_in , size_tot * sizeof(double));
cudaMalloc((void **)&a_gpu_out, size_tot * sizeof(double));
cudaMemcpy(a_gpu_in, a_cpu_in.data(), size_tot*sizeof(double), cudaMemcpyHostToDevice);
int id_gpu;
cudaGetDevice(&id_gpu);
std::cout << id << ", " << id_local << ", " << id_gpu << ": Starting GPU all-to-all\n";
time_start = std::chrono::high_resolution_clock::now();
MPI_Alltoall(
a_gpu_in , size_max, MPI_DOUBLE,
a_gpu_out, size_max, MPI_DOUBLE,
MPI_COMM_WORLD);
time_end = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration<double, std::milli>(time_end-time_start).count();
std::cout << id << ", " << id_local << ", " << id_gpu << ": Finished GPU all-to-all in " << std::to_string(duration) << " (ms)\n";
MPI_Finalize();
return 0;
I did that, but it does not seem to work. I have it already loaded as a environment variable, but I see this happening at the moment the code reaches the first data = CuArray(data_cpu)
line.
chiel@gcn19:/gpfs/scratch1/shared/chiel/MicroHH.jl/test$ JULIA_CUDA_USE_BINARY_BUILDER=false mpiexec -n 2 julia alltoall_test.jl
The MPI rank is: 1
The MPI rank is: 0
The CUDA device is: CuDevice(0)
The CUDA device is: CuDevice(1)
Downloading Downloading artifact: CUDA
artifact: CUDA
JULIA_CUDA_USE_BINARY_BUILDER=false
needs to be set for each process. One way could be to put it in ~.bashrc or similar, another way to set it explicitly everywhere, i.e. the idea is:
mpiexec -n 2 'JULIA_CUDA_USE_BINARY_BUILDER=false julia alltoall_test.jl'
Maybe you will need to put JULIA_CUDA_USE_BINARY_BUILDER=false julia alltoall_test.jl
in a shell script and call
mpiexec -n 2 my_julia_multi_gpu_app.sh
This does not work, I added a line to my script to check the variable (which is now in my .bashrc
) and it has the correct value, yet it starts installing its own CUDA:
The MPI rank is: 1
The MPI rank is: 0
The CUDA device is: CuDevice(1), JULIA_CUDA_USE_BINARY_BUILDER is false
The CUDA device is: CuDevice(0), JULIA_CUDA_USE_BINARY_BUILDER is false
Downloading Downloading artifact: CUDA
artifact: CUDA
(and some more environment variables for MPI). CUDA artifacts are not downloaded and CUDA-aware MPI works.
Now, what happens if you do mpiexec -n 1 ...
? Does it still download the artifacts? And what if you run the same without artifacts? This should help to further localize the issue…
Moreover, did you install MPI with JULIA_MPI_BINARY=system
? Else I think CUDA-aware MPI cannot be supported yet.
One more thing: to make it easier to choose the right MPI library etc., you can use the Julia mpiexec wrapper:
https://juliaparallel.github.io/MPI.jl/stable/configuration/#Julia-wrapper-for-mpiexec
Also with mpiexec -n 1
the artifact download starts. I do not know how to suppress this. The CUDA-aware MPI works perfectly in the enclosed C++ example, which gives me still hope that somehow I could fix this. I also installed the MPI with the JULIA_MPI_BINARY=system
so I am a little confused where to look now.
I deleted my .julia
folder to reinstall the packages again. I manage now to avoid the artifact download, but the error remains the same with a segfault. I have tested both mpiexec
and mpiexecjl
. A single process run (which probably skips the entire transfer in the MPI.Alltoall
function does not crash.
That is good, the first issue - downloading of CUDA artifacts - is solved.
For the second issue - the error message - I assume that you know how to set the required variables for CUDA-aware MPI in general as you say it works with C++…
So, try maybe this in order to see if it is related to the functionality you are using in you all-to-all example:
using MPI
using CUDA
MPI.Init()
comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)
dst = mod(rank+1, size)
src = mod(rank-1, size)
println("rank=$rank, size=$size, dst=$dst, src=$src")
N = 4
send_mesg = CuArray{Float64}(undef, N)
recv_mesg = CuArray{Float64}(undef, N)
fill!(send_mesg, Float64(rank))
#rreq = MPI.Irecv!(recv_mesg, src, src+32, comm)
MPI.Sendrecv!(send_mesg, dst, 0, recv_mesg, src, 0, comm)
println("recv_mesg on proc $rank: $recv_mesg")
If the problem remains, set LD_PRELOAD
in order to point to your libcuda.so
and libcudart.so
. On Piz Daint this was, e.g., done as follows: LD_PRELOAD=/usr/lib64/libcuda.so:/usr/local/cuda/lib64/libcudart.so
This was a workaround required there before Cray fixed an issue in Cray-MPICH…
rank=1, size=2, dst=0, src=0
rank=0, size=2, dst=1, src=1
[1642940044.417983] [gcn19:4129247:0] gdr_copy_md.c:122 UCX ERROR gdr_pin_buffer failed. length :65536 ret:22
signal (11): Segmentation fault
in expression starting at /gpfs/scratch1/shared/chiel/MicroHH.jl/test/help.jl:15
uct_gdr_copy_mkey_pack at /tmp/jenkins/build/UCXCUDA/1.10.0/GCCcore-10.3.0-CUDA-11.3.1/ucx-1.10.0/src/uct/cuda/gdr_copy/gdr_copy_md.c:68
Now with your preloading suggestion. I did not find a libcuda.so
, except in the stubs
folder in the lib
. Preloading those with export LD_PRELOAD=/sw/arch/Centos8/EB_production/2021/software/CUDA/11.3.1/lib/libcudart.so:/sw/arch/Centos8/EB_production/2021/software/CUDA/11.3.1/lib/stubs/libcuda.so
give me on the line device!(mpiid)
:
CUDA error: OS call failed or operation not supported on this OS (code 304, ERROR_OPERATING_SYSTEM)
Stacktrace:
[1] throw_api_error(
signal (15): Terminated
in expression starting at none:0
I have checked that, but I am not using threads and the assignment of devices to MPI tasks seem to go OK. I noticed that CUDA
does not like the stub library and gives an error over that. If I only preload the libcudart.so
then CUDA.version()
gives me this: v"11.6.0"
which is not my cuda version, because this is 11.3.1.
I tried it and indeed it is pointing to artifact CUDA. I deleted the artifact and did it again and then, despite the correct environment variables, it started downloading the artifact again. I think I need to point Julia in more detail to the correct and already installed CUDA but do not know how.
┌ Debug: Trying to use artifacts...
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:131
┌ Debug: Selecting artifacts based on driver compatibility 11.6.0
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:143
Downloading artifact: CUDA
Downloaded artifact: CUDA
┌ Debug: Using CUDA 11.6.0 from an artifact at /home/chiel/.julia/artifacts/7b09e1deca842d1e5467b6f7a8ec5a96d47ae0b4
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:168
CUDA toolkit 11.6, artifact installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.6
Libraries:
- CUBLAS: 11┌ Debug: cuBLAS (v11.6) function cublasStatus_t cublasGetProperty(libraryPropertyType, int*) called:
│ type: type=SOME TYPE; val=0
│ value: type=int; val=POINTER (IN HEX:0x0x14e36fa302e0)
│ Time: 2022-01-23T14:29:46 elapsed from start 0.000000 minutes or 0.000000 seconds
│ Process=2487884; Thread=22969315264320; GPU=0; Handle=POINTER (IN HEX:0x(nil))
│ COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
└ @ CUDA.CUBLAS ~/.julia/packages/CUDA/nYggH/lib/cublas/CUBLAS.jl:220
.┌ Debug: cuBLAS (v11.6) function cublasStatus_t cublasGetProperty(libraryPropertyType, int*) called:
│ type: type=SOME TYPE; val=1
│ value: type=int; val=POINTER (IN HEX:0x0x14e36f10d8b0)
│ Time: 2022-01-23T14:29:46 elapsed from start 0.000000 minutes or 0.000000 seconds
│ Process=2487884; Thread=22969315264320; GPU=0; Handle=POINTER (IN HEX:0x(nil))
│ COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
└ @ CUDA.CUBLAS ~/.julia/packages/CUDA/nYggH/lib/cublas/CUBLAS.jl:220
8┌ Debug: cuBLAS (v11.6) function cublasStatus_t cublasGetProperty(libraryPropertyType, int*) called:
│ type: type=SOME TYPE; val=2
│ value: type=int; val=POINTER (IN HEX:0x0x14e36f10d8c0)
│ Time: 2022-01-23T14:29:46 elapsed from start 0.000000 minutes or 0.000000 seconds
│ Process=2487884; Thread=22969315264320; GPU=0; Handle=POINTER (IN HEX:0x(nil))
│ COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
└ @ CUDA.CUBLAS ~/.julia/packages/CUDA/nYggH/lib/cublas/CUBLAS.jl:220
- CURAND: 10.2.9
- CUFFT: 10.7.0
- CUSOLVER: 11.3.2
- CUSPARSE: 11.7.1
- CUPTI: 16.0.0
- NVML: 11.0.0+470.57.2
┌ Debug: Using CUDNN from an artifact at /home/chiel/.julia/artifacts/b0757335df76c8a6732f8261b705210afd7d2583
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:536
- CUDNN: 8.30.2 (for CUDA 11.5.0)┌ Debug: CuDNN (v8302) function cudnnGetVersion() called:
│ Time: 2022-01-23T14:29:47.448540 (0d+0h+0m+0s since start)
│ Process=2487884; Thread=2487884; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ CUDA.CUDNN ~/.julia/packages/CUDA/nYggH/lib/cudnn/CUDNN.jl:134
┌ Debug: Using CUTENSOR library cutensor from an artifact at /home/chiel/.julia/artifacts/b4714d43eda0a77581c8664d279f7456a0adfb47
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/bindeps.jl:609
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)
Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
┌ Debug: Toolchain with LLVM 12.0.1, CUDA driver 11.6 and toolkit 11.6 supports devices 3.5, 3.7, 5.0, 5.2, 5.3, 6.0, 6.1, 6.2, 7.0, 7.2, 7.5 and 8.0; PTX 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5 and 7.0
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/compatibility.jl:210
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
┌ Debug: Toolchain with LLVM 12.0.1, CUDA driver 11.6 and toolkit 11.6 supports devices 3.5, 3.7, 5.0, 5.2, 5.3, 6.0, 6.1, 6.2, 7.0, 7.2, 7.5 and 8.0; PTX 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5 and 7.0
└ @ CUDA.Deps ~/.julia/packages/CUDA/nYggH/deps/compatibility.jl:210
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
Environment:
- JULIA_CUDA_USE_BINARY_BUILDER: false
1 device:
0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
You typed the variable name wrong: it must be JULIA_CUDA_USE_BINARYBUILDER
, not JULIA_CUDA_USE_BINARY_BUILDER
. Better copy-paste long variable names
That should solve the artifact downloading issue. Then, we will see if there is still another issue…
Excellent observation, sorry for missing that! This at least brings me to an almost correct CUDA installation, but still with error. I wonder where this 11.6 CUDA driver comes from, nvidia-smi
gives me: NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4
, so I do not know where this 11.6 originates from.
CUDA toolkit 11.3, local installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.6
It does give more error information though:
[1642947535.067588] [gcn19:4179357:0] cuda_ipc_md.c:233 UCX ERROR cuIpcGetMemHandle(&key->ph, (CUdeviceptr)addr)() failed: invalid argument
Chiil:
Excellent observation, sorry for missing that! This at least brings me to an almost correct CUDA installation, but still with error. I wonder where this 11.6 CUDA driver comes from, nvidia-smi
gives me: NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4
, so I do not know where this 11.6 originates from.
CUDA toolkit 11.3, local installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.6
I suppose @maleadt can immediately tell how to interpret this…
A minor addition to @samo 's hints, you could try setting the CUDA memory pool to none
:
export JULIA_CUDA_MEMORY_POOL=none
This may help for the CUDA-aware MPI error.
For the artifact download issue, I’d make sure, starting from scratch once more, to:
Have MPI and CUDA on path (or module loaded) that were used to build the CUDA-aware MPI
Make sure to have:export JULIA_CUDA_MEMORY_POOL=none
export JULIA_MPI_BINARY=system
export JULIA_CUDA_USE_BINARYBUILDER=false
Add CUDA and MPI packages in Julia. Build MPI.jl in verbose mode to check whether correct versions are built/used:julia -e 'using Pkg; pkg"add CUDA"; pkg"add MPI"; Pkg.build("MPI"; verbose=true)'
Then in Julia, upon loading MPI and CUDA modules, you can check
CUDA version: CUDA.versioninfo()
If MPI has CUDA: MPI.has_cuda()
If you are using correct MPI implementation: MPI.identify_implementation()