@stamoor However, when I run lammps with kokkos, for a system of around 500K atoms (only short interaction forces), I get very low performance, almost half of what I get by running via GPU. package. I also get the following warning at the beginning of my simulations:
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
what type of GPU and CPU
what is your run command
what pair and fix styles are you using in the input?
Are these styles Kokkos-enabled?
CPU: Intel Ice Lake (Intel® Xeon® Platinum 8358 Processor) CPUs, GPU: NVIDIA Ampere A100 GPU’s with NVLink interconnect
This is my run command:
mpirun -n $SLURM_NPROCS /home/madibi/mylammps/build_mpi_kokkos/lmp -k on g 1 -sf kk -pk kokkos newton on neigh half comm device -in mw_hydrate_Teql.in
I run with different number of processes from 1 to 64 ( the maximum number of cpus available on the node)
pair style: sw, fix: I use fix nph, and fix langevin simultaneously
As far as I know all pair styles and fixes are KOKKOS-enabled.
This looks like an ideal case for Kokkos. I would expect to see the Kokkos package as fast or faster than the GPU package.
How many GPUs?
You will probably only want to run a single MPI rank per GPU and leave most of the CPU cores idle. This is because everything is running on the GPU, and there isn’t a good way for the CPU to help out (without slowing the GPU down).
Currently you are only running a single GPU (-k on g 1
), be sure to use -k on g 4
for 4 GPUs per node, -k on g 6
for 6 GPUs per node, etc.
Also be sure you are using GPU-aware MPI when using multiple GPUs. LAMMPS will disable that automatically and give a warning if there is a problem with your MPI library, which can really slow down performance.
Thanks for the reply.
I have as many as 4 GPUs available per node, but I can run simulations on up to 8 nodes. Thus, I can run on 32 GPUs in total.There are around 0.5 million atoms in the simulation box. I have another system with around 2 milliion atoms but the runtime still is higher for that system as well, compared to running with GPU package. I also used -k on g 4 but the performance is still low compared to the GPU package with the same number of gpus.
How can I be sure if I am using a GPU-aware MPI? I followd the instructions on the lammps webpage to compile lammps with KOKKOS.
Can you try running this Stillinger-Weber benchark: lammps/in.intel.sw at 2b8d6fc4d93f6c7dcce870ed134e6d3ac45f47b7 · lammps/lammps · GitHub
Here is what I get on a single A100 with NVLink, only difference is the CPU (2.8 Ghz AMD EPYC Milan 7543P).
Loop time of 11.705 on 1 procs for 6200 steps with 512000 atoms
Performance: 45.765 ns/day, 0.524 hours/ns, 529.689 timesteps/s, 271.201 Matom-step/s
Or I could try running your input if you can share it.
Here is what I get for 2 and 4 GPUs for the same 512k atom SW problem:
Loop time of 7.8263 on 2 procs for 6200 steps with 512000 atoms
Performance: 68.446 ns/day, 0.351 hours/ns, 792.201 timesteps/s, 405.607 Matom-step/s
Loop time of 5.38935 on 4 procs for 6200 steps with 512000 atoms
Performance: 99.396 ns/day, 0.241 hours/ns, 1150.418 timesteps/s, 589.014 Matom-step/s
For GPU-aware, you can force it with -pk kokkos newton on neigh half gpu/aware on
, in this case it will segfault if there is a problem instead of automatically disabling GPU-aware MPI.
Running with KOKKOS, I got this for the Stillinger-Weber benchmark:
Loop time of 11.9299 on 1 procs for 6200 steps with 512000 atoms
Performance: 44.902 ns/day, 0.534 hours/ns, 519.704 timesteps/s
I used 1 CPU and 1 GPU.
and I got the following performance for running with GPU package (64 CPUs, 1 GPU):
Loop time of 109.331 on 64 procs for 6200 steps with 512000 atoms
Performance: 4.900 ns/day, 4.898 hours/ns, 56.708 timesteps/s
So, it seems that my kokkos-senabled LAMMPS produces just the same results as yours for the benchmark. The pair style of the benchmark is the same as it is in my own simulation file. Do you think it is because of the fix nph, fix langevin that my simulation performance is lower on KOKKOS?
For the GPU package, 64 CPU cores is probably too much for a single GPU unless you have a really huge number of atoms since you are basically strong-scaling the kernels you offload to the GPU, which leads to overhead. You also want to make sure you enable the CUDA MPS (multi-process service) daemon when using more than 1 MPI rank per GPU, see Multi-Process Service :: GPU Deployment and Management Documentation, this can really help performance.
For Kokkos, both fix nph
and fix langevin
are ported to Kokkos and should be running on the GPU so I wouldn’t expect that to significantly slow down the simulation. So I’m not sure why it is slower, would need to do some profiling. Is the input something you can share? Or I could give you instructions on how to get a quick profile.
However, when I run the benchmark on KOKKOS with 4 gpus, I get lower or the same performance:
Loop time of 11.9368 on 1 procs for 6200 steps with 512000 atoms
Performance: 44.876 ns/day, 0.535 hours/ns, 519.401 timesteps/s
That is still running on a single GPU: 1 procs
. You need to change both the number of MPI ranks and the number of GPUs in the Kokkos args (e.g. -k on g 4 -sf kk
)
For example: mpiexec -np 4 ./lmp -in in.intel.sw -k on g 4 -sf kk -pk kokkos newton on neigh half
I did an the performance is even lower!
Loop time of 49.4631 on 4 procs for 6200 steps with 512000 atoms
Performance: 10.830 ns/day, 2.216 hours/ns, 125.346 timesteps/s
This is my command line for the above run:
mpirun -n $SLURM_NPROCS /home/madibi/mylammps/build_mpi_kokkos/lmp -k on g 4 -sf kk -pk kokkos newton on neigh half comm device -in in.intel.sw