添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

On the eastern side of Oxfordshire are the Cotswolds, a pleasant hill range with a curious etymology: the hills of the goddess Cuda (maybe, see footnote). Cuda is a powerful yet wrathful goddess, and to be in her good side it does feel like druidry. The first druidic test is getting software to work: the wild magic makes the rules of this test change continually. Therefore, I am writing a summary of what works as of Late 2023.

When searching online documentation for Cuda-enabled software, check the publication date. It is a rapidly changing field.
ChatGTP gives outdated information.
The top example of this is pip install tensorflow-gpu which is already included with pip install tensorflow>=2.12 . In fact, the major change is a recent trend to include everything in the pip installer —CUDA toolkit included.
Check what the newest official documentation says first, then when stuff is failing, check newer blogs, then older blogs.
The Nvidia documentation can be tricky to navigate. Say I am “interested” in learning more about lifecycle cadence. Google gives me https://docs.nvidia.com/datacenter/tesla/drivers/#lifecycle . I notice the Tesla URL, try ampere or hopper: 404s. I look at the top right for the date, June 2023. So it is a misleading URL to confuse novice druids.

Drivers

This blog post will be discussing software using the Cuda toolkit, not driver installation. But a few maxims need airing.
As everything with software, do not install the latest version, install the version that will have longest lifetime .

Parenthetically, always double check compatibilities when installing drivers/Cuda (the high druid council of Cuda calls it a “support matrix”). Ignoring the complication of the video driver, Nvidia GPU requires a driver, which have a 3 digit release number, e.g. R520, and Cuda platform, which have a version and subversion (currently 12.3), which is paired with the driver, e.g. R520 goes with Cuda 11.8. Traditionally there was a ±3 release/subversion wiggle room: this lifecycle cadence gets quoted dogmatically. This is no longer really true as far as I can tell, and it is more fruitful to always double check.

About the lifetime. A driver Release has a Branch Designation . These can be Long Term Support Branch , New Feature Branch , Evaluation/ Developer Branch and Production Branch . Do check what your chosen branch designation is.

This sounds obvious, but does happen: Unfortunately for me, on a cluster I use, the sys admin had to use an image created by a central sys admin who had chosen a production branch, which was then classed as a development branch, which means Cuda issues galore —I am hoping this will go away when CentOS 7 is buried. Hoping.

Cuda Compat

Nvidia at the end of 2022, introduced Cuda 12.0 and with it the cuda-compat package. This has its own support matrix: https://docs.nvidia.com/deploy/cuda-compatibility/#use-the-right-compat-package .
This ( conda install conda-forge::cuda-compat ) might be the trick you need.

Clean-up in aisle five

Often when you install something you get an incompatible version issue. This is caused generally by conda / pip packages that were made lazily and simply use a frozen dependency list: numpy, OpenSSL and six are three common packages that end up yoyoing in versions and cause damage. In the case of OpenSSL one has to rm -rf the package as it will have broken pip. While the 2to3 module six indicates the offending code is proper ancient. Conda has a handy --freeze-installed to stop this, pip does not. With cuda-dependent installation, some developers are lazy or want to make your life easier by installing cuda toolkit (and CuDNN) for you. This will undoubtably cause issues.

For example, say you conda installed your ideal version of Cuda toolkit, and then a conda package from a piece of software you want to use. Pytorch then tells you have incompatible driver and CuDNN versions…

RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (8, 4, 4) but found runtime version (8, 0, 5). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.

The $LD_LIBRARY_PATH option actually works in PyTorch and openMM, but not with tensorflow or jax, unless you import pytorch first and then tensorflow —I will admit I used this horrible trick a few times.

The first step is establish what is going wrong. OpenMM works if pytorch works, so I have used that as my verbose proxy.

# What cudatoolkit is installed?
conda list cudatoolkit
# Nvidia version?
conda list cuda-toolkit
# CuDNN
conda list cudnn
# Where are the so file?
find $CONDA_PREFIX -name "libdevice.*"

The latter library is a key Cuda toolkit library , which allows bytecode to be compiled to GPU basically: there should really be one as the CuDNN etc issues are due to different versions.
Here is a what I got from a nasty installation:

$CONDA_PREFIX/lib/python3.10/site-packages/triton/third_party/cuda/lib/libdevice.10.bc
$CONDA_PREFIX/lib/python3.10/site-packages/jaxlib/cuda/nvvm/libdevice/libdevice.10.bc
$CONDA_PREFIX/lib/libdevice.10.bc
$CONDA_PREFIX/nvvm/libdevice/libdevice.10.bc

Triton is what pytorch uses, and the last two are installed by Conda and by Pip.

There are three ways of dealing with this:

Simply take note of the libdevice versions and remove the relevant parent folder. Don’t change the extension (say to nvvm-bk ) as that will not fool anyone.

In the above example, the folder nvvm appears twice: NVIDIA Virtual Machine is a bridge of sorts to GPU and it the nvvm folder there will likely be a nvvm/bin/cicc , which is an internal low level compiler —not to be confused with the high-level (=normal usage) one, nvcc , which is commonly used to see the installed version of Cuda Toolkit via nvcc --version (the apt/dnf installation will be in /usr/local/cuda/bin/nvcc ). The other file of note in the folder, nvvm/lib64/libdevice.so, is more low-level still and I do not believe has caused me incompatibility issues… but to be safe the whole repeated folder can go.

Environment variables

The above is a bit brutal as one cannot go back. The $LD_LIBRARY_PATH environment variable most likely has or ought to the following colon separate paths:

Note that after a module has been imported setting LD_LIBRARY_PATH will not do anything as the library is already loaded. So if this is done in say a Jupyter notebook, do it first (also what happens in Jupyter inline magic ( !export FOO=foo ) is not kept as it is not the same process, hence the tedious os.environ ).

import os, sys
os.environ['LD_LIBRARY_PATH'] = ':'.join([os.environ['CONDA_PREFIX'] + f'/lib/python{sys.version_info.major}.{sys.version_info.minor}/site-packages', os.environ['CONDA_PREFIX']+'/lib'])
import torch
assert torch.cuda.is_available()
print(torch.cuda.device_count(), torch.cuda.get_device_name(0))
print(f'Using CuDNN: {torch.backends.cudnn.enabled} ({torch.backends.cudnn.m.version()})')
device = torch.device("cuda")
# Create a random tensor and transfer it to the GPU
x = torch.rand(5, 3).to(device)
print("A random tensor:", x)
y = x * x
print("After calculation:", y)
print("Calculated on:", y.device)

For tensorflow this environment variable may be ignored so set also $CUDA_HOME or adding to $XLA_FLAGS an entry --xla_gpu_cuda_data_dir=👾👾👾 .

import os, sys
os.environ['LD_LIBRARY_PATH'] = ':'.join([os.environ['CONDA_PREFIX'] + f'/lib/python{sys.version_info.major}.{sys.version_info.minor}/site-packages', os.environ['CONDA_PREFIX']+'/lib'])
import tensorflow as tf
print(sys.version_info)
print(tf.config.list_physical_devices('GPU'))
print('CUDA build:', tf.test.is_built_with_cuda())
print("CUDA version:", tf.sysconfig.get_build_info()["cuda_version"])
print("cuDNN version:", tf.sysconfig.get_build_info()["cudnn_version"])
print("CUDA library paths:", tf.sysconfig.get_lib())
a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
b = tf.constant([[1.0, 1.0], [0.0, 1.0]])
c = tf.matmul(a, b)
print(c)

For Jax it can be $CUDA_PATH.

from jax.lib import xla_bridge
assert xla_bridge.get_backend().platform != 'cpu'
import jax.numpy as jnp
from jax import random
key = random.PRNGKey(0)
x = random.normal(key, (5000, 5000), dtype=jnp.float32)
print( jnp.dot(x, x.T) )

For OpenMM, if torch is happy so is OpenMM. To test the latter:

import openmm as mm
plat = mm.Platform.getPlatformByName('CUDA')
print( plat.getOpenMMVersion() )
# this will fail with no failures:
assert plat.supportsKernels('CUDA'), f'failures: {plat.getPluginLoadFailures()}'
# ditto:
mm.Platform.findPlatform('CUDA') # No Platform supports all the requested kernels
# however... this will show Reference, CPU and CUDA as valid?!
print([mm.Platform.getPlatform(index).getName() for index in range(mm.Platform.getNumPlatforms())])
# everything is fine:
print(mm.version.openmm_library_path)
print(mm.pluginLoadedLibNames)
# so to test one has to make an OpenMM simulation object
simulation = mma.Simulation(modeller.topology, system, integrator)
platform: mm.Platform = simulation.context.getPlatform()
assert platform.getName() == 'CUDA', platform.getPluginLoadFailures()

Once a solution is found, make sure to store the environment variable for next time the conda environment is activated via conda env config vars set LD_LIBRARY_PATH=👾👾👾

Arduous way

As mentioned the issue is dependancies of certain packages. These can be inspected on the terminal via conda search 👾👾👾 --info or online at https://anaconda.org/ by searching for the name, picking one by paying attention to version, then files and the green ℹ️ symbol. To install a package without dependencies use --no-deps .

There is a environment variable for conda, $CONDA_OVERRIDE_CUDA which can be use to spefic version eg. 11.8. This will have no effect on say jax installed by pip.

Singularity

A different problem entirely is Singularity. There is a flag -nv which adds the relevant virtual folders. This will however not add /usr/bin/nvidia-smi , you can bind this or copy the binary over into the container assuming its got later kernel drivers. The files in /usr/local/cuda are from the Cuda toolkit, so that is fine if its missing as Conda can add them.

Walkthrough

Say I want to install a package which has an conda yaml provided —yay, how nice!

# common fluff
unset LD_LIBRARY_PATH
unset CUDA_HOME
unset CUDA_DIR
unset XLA_FLAGS
# install
conda env create -f 👾👾.yml
conda activate 👾👾
python 👾👾.py

Worse case scenario this failed to install because of python version (hello PyMOL…)

unset LD_LIBRARY_PATH
unset CUDA_HOME
unset CUDA_DIR
unset XLA_FLAGS
export CONDA_PREFIX
conda create --name RFdiffusion -y python=3.👾
conda activate RFdiffusion
conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$CONDA_PREFIX:👾👾
# 👾👾 = either /.singularity.d/libs or /usr/local/cuda/compat:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
conda env config vars set PYTHONUSERBASE=$CONDA_PREFIX  # this stop pip from installing not in PATH (a yellow warning you get sometimes)
conda deactivate
conda activate 👾👾
conda env update --name 👾👾 --file 👾👾

Say it installed fine but pytorch is unhappy.

find $CONDA_PREFIX -name "libdevice.*"
find $CONDA_PREFIX -name "libnvidia*"
conda list cudatoolkit
conda list cuda-toolkit
conda list cudnn

Say there are multiple versions of libdevice. Anectdotally the lib/nivida version is the problem. So I delete the whole folder. Same with Jax’s nvidia folder.
The latter commands will give version numbers, the guilty parties will be obvious.
However, conda uninstallation does not accept no-deps, so say for openMM I reinstall Cuda Toolkit of the correct version that does not glitch with my drivers or with CuDNN etc. Generally I run just to add to the mix:

export CONDA_CHANNELS="nvidia/label/cuda-11.8.0"
conda install -y nvidia/label/cuda-11.8.0::cuda
conda install -y nvidia/label/cuda-11.8.0::cuda-toolkit
conda install -y nvidia/label/cuda-11.8.0::cuda-nvrtc 
conda install -y nvidia/label/cuda-11.8.0::libcufile 
conda install -y nvidia/label/cuda-11.8.0::cuda-tools
# these will be in `$CONDA_PREFIX/lib` so make sure this is set:
conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$CONDA_PREFIX:👾👾
conda deactivate
conda activate 👾👾

A lot of those are redundant, but you might prefer a different version (eg. 11.6.2 is nice).

Conclusion

I am sorry you are having Cuda issues. I hope this helped.

Footnote

A ‘wold’ is a forested hill (cf. ‘Wald’ in German, forest), while Cot’s is a name and might come from Cuda, a Brettonic female mother goddess, equivalent to Danu in Gaelic mythology, the other branch of Celtic culture.