Could not load library libcudnn_adv_train.so.8 error on lambda workstation - Technical Help

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

怕老婆的柠檬 · gRPC python 教程(四) ...· 14 小时前 ·

飘逸的登山鞋 · python 并发请求grpc ...· 14 小时前 ·

耍酷的木瓜 · Python中的并发控制 - ...· 14 小时前 ·

谦虚好学的紫菜 · 推荐开源项目：Zeep - Python ...· 12 小时前 ·

礼貌的消防车 · C#/.net程序调用python - ...· 8 小时前 ·

慷慨大方的泡面 · 广东省深圳市福田区上梅林卓越城2期A座平面地 ...· 3 月前 ·

开朗的书包 · 表格OCR - ...· 6 月前 ·

行走的冲锋衣 · 2020年烟台南山学院新生学费一年多少钱及每 ...· 10 月前 ·

坏坏的猕猴桃 · 大家有推荐的python方面的公众号吗 – ...· 10 月前 ·

踏实的骆驼 · 23. JMS (Java Message ...· 1 年前 ·

Hello.

Every time I try to fit an LSTM in Tensorflow on our lambda workstation, the python kernel dies with the following error message:

Could not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
Please make sure libcudnn_adv_train.so.8 is in your library path!

The complete error stack is at the bottom of this message.

This problem appears to be similar to this thread from a year ago that indicates that lambda stack does not (yet?) support libcudnn v8.

To fix this, I’ve tried to invoking the update command below from the lambda stack webpage:

sudo apt-get update && sudo apt-get dist-upgrade

This does not fix the problem.

Is this a known problem with lambda stack? I’m running a preinstalled lambda stack on a lambda workstation and am not aware of having taken any other action that would update or desync tensorflow and cuDNN.

Thank you in advance for any help you might be able to provide.

Invalid MIT-MAGIC-COOKIE-1 key2021-11-12 10:29:13.341802: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-12 10:29:14.026866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22357 MB memory: → device: 0, name: NVIDIA RTX A5000, pci bus id: 0000:67:00.0, compute capability: 8.6
2021-11-12 10:29:14.027819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 21993 MB memory: → device: 1, name: NVIDIA RTX A5000, pci bus id: 0000:68:00.0, compute capability: 8.6
2021-11-12 10:29:15.356581: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-12 10:29:16.144405: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8201
Could not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
Please make sure libcudnn_adv_train.so.8 is in your library path!
[lambda-dual:06296] *** Process received signal ***
[lambda-dual:06296] Signal: Aborted (6)
[lambda-dual:06296] Signal code: (-6)
[lambda-dual:06296] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f1665725210]
[lambda-dual:06296] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f166572518b]
[lambda-dual:06296] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1665704859]
[lambda-dual:06296] [ 3] /usr/lib/python3/dist-packages/tensorflow/python/…/libcudnn.so.8(cudnnRNNForwardTraining+0x230)[0x7f15a0d96480]
[lambda-dual:06296] [ 4] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(cudnnRNNForwardTraining+0x8c)[0x7f15a5fb48cc]
[lambda-dual:06296] [ 5] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudnnSupport16DoRnnForwardImplIfEEN10tensorflow6StatusEPNS_6StreamERKNS0_18CudnnRnnDescriptorERKNS0_32CudnnRnnSequenceTensorDescriptorERKNS_12DeviceMemoryIT_EERKNSD_IiEERKNS0_29CudnnRnnStateTensorDescriptorESH_SN_SH_SH_SC_PSF_SN_SO_SN_SO_bPNS_16ScratchAllocatorESQ_PNS_3dnn13ProfileResultE+0x1080)[0x7f15a5f88750]
[lambda-dual:06296] [ 6] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudnnSupport12DoRnnForwardEPNS_6StreamERKNS_3dnn13RnnDescriptorERKNS4_27RnnSequenceTensorDescriptorERKNS_12DeviceMemoryIfEERKNSB_IiEERKNS4_24RnnStateTensorDescriptorESE_SK_SE_SE_SA_PSC_SK_SL_SK_SL_bPNS_16ScratchAllocatorESN_PNS4_13ProfileResultE+0x65)[0x7f15a5f88ec5]
[lambda-dual:06296] [ 7] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN15stream_executor6Stream14ThenRnnForwardERKNS_3dnn13RnnDescriptorERKNS1_27RnnSequenceTensorDescriptorERKNS_12DeviceMemoryIfEERKNS8_IiEERKNS1_24RnnStateTensorDescriptorESB_SH_SB_SB_S7_PS9_SH_SI_SH_SI_bPNS_16ScratchAllocatorESK_PNS1_13ProfileResultE+0x93)[0x7f15c1b1d483]
[lambda-dual:06296] [ 8] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(+0xa83f526)[0x7f15babf4526]
[lambda-dual:06296] [ 9] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN10tensorflow17CudnnRNNForwardOpIN5Eigen9GpuDeviceEfE25ComputeAndReturnAlgorithmEPNS_15OpKernelContextEPN15stream_executor3dnn15AlgorithmConfigEbbi+0x5f1)[0x7f15babfd631]
[lambda-dual:06296] [10] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN10tensorflow17CudnnRNNForwardOpIN5Eigen9GpuDeviceEfE7ComputeEPNS_15OpKernelContextE+0x59)[0x7f15bac07959]
[lambda-dual:06296] [11] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x24e)[0x7f15a563b4de]
[lambda-dual:06296] [12] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(+0x976e48)[0x7f15a5731e48]
[lambda-dual:06296] [13] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x2a5)[0x7f15b4b7aed5]
[lambda-dual:06296] [14] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x47)[0x7f15b4b77eb7]
[lambda-dual:06296] [15] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(+0xe236ef)[0x7f15a5bde6ef]
[lambda-dual:06296] [16] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f16656c5609]
[lambda-dual:06296] [17] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f1665801293]
[lambda-dual:06296] *** End of error message ***

After trying several different possible fixes, I did a clean reinstall of Ubuntu 20.04 and Lambda Stack. The problem persisted with the same error message.

I then did a second clean reinstall of Ubuntu 20.04 and manually installed graphics drivers (495.29.05), CUDA (11.5), cuDNN (8.3.1.22), and TensorFlow (2.7.0). My LSTM test script now works correctly.

I suspect that this is a problem with Lambda Stack and have forwarded my test script on to Lambda Labs technical support in case they’d like to try to reproduce the problem.

This is normally a issue with Anaconda or python venv/virtualenv (without --system-site-packages).

The issue is Anaconda removes all the normal system paths. And tensorflow/pytorch are built with cuDNN (CUDA, etc.). So it does not find the libraries. And they are including in the python path (which is likely part of the issue generally). /usr/python3/dist-packages:
/usr/lib/python3/dist-packages/tensorflow/libcudnn.so.8
/usr/lib/python3/dist-packages/torch/lib/libcudnn.so.8

And NVIDIA default install is in a non-standard location in /usr/local (for local site software).
Standard location for 3rd party software is: /opt///

The install of the NVIDIA can work, or the rpm for cuDNN can be done to work around this for using Anaconda or venv/virtualenv. Or with vnev/virtualenv you can use --system-site-packages

Thank you, but I wasn’t using Anaconda or a python virtual environment when I experienced this problem. In order to minimize confounding factors, I did a clean install of Ubuntu, a clean install of Lambda Stack, and ran an LSTM test script directly from the command line.

Since the same test script runs fine from the command line after I did a second clean install of Ubuntu and manually installed TensorFlow, I suspect the problem lies elsewhere.

FWIW, I’ve subsequently installed miniconda on the workstation, and TensorFlow works just fine within a conda environment, returning all the expected messages about how it’s finding the GPUs.

If anyone else is running into this problem: I could solve it by copying the libraries to the folder where tensorflow expects them to be.

like so:

sudo cp /usr/lib/python3/dist-packages/tensorflow/libcudnn* /usr/lib/x86_64-linux-gnu/

After that cudnn works for me without doing any updates whatsoever.

Yes, that can work also, but it is incomplete.

What that does is copying the cuDNN library in the system wide PATH.
That will help if you are not using the standard python which knows where to find it for the specific build of pytorch or tensorflow.

If you are using Anaconda, that installs all of its own python, pytorch, tensorflow, cuda-toolkit and cudnn. However, Anaconda does not point the LD_LIBRARY_PATH to the new location.
$ export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH}

And if you are using python venv or virtualenv - I normally install the appropriate cuDNN manually for those virtual environments for the given build I am using.

And docker would have that in the image if it had tensorflow/cuDNN.