Hello,
I am trying to create a kubernetes cluster on top of AGX Xavier Jetsons with Jetpack 4.6.1. I created the cluster succesfully and after that, in order to expose the gpus to the services-pods, following the nvidias readme:
https://github.com/NVIDIA/k8s-device-plugin?fbclid=IwAR1_LG86MIM4P-KbGsZ5kkIbwHchpKh9HX6P47pI-rbOmhk6TA3iVQ6Jeac
I tried to add the nvidias kubernetes plug in. After doing that, the plug in could not start due to the following error:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 19s kubelet Pod sandbox changed, it will be killed and re-created.
Normal Killing 19s kubelet Stopping container nvidia-device-plugin-ctr
Normal Pulled 16s kubelet Container image "nvcr.io/nvidia/k8s-device-plugin:v0.14.0" already present on machine
Normal Created 15s kubelet Created container nvidia-device-plugin-ctr
Warning Failed 10s kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: src: /etc/vulkan/icd.d/nvidia_icd.json, src_lnk: /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/vulkan/icd.d/nvidia_icd.json, dst_lnk: /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json
src: /usr/lib/aarch64-linux-gnu/libcuda.so, src_lnk: tegra/libcuda.so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcuda.so, dst_lnk: tegra/libcuda.so
src: /usr/lib/aarch64-linux-gnu/libdrm_nvdc.so, src_lnk: tegra/libdrm.so.2, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libdrm_nvdc.so, dst_lnk: tegra/libdrm.so.2
src: /usr/lib/aarch64-linux-gnu/libv4l2.so.0.0.999999, src_lnk: tegra/libnvv4l2.so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libv4l2.so.0.0.999999, dst_lnk: tegra/libnvv4l2.so
src: /usr/lib/aarch64-linux-gnu/libv4lconvert.so.0.0.999999, src_lnk: tegra/libnvv4lconvert.so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libv4lconvert.so.0.0.999999, dst_lnk: tegra/libnvv4lconvert.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvargus.so, src_lnk: ../../../tegra/libv4l2_nvargus.so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvargus.so, dst_lnk: ../../../tegra/libv4l2_nvargus.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvcuvidvideocodec.so, src_lnk: ../../../tegra/libv4l2_nvcuvidvideocodec.so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvcuvidvideocodec.so, dst_lnk: ../../../tegra/libv4l2_nvcuvidvideocodec.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvidconv.so, src_lnk: ../../../tegra/libv4l2_nvvidconv.so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvidconv.so, dst_lnk: ../../../tegra/libv4l2_nvvidconv.so
src: /usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvideocodec.so, src_lnk: ../../../tegra/libv4l2_nvvideocodec.so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvideocodec.so, dst_lnk: ../../../tegra/libv4l2_nvvideocodec.so
src: /usr/lib/aarch64-linux-gnu/libvulkan.so.1.2.141, src_lnk: tegra/libvulkan.so.1.2.141, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libvulkan.so.1.2.141, dst_lnk: tegra/libvulkan.so.1.2.141
src: /usr/lib/aarch64-linux-gnu/tegra/libcuda.so, src_lnk: libcuda.so.1.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/tegra/libcuda.so, dst_lnk: libcuda.so.1.1
src: /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurface.so, src_lnk: libnvbufsurface.so.1.0.0, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/tegra/libnvbufsurface.so, dst_lnk: libnvbufsurface.so.1.0.0
src: /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so, src_lnk: libnvbufsurftransform.so.1.0.0, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so, dst_lnk: libnvbufsurftransform.so.1.0.0
src: /usr/lib/aarch64-linux-gnu/tegra/libnvbuf_utils.so, src_lnk: libnvbuf_utils.so.1.0.0, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/tegra/libnvbuf_utils.so, dst_lnk: libnvbuf_utils.so.1.0.0
src: /usr/lib/aarch64-linux-gnu/tegra/libnvdsbufferpool.so, src_lnk: libnvdsbufferpool.so.1.0.0, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/tegra/libnvdsbufferpool.so, dst_lnk: libnvdsbufferpool.so.1.0.0
src: /usr/lib/aarch64-linux-gnu/tegra/libnvid_mapper.so, src_lnk: libnvid_mapper.so.1.0.0, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/tegra/libnvid_mapper.so, dst_lnk: libnvid_mapper.so.1.0.0
src: /usr/share/glvnd/egl_vendor.d/10_nvidia.json, src_lnk: ../../../lib/aarch64-linux-gnu/tegra-egl/nvidia.json, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/share/glvnd/egl_vendor.d/10_nvidia.json, dst_lnk: ../../../lib/aarch64-linux-gnu/tegra-egl/nvidia.json
src: /usr/lib/aarch64-linux-gnu/libcudnn.so.8, src_lnk: libcudnn.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn.so.8, dst_lnk: libcudnn.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libcudnn.so, src_lnk: /etc/alternatives/libcudnn_so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn.so, dst_lnk: /etc/alternatives/libcudnn_so
src: /usr/lib/aarch64-linux-gnu/libcudnn_ops_infer.so.8, src_lnk: libcudnn_ops_infer.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_ops_infer.so.8, dst_lnk: libcudnn_ops_infer.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libcudnn_ops_infer.so, src_lnk: /etc/alternatives/libcudnn_ops_infer_so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_ops_infer.so, dst_lnk: /etc/alternatives/libcudnn_ops_infer_so
src: /usr/lib/aarch64-linux-gnu/libcudnn_ops_train.so.8, src_lnk: libcudnn_ops_train.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_ops_train.so.8, dst_lnk: libcudnn_ops_train.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libcudnn_ops_train.so, src_lnk: /etc/alternatives/libcudnn_ops_train_so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_ops_train.so, dst_lnk: /etc/alternatives/libcudnn_ops_train_so
src: /usr/lib/aarch64-linux-gnu/libcudnn_adv_infer.so.8, src_lnk: libcudnn_adv_infer.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_adv_infer.so.8, dst_lnk: libcudnn_adv_infer.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libcudnn_adv_infer.so, src_lnk: /etc/alternatives/libcudnn_adv_infer_so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_adv_infer.so, dst_lnk: /etc/alternatives/libcudnn_adv_infer_so
src: /usr/lib/aarch64-linux-gnu/libcudnn_cnn_infer.so.8, src_lnk: libcudnn_cnn_infer.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_cnn_infer.so.8, dst_lnk: libcudnn_cnn_infer.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libcudnn_cnn_infer.so, src_lnk: /etc/alternatives/libcudnn_cnn_infer_so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_cnn_infer.so, dst_lnk: /etc/alternatives/libcudnn_cnn_infer_so
src: /usr/lib/aarch64-linux-gnu/libcudnn_adv_train.so.8, src_lnk: libcudnn_adv_train.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_adv_train.so.8, dst_lnk: libcudnn_adv_train.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libcudnn_adv_train.so, src_lnk: /etc/alternatives/libcudnn_adv_train_so, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_adv_train.so, dst_lnk: /etc/alternatives/libcudnn_adv_train_so
src: /usr/lib/aarch64-linux-gnu/libcudnn_cnn_train.so.8, src_lnk: libcudnn_cnn_train.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_cnn_train.so.8, dst_lnk: libcudnn_cnn_train.so.8.2.1
src: /usr/include/cudnn_adv_infer.h, src_lnk: /etc/alternatives/cudnn_adv_infer_h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/include/cudnn_adv_infer.h, dst_lnk: /etc/alternatives/cudnn_adv_infer_h
src: /usr/include/cudnn_adv_train.h, src_lnk: /etc/alternatives/cudnn_adv_train_h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/include/cudnn_adv_train.h, dst_lnk: /etc/alternatives/cudnn_adv_train_h
src: /usr/include/cudnn_backend.h, src_lnk: /etc/alternatives/cudnn_backend_h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/include/cudnn_backend.h, dst_lnk: /etc/alternatives/cudnn_backend_h
src: /usr/include/cudnn_cnn_infer.h, src_lnk: /etc/alternatives/cudnn_cnn_infer_h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/include/cudnn_cnn_infer.h, dst_lnk: /etc/alternatives/cudnn_cnn_infer_h
src: /usr/include/cudnn_cnn_train.h, src_lnk: /etc/alternatives/cudnn_cnn_train_h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/include/cudnn_cnn_train.h, dst_lnk: /etc/alternatives/cudnn_cnn_train_h
src: /usr/include/cudnn.h, src_lnk: /etc/alternatives/libcudnn, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/include/cudnn.h, dst_lnk: /etc/alternatives/libcudnn
src: /usr/include/cudnn_ops_infer.h, src_lnk: /etc/alternatives/cudnn_ops_infer_h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/include/cudnn_ops_infer.h, dst_lnk: /etc/alternatives/cudnn_ops_infer_h
src: /usr/include/cudnn_ops_train.h, src_lnk: /etc/alternatives/cudnn_ops_train_h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/include/cudnn_ops_train.h, dst_lnk: /etc/alternatives/cudnn_ops_train_h
src: /usr/include/cudnn_version.h, src_lnk: /etc/alternatives/cudnn_version_h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/include/cudnn_version.h, dst_lnk: /etc/alternatives/cudnn_version_h
src: /etc/alternatives/libcudnn, src_lnk: /usr/include/aarch64-linux-gnu/cudnn_v8.h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/libcudnn, dst_lnk: /usr/include/aarch64-linux-gnu/cudnn_v8.h
src: /etc/alternatives/libcudnn_adv_infer_so, src_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_adv_infer.so.8, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/libcudnn_adv_infer_so, dst_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_adv_infer.so.8
src: /etc/alternatives/libcudnn_adv_train_so, src_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_adv_train.so.8, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/libcudnn_adv_train_so, dst_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_adv_train.so.8
src: /etc/alternatives/libcudnn_cnn_infer_so, src_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_cnn_infer.so.8, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/libcudnn_cnn_infer_so, dst_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_cnn_infer.so.8
src: /etc/alternatives/libcudnn_cnn_train_so, src_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_cnn_train.so.8, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/libcudnn_cnn_train_so, dst_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_cnn_train.so.8
src: /etc/alternatives/libcudnn_ops_infer_so, src_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_ops_infer.so.8, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/libcudnn_ops_infer_so, dst_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_ops_infer.so.8
src: /etc/alternatives/libcudnn_ops_train_so, src_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_ops_train.so.8, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/libcudnn_ops_train_so, dst_lnk: /usr/lib/aarch64-linux-gnu/libcudnn_ops_train.so.8
src: /etc/alternatives/libcudnn_so, src_lnk: /usr/lib/aarch64-linux-gnu/libcudnn.so.8, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/libcudnn_so, dst_lnk: /usr/lib/aarch64-linux-gnu/libcudnn.so.8
src: /etc/alternatives/cudnn_adv_infer_h, src_lnk: /usr/include/aarch64-linux-gnu/cudnn_adv_infer_v8.h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/cudnn_adv_infer_h, dst_lnk: /usr/include/aarch64-linux-gnu/cudnn_adv_infer_v8.h
src: /etc/alternatives/cudnn_backend_h, src_lnk: /usr/include/aarch64-linux-gnu/cudnn_backend_v8.h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/cudnn_backend_h, dst_lnk: /usr/include/aarch64-linux-gnu/cudnn_backend_v8.h
src: /etc/alternatives/cudnn_cnn_train_h, src_lnk: /usr/include/aarch64-linux-gnu/cudnn_cnn_train_v8.h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/cudnn_cnn_train_h, dst_lnk: /usr/include/aarch64-linux-gnu/cudnn_cnn_train_v8.h
src: /etc/alternatives/cudnn_ops_train_h, src_lnk: /usr/include/aarch64-linux-gnu/cudnn_ops_train_v8.h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/cudnn_ops_train_h, dst_lnk: /usr/include/aarch64-linux-gnu/cudnn_ops_train_v8.h
src: /etc/alternatives/cudnn_adv_train_h, src_lnk: /usr/include/aarch64-linux-gnu/cudnn_adv_train_v8.h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/cudnn_adv_train_h, dst_lnk: /usr/include/aarch64-linux-gnu/cudnn_adv_train_v8.h
src: /etc/alternatives/cudnn_cnn_infer_h, src_lnk: /usr/include/aarch64-linux-gnu/cudnn_cnn_infer_v8.h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/cudnn_cnn_infer_h, dst_lnk: /usr/include/aarch64-linux-gnu/cudnn_cnn_infer_v8.h
src: /etc/alternatives/cudnn_ops_infer_h, src_lnk: /usr/include/aarch64-linux-gnu/cudnn_ops_infer_v8.h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/cudnn_ops_infer_h, dst_lnk: /usr/include/aarch64-linux-gnu/cudnn_ops_infer_v8.h
src: /etc/alternatives/cudnn_version_h, src_lnk: /usr/include/aarch64-linux-gnu/cudnn_version_v8.h, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/etc/alternatives/cudnn_version_h, dst_lnk: /usr/include/aarch64-linux-gnu/cudnn_version_v8.h
src: /usr/lib/aarch64-linux-gnu/libcudnn_static.a, src_lnk: /etc/alternatives/libcudnn_stlib, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libcudnn_static.a, dst_lnk: /etc/alternatives/libcudnn_stlib
src: /usr/lib/libvisionworks_sfm.so.0.90, src_lnk: libvisionworks_sfm.so.0.90.4, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/libvisionworks_sfm.so.0.90, dst_lnk: libvisionworks_sfm.so.0.90.4
src: /usr/lib/libvisionworks.so, src_lnk: libvisionworks.so.1.6, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/libvisionworks.so, dst_lnk: libvisionworks.so.1.6
src: /usr/lib/libvisionworks_tracking.so.0.88, src_lnk: libvisionworks_tracking.so.0.88.2, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/libvisionworks_tracking.so.0.88, dst_lnk: libvisionworks_tracking.so.0.88.2
src: /usr/lib/aarch64-linux-gnu/libnvinfer.so.8, src_lnk: libnvinfer.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer.so.8, dst_lnk: libnvinfer.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8, src_lnk: libnvinfer_plugin.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.8, dst_lnk: libnvinfer_plugin.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvparsers.so.8, src_lnk: libnvparsers.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvparsers.so.8, dst_lnk: libnvparsers.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvonnxparser.so.8, src_lnk: libnvonnxparser.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvonnxparser.so.8, dst_lnk: libnvonnxparser.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer.so, src_lnk: libnvinfer.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer.so, dst_lnk: libnvinfer.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so, src_lnk: libnvinfer_plugin.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so, dst_lnk: libnvinfer_plugin.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvparsers.so, src_lnk: libnvparsers.so.8.2.1, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvparsers.so, dst_lnk: libnvparsers.so.8.2.1
src: /usr/lib/aarch64-linux-gnu/libnvonnxparser.so, src_lnk: libnvonnxparser.so.8, dst: /run/containerd/io.containerd.runtime.v2.task/k8s.io/nvidia-device-plugin-ctr/rootfs/usr/lib/aarch64-linux-gnu/libnvonnxparser.so, dst_lnk: libnvonnxparser.so.8
, stderr: nvidia-container-cli: mount error: open failed: /sys/fs/cgroup/devices/system.slice/containerd.service/kubepods-besteffort-pod42d1b19e_561f_489a_9bff_1afbbdab3791.slice/devices.allow: no such file or directory: unknown
Warning BackOff 9s (x2 over 10s) kubelet Back-off restarting failed container nvidia-device-plugin-ctr in pod nvidia-device-plugin-daemonset-pz65q_kube-system(42d1b19e-561f-489a-9bff-1afbbdab3791)
The same error i get when i try to deploy a container from ngc (compatible with jetpack 4.6.1).
After several hours trying to figure out whats going wrong, i found that the nvidia-container-toolkit that jetpack 4.6.1 uses is 1.7.0-1 version and the plug in is compatible with >= 1.11.0. The problem is that after trying to install nvidia-container-toolkit_1.11.0 to the AGX, i get compatibility errors:
orfeas@xavier-agx-01:~/Documents/jetpack_debs/jetpack_511$ sudo dpkg -i nvidia-container-toolkit_1.11.0_rc.1-1_arm64.deb
(Reading database ... 170933 files and directories currently installed.)
Preparing to unpack nvidia-container-toolkit_1.11.0_rc.1-1_arm64.deb ...
Unpacking nvidia-container-toolkit (1.11.0~rc.1-1) over (1.7.0-1) ...
dpkg: dependency problems prevent configuration of nvidia-container-toolkit:
nvidia-container-toolkit depends on libnvidia-container-tools (>= 1.10.0-1); however:
Version of libnvidia-container-tools on system is 1.7.0-1.
dpkg: error processing package nvidia-container-toolkit (--install):
dependency problems - leaving unconfigured
Errors were encountered while processing:
nvidia-container-toolkit
This is probably due to the fact that nvidia-container-toolkit_1.11.0_rc.1-1_arm64.deb is compatible only with jetpacks 5.*.
I could easily just reflash the jetsons with a jetpack 5.* and fix the problem but the nvidias images (for jetpack 5.*) are about 12GB (with pytorch) and this is prohibited in the case I am studying. In jetpack 4.6.1 the nvidias images are about 1.9GB (with pytorch) and this is ok, thats why i am trying to set it up on jetpack 4.6.1.
What can i do in order to get my cluster up and running with gpu support in jetpack 4.6.1 (or in jetpack 5.* but without 12GB images…)?
My kubernetes version is:
1.26.3
This is the first time i am asking to nvidia developer forum so if you need further information i can post it.
Thank you in advance.
In JetPack 5, we include all the libraries within the container while JetPack 4 is mounted from the host.
So to save storage space, you can use the container on a clean r35 environment without installing components (ex. CUDA, cuDNN, …).
Thanks.
Thank you for quick response!
Do you mean without installing CUDA, cuDNN etc on the Jetson right (and just use cuda, cudnn etc in the 12GB container)?
If yes, then there is no way to fix this? To use a more lightweight image, like in r32 (mount cuda or something in r35)? Or to setup kubernetes in r32 with gpu support?
I am using Knative on top of kubernetes (a serverless framework) and its not efficient for each function I want to run (for example a function that runs a simple neural-net) to load 12GB to RAM.
This is not scalable obviously.
I really need to fix this.
Thank you in advance!
Could you share the steps to reproduce this error?
We need to check it further to see if this can be fixed on r32.
Thanks.
Yeah Sure.
The problem is irrelevant with Knative (by the time the kubernetes nvidia plug in does not work) so I believe there is no reason to provide these steps.
First of all I downloaded from SDK Manager the jetpack 4.6.1 and I flashed it to the Jetson AGX Xavier. I think this is pretty much straightforward but I can provide the exact steps even for the flash.
After that, I found out that docker and containerd are installed on Jetson.
Then I executed the command: sudo swapoff -a
in order to disable swap. This is a critical command in order to set it up correctly.
Then I downloaded kubernetes with the following commands (to be fair I followed this tutorial after Initialize Kubeadm On Master Node To Setup Control Plane section)):
sudo apt-get update sudo apt-get install -y apt-transport-https ca-certificates curl sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update -y sudo apt-get install -y kubelet kubeadm kubectl
Then I created the cluster with the following commands:
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Instead of Calico as my Network plugin (as the tutorial says), I installed the Flannel with this command (this network plug in worked for me):
kubectl apply -f https://github.com/coreos/flannel/raw/master/Documentation/kube-flannel.yml
After that I had to install metallb as my network load balancer (for Knative) but I dont think that this will be helpful for you. If you want to though, You can follow the steps in this tutorial (only Layer 2 Configuration steps)
After that I followed the official Nvidias README to setup nvidia-device plug in for kubernetes:
https://github.com/NVIDIA/k8s-device-plugin
Now, You should be able to see that the nvidias plug in cannot start and if you describe the pod you should see a similar error as the error I posted to my initial question.
Thank you in advance.
I was completely wrong about that the whole image is loaded to RAM when i call a function. I am sorry about that…
This is only storage problem (as you said), but i think i can fix this (i will flash jetpack 5.1.1). The problem I reported still remains as a problem though for someone else who has the limitation only to use jetpack 4.6.1.
Thank you for the quick respone!
Just want to confirm again.
After you flash Xavier with JetPack 5.1.1, is the problem fixed?
Thanks.
Hi there again!
I flashed the jetson with jetpack 5.1.1 and now works fine (kubernetes + gpu support + knative (for serverless)). I can use finally the gpu inside the pods/services. There is a way to fix it even on jetpack 4.6.1 based on this tutorial but I had already flashed the jetson when I doscovered it. I provide this URL for someone else who will try the same as I did.
@AastaLLL thank you for your help!