Unable to run a nvidia docker on AGX Xavier - Jetson AGX Xavier

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

英俊的韭菜 · Foreground service ...· 6 天前 ·

爱健身的作业本 · “白皮烟”买不得 | 兴国县信息公开· 4 月前 ·

热情的签字笔 · 盛趣游戏_盛趣游戏最新动态_IT之家· 6 月前 ·

热情的橡皮擦 · 仙子的修行karma(曦月仙子)_全文免费在 ...· 10 月前 ·

想出家的葫芦 · Difference Between ...· 1 年前 ·

爱听歌的生菜 · 莫迪的“印度制造”梦依然遥远：该向越南学什么 ...· 1 年前 ·

I have been trying to do something very simple:

docker run --runtime=nvidia --rm nvidia/cuda

However I got the error

docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/8cb963c23bee566216d2d890e60f62ae497be2857ef31e519ebd31e43e91a865/log.json: no such file or directory): exec: “nvidia-container-runtime”: executable file not found in $PATH: unknown.

So I tried to do sudo apt install nvidia-container-runtime

but I got

E: Unable to locate package nvidia-container-runtime

So I followed the advice of this page
and I did

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
with this I could do sudo apt install nvidia-container-runtime
then I tried to run the docker container of the start of this question

docker run --runtime=nvidia --rm nvidia/cuda
and now I got a complete different error
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused “process_linux.go:430: container init caused "process_linux.go:413: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\n\""”: unknown.
I don’t know how to proceeed from here to be able to run the container. Any help will be greatly appreciated
^^ should be for x86_64 architecture;

for Xavier AGX you may like to use l4t containers from ngc.nvidia.com:
nvcr.io/nvidia/l4t-base:r32.4.2
              Sorry to jump in here, I just started with the AGX Xavier, and having issue with the docker.
thuy@worker03-xavieragx:~$ sudo docker run --runtime nvidia --network host -it -e DISPLAY=$DISPLAY -v /tmp/.X11-unix/:/tmp/.X11-unix nvcr.io/nvidia/l4t-base:r32.3.1

Unable to find image ‘nvcr.io/nvidia/l4t-base:r32.3.1’ locally

r32.3.1: Pulling from nvidia/l4t-base

8aaa03d29a6e: Pull complete

e73d3a974854: Pull complete

2c14cdba18f5: Pull complete

23dd63c7659b: Pull complete

3bd414bd9504: Pull complete

cafd526eb263: Pull complete

483b0873e636: Pull complete

2568c5428ff2: Pull complete

6bcd9356d42f: Pull complete

c7f6d0180a4e: Pull complete

beddc9b83fb0: Pull complete

656f2307c79e: Pull complete

fe2e73a571b7: Pull complete

f5decba41c07: Pull complete

f0b6e413c48c: Pull complete

Digest: sha256:e8987d52ddb9496948e02656fc62d46561abce25bfe83203f4bc24c67e094578

Status: Downloaded newer image for nvcr.io/nvidia/l4t-base:r32.3.1

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused “process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\n\""”: unknown.

ERRO[0072] error waiting for container: context canceled
thuy@worker03-xavieragx:~$ nvidia-container-cli list

nvidia-container-cli: initialization error: driver error: failed to process request
Not sure why I have issue with the driver error as everything has been installed through the JetPack 4.4 which should include all necessary drivers for nvidia?
Can you give me some pointers here or let me know if I should open a new thread for this?
Thanks,

              
No, there is no issue with docker itself. Switching to the second Xavier, it is working fine (so I can remove the current drivers and reinstall them later). I’m then having issue with this xavier when joining an existing kubernetes cluster, the nvidia-device-plugin-daemonset does not work. I just want to expose the GPU to the cluster.
I used this command for the plugin in the master-node:

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.6.0/nvidia-device-plugin.yml
thuy@thuy-xavier-02:~$ kubectl version

Client Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.6”, GitCommit:“dff82dc0de47299ab66c83c626e08b245ab19037”, GitTreeState:“clean”, BuildDate:“2020-07-15T16:58:53Z”, GoVersion:“go1.13.9”, Compiler:“gc”, Platform:“linux/arm64”}
thuy@thuy-xavier-02:~$ docker version

Client:

Version:           19.03.6

API version:       1.40

Go version:        go1.12.17

Git commit:        369ce74a3c

Built:             Fri Feb 28 23:47:53 2020

OS/Arch:           linux/arm64

Experimental:      false
thuy@thuy-xavier-02:~$ cat /etc/docker/daemon.json

“runtimes”: {

“nvidia”: {

“path”: “nvidia-container-runtime”,

“runtimeArgs”: 

thuy@thuy-xavier-02:~$ nvidia-docker version

NVIDIA Docker: 2.0.3

Client:

Version:           19.03.6

API version:       1.40

Go version:        go1.12.17

Git commit:        369ce74a3c

Built:             Fri Feb 28 23:47:53 2020

OS/Arch:           linux/arm64

Experimental:      false
Sorry for switching the topic,
              I figured out the issue is that the nvidia-device-plugin in kubernetes requires nvidia-smi to work with, but I don’t have nvidia-smi, but only have tegrastats.
I don’t know if I should try to install nvidia-smi in xavier agx board, so try to figure out how to enable the nvidia-device-plugin to work with tegrastats.
If you have any experience with this, it’d be great to know.
Thanks,