Docker安装指南以及使用GPU - Gai's Blog

yum remove docker \
           docker-client \
           docker-client-latest \
           docker-common \
           docker-latest \
           docker-latest-logrotate \
           docker-logrotate \
           docker-selinux \
           docker-engine-selinux \
           docker-engine

安装Docker CE

安装方式有许多种，这里选择使用添加yum库的方式来安装

安装相关依赖

1	yum install -y yum-utils device-mapper-persistent-data lvm2

添加Docker的yum库

1
2
3

yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
# 若担心下载过慢，可以添加阿里的镜像源
# yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

安装最新版Docker CE（安装时为18.06.1-ce）

1	yum install docker-ce

如果遇见无法安装的问题（跟audit-lib相关），请参考下方’一些错误’部分

如果提示需要接受GPG key，看一下是不是 060A 61C5 1B55 8A7F 742B 77AA C52F EB6B 621E 9F35 ，是的话就接受

如果要安装指定版本的，可以先执行 yum list docker-ce --showduplicates | sort -r 列举出来所有版本，例如：

docker-ce.x86_64            3:18.09.0-3.el7                    docker-ce-stable
docker-ce.x86_64            3:18.09.0-3.el7                    @docker-ce-stable
docker-ce.x86_64            18.06.1.ce-3.el7                   docker-ce-stable
docker-ce.x86_64            18.06.0.ce-3.el7                   docker-ce-stable
docker-ce.x86_64            18.03.1.ce-1.el7.centos            docker-ce-stable

然后执行 yum install docker-ce-<VERSION STRING> 安装。其中 <VERSION STRING> 指的是版本号中第一个破折号之前的内容，例如 yum install docker-ce-18.06.1.ce

验证是否安装成功

启动Docker服务

1 2	systemctl start docker # service docker start # 也可以

如果报错，请参考下方’一些错误’部分

运行hello-world镜像

1	docker run hello-world

该命令会首先在本地寻找是否存在hello-world镜像，如果不存在就会去Docker官方镜像库Docker Hub中拉取下载，然后启动一个容器运行该镜像。如果成功运行，将会打印以下内容

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

添加权限

执行Docker需要用户具有sudo权限，所以可以将需要使用Docker的普通用户加入docker用户组

先查看下是否存在docker用户组 cat /etc/group | grep docker （注意不是dockerroot），如果不存在则创建 groupadd docker 。然后将普通用户添加至docker用户组中 usermod -aG docker username

修改Docker Hub为国内镜像

参考自： https://www.docker-cn.com/registry-mirror

修改 /etc/docker/daemon.json 文件（不存在则创建），并添加上 registry-mirrors 键值

1
2
3

{
  "registry-mirrors": ["https://registry.docker-cn.com"]
}

转移数据目录

Docker的数据目录默认位于 /var/lib/docker ，里面会存储着Docker镜像的数据。如果其所在的硬盘分区空间较小，可以将其转移到大的磁盘分区。例如我这里是根目录 / 挂载在小硬盘上， /home 目录挂载在大硬盘上，所以将其转移到 /home 目录下

service docker stop
mkdir /home/dockerData
mv /var/lib/docker /home/dockerData
ln -s /home/dockerData/docker /var/lib/docker
service docker start

安装Docker Compose

Docker提倡的理念是一个容器一个进程，那么一个服务是由多个进程组成的，那么就需要启动多个容器。而容器之间肯定是有依赖关系的（比如需要先启动数据库容器），如果手动管理则太麻烦。幸好Docker提供了一个工具——Docker Compose来解决这个问题，它允许用户在一个模版（yaml格式的）中定义一组相关联的容器，通过配置可以实现多个容器依次创建和启动

执行以下命令进行安装。其中1.23.1为撰文时的最新版本。可以去 GitHub 去查看最新版本

1 2	curl -L "https://github.com/docker/compose/releases/download/1.23.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose chmod +x /usr/local/bin/docker-compose

执行 docker-compose --version 验证是否安装成功

跟audit-libs相关的依赖问题，错误内容如下

错误：软件包：audit-libs-python-2.8.1-3.el7.x86_64 (lianjia-base)
          需要：audit-libs(x86-64) = 2.8.1-3.el7
          已安装: audit-libs-2.8.1-3.el7_5.1.x86_64 (@updates)
              audit-libs(x86-64) = 2.8.1-3.el7_5.1
          可用: audit-libs-2.8.1-3.el7.x86_64 (lianjia-base)
              audit-libs(x86-64) = 2.8.1-3.el7
 您可以尝试添加 --skip-broken 选项来解决该问题
 您可以尝试执行：rpm -Va --nofiles --nodigest

看错误内容，是因为Docker所要依赖的包已经安装了，但是太新了不支持。所以需要卸载已安装的，然后安装Docker的时候自动安装所需要的版本。因为有很多包依赖了audit-libs，如果直接用yum进行删除会导致依赖audit-libs的包也被删除，所以这里使用 rpm -e --nodeps xxx 的方式进行脱离依赖删除

执行 rpm -e --nodeps audit-libs-2.8.1-3.el7_5.1.x86_64 后再安装Docker可能会提示下面的内容。这是因为服务器上安装了针对两个平台的audit-libs，只卸载一个的话，另外一个会被错误识别

错误： Multilib version problems found. This often means that the root
      cause is something else and multilib version checking is just
      pointing out that there is a problem. Eg.:

        1. You have an upgrade for audit-libs which is missing some
           dependency that another package requires. Yum is trying to
           solve this by installing an older version of audit-libs of the
           different architecture. If you exclude the bad architecture
           yum will tell you what the root cause is (which package
           requires what). You can try redoing the upgrade with
           --exclude audit-libs.otherarch ... this should give you an error
           message showing the root cause of the problem.

        2. You have multiple architectures of audit-libs installed, but
           yum can only see an upgrade for one of those architectures.
           If you don't want/need both architectures anymore then you
           can remove the one with the missing update and everything
           will work.

        3. You have duplicate versions of audit-libs installed already.
           You can use "yum check" to get yum show these errors.

      ...you can also use --setopt=protected_multilib=false to remove
      this checking, however this is almost never the correct thing to
      do as something else is very likely to go wrong (often causing
      much more problems).

      保护多库版本：audit-libs-2.8.1-3.el7.x86_64 != audit-libs-2.8.1-3.el7_5.1.i686

那解决办法就是把支持另外一个平台的audit-libs也删除，执行 rpm -e --nodeps audit-libs-2.8.1-3.el7_5.1.i686 ，然后再 yum install docker-ce 安装Docker就没问题了

执行


    systemctl start docker

无法启动Docker服务，报错信息如下

1	Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.

不论Docker服务因为什么原因无法启动都会报这个错误，具体的错误信息需要使用 systemctl status docker.service 或者 journalctl -xe 去查看，然后根据具体的错误再去解决（上网寻找资料）

执行 journalctl -xe > error_info 将错误信息保存，便于查看。打开error_info，定位到Docker相关的部分。在这里我的最后的错误信息为：

dockerd[3518]: Error starting daemon: Error initializing network controller: error obtaining controller instance: failed to create NAT chain DOCKER: iptables failed: iptables --wait -t nat -N DOCKER: iptables v1.4.21: can't initialize iptables table `nat': Table does not exist (do you need to insmod?)

看起来是跟iptables和nat等网络有关的问题。再往上看会发现一堆 nf_nat_ipv4: Unknown symbol nf_nat_l3proto_register (err 0) 类似的错误，再往上会看到最相关的初始错误：

1
2
3

kernel: xt_conntrack: Unknown symbol nf_ct_l3proto_module_put (err 0)
kernel: xt_conntrack: Unknown symbol nf_ct_l3proto_try_module_get (err 0)
dockerd[3518]: time="2018-11-02T13:32:27.308082925+08:00" level=warning msg="Running modprobe xt_conntrack failed with message: `modprobe: ERROR: could not insert 'xt_conntrack': Unknown symbol in module, or unknown parameter (see dmesg)\ninstall /bin/true \ninsmod /lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/net/netfilter/xt_conntrack.ko.xz`, error: exit status 1"

分析一下是因为无法加载xt_conntrack内核模块，而无法加载的原因是找不到 nf_ct_l3proto_module_put 和 nf_ct_l3proto_try_module_get 这两个内核符号

使用 modinfo xt_conntrack 查看模块信息，可以发现其依赖于nf_conntrack。使用 modprobe -v nf_conntrack 尝试加载，然后 lsmod | grep nf_conntrack 检查是否加载成功，发现并没有输出相应信息。陷入困境。此处非常感谢 @武帅，指出可能是该模块被加入了黑名单，导致无法加载（自己完全没有考虑过还有模块加载黑名单的存在）。具体的解决办法为：

打开 /etc/modprobe.d/blacklist.conf 文件，会看到其中跟nf_conntrack相关的内容

# I/O dynamic configuration support for s390x (bz #563228)
blacklist chsc_sch
blacklist nf_conntrack
blacklist nf_conntrack_ipv6
blacklist xt_conntrack
blacklist nf_conntrack_ftp
blacklist xt_state
blacklist iptable_nat
blacklist ipt_REDIRECT
blacklist nf_nat
blacklist nf_conntrack_ipv4

可以看到使用一系列网络相关的模块被blacklist关键字给屏蔽了，解决办法为将从nf_conntrack开始到nf_conntrack_ipv4之间的行加上注释

打开 /etc/modprobe.d/connectiontracking.conf ，也会看到其中跟nf_conntrack相关的内容

install nf_nat /bin/true
install xt_state  /bin/true
install iptable_nat /bin/true
install nf_conntrack /bin/true
install nf_defrag_ipv4   /bin/true
install nf_conntrack_ipv4 /bin/true
install nf_conntrack_ipv6  /bin/true

这里的含义是如果发生 install xxx 的行为，跳过加载过程直接返回 install /bin/true ，所以也起到了加载模块屏蔽的作用。解决办法为将文件的后缀名进行修改，此处将其重命名为 connectiontracking.conf.old 使其失效（或将文件内容全部加上注释）

然后再启动Docker服务 systemctl start docker ，拉取hello-world镜像，成功。问题解决

2、安装Nvidia-Docker

安装好了普通的Docker以后，如果想在容器内使用GPU会非常麻烦（并不是不可行），好在Nvidia为了让大家能在容器中愉快使用GPU，基于Docker开发了Nvidia-Docker，使得在容器中深度学习框架调用GPU变得极为容易

以下内容根据官方安装指南进行简化整理，完成版请移步 https://github.com/NVIDIA/nvidia-docker

官方指南中针对各个系统都有安装说明，因为使用的是CentOS 7，并且安装的Docker-CE，所以此处参照的是 CentOS 7 (docker-ce), RHEL 7.4/7.5 (docker-ce), Amazon Linux 1/2

GNU/Linux x86_64 with kernel version > 3.10

Docker >= 1.12

NVIDIA GPU with Architecture > Fermi (2.1)

NVIDIA drivers ~= 361.93 (untested on older versions)

一句话，只要安装了显卡驱动就可以（终于不用自己搞CUDA和cuDNN了）

显卡驱动下载链接： https://www.nvidia.com/Download/index.aspx

这里根据本机环境，各个选项的选择如下：

Product Type: Tesla

Product Series: P-Series

Product: Tesla P40

Operating System: Linux 64-bit

Windows Driver Type: Standard

CUDA Toolkit: 10.0

Language: English(US)

其中CUDA版本选择最高的就行，因为支持高版本CUDA的驱动可以支持低版本CUDA，反过来不行

下载的文件名为： NVIDIA-Linux-x86_64-410.72.run

关于显卡驱动的安装可以参考之前的文章

删除旧版本

如果以前安装过nvidia-docker 1.0版本，需要先将其删除

1 2	docker volume ls -q -f driver=nvidia-docker \| xargs -r -I{} -n1 docker ps -q -a -f volume={} \| xargs -r docker rm -f yum remove nvidia-docker

添加相关库

1 2	distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo \| tee /etc/yum.repos.d/nvidia-docker.repo

安装

1 2	yum install -y nvidia-docker2 pkill -SIGHUP dockerd

测试

1	docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

看到如下信息那便表示安装成功了

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:00:06.0 Off |                  N/A |
| N/A   23C    P0    48W / 250W |      0MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P40           Off  | 00000000:00:07.0 Off |                  N/A |
| N/A   23C    P0    44W / 250W |      0MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

安装完nvidia-docker后，之前配置使用Docker Hub国内镜像的那个文件（


     /etc/docker/daemon.json

）内容可能会发生改变，需要检查并如有必要重新添加Docker Hub国内镜像

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused “process_linux.go:402: container init caused \”process_linux.go:385: running prestart hook 1 caused \\”error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli –load-kmods configure --ldconfig=@/sbin/ldconfig –device=all –compute –utility –require=cuda>=10.0 brand=tesla,driver>=384,driver<385 –pid=46031 /home/dockerData/docker/overlay2/2965080837ad6a78dd0013f677706eac50d147e457a719281750755f7aecbdb1/merged]\\nnvidia-container-cli: requirement error: unsatisfied condition: driver < 385\\n\\”\””: unknown.

这个是在运行 docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi 时的报错。报错的关键信息为最后的 requirement error: unsatisfied condition: driver < 385 ，是说驱动不满足要求。这是因为在使用CUDA镜像的时候没指定tag，那么就会使用最新的，而最新版本的CUDA为10.0。出问题的这台服务器上安装的显卡驱动是当时为使用CUDA9安装的。所以问题就是因为现有的显卡驱动不支持CUDA10导致的

两种解决办法，一种是升级显卡驱动，如果要使用CUDA10可以这么做；另外一种是指定CUDA镜像的tag，因为目前的显卡驱动是支持CUDA9.0的，而此处运行该镜像也只是为了验证Nvidia-Docker是否安装成功，所以可以执行 docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi ，从而可以成功看到显卡信息

3、使用容器运行TensorFlow

接下来需要验证在容器中能否正常使用TensorFlow，参考自： https://www.tensorflow.org/install/docker

1 2	docker run -it --rm tensorflow/tensorflow \ python -c "import tensorflow as tf; print(tf.__version__)"

能输出TensorFlow的版本号便是成功，此处为1.11.0

1 2	docker run --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu \ python -c "import tensorflow as tf; print(tf.contrib.eager.num_gpus())"

能输出TensorFlow调用GPU的信息以及GPU数量便是成功，此处输出为

2018-11-02 10:02:03.213647: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-02 10:02:03.827049: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-02 10:02:03.827656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:00:06.0
totalMemory: 22.38GiB freeMemory: 22.22GiB
2018-11-02 10:02:03.916592: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-02 10:02:03.917168: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 1 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:00:07.0
totalMemory: 22.38GiB freeMemory: 22.22GiB
2018-11-02 10:02:03.917248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0, 1
2018-11-02 10:02:04.566762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-02 10:02:04.566813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0 1
2018-11-02 10:02:04.566824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N N
2018-11-02 10:02:04.566831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 1:   N N
2018-11-02 10:02:04.567666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21551 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0000:00:06.0, compute capability: 6.1)
2018-11-02 10:02:04.960299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 21551 MB memory) -> physical GPU (device: 1, name: Tesla P40, pci bus id: 0000:00:07.0, compute capability: 6.1)
2

PS：添加 -e NVIDIA_VISIBLE_DEVICES 参数可以指定容器使用哪一张（哪几张显卡），例如只使用第一张显卡

1 2	docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -it --rm tensorflow/tensorflow:latest-gpu \ python -c "import tensorflow as tf; print(tf.contrib.eager.num_gpus())"

4、小结

至此完成了Docker的安装以及在容器内使用GPU

容器使用GPU并不会对其独占，多个容器使用GPU就如同多个程序使用GPU一样，只要协调好显存与计算力的使用即可

References

各类官方安装指南
- Docker： https://docs.docker.com/install/linux/docker-ce/centos/
- Nvidia-Docker： https://github.com/NVIDIA/nvidia-docker
- TensorFlow： https://www.tensorflow.org/install/docker
- 内核模块屏蔽相关
  - https://www.linuxquestions.org/questions/linux-kernel-70/block-a-kernel-module-to-be-loaded-4175490812/
  - https://www.cyberciti.biz/faq/linux-disable-mounting-of-uncommon-filesystem/