VERSION="v1.17.0"
wget https://github.com/kubernetes-sigs/cri-tools/releases/download/$VERSION/crictl-$VERSION-linux-amd64.tar.gz
sudo tar zxvf crictl-$VERSION-linux-amd64.tar.gz -C /usr/local/bin
rm -f crictl-$VERSION-linux-amd64.tar.gz
安装critest:
VERSION="v1.17.0"
wget https://github.com/kubernetes-sigs/cri-tools/releases/download/$VERSION/critest-$VERSION-linux-amd64.tar.gz
sudo tar zxvf critest-$VERSION-linux-amd64.tar.gz -C /usr/local/bin
rm -f critest-$VERSION-linux-amd64.tar.gz
crictl
命令默认连接到 unix:///var/run/dockershim.sock
如果要连接到其他runtimes,需要设置 endpoint:
可以通过命令行参数 --runtime-endpoint
和 --image-endpoint
可以通过设置环境变量 CONTAINER_RUNTIME_ENDPOINT
和 CONTAINER_RUNTIME_ENDPOINT
可以通过配置文件的endpoint设置 --config=/etc/crictl.yaml
当前配置为 /etc/crictl.yaml
:
crictl配置文件 /etc/crictl.yaml
runtime-endpoint: unix:///var/run/containerd/containerd.sock
image-endpoint: unix:///var/run/containerd/containerd.sock
timeout: 10
#debug: true
debug: false
如果没有配置 /etc/crictl.yaml
执行 crictl ps
会提示错误:
WARN[0000] runtime connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
WARN[0000] image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
E0525 16:36:22.395915 241945 remote_runtime.go:390] "ListContainers with filter from runtime service failed" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory\"" filter="&ContainerFilter{Id:,State:&ContainerStateValue{State:CONTAINER_RUNNING,},PodSandboxId:,LabelSelector:map[string]string{},}"
FATA[0000] listing containers: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory"
在 Fedora 中我配置了上述 /etc/crictl.yaml
后执行 crictl ps
依然遇到报错:
FATA[0000] validate service connection: CRI v1 runtime API is not implemented for endpoint "unix:///var/run/containerd/containerd.sock": rpc error: code = Unimplemented desc = unknown service runtime.v1.
RuntimeService
参考 crictl rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService #356 这看起来是 containerd运行时(runtime) 和 crictl 版本不匹配导致的(建议采用相同版本),例如我的环境:
# containerd --version
containerd containerd.io 1.6.20 2806fc1057397dbaeefbea0e4e17bddfbd388f38
# crictl --version
crictl version v1.26.0
crictl pods 列出主机上的pod
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
173ffdaeefab9 13 hours ago Ready kube-proxy-vwqsn kube-system 0 (default)
0952e1399e340 13 hours ago Ready kube-scheduler-z-k8s-m-1 kube-system 0 (default)
424cc7a5a9bfc 13 hours ago Ready kube-controller-manager-z-k8s-m-1 kube-system 0 (default)
7249ac0122d31 13 hours ago Ready kube-apiserver-z-k8s-m-1 kube-system 0 (default)
crictl pods 列出主机上的镜像
IMAGE TAG IMAGE ID SIZE
k8s.gcr.io/coredns/coredns v1.8.6 a4ca41631cc7a 13.6MB
k8s.gcr.io/kube-apiserver v1.24.2 d3377ffb7177c 33.8MB
k8s.gcr.io/kube-apiserver v1.24.3 d521dd763e2e3 33.8MB
k8s.gcr.io/kube-controller-manager v1.24.2 34cdf99b1bb3b 31MB
k8s.gcr.io/kube-controller-manager v1.24.3 586c112956dfc 31MB
k8s.gcr.io/kube-proxy v1.24.2 a634548d10b03 39.5MB
k8s.gcr.io/kube-proxy v1.24.3 2ae1ba6417cbc 39.5MB
k8s.gcr.io/kube-scheduler v1.24.2 5d725196c1f47 15.5MB
k8s.gcr.io/kube-scheduler v1.24.3 3a5aa3a515f5d 15.5MB
k8s.gcr.io/pause 3.5 ed210e3e4a5ba 301kB
k8s.gcr.io/pause 3.6 6270bb605e12e 302kB
k8s.gcr.io/pause 3.7 221177c6082a8 311kB
使用 crictl images -a
还可以进一步显示完整的镜像ID( sha256
签名 )
列出主机上容器(这个命令类似 Docker Atlas 的 docker ps -a
: 注意是所有容器,包括了运行状态和停止状态的所有容器 ):
crictl ps -a
crictl pods 列出主机上的容器
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
fd65e2a037600 2ae1ba6417cbc 16 hours ago Running kube-proxy 0 173ffdaeefab9 kube-proxy-vwqsn
5922644848149 3a5aa3a515f5d 16 hours ago Running kube-scheduler 0 0952e1399e340 kube-scheduler-z-k8s-m-1
58e48e8bc6861 586c112956dfc 16 hours ago Running kube-controller-manager 0 424cc7a5a9bfc kube-controller-manager-z-k8s-m-1
901b1dc06eed1 d521dd763e2e3 16 hours ago Running kube-apiserver 0 7249ac0122d31 kube-apiserver-z-k8s-m-1
如果只需要查看正在运行的容器(不显示停止的容器),则去掉 -a
参数
执行容器中的命令( 类似 docker
或者 kubectl
提供的 exec
指令,直接在容器内部运行命令 ):
crictl exec -it fd65e2a037600 ls
举例显示:
bin boot dev etc go-runner home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var
但是,并不是所有容器都可以执行,例如上文中只有 kube-proxy-vwqsn
这个pod提供都容器执行成功;而我尝试对 kube-apiserver-z-k8s-m-1
等对应容器执行 ls
命令都是失败:
crictl exec -it 901b1dc06eed1 ls
提示不能打开 /dev/pts/0
设备:
FATA[0000] execing command in container: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "3152d5e5f78f25a91d5e2a659c6e8036bf07978dbb8db5c95d1470089b968c9d": OCI runtime exec failed: exec failed: unable to start container process: open /dev/pts/0: operation not permitted: unknown
为何在容器内部没有 /dev/pts/0
设备? (我暂时没有找到解决方法)
检查容器日志(这里检查 apiserver 日志):
crictl logs 901b1dc06eed1
crictl runp
表示运行一个新pod(Run a new pod)
crictl run
表示在一个sandbox内部运行一个新容器(Run a new container inside a sandbox)
我遇到报错:
E0719 20:50:53.394867 2319718 remote_runtime.go:201] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: expected cgroupsPath to be of format \"slice:prefix:name\" for systemd cgroups, got \"/k8s.io/81805934b02f89c87f1babefc460beb28679e184b777ebb8082942c3776a8d5b\" instead: unknown"
FATA[0001] run pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: expected cgroupsPath to be of format "slice:prefix:name" for systemd cgroups, got "/k8s.io/81805934b02f89c87f1babefc460beb28679e184b777ebb8082942c3776a8d5b" instead: unknown
Configuring a cgroup driver: Migrating to the systemd driver 提到将就节点转为使用 systemd
driver,这个步骤是在节点加入是完成
排查无法运行pod
参考 Impossible to create or start a container after reboot (OCI runtime create failed: expected cgroupsPath to be of format "slice:prefix:name" for systemd cgroups, got "/kubepods/burstable/…”) #4857 :
原因是 kubelet
/ crictl
/ containerd
所使用的 cgroup 驱动不一致导致的: 有两种 cgroup 驱动,一种是 cgroupfs cgroup driver
,另一种是 systemd cgroup driver
:
我在 z-k8s高可用Kubernetes集群准备 时采用了 cgroup v2
(通过 Systemd进程管理器 ),同时在 配置 systemd cgroup驱动 ,这样 containerd运行时(runtime) 就会使用 systemd cgroup driver
配置第一个管控节点(control plane ndoe) 默认已经激活了 kubelet
使用 systemd cgroup driver
(从 Kubernetes 1.22 开始,默认 kubelet
就是使用 systemd
cgroup driver
无需特别配置: 这点可以通过 kubectl edit cm kubelet-config -n kube-system
查看,可以看到集群当前配置就是 cgroupDriver: systemd
如果kubelet确实没有使用 systemd
的 cgroup
的话,可以参考 Update the cgroup driver on all nodes 方法进行修订
但是 crictl
默认配置没有采用 systemd cgroup driver
,可以通过以下命令检查:
crictl info | grep system
crictl info
显示信息错误,这个异常可以参考 crictl info get wrong container runtime cgroupdrives when use containerd. #728 ,但是这个bug应该在 Change the type of CRI runtime option #5300 已经修复
我仔细看了以下 crictl info | grep systemd
输出,发现有2个地方和cgroup有关:
"SystemdCgroup": true
"systemdCgroup": false,
为何一个是 true
一个是 false
检查 /etc/containerd/config.toml
中也有几处和 systemdCgroup
有关:
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
systemd_cgroup = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
修订 containerd 配置 systemd_cgoup(失败,无需特别配置)
我尝试修改上面 systemd_cgroup = false
改为 systemd_cgroup = true
,结果重启 containerd
就挂掉了,节点 NotReady。
检查 containerd
的服务状态显示:
Jul 19 23:28:15 z-k8s-n-1 containerd[2321243]: time="2022-07-19T23:28:15.548636577+08:00" level=info msg="loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." type=io.containerd.tracing.processor.v1
Jul 19 23:28:15 z-k8s-n-1 containerd[2321243]: time="2022-07-19T23:28:15.548669143+08:00" level=info msg="skip loading plugin \"io.containerd.tracing.processor.v1.otlp\"..." error="no OpenTelemetry endpoint: skip plugin" type=io.containerd.tracing.processor.v1
Jul 19 23:28:15 z-k8s-n-1 containerd[2321243]: time="2022-07-19T23:28:15.548696451+08:00" level=info msg="loading plugin \"io.containerd.internal.v1.tracing\"..." type=io.containerd.internal.v1
Jul 19 23:28:15 z-k8s-n-1 containerd[2321243]: time="2022-07-19T23:28:15.548761787+08:00" level=error msg="failed to initialize a tracing processor \"otlp\"" error="no OpenTelemetry endpoint: skip plugin"
Jul 19 23:28:15 z-k8s-n-1 containerd[2321243]: time="2022-07-19T23:28:15.548852549+08:00" level=info msg="loading plugin \"io.containerd.grpc.v1.cri\"..." type=io.containerd.grpc.v1
Jul 19 23:28:15 z-k8s-n-1 containerd[2321243]: time="2022-07-19T23:28:15.549391266+08:00" level=warning msg="failed to load plugin io.containerd.grpc.v1.cri" error="invalid plugin config: `systemd_cgroup` only works for runtime io.containerd.runtime.v1.linux"
Jul 19 23:28:15 z-k8s-n-1 containerd[2321243]: time="2022-07-19T23:28:15.549774946+08:00" level=info msg=serving... address=/run/containerd/containerd.sock.ttrpc
Jul 19 23:28:15 z-k8s-n-1 containerd[2321243]: time="2022-07-19T23:28:15.549940533+08:00" level=info msg=serving... address=/run/containerd/containerd.sock
此时也无法执行 crictl ps
,提示错误:
E0722 09:59:01.808419 2346431 remote_runtime.go:536] "ListContainers with filter from runtime service failed" err="rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService" filter="&ContainerFilter{Id:,State:&ContainerStateValue{State:CONTAINER_RUNNING,},PodSandboxId:,LabelSelector:map[string]string{},}"
FATA[0000] listing containers: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService
我仔细核对 containerd
配置 config.toml
,其实有2个地方配置 systemd cgroup driver
,之前在 z-k8s高可用Kubernetes集群准备 参考官方文档仅修订一处
参考 deprecated (?) systemd_cgroup still printed by “containerd config default” #4574 :
containerd 1.3.7 已经不再支持 io.containerd.grpc.v1.cri
这个插件配置成 systemd_cgroup = true
,也就是如此操作是失败的:
containerd config default > /etc/containerd/config.toml
sed -i -e 's/systemd_cgroup = false/systemd_cgroup = true/' /etc/containerd/config.toml
启动 containerd 失败报错:
Jul 19 23:28:15 z-k8s-n-1 containerd[2321243]: time="2022-07-19T23:28:15.549391266+08:00" level=warning msg="failed to load plugin io.containerd.grpc.v1.cri" error="invalid plugin config: `systemd_cgroup` only works for runtime io.containerd.runtime.v1.linux"
配置方法需要参考 how to configure systemd cgroup with 1.3.X #4203 的comment ,原来Kubernetes官方文档是正确的,现在配置 io.containerd.runc.v2
确实如 安装containerd官方执行程序 中配置的:
配置containerd的runc使用systemd cgroup驱动
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
另外一个 how to configure systemd cgroup with 1.3.X #4203 的另一个comment 说明了有2个 systemd_cgroup
值:
systemd_cgroup 是用于 shim v1 的配置,当 shim v1 废弃以后将会移除。也就是说,现在使用 containerd 不再需要修订这个配置
systemdCgroup 是现在真正需要调整当v2版本配置,对应当是 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
看起来不需要修改 containerd 的 config.toml 配置
参考 config.toml SystemdCgroup not work #4900 讨论 验证containerd是否使用 systemd cgroups
不要看 crictl info
输出,这个输出非常让人困惑。
正确方法是查看 systemctl status containerd
可以看到启动的服务显示:
CGroup: /system.slice/containerd.service
├─2306099 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 2cd1b6245
├─2306139 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id c380743d5
└─2346698 /usr/local/bin/containerd
此外检查 systemd-cgls
输出,就可以看到容器都是位于 systemd
的 cgroups
之下,这就表明正确使用了 systemd cgroups
crictl ps
可以检查节点上的容器以及对应的pods关系输出内容
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
e06473dd8f0d1 5e3a0e9ab91a1 9 hours ago Running speaker 37 ed5621760e601 speaker-vlhld
7f4fb84d66dce 0c16e5f81be4b 9 hours ago Running master 11 b6ce34d7999e5 gpu-operator-1673526262-node-feature-discovery-master-6594glnlk
592bc16348f70 5185b96f0becf 9 hours ago Running coredns 37 23d008f4aaa1f coredns-d669857b7-x6wlt
65b28faa3741e 526bd4754c9cd 9 hours ago Running cilium-agent 162 aa9e0b0642dff cilium-2x6gn
fa4c96d33f275 0da6a335fe135 9 hours ago Running node-exporter 33 7e041042fae43 kube-prometheus-stack-1680871060-prometheus-node-exporter-dkqqs
0d517d735bddc 6d23ec0e8b87e 9 hours ago Running kube-scheduler 17 5290816dcc423 kube-scheduler-z-k8s-m-2
56b89f3c82bf9 6039992312758 9 hours ago Running kube-controller-manager 17 1db70ddeb87be kube-controller-manager-z-k8s-m-2
04465ed654902 0346dbd74bcb9 9 hours ago Running kube-apiserver 39 f6178f5243ea2 kube-apiserver-z-k8s-m-2
crictl pods
可以检查节点上pods输出信息
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
b6ce34d7999e5 9 hours ago Ready gpu-operator-1673526262-node-feature-discovery-master-6594glnlk gpu-operator 11 (default)
23d008f4aaa1f 9 hours ago Ready coredns-d669857b7-x6wlt kube-system 13 (default)
aa9e0b0642dff 9 hours ago Ready cilium-2x6gn kube-system 18 (default)
ed5621760e601 9 hours ago Ready speaker-vlhld metallb-system 18 (default)
7e041042fae43 9 hours ago Ready kube-prometheus-stack-1680871060-prometheus-node-exporter-dkqqs prometheus 5 (default)
1db70ddeb87be 9 hours ago Ready kube-controller-manager-z-k8s-m-2 kube-system 5 (default)
f6178f5243ea2 9 hours ago Ready kube-apiserver-z-k8s-m-2 kube-system 13 (default)
5290816dcc423 9 hours ago Ready kube-scheduler-z-k8s-m-2 kube-system 5 (default)
937886c263a4a 14 hours ago NotReady coredns-d669857b7-x6wlt kube-system 12 (default)
bcdad02f446e2 14 hours ago NotReady gpu-operator-1673526262-node-feature-discovery-master-6594glnlk gpu-operator 10 (default)
0d77009112fcd 14 hours ago NotReady kube-scheduler-z-k8s-m-2 kube-system 4 (default)
47804f95cfe93 14 hours ago NotReady kube-controller-manager-z-k8s-m-2 kube-system 4 (default)
3d5a1982db649 14 hours ago NotReady kube-apiserver-z-k8s-m-2 kube-system 12 (default)
376b18ec2589e 14 hours ago NotReady cilium-2x6gn kube-system 17 (default)
c64cc652fb3d5 14 hours ago NotReady kube-prometheus-stack-1680871060-prometheus-node-exporter-dkqqs prometheus 4 (default)
执行 crictl pods -q
则输出完整pod的ID,类似:
crictl pods -q
可以检查节点上pods输出详细ID
b6ce34d7999e5cf6ae273da17c47be7c76a674c75e0039792b5fe77469601fa8
23d008f4aaa1ffb0dbd1d5ee077852bb8f34b7924690633bc938448e376c91f1
aa9e0b0642dff640fb73206b50cbc1afdb25f85e236c79621645062608812867
ed5621760e60195ae1f294609ec1460c2c050cec2696e417c4d49f798263f7a6
7e041042fae4358206b7075b1e39ae87f739e3397958d483b3b18e2985392ce3
1db70ddeb87be8ec951b0b9fa8b4dbf70f453359f238e222c13c63488fdfc798
f6178f5243ea2b06e1d12623643e8caaa9d0145f7582e5c685c5f663804bcf8e
5290816dcc423a4fd04c9994bc9dcc075284b72e71de77471ae1206e646a2c64
937886c263a4a916fddc9178049e6e657eb4ae8b9c95d63d73ed2084bff9a629
bcdad02f446e2601277dc4f57d5e7d8ff04ec5d8c58f50a0bdec18a9b080f60f
0d77009112fcd344e5f45cc1e15f6f836e649c19adeda68715ba18e855e8c664
47804f95cfe93f5ad79e077e58033b13a4b554e1e1107fb97a4d4744d1413f52
3d5a1982db6493bea8ae6faa86be1d4f08bcefb96cb303017412275e63f20d12
376b18ec2589ed4a338df74fda73a473fca899cc5ef0d80226b342481cff7556
c64cc652fb3d5c277ae388c0ad75350c2672da2dcf897f2f5496c02848b3a5ba
Debugging Kubernetes nodes with crictl
how to configure systemd cgroup with 1.3.X #4203 containerd新版支持systemd cgroup配置方法
Troubleshooting CRI-O container runtime issues Red Hat OpenShift Atlas 异常排查使用 crictl
提供了一些案例
containerd 1.4.9 Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService