2025998 – 4.10 - Failed to create pod sandbox: rpc error: code = Unknown desc = Kubelet may be retry_link管理

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

傻傻的帽子 · Webtor.io — ...· 3 周前 ·

正直的烈马 · [Help] After install ...· 2 月前 ·

善良的香烟 · 网络主播能否成为“第361行” _光明网· 3 月前 ·

卖萌的滑板 · ucx/CHANGELOG.md at ...· 3 月前 ·

酒量大的风衣 · 公安部公布10起“飙车炸街”典型案例_中青在 ...· 5 月前 ·

Description of problem:
While conducting 500pods/node testing with set resources on all containers in the 500pods/node at scale of 10 nodes, many pods become stuck in state ContainerCreating.  
13 node cluster
* 3 control plane nodes, 10 worker nodes
* kubeletconfig is adjusted to 530 max-pods
* cluster network host prefix is a /21 to support plenty of ip addressing per pods on each host
* deployed network type is OpenShiftSDN
Workload is:
* 5000 namespaces
* 1 deployment per namespace, 1 service and route per deployment
* 1 pod replica per deployment
* 1 container per pod replica
* 50m cpu requests and limits set per container
* 100Mi memory requests and limits set per container
* Container image is quay.io/redhat-performance/test-gohttp-probe:v0.0.2
Prior to the resource testing, 6 similar tests ran successfully:
* 3 x 2500 namespaces (1 deploy/ns, 1 pod replica/deploy, 4 containers/pod, no set resources - best effort pods) - 250pods/node
* 3 x 5000 namespaces (1 deploy/ns, 1 pod replica/deploy, 3 containers/pod, no set resources - best effort pods) - 500pods/node
It is not clear if this is a "compounded" issue in that the previous tests (6 in total) caused the following test with set guaranteed resources to fail.
Version-Release number of selected component (if applicable):
ocp 4.10.0-0.nightly-2021-11-21-005535
How reproducible:
Unclear at the moment, will reattempt same test without prior tests to confirm after gathering as much diagnostic data as possible.
Steps to Reproduce:
Actual results:
Expected results:
Containers to become running and test to succeed without issues
Additional info:
Not all hosts failed to run their respective pods:
# Hosts with pods stuck in containercreating
# oc get po -A -o wide | grep boatload | grep ContainerCreating | awk '{print $8}' | sort | uniq -c
    481 jetlag-bm13
    437 jetlag-bm14
    481 jetlag-bm16
    465 jetlag-bm17
    431 jetlag-bm18
    487 jetlag-bm19
# Count of boatload pods running on each worker node
# oc get po -A -o wide | grep boatload | grep Running | awk '{print $8}' | sort | uniq -c
     19 jetlag-bm13
     63 jetlag-bm14
    500 jetlag-bm15
     19 jetlag-bm16
     35 jetlag-bm17
     69 jetlag-bm18
     11 jetlag-bm19
    501 jetlag-bm20
    500 jetlag-bm21
    501 jetlag-bm22
Example pods:
# oc get po -A -o wide | grep boatload | head
boatload-1                                         boatload-1-1-boatload-7fbbfdd9d4-9wfpf                       1/1     Running             0              38h     10.130.9.24     jetlag-bm22   <none>           <none>
boatload-10                                        boatload-10-1-boatload-9b7c4f96-llc8k                        1/1     Running             0              38h     10.131.23.79    jetlag-bm15   <none>           <none>
boatload-100                                       boatload-100-1-boatload-699b746448-sqq7l                     1/1     Running             0              38h     10.130.4.32     jetlag-bm19   <none>           <none>
boatload-1000                                      boatload-1000-1-boatload-6649fcdfbd-xwzqv                    0/1     ContainerCreating   0              38h     <none>          jetlag-bm16   <none>           <none>
boatload-1001                                      boatload-1001-1-boatload-76cfbcdcb5-6ngmw                    1/1     Running             0              38h     10.131.9.208    jetlag-bm17   <none>           <none>
boatload-1002                                      boatload-1002-1-boatload-6d4fff77f8-rhnbb                    1/1     Running             0              38h     10.131.9.167    jetlag-bm17   <none>           <none>
boatload-1003                                      boatload-1003-1-boatload-5ff58b46c9-vh7t7                    1/1     Running             0              38h     10.130.5.117    jetlag-bm19   <none>           <none>
boatload-1004                                      boatload-1004-1-boatload-6b7797978d-vcxzc                    1/1     Running             0              38h     10.130.4.98     jetlag-bm19   <none>           <none>
boatload-1005                                      boatload-1005-1-boatload-844768f7b9-6bpv4                    1/1     Running             0              38h     10.130.9.105    jetlag-bm22   <none>           <none>
boatload-1006                                      boatload-1006-1-boatload-78dd5c8f54-7fv6q                    0/1     ContainerCreating   0              38h     <none>          jetlag-bm13   <none>           <none>
Example stuck pod description:
# oc describe po -n boatload-1000 
Name:           boatload-1000-1-boatload-6649fcdfbd-xwzqv
Namespace:      boatload-1000
Priority:       0
Node:           jetlag-bm16/10.5.190.41
Start Time:     Sun, 21 Nov 2021 18:29:12 -0600
Labels:         app=boatload-1000-1
                pod-template-hash=6649fcdfbd
Annotations:    openshift.io/scc: restricted
Status:         Pending
IPs:            <none>
Controlled By:  ReplicaSet/boatload-1000-1-boatload-6649fcdfbd
Containers:
  boatload-1:
    Container ID:   
    Image:          quay.io/redhat-performance/test-gohttp-probe:v0.0.2
    Image ID:       
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  100Mi
    Requests:
      cpu:     50m
      memory:  100Mi
    Environment:
      PORT:                         8000
      LISTEN_DELAY_SECONDS:         0
      LIVENESS_DELAY_SECONDS:       0
      READINESS_DELAY_SECONDS:      0
      RESPONSE_DELAY_MILLISECONDS:  0
      LIVENESS_SUCCESS_MAX:         0
      READINESS_SUCCESS_MAX:        0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bts7d (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-bts7d:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Guaranteed
Node-Selectors:              jetlag=true
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                  From     Message
  ----     ------                  ----                 ----     -------
  Warning  FailedCreatePodSandBox  79s (x524 over 38h)  kubelet  Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Nodes all showing ready and crio runtime version
# oc get no -o wide
NAME          STATUS   ROLES    AGE   VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                        KERNEL-VERSION                 CONTAINER-RUNTIME
jetlag-bm10   Ready    master   43h   v1.22.1+35a59a5   10.5.190.19   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm11   Ready    master   43h   v1.22.1+35a59a5   10.5.190.32   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm12   Ready    master   43h   v1.22.1+35a59a5   10.5.190.59   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm13   Ready    worker   43h   v1.22.1+35a59a5   10.5.190.42   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm14   Ready    worker   43h   v1.22.1+35a59a5   10.5.190.6    <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm15   Ready    worker   43h   v1.22.1+35a59a5   10.5.190.3    <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm16   Ready    worker   43h   v1.22.1+35a59a5   10.5.190.41   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm17   Ready    worker   43h   v1.22.1+35a59a5   10.5.190.15   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm18   Ready    worker   43h   v1.22.1+35a59a5   10.5.190.38   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm19   Ready    worker   43h   v1.22.1+35a59a5   10.5.190.29   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm20   Ready    worker   43h   v1.22.1+35a59a5   10.5.190.51   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm21   Ready    worker   43h   v1.22.1+35a59a5   10.5.190.30   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
jetlag-bm22   Ready    worker   43h   v1.22.1+35a59a5   10.5.190.37   <none>        Red Hat Enterprise Linux CoreOS 410.84.202111201002-0 (Ootpa)   4.18.0-305.28.1.el8_4.x86_64   cri-o://1.23.0-26.rhaos4.10.git2874194.el8
Snippet of logs from bm16 with stuck pod - boatload-1000-1-boatload-6649fcdfbd-xwzqv
Nov 23 15:11:52 jetlag-bm16 hyperkube[4099]: E1123 15:11:52.608785    4099 pod_workers.go:836] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000(f574c2d0-1ef5-4dc6-a861-02efab8e91cf)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000(f574c2d0-1ef5-4dc6-a861-02efab8e91cf)\\\": rpc error: code = DeadlineExceeded desc = context deadline exceeded\"" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv" podUID=f574c2d0-1ef5-4dc6-a861-02efab8e91cf                  
Nov 23 15:12:06 jetlag-bm16 hyperkube[4099]: I1123 15:12:06.608384    4099 kuberuntime_manager.go:483] "No sandbox for pod can be found. Need to start a new one" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv"
Nov 23 15:12:06 jetlag-bm16 crio[4063]: time="2021-11-23 15:12:06.608871400Z" level=info msg="Running pod sandbox: boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv/POD" id=f2a8c8bc-633f-4fdd-8163-12cde38f0f88 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Nov 23 15:12:06 jetlag-bm16 crio[4063]: time="2021-11-23 15:12:06.608939657Z" level=warning msg="error reserving pod name k8s_boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000_f574c2d0-1ef5-4dc6-a861-02efab8e91cf_0 for id 70e38c207577f89f92000f0a6935b9bd0a6c58cadb3381ce4ff6eb260701bd19: name is reserved"
Nov 23 15:12:06 jetlag-bm16 crio[4063]: time="2021-11-23 15:12:06.608976046Z" level=info msg="Creation of sandbox k8s_boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000_f574c2d0-1ef5-4dc6-a861-02efab8e91cf_0 not yet finished. Waiting up to 4m0s for it to finish" id=f2a8c8bc-633f-4fdd-8163-12cde38f0f88 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox                                                        
Nov 23 15:12:52 jetlag-bm16 crio[4063]: time="2021-11-23 15:12:52.648784847Z" level=info msg="Got pod network &{Name:boatload-1000-1-boatload-6649fcdfbd-xwzqv Namespace:boatload-1000 ID:cd6a10461b6569af6af6bc516b00181055cf809475e5c5582fb7411d737ba83c UID:f574c2d0-1ef5-4dc6-a861-02efab8e91cf NetNS:/var/run/netns/2a2dad39-9173-43e0-87ae-964ebdb293da Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Nov 23 15:12:52 jetlag-bm16 crio[4063]: time="2021-11-23 15:12:52.648871102Z" level=info msg="Deleting pod boatload-1000_boatload-1000-1-boatload-6649fcdfbd-xwzqv from CNI network \"multus-cni-network\" (type=multus)"
Nov 23 15:13:52 jetlag-bm16 crio[4063]: time="2021-11-23 15:13:52.649872109Z" level=error msg="Error stopping network on cleanup: failed to destroy network for pod sandbox k8s_boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000_f574c2d0-1ef5-4dc6-a861-02efab8e91cf_0(cd6a10461b6569af6af6bc516b00181055cf809475e5c5582fb7411d737ba83c): error removing pod boatload-1000_boatload-1000-1-boatload-6649fcdfbd-xwzqv from CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): netplugin failed: \"2021-11-23T15:12:52Z [verbose] Del: boatload-1000:boatload-1000-1-boatload-6649fcdfbd-xwzqv:f574c2d0-1ef5-4dc6-a861-02efab8e91cf:openshift-sdn:eth0 {\\\"cniVersion\\\":\\\"0.3.1\\\",\\\"name\\\":\\\"openshift-sdn\\\",\\\"type\\\":\\\"openshift-sdn\\\"}\\n\"" id=4ed446a4-dccf-47b5-a2f8-f3f65bbaacb3 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Nov 23 15:16:06 jetlag-bm16 hyperkube[4099]: E1123 15:16:06.608772    4099 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv"
Nov 23 15:16:06 jetlag-bm16 hyperkube[4099]: E1123 15:16:06.608788    4099 kuberuntime_manager.go:818] "CreatePodSandbox for pod failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv"
Nov 23 15:16:06 jetlag-bm16 hyperkube[4099]: E1123 15:16:06.608912    4099 pod_workers.go:836] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000(f574c2d0-1ef5-4dc6-a861-02efab8e91cf)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000(f574c2d0-1ef5-4dc6-a861-02efab8e91cf)\\\": rpc error: code = DeadlineExceeded desc = context deadline exceeded\"" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv" podUID=f574c2d0-1ef5-4dc6-a861-02efab8e91cf                  
Nov 23 15:16:18 jetlag-bm16 hyperkube[4099]: I1123 15:16:18.608217    4099 kuberuntime_manager.go:483] "No sandbox for pod can be found. Need to start a new one" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv"
Nov 23 15:16:18 jetlag-bm16 crio[4063]: time="2021-11-23 15:16:18.608704970Z" level=info msg="Running pod sandbox: boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv/POD" id=b1f9fa78-9084-4386-a0e9-85091ffc4961 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Nov 23 15:16:18 jetlag-bm16 crio[4063]: time="2021-11-23 15:16:18.655281765Z" level=info msg="Got pod network &{Name:boatload-1000-1-boatload-6649fcdfbd-xwzqv Namespace:boatload-1000 ID:358773ee954a51345c852253e83190282730d751b9b39aa7deeae908b421ccf9 UID:f574c2d0-1ef5-4dc6-a861-02efab8e91cf NetNS:/var/run/netns/b9e58ba9-c93a-4c36-a3eb-1638d22da8b9 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Nov 23 15:16:18 jetlag-bm16 crio[4063]: time="2021-11-23 15:16:18.655320815Z" level=info msg="Adding pod boatload-1000_boatload-1000-1-boatload-6649fcdfbd-xwzqv to CNI network \"multus-cni-network\" (type=multus)"
Nov 23 15:20:18 jetlag-bm16 hyperkube[4099]: E1123 15:20:18.608622    4099 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv"
Nov 23 15:20:18 jetlag-bm16 hyperkube[4099]: E1123 15:20:18.608637    4099 kuberuntime_manager.go:818] "CreatePodSandbox for pod failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv"
Nov 23 15:20:18 jetlag-bm16 hyperkube[4099]: E1123 15:20:18.608758    4099 pod_workers.go:836] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000(f574c2d0-1ef5-4dc6-a861-02efab8e91cf)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000(f574c2d0-1ef5-4dc6-a861-02efab8e91cf)\\\": rpc error: code = DeadlineExceeded desc = context deadline exceeded\"" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv" podUID=f574c2d0-1ef5-4dc6-a861-02efab8e91cf                  
Nov 23 15:20:32 jetlag-bm16 hyperkube[4099]: I1123 15:20:32.608156    4099 kuberuntime_manager.go:483] "No sandbox for pod can be found. Need to start a new one" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv"
Nov 23 15:20:32 jetlag-bm16 crio[4063]: time="2021-11-23 15:20:32.608524091Z" level=info msg="Running pod sandbox: boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv/POD" id=7f6b1553-bab7-482f-81e9-2cc4044ce357 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Nov 23 15:20:32 jetlag-bm16 crio[4063]: time="2021-11-23 15:20:32.608630663Z" level=warning msg="error reserving pod name k8s_boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000_f574c2d0-1ef5-4dc6-a861-02efab8e91cf_0 for id 73292d4bd616988cacf8c8abe6d232fafd3cb3c711cf2d9a616e0b2e62a4eee7: name is reserved"
Nov 23 15:20:32 jetlag-bm16 crio[4063]: time="2021-11-23 15:20:32.608654528Z" level=info msg="Creation of sandbox k8s_boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000_f574c2d0-1ef5-4dc6-a861-02efab8e91cf_0 not yet finished. Waiting up to 4m0s for it to finish" id=7f6b1553-bab7-482f-81e9-2cc4044ce357 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox                                                        
Nov 23 15:21:18 jetlag-bm16 crio[4063]: time="2021-11-23 15:21:18.658500951Z" level=info msg="Got pod network &{Name:boatload-1000-1-boatload-6649fcdfbd-xwzqv Namespace:boatload-1000 ID:358773ee954a51345c852253e83190282730d751b9b39aa7deeae908b421ccf9 UID:f574c2d0-1ef5-4dc6-a861-02efab8e91cf NetNS:/var/run/netns/b9e58ba9-c93a-4c36-a3eb-1638d22da8b9 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Nov 23 15:21:18 jetlag-bm16 crio[4063]: time="2021-11-23 15:21:18.658591955Z" level=info msg="Deleting pod boatload-1000_boatload-1000-1-boatload-6649fcdfbd-xwzqv from CNI network \"multus-cni-network\" (type=multus)"
While examining one of the hosts which had a number of stuck pods, I attempted to remediate the host by restarting the kubelet and shortly afterwards restarting crio. (jetlag-bm17 node)  The pods became unstuck and started running after doing so.
I also attempted on another host jetlag-bm19 just restarting crio and this also resolved the issue after some period of time, there was no more stuck pods.  We can see in present state, those nodes no longer have any stuck pods:
# oc get po -A -o wide | grep boatload | grep ContainerCreating | awk '{print $8}' | sort | uniq -c                                                                               
    481 jetlag-bm13
    437 jetlag-bm14
    481 jetlag-bm16
    431 jetlag-bm18
it seems a ton of pods are stuck in network creatoin:
grep -r 'ExecPlugin(' crio-goroutine-stacks-2021-11-23T152424Z.log | wc -l
looking further into this, it seems like there's something up with the multus pods:
openshift-multus                                   ip-reconciler-27292455--1-wd94g                              0/1     ContainerCreating   0              40h     <none>          jetlag-bm16   <none>           <none>
openshift-multus                                   ip-reconciler-27292475--1-h7fmr                              0/1     ContainerCreating   0              39h     <none>          jetlag-bm16   <none>           <none>
openshift-multus                                   ip-reconciler-27292485--1-gkgxc                              0/1     ContainerCreating   0              39h     <none>          jetlag-bm16   <none>           <none>
openshift-multus                                   ip-reconciler-27292515--1-wnwvp                              0/1     ContainerCreating   0              39h     <none>          jetlag-bm16   <none>           <none>
openshift-multus                                   ip-reconciler-27292530--1-l9kvb                              0/1     ContainerCreating   0              38h     <none>          jetlag-bm16   <none>           <none>
openshift-multus                                   ip-reconciler-27292585--1-27l6s                              0/1     ContainerCreating   0              37h     <none>          jetlag-bm16   <none>           <none>
openshift-multus                                   ip-reconciler-27292615--1-zpsnb                              0/1     ContainerCreating   0              37h     <none>          jetlag-bm16   <none>           <none>
openshift-multus                                   ip-reconciler-27292625--1-8grf9                              0/1     ContainerCreating   0              37h     <none>          jetlag-bm16   <none>           <none>
openshift-operator-lifecycle-manager               collect-profiles-27292410--1-dq2tg                           0/1     ContainerCreating   0              40h     <none>          jetlag-bm16   <none>           <none>
openshift-operator-lifecycle-manager               collect-profiles-27292440--1-xp94j                           0/1     ContainerCreating   0              40h     <none>          jetlag-bm16   <none>           <none>
openshift-operator-lifecycle-manager               collect-profiles-27292455--1-ntqnb                           0/1     ContainerCreating   0              40h     <none>          jetlag-bm16   <none>           <none>
openshift-operator-lifecycle-manager               collect-profiles-27292470--1-k76vq                           0/1     ContainerCreating   0              39h     <none>          jetlag-bm16   <none>           <none>
openshift-operator-lifecycle-manager               collect-profiles-27292500--1-m4sqj                           0/1     ContainerCreating   0              39h     <none>          jetlag-bm16   <none>           <none>
openshift-operator-lifecycle-manager               collect-profiles-27292530--1-54vqb                           0/1     ContainerCreating   0              38h     <none>          jetlag-bm16   <none>           <none>
openshift-operator-lifecycle-manager               collect-profiles-27292545--1-79pjh                           0/1     ContainerCreating   0              38h     <none>          jetlag-bm16   <none>           <none>
openshift-operator-lifecycle-manager               collect-profiles-27292575--1-z6dkj                           0/1     ContainerCreating   0              38h     <none>          jetlag-bm16   <none>           <none>
interestingly, it seems like the multus pod is relying on the network:
Nov 23 12:22:44 jetlag-bm16 crio[4063]: time="2021-11-23 12:22:44.650636389Z" level=error msg="Error stopping network on cleanup: failed to destroy network for pod sandbox k8s_ip-reconciler-27292515--1-wnwvp_openshift-multus_6cdec360-bb54-423e-a915-db3ec050f82e_0(ccaedb039cf231f4cca6a30a1e430de2625dbadf685e2a1c2a563e3f741358e3): error removing pod openshift-multus_ip-reconciler-27292515--1-wnwvp from CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): netplugin failed: \"2021-11-23T12:21:44Z [verbose] Del: openshift-multus:ip-reconciler-27292515--1-wnwvp:6cdec360-bb54-423e-a915-db3ec050f82e:openshift-sdn:eth0 {\\\"cniVersion\\\":\\\"0.3.1\\\",\\\"name\\\":\\\"openshift-sdn\\\",\\\"type\\\":\\\"openshift-sdn\\\"}\\n\"" id=6d2bdbec-f862-42e6-872c-863f86150584 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
finally, we have these error:
Nov 23 12:22:47 jetlag-bm16 crio[4063]: time="2021-11-23 12:22:47.670618881Z" level=error msg="error loading cached network config: network \"multus-cni-network\" not found in CNI cache"
so I am reassigning to multus for now for further investigation
A couple notes based on the error message as well as the provided must-gather
The error message:
Nov 23 12:22:47 jetlag-bm16 crio[4063]: time="2021-11-23 12:22:47.670618881Z" level=error msg="error loading cached network config: network \"multus-cni-network\" not found in CNI cache"
Appears to come from ocicni: https://github.com/cri-o/ocicni/blob/master/pkg/ocicni/ocicni.go#L502
In this case, I don't believe it's Multus loading this libcni cache.
In the provided must-gather, this error ("error loading cached network config") occurs only twice.
I further analyzed the must-gather, and looking for similar errors, I did find that Multus did experience API connectivity issues at least at times, using this command:
cat ./must-gather/path/to/kubelet_service.log | grep -i multus | grep -i error
I can see that there are a number of errors (~19), appearing for an ~2 minute period between Nov 21 19:18-19:20 which look like:
Nov 21 19:20:45.048296 jetlag-bm11 hyperkube[4106]: E1121 19:20:45.048253    4106 pod_workers.go:836] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"210b6f0c-a3e6-4569-a327-7ebf4a638a17\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_installer-7-jetlag-bm11_openshift-kube-controller-manager_210b6f0c-a3e6-4569-a327-7ebf4a638a17_0(aabe8f63424931c86d3441b4677aa64599347135309eea68c0ce006f145d98cb): error removing pod openshift-kube-controller-manager_installer-7-jetlag-bm11 from CNI network \\\"multus-cni-network\\\": plugin type=\\\"multus\\\" name=\\\"multus-cni-network\\\" failed (delete): Multus: [openshift-kube-controller-manager/installer-7-jetlag-bm11]: error getting pod: Get \\\"https://[api-int.jetlag-ibm1.performance-scale.cloud]:6443/api/v1/namespaces/openshift-kube-controller-manager/pods/installer-7-jetlag-bm11?timeout=1m0s\\\": dial tcp 10.5.190.43:6443: connect: connection refused\"" pod="openshift-kube-controller-manager/installer-7-jetlag-bm11" podUID=210b6f0c-a3e6-4569-a327-7ebf4a638a17
This appears to be an API connectivity issue, which may be related to Peter's assessement that OVN is experiencing problems handling all the requests.
For the error reading such as "Error stopping network on cleanup" I do see one instance during the same time period, which reads:
must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-95458c54c1f17a0f6d3e76eac8b50e3c2a4531d9fb0bc859ea1518845f63a156/host_service_logs/masters/crio_service.log:3644:Nov 21 19:18:19.487729 jetlag-bm10 crio[4009]: time="2021-11-21 19:18:19.487692065Z" level=error msg="Error stopping network on cleanup: failed to destroy network for pod sandbox k8s_apiserver-7ff54f9fc4-bkfxz_openshift-oauth-apiserver_71cf19aa-6555-44f7-ba8f-491f23ed87b7_0(f5124bf00af73eb30b6074543fefb2bf4087262a9cca69e47b91d69545f32015): error removing pod openshift-oauth-apiserver_apiserver-7ff54f9fc4-bkfxz from CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): Multus: [openshift-oauth-apiserver/apiserver-7ff54f9fc4-bkfxz]: error getting pod: Get \"https://[api-int.jetlag-ibm1.performance-scale.cloud]:6443/api/v1/namespaces/openshift-oauth-apiserver/pods/apiserver-7ff54f9fc4-bkfxz?timeout=1m0s\": dial tcp 10.5.190.43:6443: connect: connection refused" id=da5bdd26-44c8-40aa-8ca1-b9db82e51181 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
This is also indicative of a failure to reach the API server.
I'm not certain what component should be assigned for analysis of the OCICNI error, however, it would be my recommendation that the BZ is assigned to the Networking component for an analysis from the OVN side regarding the API connectivity.
Nov 24 12:42:34 jetlag-bm16 hyperkube[4099]: E1124 12:42:34.608477    4099 pod_workers.go:836] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"boatload-2079-1-boatload-597c5f8dd9-2b6sw_
boatload-2079(6a29a012-a934-4da2-8838-52f8875f2dcb)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"boatload-2079-1-boatload-597c5f8dd9-2b6sw_boatload-2079(6a29a012-a934-4da2-8838-52f8875f
2dcb)\\\": rpc error: code = Unknown desc = Kubelet may be retrying requests that are timing out in CRI-O due to system load: context deadline exceeded: error reserving pod name k8s_boatload-2079-1-boatload-59
7c5f8dd9-2b6sw_boatload-2079_6a29a012-a934-4da2-8838-52f8875f2dcb_0 for id 5e2984836950f79fc0833a46a5a41f37d0815f3b94c2c48f4b8940c93d1a2dd6: name is reserved\"" pod="boatload-2079/boatload-2079-1-boatload-597c
5f8dd9-2b6sw" podUID=6a29a012-a934-4da2-8838-52f8875f2dcb   
which looks related to: https://bugzilla.redhat.com/show_bug.cgi?id=1785399
Additionally, many errors for the CNI cache error as noted above:
[core@jetlag-bm16 ~]$ journalctl -u crio | grep -i "CNI cache" | wc -l
11086   
Looking at the CNI configuration we can see the name that OCICNI is using to cache on disk, which is appearing in the error messages:
[core@jetlag-bm16 ~]$ sudo cat /etc/kubernetes/cni/net.d/00-multus.conf
{ "cniVersion": "0.3.1", "name": "multus-cni-network", "type": "multus", "namespaceIsolation": true, "globalNamespaces": "default,openshift-multus,openshift-sriov-network-operator", "logLevel": "verbose", "bin
Dir": "/opt/multus/bin", "readinessindicatorfile": "/var/run/multus/cni/net.d/80-openshift-network.conf", "kubeconfig": "/etc/kubernetes/cni/net.d/multus.d/multus.kubeconfig", "delegates": [ { "cniVersion": "0
.3.1", "name": "openshift-sdn", "type": "openshift-sdn" } ] }
Given: "name": "multus-cni-network"
And we see errors like:
Nov 23 12:22:47 jetlag-bm16 crio[4063]: time="2021-11-23 12:22:47.670618881Z" level=error msg="error loading cached network config: network \"multus-cni-network\" not found in CNI cache"
This is the default network name in openshift according to the CNI configuration, crio caches this information using ocicni (and libcni? unsure), and it has the name ""multus-cni-network" in it but the error message is from crio, crio is looking for this on disk, but hasn't found it, so I think we need a crio / ocicni expert to analyze this.
A bit more details on how reproducible this bug is and what scenarios seem to reproduce it:
I have now been able to reproduce this on the scenario without setting resource limits so it seems that it could be exasperated with a compounding effect of repeated running tests that scale to 500pods/node on multiple nodes (10 nodes).  I did not see this with 5 nodes running these same tests.
I am under the impression that this bug is still a crio bug because I can systemctl restart crio and the pods will come up from ContainerCreating
Note when crio was restarted:
[root@jetlag-bm13 ~]# systemctl status crio                                                                                                                                                  
● crio.service - Container Runtime Interface for OCI (CRI-O)
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf, 20-stream-address.conf                                                     
   Active: active (running) since Thu 2021-12-02 17:09:31 UTC; 1h 48min ago
For example, take this pod boatload-10-1-boatload-9b7c4f96-cvd59 which was stuck in ContainerCreating and I had restarted crio at 17:09:31 and we see in the crio logs the pod comes up now and oc interactions with the cluster show the pod as running.
Crio logs:
# grep "boatload-10-1" crio-post.log
......
Dec 02 17:07:35 jetlag-bm13 crio[4054]: time="2021-12-02 17:07:35.288562036Z" level=info msg="Running pod sandbox: boatload-10/boatload-10-1-boatload-9b7c4f96-cvd59/POD" id=2e27b135-[7/1876]
5-b1a9-3f6b735b45a3 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox                                                                                                                      
Dec 02 17:07:35 jetlag-bm13 crio[4054]: time="2021-12-02 17:07:35.288664413Z" level=warning msg="error reserving pod name k8s_boatload-10-1-boatload-9b7c4f96-cvd59_boatload-10_4063738a-1db9-
4e42-9f2c-f070db3979f7_0 for id cf26a79ce4ff0813a41c79aeaafb1f1398ee9c9dabbbb36164506b3fb2db9b40: name is reserved"                                                                          
Dec 02 17:07:35 jetlag-bm13 crio[4054]: time="2021-12-02 17:07:35.288688655Z" level=info msg="Creation of sandbox k8s_boatload-10-1-boatload-9b7c4f96-cvd59_boatload-10_4063738a-1db9-4e42-9f2
c-f070db3979f7_0 not yet finished. Waiting up to 4m0s for it to finish" id=2e27b135-a159-4cb5-b1a9-3f6b735b45a3 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox                          
Dec 02 17:08:21 jetlag-bm13 crio[4054]: time="2021-12-02 17:08:21.407332423Z" level=info msg="Got pod network &{Name:boatload-10-1-boatload-9b7c4f96-cvd59 Namespace:boatload-10 ID:0609e95557
8cccdcfc4220378aee9074a8b727ce4ab478a481f1062fd1c95f3e UID:4063738a-1db9-4e42-9f2c-f070db3979f7 NetNS:/var/run/netns/af6a2d13-cf8d-4c89-a9a7-7d15f65a2ada Networks:[] RuntimeConfig:map[multus
-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"                                                                                                         
Dec 02 17:08:21 jetlag-bm13 crio[4054]: time="2021-12-02 17:08:21.407411823Z" level=info msg="Deleting pod boatload-10_boatload-10-1-boatload-9b7c4f96-cvd59 from CNI network \"multus-cni-net
work\" (type=multus)"                                                                                                                                                                        
Dec 02 17:08:27 jetlag-bm13 crio[4054]: time="2021-12-02 17:08:27.183614655Z" level=error msg="Error stopping network on cleanup: failed to destroy network for pod sandbox k8s_boatload-10-1-
boatload-9b7c4f96-cvd59_boatload-10_4063738a-1db9-4e42-9f2c-f070db3979f7_0(0609e955578cccdcfc4220378aee9074a8b727ce4ab478a481f1062fd1c95f3e): error removing pod boatload-10_boatload-10-1-boa
tload-9b7c4f96-cvd59 from CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): netplugin failed: \"2021-12-02T17:08:21Z [verbose] Del: boat
load-10:boatload-10-1-boatload-9b7c4f96-cvd59:4063738a-1db9-4e42-9f2c-f070db3979f7:openshift-sdn:eth0 {\\\"cniVersion\\\":\\\"0.3.1\\\",\\\"name\\\":\\\"openshift-sdn\\\",\\\"type\\\":\\\"op
enshift-sdn\\\"}\\n\"" id=d84735af-79af-4b65-8d59-209ef829b8f6 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox                                                                           
Dec 02 17:10:05 jetlag-bm13 crio[4136313]: time="2021-12-02 17:10:05.101495652Z" level=info msg="Running pod sandbox: boatload-10/boatload-10-1-boatload-9b7c4f96-cvd59/POD" id=07489922-a763-
423e-95d8-46acd23eb5f1 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox                                                                                                                   
Dec 02 17:10:07 jetlag-bm13 crio[4136313]: time="2021-12-02 17:10:07.274084114Z" level=info msg="Got pod network &{Name:boatload-10-1-boatload-9b7c4f96-cvd59 Namespace:boatload-10 ID:59a757f
69399947f30a3648a623a1a259d5df9dc903a028bfcf7b0d5a4abbb88 UID:4063738a-1db9-4e42-9f2c-f070db3979f7 NetNS:/var/run/netns/1a0b80ba-b15f-42a8-915d-35d5d9c19d4a Networks:[] RuntimeConfig:map[mul
tus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"                                                                                                      
Dec 02 17:10:07 jetlag-bm13 crio[4136313]: time="2021-12-02 17:10:07.274125256Z" level=info msg="Adding pod boatload-10_boatload-10-1-boatload-9b7c4f96-cvd59 to CNI network \"multus-cni-netw
ork\" (type=multus)"                                                                                                                                                                         
Dec 02 17:11:37 jetlag-bm13 crio[4136313]: 2021-12-02T17:11:37Z [verbose] Add: boatload-10:boatload-10-1-boatload-9b7c4f96-cvd59:4063738a-1db9-4e42-9f2c-f070db3979f7:openshift-sdn(openshift-
sdn):eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/var/run/netns/1a0b80ba-b15f-42a8-915d-35d5d9c19d4a"}],"ips":[{"version":"4","interface":0,"address":"10.128.19.226/21
"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.128.16.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}}                                                                                
Dec 02 17:11:37 jetlag-bm13 crio[4136313]: I1202 17:11:37.810910 4172319 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"boatload-10", Name:"boatload-10-1-boatload-9b7c4f96-cvd
59", UID:"4063738a-1db9-4e42-9f2c-f070db3979f7", APIVersion:"v1", ResourceVersion:"2982587", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.128.19.226/21] from openshif
t-sdn
Dec 02 17:11:37 jetlag-bm13 crio[4136313]: time="2021-12-02 17:11:37.837146163Z" level=info msg="Got pod network &{Name:boatload-10-1-boatload-9b7c4f96-cvd59 Namespace:boatload-10 ID:59a757f
69399947f30a3648a623a1a259d5df9dc903a028bfcf7b0d5a4abbb88 UID:4063738a-1db9-4e42-9f2c-f070db3979f7 NetNS:/var/run/netns/1a0b80ba-b15f-42a8-915d-35d5d9c19d4a Networks:[] RuntimeConfig:map[mul
tus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Dec 02 17:11:37 jetlag-bm13 crio[4136313]: time="2021-12-02 17:11:37.837281153Z" level=info msg="Checking pod boatload-10_boatload-10-1-boatload-9b7c4f96-cvd59 for CNI network multus-cni-net
work (type=multus)"
Dec 02 17:11:38 jetlag-bm13 crio[4136313]: time="2021-12-02 17:11:38.518841494Z" level=info msg="Ran pod sandbox 59a757f69399947f30a3648a623a1a259d5df9dc903a028bfcf7b0d5a4abbb88 with infra c
ontainer: boatload-10/boatload-10-1-boatload-9b7c4f96-cvd59/POD" id=07489922-a763-423e-95d8-46acd23eb5f1 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 02 17:11:38 jetlag-bm13 crio[4136313]: time="2021-12-02 17:11:38.805721350Z" level=info msg="Creating container: boatload-10/boatload-10-1-boatload-9b7c4f96-cvd59/boatload-1" id=09f764a1-fbec-4280-9408-ecc4638c52df name=/runtime.v1alpha2.RuntimeService/CreateContainer
Dec 02 17:12:08 jetlag-bm13 crio[4136313]: time="2021-12-02 17:12:08.917455413Z" level=info msg="Created container d29a1b1e5991a91439b6cb120f8b1c114b43d5bfc503f16eea836fe3f1922bb6: boatload-10/boatload-10-1-boatload-9b7c4f96-cvd59/boatload-1" id=09f764a1-fbec-4280-9408-ecc4638c52df name=/runtime.v1alpha2.RuntimeService/CreateContainer                                           
Dec 02 17:12:09 jetlag-bm13 crio[4136313]: time="2021-12-02 17:12:09.199963444Z" level=info msg="Started container" PID=15228 containerID=d29a1b1e5991a91439b6cb120f8b1c114b43d5bfc503f16eea836fe3f1922bb6 description=boatload-10/boatload-10-1-boatload-9b7c4f96-cvd59/boatload-1 id=2a5012a5-3986-4b53-ac2b-0b24d51840d8 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=59a757f69399947f30a3648a623a1a259d5df9dc903a028bfcf7b0d5a4abbb88
Please advise what additional data would help debug this issue?
This is on openshift-sdn not ovn. The context deadline exceeded is 4 minute timeout:
Nov 23 15:16:18 jetlag-bm16 crio[4063]: time="2021-11-23 15:16:18.655320815Z" level=info msg="Adding pod boatload-1000_boatload-1000-1-boatload-6649fcdfbd-xwzqv to CNI network \"multus-cni-network\" (type=multus)"
Nov 23 15:20:18 jetlag-bm16 hyperkube[4099]: E1123 15:20:18.608622    4099 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv"
Nov 23 15:20:18 jetlag-bm16 hyperkube[4099]: E1123 15:20:18.608637    4099 kuberuntime_manager.go:818] "CreatePodSandbox for pod failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv"
In openshift-sdn for this pod:
2021-11-23T15:02:23.755503272Z I1123 15:02:23.755368    5080 pod.go:541] CNI_DEL boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv
2021-11-23T15:05:01.175191940Z I1123 15:05:01.175139    5080 pod.go:505] CNI_ADD boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv got IP 10.129.16.251, ofport 29198
2021-11-23T15:10:58.829235866Z I1123 15:10:58.829192    5080 pod.go:541] CNI_DEL boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv
2021-11-23T15:13:40.023412192Z I1123 15:13:40.023306    5080 pod.go:505] CNI_ADD boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv got IP 10.129.18.237, ofport 29407
2021-11-23T15:19:21.047188256Z I1123 15:19:21.047150    5080 pod.go:541] CNI_DEL boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv
2021-11-23T15:22:05.210671646Z I1123 15:22:05.210611    5080 pod.go:505] CNI_ADD boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv got IP 10.129.20.221, ofport 29621
2021-11-23T15:27:51.635704973Z I1123 15:27:51.635562    5080 pod.go:541] CNI_DEL boatload-1000/boatload-1000-1-boatload-6649fcdfbd-xwzqv
Openshift-sdn prints out the "got IP..." after CNI success. You can see there is no entry for cni success between 15:16:18 and 15:20:18. The previous CNI delete:
Nov 23 15:13:52 jetlag-bm16 crio[4063]: time="2021-11-23 15:13:52.649872109Z" level=error msg="Error stopping network on cleanup: failed to destroy network for pod sandbox k8s_boatload-1000-1-boatload-6649fcdfbd-xwzqv_boatload-1000_f574c2d0-1ef5-4dc6-a861-02efab8e91cf_0(cd6a10461b6569af6af6bc516b00181055cf809475e5c5582fb7411d737ba83c): error removing pod boatload-1000_boatload-1000-1-boatload-6649fcdfbd-xwzqv from CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): netplugin failed: \"2021-11-23T15:12:52Z [verbose] Del: boatload-1000:boatload-1000-1-boatload-6649fcdfbd-xwzqv:f574c2d0-1ef5-4dc6-a861-02efab8e91cf:openshift-sdn:eth0 
Doesn't seem to line up with a CNI delete in the sdn log either so something is lagging behind. I would suggest looking at metrics:
https://github.com/openshift/sdn/blob/master/pkg/network/node/pod.go#L472
This will prove if its openshift-sdn or a crio problem.