link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Ansible Automation Platform 故障排除

PDF

Red Hat Ansible Automation Platform 2.4

对 Ansible Automation Platform 的问题进行故障排除

Red Hat Customer Content Services

法律通告

摘要

本指南为 Red Hat Ansible Automation Platform 提供故障排除主题。使用故障排除 Ansible Automation Platform 指南来排除 Ansible Automation Platform 安装的问题。

对红帽文档提供反馈

如果您对本文档有任何改进建议，或发现了任何错误，请通过 https://access.redhat.com 联系技术支持，以使用 docs-product 组件在 Ansible Automation Platform JIRA 项目中创建一个问题。

第 1 章诊断问题

要开始对 Ansible Automation Platform 进行故障排除，请使用 OpenShift Container Platform 上的 must-gather 命令或基于虚拟机的安装上的 sos 实用程序来收集配置和诊断信息。您可以将这些工具的输出附加到支持问题单中。

1.1. 使用 must-gather 命令在 OpenShift Container Platform 上对 Ansible Automation Platform 进行故障排除

oc adm must-gather 命令行界面(CLI)命令从 OpenShift Container Platform 上部署的 Ansible Automation Platform 安装收集信息。它收集调试问题通常需要的信息，包括资源定义和服务日志。运行 oc adm must-gather CLI 命令会创建一个新目录，其中包含可用于排除故障或附加到支持问题单的新目录。如果您的 OpenShift 环境无法访问 registry.redhat.io ，且您无法运行 must-gather 命令，则运行 oc adm inspect 命令。已安装 OpenShift CLI ( oc )。登录到集群：

oc login <openshift_url>

根据集群中的访问级别运行以下命令之一：在整个集群中运行 must-gather ：

oc adm must-gather --image=registry.redhat.io/ansible-automation-platform-24/aap-must-gather-rhel8 --dest-dir <dest_dir>

--image 指定收集数据的镜像 --dest-dir 指定输出的目录为集群中的特定命名空间运行 must-gather ：
```
oc adm must-gather --image=registry.redhat.io/ansible-automation-platform-24/aap-must-gather-rhel8 --dest-dir <dest_dir> – /usr/bin/ns-gather <namespace>
```
- - /usr/bin/ns-gather 将 must-gather 数据收集限制为指定命名空间要将 must-gather 归档附加到支持问题单中，请从之前创建的 must-gather 目录创建一个压缩文件，并将其附加到您的支持问题单中。例如，在使用 Linux 操作系统的计算机中，运行以下命令，将 < must-gather-local.5421342344627712289/&gt ; 替换为 must-gather 目录名称：
```
$ tar cvaf must-gather.tar.gz <must-gather.local.5421342344627712289/>
```

其他资源

有关安装 OpenShift CLI ( oc )的详情，请参考 OpenShift Container Platform 文档中的安装 OpenShift CLI 。有关运行 oc adm inspect 命令的详情，请参考 OpenShift Container Platform 文档中的 ocm adm inspect 部分。

1.2. 通过生成 sos 报告，对基于虚拟机的安装进行故障排除

sos 实用程序在基于虚拟机的安装中从 Ansible Automation Platform 收集配置、诊断和故障排除数据。有关安装和使用 sos 工具的更多信息，请参阅为技术支持生成 sos 报告。

第 2 章自动化控制器故障排除的资源

有关自动化控制器故障排除的详情，请参考自动化控制器管理指南中的对自动化控制器进行故障排除。有关对自动化控制器性能进行故障排除的信息，请参阅自动化控制器管理指南中的自动化控制器的性能故障排除。

第 3 章备份和恢复

有关执行 Ansible Automation Platform 的备份和恢复的详情，请参考自动化控制器管理指南中的备份和恢复。有关在 OpenShift Container Platform 上安装 Ansible Automation Platform Operator 的备份和恢复故障排除部分，请参阅 Red Hat Ansible Automation Platform Operator 备份和恢复指南中的故障排除部分。 https://docs.redhat.com/en/documentation/red_hat_ansible_automation_platform/2.4/html/red_hat_ansible_automation_platform_operator_backup_and_recovery_guide/aap-troubleshoot-backup-recover

第 4 章执行环境

对执行环境的问题进行故障排除。

4.1. 问题 - 无法为私有自动化中心上的执行环境镜像选择 "Use in Controller" 选项

您不能将 Use in Controller 选项用于私有自动化中心上的执行环境镜像。您还会收到错误消息："No Controllers available"。要解决这个问题，将自动化控制器连接到您的私有自动化中心实例。在私有自动化中心上更改 /etc/pulp/settings.py 文件，并根据您的配置添加以下参数之一：单个控制器

CONNECTED_ANSIBLE_CONTROLLERS = ['<https://my.controller.node>']

负载均衡器后面的很多控制器

CONNECTED_ANSIBLE_CONTROLLERS = ['<https://my.controller.loadbalancer>']

没有负载均衡器的许多控制器

CONNECTED_ANSIBLE_CONTROLLERS = ['<https://my.controller.node1>', '<https://my.controller2.node2>']

停止所有私有自动化中心服务：

# systemctl stop pulpcore.service pulpcore-api.service pulpcore-content.service [email protected] [email protected] nginx.service redis.service

重启所有私有自动化中心服务：

# systemctl start pulpcore.service pulpcore-api.service pulpcore-content.service [email protected] [email protected] nginx.service redis.service

验证

验证您现在可以在私有自动化中心中使用 Use in Controller 选项。

第 5 章安装

对安装问题进行故障排除。

5.1. 问题 - 无法找到与 Ansible Automation Platform 安装程序捆绑的某些软件包

您无法找到与 Ansible Automation Platform 安装程序捆绑的特定软件包，或者您会看到由配置消息禁用的"Repositories"。要解决这个问题，请在命令行中使用 subscription-manager 命令启用存储库。有关解决这个问题的更多信息，请参阅 Red Hat Ansible Automation Platform 规划指南中的附加 Red Hat Ansible Automation Platform 订阅的故障排除部分。

第 6 章 Jobs

对作业问题进行故障排除。

6.1. 问题 - 针对 localhost 运行时的作业失败

使用 Ansible Automation Platform 2 及其容器化执行环境时，使用 localhost 已更改。如需更多信息，请参阅 Red Hat Ansible Automation Platform 升级和迁移指南中的为 AAP 2 转换 playbook 。

6.2. 问题 - 作业失败并显示"ERROR! could not resolve module/action"错误消息

作业失败并显示错误消息 "ERROR! could't resolve module/action 'module name'。这通常表示拼写错误、缺少集合或不正确的模块路径"。当执行环境中缺少与模块关联的集合时，可能会出现此错误。推荐的解决方法是创建自定义执行环境，并在该执行环境中添加所需的集合。有关创建执行环境的更多信息，请参阅创建和恢复执行环境中的使用 Ansible Builder 。另外，您可以完成以下步骤：在项目存储库内创建 集合 文件夹。在 collections 文件夹中添加 requirements.yml 文件并添加集合：

collections:
- <collection_name>

6.3. 问题 - 作业失败并显示 "Timeout (12s)等待特权升级提示"错误消息

当超时值太小时，可能会出现此错误，从而导致作业在完成前停止。连接插件的默认超时值为 10 。要解决这个问题，请完成以下步骤之一来增加超时值。以下更改将影响自动化控制器中的所有作业。要为特定项目使用超时值，请在项目目录的根目录中添加 ansible.cfg 文件，并将 timeout 参数值添加到 ansible.cfg 文件中。 在自动化控制器 UI 中添加 ANSIBLE_TIMEOUT 作为环境变量 进入自动化控制器。在导航面板中，选择 Settings → Jobs settings 。在 Extra Environment Variables 下添加以下内容： "ANSIBLE_TIMEOUT": 60 使用 CLI 在 ansible.cfg 文件的 [defaults] 部分中添加一个超时值 编辑 /etc/ansible/ansible.cfg 文件并添加以下内容：

[defaults]
timeout = 60

使用超时运行临时命令 要在命令行中运行临时命令，请在 ansible-playbook 命令中添加 --timeout 标志，例如：

# ansible-playbook --timeout=60 <your_playbook.yml>

其他资源

有关 DEFAULT_TIMEOUT 配置设置的更多信息，请参阅 Ansible 社区文档中的 DEFAULT_TIMEOUT 。

6.4. 问题 - 自动化控制器中的作业处于待处理状态

在自动化控制器中启动作业后，作业将保持待处理状态，且不启动。有一些原因的作业可能会处于待处理状态。有关此问题故障排除的更多信息，请参阅自动化控制器管理指南中的处于待处理 Playbook 。 取消所有待处理的作业 运行以下命令列出所有待处理的作业：

# awx-manage shell_plus

>>> UnifiedJob.objects.filter(status='pending')

运行以下命令以取消所有待处理的作业：

>>> UnifiedJob.objects.filter(status='pending').update(status='canceled')

使用作业 ID 取消单个作业 要取消特定作业，请运行以下命令，将 < job_id&gt ; 替换为要取消的作业 ID：

# awx-manage shell_plus

>>> UnifiedJob.objects.filter(id=_<job_id>_).update(status='canceled')

6.5. 问题 - 私有自动化中心中的作业失败并显示 "denied: requested access the resource is denied, unauthorized: Insufficient permissions" 错误信息

在私有自动化中心中使用执行环境时，作业会失败，并显示 "denied: requested access the resource is denied, unauthorized: Insufficient permissions"。当您的私有自动化中心使用密码或令牌保护且 registry 凭证没有分配给执行环境时，会发生此问题。进入自动化控制器。在导航面板中，选择 Administration → Execution Environments 。点分配给失败的作业模板的执行环境。点 Edit 。将适当的 Registry 凭证 从私有自动化 hub 分配给执行环境。有关在自动化控制器中创建新凭证的详情，请参考自动化控制器用户指南中的创建新凭证。

第 7 章登录

对登录问题进行故障排除。

7.1. 问题 - 登录到自动化控制器 UI 会导致 "Invalid username or password.请重试"。

当您尝试登录到自动化控制器 UI 时，登录会失败，您会看到错误消息："Invalid username or password.请重试。" 这可能会发生这种情况的一个原因是 同时登录会话的最大数量的值是 0。 并发登录会话值的最大数量 决定了每个用户允许的最大会话数。如果这个值为 0， 则无法登录到自动化控制器。默认值为 -1， 它会禁用允许的最大会话。这意味着您可以有多个会话，而无需强制限制。以 root 用户身份，从命令行运行以下命令，将 SESSIONS_PER_USER 变量设置为 -1， 它将禁用允许的最大会话：

# echo "settings.SESSIONS_PER_USER = -1" | awx-manage shell_plus --quiet

验证

验证您可以成功登录到自动化控制器。有关安装和使用控制器节点 CLI 的更多信息，请参阅 AWX 命令行界面和 AWX 管理实用程序。有关会话限制的更多信息，请参阅自动化控制器管理指南中的会话限制。

第 8 章网络

对网络问题进行故障排除。

8.1. 问题 - Ansible Automation Platform 容器中使用的默认子网与内部网络冲突

Ansible Automation Platform 容器中使用的默认子网与内部网络冲突，从而导致 "No route to host" 错误。要解决这个问题，更新默认的无类别域间路由(CIDR)值，使其不会与默认 Podman 网络插件使用的 CIDR 冲突。在所有控制器和混合节点中，运行以下命令来创建名为 custom.py 的文件：

# touch /etc/tower/conf.d/custom.py

# chmod 640 /etc/tower/conf.d/custom.py

# chown root:awx /etc/tower/conf.d/custom.py

将以下内容添加到 /etc/tower/conf.d/custom.py 文件中：

DEFAULT_CONTAINER_RUN_OPTIONS = ['--network', 'slirp4netns:enable_ipv6=true,cidr=192.0.2.0/24']

192.0.2.0/24 是本例中新 CIDR 的值。在所有控制器和混合节点中停止并启动自动化控制器服务：
```
# automation-controller-service stop
```
```
# automation-controller-service start
```
所有容器都将在新 CIDR 上启动。

第 9 章 Playbook

您可以使用自动化内容导航器以交互方式对 playbook 进行故障排除。有关使用自动化内容导航器对 playbook 进行故障排除的更多信息，请参阅 Automation Content Navigator Creator 指南中的使用自动化内容导航器进行故障排除 Ansible 内容。

第 10 章订阅

有关保持自动化控制器订阅处于合规的信息，请参阅自动化控制器用户指南中的故障排除：保持您的订阅是否合规。 The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/ . In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version. Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law. Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux ® is the registered trademark of Linus Torvalds in the United States and other countries. Java ® is a registered trademark of Oracle and/or its affiliates. XFS ® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. MySQL ® is a registered trademark of MySQL AB in the United States, the European Union and other countries. Node.js ® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.