添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
Description of problem:
OVS is dead OVSNeutronAgent will keep running and checking OVS status periodically
After idle period of time with no instances, /var/run/openvswitch directory disappears on controllers.
[root@controller-0 ~]# ll /var/run/openvswitch
ls: cannot access /var/run/openvswitch: No such file or directory
[root@controller-0 ~]# 
[root@controller-0 ~]# ovs-vsctl show
ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
2018-06-11 10:56:42.594 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused
2018-06-11 10:56:42.595 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused)
2018-06-11 10:56:50.597 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused
2018-06-11 10:56:50.597 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused)
2018-06-11 10:56:58.600 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused
2018-06-11 10:56:58.601 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused)
2018-06-11 10:57:06.249 102859 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: ovsdb-client: failed to connect to "tcp:127.0.0.1:6640" (Connection refused)
2018-06-11 10:57:06.249 102859 ERROR neutron.agent.linux.async_process [-] Process [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json] dies due to the error: ovsdb-client: failed to connect to "tcp:127.0.0.1:6640" (Connection refused)
2018-06-11 10:57:06.603 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused
2018-06-11 10:57:06.604 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused)
2018-06-11 10:57:07.778 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-c76421a3-0d63-4284-85f9-a7cfffa743ff - - - - -] Switch connection timeout
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int [req-c76421a3-0d63-4284-85f9-a7cfffa743ff - - - - -] Failed to communicate with the switch: RuntimeError: Switch connection timeout
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int Traceback (most recent call last):
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_int.py", line 52, in check_canary_table
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     flows = self.dump_flows(constants.CANARY_TABLE)
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 141, in dump_flows
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     (dp, ofp, ofpp) = self._get_dp()
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 67, in _get_dp
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     self._cached_dpid = new_dpid
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     self.force_reraise()
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     six.reraise(self.type_, self.value, self.tb)
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 50, in _get_dp
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     dp = self._get_dp_by_dpid(self._cached_dpid)
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int   File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 69, in _get_dp_by_dpid
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int     raise RuntimeError(m)
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout
2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int 
2018-06-11 10:57:07.781 102859 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-c76421a3-0d63-4284-85f9-a7cfffa743ff - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically.
Version-Release number of selected component (if applicable):
OSP 13
Puddle 2018-06-06.3
How reproducible:
Deploy an environment.
Leave the environment for a day without any instance existing.
Try to boot an instance and check the status of ovs and the /var/run/openvswitch directory.
SOS reports attached.
ovs-vswitchd.log
----------------
2018-06-10T14:44:02.200Z|00379|connmgr|INFO|br-int<->tcp:127.0.0.1:6633: 11 flow_mods in the 2 s starting 57 s ago (11 deletes)
2018-06-11T03:25:01.887Z|00380|vlog|WARN|failed to open /var/log/openvswitch/ovs-vswitchd.log for logging: Permission denied
[root@controller-0 heat-admin]# ll /var/log/openvswitch/
total 48
-rw-r-----. 1 openvswitch hugetlbfs     0 Jun 12 03:26 ovsdb-server.log
-rw-r-----. 1 openvswitch hugetlbfs   484 Jun 10 11:11 ovsdb-server.log-20180611
-rw-r-----. 1 openvswitch hugetlbfs     0 Jun 11 03:25 ovsdb-server.log-20180612
-rw-r-----. 1 openvswitch hugetlbfs     0 Jun 12 03:26 ovs-vswitchd.log
-rw-r-----. 1 openvswitch hugetlbfs 42895 Jun 10 14:44 ovs-vswitchd.log-20180611
-rw-r-----. 1 openvswitch hugetlbfs     0 Jun 11 03:25 ovs-vswitchd.log-20180612
[root@controller-0 heat-admin]# ll /var/log/openvswitch/ -d
drwxr-x---. 2 root root 186 Jun 12 08:08 /var/log/openvswitch/
As per the logs, once the deployment is completed, all the tests has been passed, which implies that the the permissions were fine during the deployment. But if left over for overnight, the log folder /var/log/openvswitch ownership changes to root:root. This results in permission denied error in the ovsdb-server and ovs-vswitchd process. And those processes fails, resulting in this error.
Looking for possibility of log management tools affecting ownership.
The ownership of the directory /var/log/openvswitch is never changed to openvswitch:hugetlbfs, it is always with root:root as it is created by systemd. After the logs files are rotated, the new logs files could not be opened by ovs daemons. It results in permission denied error log in the older log file (after logrotate).
2018-06-11T03:25:01.887Z|00380|vlog|WARN|failed to open /var/log/openvswitch/ovs-vswitchd.log for logging: Permission denied
Though this permission denied is an issue to be looked upon, it does not clarify why ovsdb-server and ovs-vswitchd services are going to error state.
[root@computeovsdpdk-0 heat-admin]# logrotate -f /etc/logrotate.conf 
[root@computeovsdpdk-0 heat-admin]# ll /var/log/openvswitch/ 
total 296
-rw-r-----. 1 openvswitch hugetlbfs      0 Jun 12 12:55 ovsdb-server.log
-rw-r-----. 1 openvswitch hugetlbfs    352 Jun 10 11:25 ovsdb-server.log-20180611.gz
-rw-r-----. 1 openvswitch hugetlbfs    168 Jun 12 12:49 ovsdb-server.log-20180612.gz
-rw-r-----. 1 openvswitch hugetlbfs      0 Jun 12 12:55 ovs-vswitchd.log
-rw-r-----. 1 openvswitch hugetlbfs 289322 Jun 10 14:43 ovs-vswitchd.log-20180611
-rw-r-----. 1 openvswitch hugetlbfs    167 Jun 12 12:49 ovs-vswitchd.log-20180612.gz
[root@computeovsdpdk-0 heat-admin]# ps aux  | grep ovs
openvsw+    1465  0.0  0.0  55088  2944 ?        S<s  12:38   0:00 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach
openvsw+    1554  198  0.1 2904664 160420 ?      S<Lsl 12:38  35:14 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach
42435       3358  0.0  0.0  11684  1392 ?        Ss   12:39   0:00 /bin/bash /neutron_ovs_agent_launcher.sh
42435       4002  0.0  0.0  52656  2836 ?        S    12:39   0:00 ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json
[root@computeovsdpdk-0 heat-admin]# docker exec -it logrotate_crond bash
()[root@01b72ee362c4 /]#  /sbin/lsof -nPs +L1 +D /var/log 2>&1|    awk '/\S+\s+[0-9]+\s.*\/var\/log\/.*\(deleted\)/ {print $2}' |    /bin/xargs -n1 -r -t kill -HUP
kill -HUP 1465 
kill -HUP 1554 
()[root@01b72ee362c4 /]# 
------------------------------
To explain the behavior, since the /var/log/openvswitch directory has root:root ownership, after logrotate, ovs processes is not able to open the new files. Because of which lsof command lists those rotated files as "(deleted)".
---------
[root@computeovsdpdk-0 heat-admin]# /sbin/lsof -nPs +L1 +D /var/log 
ovsdb-ser 1465 openvswitch   22w   REG    8,2      220     0  14127046 /var/log/openvswitch/ovsdb-server.log-20180612 (deleted)
ovs-vswit 1554 openvswitch   83w   REG    8,2      220     0  14127063 /var/log/openvswitch/ovs-vswitchd.log-20180612 (deleted)
--------
Now the "logrotate_crond" container, which runs periodically to kill the processes which have the "(deleted)" log files, kill the ovs processes, which results in the ovsdb-server and ovs-vswitchd service error.
There are 2 issue here:
1) after logrotate, ovs logs are not capture because of the permission denied error - Aaron, could you comment on it
2) logrotate_crond container is killing ovs process, which is unexpected even though error in logging - I will continue to look in to it
In the Open vSwitch spec file, we do set the log directory ownership:
%attr(750,openvswitch,openvswitch) /var/log/openvswitch
Which RPM version is this?
From bz description: OSP 13 Puddle 2018-06-06.3
http://download.devel.redhat.com/rcm-guest/puddles/OpenStack/13.0-RHEL-7/2018-06-06.3
which includes cross-tagged openvswitch-2.9.0-19.el7fdp.1 build.
Checked on another OSP13 deployment (with latest puddle), found that the directory permission is still root:root.
--------------
[root@overcloud-controller-0 heat-admin]# ll /var/log/openvswitch/ -d
drwxr-x---. 2 root root 54 Jun 13 08:56 /var/log/openvswitch/
[root@overcloud-controller-0 heat-admin]# yum list openvswitch
Loaded plugins: product-id, search-disabled-repos, subscription-manager
This system is not registered with an entitlement server. You can use subscription-manager to register.
Installed Packages
openvswitch.x86_64                                                                     2.9.0-19.el7fdp.1                                                                     @rhos-13.0-signed
[root@overcloud-controller-0 heat-admin]# ll /var/log/openvswitch/
total 12
-rw-r-----. 1 openvswitch hugetlbfs  359 Jun 13 08:56 ovsdb-server.log
-rw-r-----. 1 openvswitch hugetlbfs 4747 Jun 13 09:23 ovs-vswitchd.log
---------------
Checking on the overcloud-full image (where openvswitch is pre-installed), the directory owenership is root:root. Does it happen if the openvswitch is updated rather than installed?
-------
 [root@undercloud01 images]# ll mount_image/var/log/openvswitch/ -d
drwxr-x---. 2 root root 6 May 30 09:59 mount_image/var/log/openvswitch/
-------
sounds related to bug we had to fix for /var/log/lastlog in overcloud-full images
This seems to be a possible virt-sysprep interaction issue or issue in logrotate config to make sure the process is bumped (sig HUP?) when log files are rotated so it will start to log to new file instead of old deleted one
The permissions will help for correct ownership but the log rotation configuration has to be clean or this will continue to happen
The problem isn't /var/log/lastlog directly - the virt-sysprep line is killing everything in /var/log; we then have hacks to put back lastlog.
If this works outside of deployment images, then it's likely a virt-sysprep issue and I know how to fix it.
After rebooting controllers it works again.
[root@newplan-controller-1 ~]# ls -l /var/run/ |grep openvswitch
drwxr-xr-x.  2 openvswitch hugetlbfs  240 Jun 14 17:57 openvswitch
(In reply to Saravanan KR from comment #15)
> To explain the behavior, since the /var/log/openvswitch directory has
> root:root ownership, after logrotate, ovs processes is not able to open the
> new files. Because of which lsof command lists those rotated files as
> "(deleted)".
> ---------
> [root@computeovsdpdk-0 heat-admin]# /sbin/lsof -nPs +L1 +D /var/log 
> ovsdb-ser 1465 openvswitch   22w   REG    8,2      220     0  14127046
> /var/log/openvswitch/ovsdb-server.log-20180612 (deleted)
> ovs-vswit 1554 openvswitch   83w   REG    8,2      220     0  14127063
> /var/log/openvswitch/ovs-vswitchd.log-20180612 (deleted)
> --------
> Now the "logrotate_crond" container, which runs periodically to kill the
> processes which have the "(deleted)" log files, kill the ovs processes,
> which results in the ovsdb-server and ovs-vswitchd service error.
> There are 2 issue here:
> 1) after logrotate, ovs logs are not capture because of the permission
> denied error - Aaron, could you comment on it
> 2) logrotate_crond container is killing ovs process, which is unexpected
> even though error in logging - I will continue to look in to it




    

If indeed 2) is one of the root causes for this problem, then the fix for that is being tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1584946
(In reply to Michele Baldessari from comment #26)
> (In reply to Saravanan KR from comment #15)
> > 2) logrotate_crond container is killing ovs process, which is unexpected
> > even though error in logging - I will continue to look in to it
> If indeed 2) is one of the root causes for this problem, then the fix for
> that is being tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1584946
Indeed, (2) is the cause of killing the ovs processes resulting in ovs service error and disappearing of /var/run/openvswitch. Thanks for pointing out. 
And (1) also should be fixed, because logrotate of ovs logs will no more collect logs after a day, which is one of the main debugging log for OVS-DPDK deployment.
(In reply to Saravanan KR from comment #27)
> (In reply to Michele Baldessari from comment #26)
> > (In reply to Saravanan KR from comment #15)
> > > 2) logrotate_crond container is killing ovs process, which is unexpected
> > > even though error in logging - I will continue to look in to it
> > If indeed 2) is one of the root causes for this problem, then the fix for
> > that is being tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1584946
> Indeed, (2) is the cause of killing the ovs processes resulting in ovs
> service error and disappearing of /var/run/openvswitch. Thanks for pointing
> And (1) also should be fixed, because logrotate of ovs logs will no more
> collect logs after a day, which is one of the main debugging log for
> OVS-DPDK deployment.
As per https://bugzilla.redhat.com/show_bug.cgi?id=1584946#c19, it seems that the ovs service is no longer stopping due to log rotation when https://review.openstack.org/#/c/575525 is applied.
There's an ordering problem with the spec file. When creating images, it is installed once. However, the group/user for ovs doesn't exist:
Preparing...                          ################################# [100%]
Updating / installing...
   1:openvswitch-2.9.0-19.el7fdp.1    warning: user openvswitch does not exist - using root
warning: group openvswitch does not exist - using root
warning: user openvswitch does not exist - using root
warning: group openvswitch does not exist - using root
################################# [100%]
warning: user openvswitch does not exist - using root
warning: group openvswitch does not exist - using root
[root@localhost ~]# ls -ld /var/log/openvswitch/
drwxr-x---. 2 root root 6 May 30 09:59 /var/log/openvswitch/
Deleting the RPM and reinstalling it without deleting the user, and things work:
Preparing...                          ################################# [100%]
Updating / installing...
   1:openvswitch-2.9.0-19.el7fdp.1    ################################# [100%]
[root@localhost ~]# ls -ld /var/log/openvswitch/
drwxr-x---. 2 openvswitch openvswitch 6 May 30 09:59 /var/log/openvswitch/
[root@localhost ~]#
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2018:2083