FileExistsError/FileNotFoundError when locking · Issue #5828 · borgbackup/borg

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
Have you checked borgbackup docs, FAQ, and open Github issues?

Is this a BUG / ISSUE report or a QUESTION?

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

$ borg -V
borg 1.1.16
Operating system (distribution) and version.
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal
Hardware / network configuration, and filesystems used.
BTRFS on target host
Describe the problem you're observing.
I've been seeing locking-related errors for some time. See backtraces below. So far, I've only seen this issue on one out of two servers. The only difference I'm aware of is that the machine is using BTRFS as filesystem.
Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.
No, but the issue reappears sporadically.
Include any warning/errors/backtraces from the system logs
Traceback (most recent call last):
  File "borg/locking.py", line 130, in acquire
FileExistsError: [Errno 17] File exists: '/backup/borg_repos/backup01/lock.exclusive'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "borg/remote.py", line 247, in serve
  File "borg/remote.py", line 375, in open
  File "borg/repository.py", line 190, in __enter__
  File "borg/repository.py", line 421, in open
  File "borg/locking.py", line 350, in acquire
  File "borg/locking.py", line 363, in _wait_for_readers_finishing
  File "borg/locking.py", line 134, in acquire
  File "borg/locking.py", line 159, in kill_stale_lock
FileNotFoundError: [Errno 2] No such file or directory: '/backup/borg_repos/backup01/lock.exclusive'
Borg server: Platform: Linux backup02 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Borg server: Linux: debian bullseye/sid 
Borg server: Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
Borg server: PID: 1808518  CWD: /backup/borg_repos/backup01
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/backup/borg_repos/backup01']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --error'
Platform: Linux backup01 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Linux: debian bullseye/sid 
Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
PID: 67423  CWD: /root
sys.argv: ['borg', 'create', '--error', '--compression', 'zstd', '--lock-wait', '43200', '--noatime', '--patterns-from', '/etc/borg/patterns-main', '--', '[email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}']
SSH_ORIGINAL_COMMAND: None
Traceback (most recent call last):
  File "borg/locking.py", line 130, in acquire
FileExistsError: [Errno 17] File exists: '/backup/borg_repos/backup01/lock.exclusive'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "borg/remote.py", line 247, in serve
  File "borg/remote.py", line 375, in open
  File "borg/repository.py", line 190, in __enter__
  File "borg/repository.py", line 421, in open
  File "borg/locking.py", line 350, in acquire
  File "borg/locking.py", line 363, in _wait_for_readers_finishing
  File "borg/locking.py", line 134, in acquire
  File "borg/locking.py", line 159, in kill_stale_lock
FileNotFoundError: [Errno 2] No such file or directory: '/backup/borg_repos/backup01/lock.exclusive'
Borg server: Platform: Linux backup02 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Borg server: Linux: debian bullseye/sid 
Borg server: Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
Borg server: PID: 1808518  CWD: /backup/borg_repos/backup01
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/backup/borg_repos/backup01']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --error'
Platform: Linux backup01 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Linux: debian bullseye/sid 
Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
PID: 67423  CWD: /root
sys.argv: ['borg', 'create', '--error', '--compression', 'zstd', '--lock-wait', '43200', '--noatime', '--patterns-from', '/etc/borg/patterns-main', '--', '[email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}']
SSH_ORIGINAL_COMMAND: None
          Analyzing this:
locking:130 exclusive lock directory (ELD) exists already when we try creating it
locking:134 we only get here if it is NOT our own lock and try to kill stale locks now
locking:159 ELD is gone when we try looking into it
So, as ELD is gone in 159 then something must have removed it (which is why I was asking for break-lock and other borg processes which maybe raced for deleting a stale lock).
here is alfredo08154711 writing. It's not clear to me whether there is some action required from my side.
BTW, it's also not clear to me into which versions my fixes have been merged. I only know that they have been merged into the master branch about one year ago.
And I'm still waiting on a discussion of my suggested contribution, option --shared for borg with-lock (a PR, IIRC), "Lock the repository with a shared (read) instead of an exclusive (write) lock." I've been using this for my repository rsync copying (redundancy for business continuity) for one year without any observed issues.
Cheers
Thomas
Am 9. Juni 2021 14:15:07 MESZ, schrieb TW ***@***.***>:




    

Note: master branch locking code did get recently get
improvements/fixes by @alfredo08154711 and it already handles this
case.
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#5828 (comment)
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
          hi @alfredo08154711 - no action strictly required, maybe you want to have a look in the PR.
yes, your fixes are only in master branch (and thus in borg 1.2).
i first wanted to have that tested there before doing big changes in 1.1 locking. not sure how much testing the 1.2 betas got until now - a lot people might hold back and use a stable release only rather than additionally testing a beta.
about the with-lock --shared PR: will try to review it, but I'ld also like other users/developers to review that and give feedback.
do you mean #4120 ?
          Vielen Dank, Herr Waldmann.
Eigentlich hätte ich Sie mal gern persönlich kennengelernt - als ich mit den Fixes begann, hatte ich noch einen Job bei AIRBUS in Ulm und pendelte daher wöchentlich durch Ihre Gegend (ich wohne 8km von Aschaffenburg entfernt Richtung Frankfurt). Aber wegen der Bundescorontäne habe ich von der Idee, Sie mal zu treffen, Abstand genommen.)
Aber was nicht ist, kann ja noch werden...
Viele Grüße
Thomas Portmann
Am 9. Juni 2021 14:50:34 MESZ, schrieb TW ***@***.***>:
hi @alfredo08154711 - yes, your fixes are only in master branch (and
thus in borg 1.2).
i first wanted to have that tested there before doing big changes in
1.1 locking. not sure how much testing the 1.2 betas got until now - a
lot people might hold back and use a stable release only rather than
additionally testing a beta.
about the with-lock --shared PR: will try to review it, but I'ld also
like other users/developers to review that and give feedback.
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#5828 (comment)
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
No, break-lock is not used.
Do you have multiple borg client processes running against that one repo? At the same time?
According to the crontab file two jobs are actually started simultaniously:
# create backup "main" at [email protected]:~
30 21 * * * root  borg create --error --compression zstd --lock-wait 43200 --noatime --patterns-from /etc/borg/patterns-main -- '[email protected]:~'::main-{now:\%Y-\%m-\%dT\%H:\%M:\%S}
# prune backup "main" at [email protected]:~
30 20 * * * root  borg prune '[email protected]:~' --lock-wait 43200 --keep-within 7d --keep-daily 7 --keep-weekly 4 --prefix main-
# create backup "postgres" at [email protected]:~
02 22 * * * root  borg create --error --compression zstd --lock-wait 43200 --noatime --patterns-from /etc/borg/patterns-postgres -- '[email protected]:~'::postgres-{now:\%Y-\%m-\%dT\%H:\%M:\%S}
# prune backup "postgres" at [email protected]:~
30 20 * * 1-5 root  borg prune '[email protected]:~' --lock-wait 43200 --keep-daily 7 --keep-weekly 4 --prefix postgres-
# check repo at [email protected]:~
00 23 * * 0 root  borg check --lock-wait 43200 '[email protected]:~' >/dev/null 2>&1
# intended for use by monitoring and health checks
0 12 * * * root  borg info --glob '*' --lock-wait 43200 --last 300 --json '[email protected]:~' >/var/lib/tocco-borg/info-1.tmp && mv /var/lib/tocco-borg/info-1.tmp /var/lib/tocco-borg/info-1
I guess that could explain why it only happens with one server. I can't see any other simultaneous executions of borg.
Note: master branch locking code did get recently get improvements/fixes by @alfredo08154711 and it already handles this case.
I noticed the errors a while back and noticed that refactoring of the locking mechanism being ongoing. So, I refrained from reporting the issue under the assumption that I'd likely be fixed by the change. I assumed the change went into 1.1.16 but I guess was wrong about that.
I'm thinking I should just update the one server to borg 1.2 and see if that helps. Considering that only the secondary backup server is affected, the risk should be manageable.