JavaScript is disabled. For a better experience, please enable JavaScript in your browser before proceeding.
On every reboot or power loss, my ceph managers are crashing, and the cephfs snap_schedule is not working since 2023-02-05-18.
The ceph mgr starts anyway, and generates a crash report turning the ceph cluster in HEALTH_WARN status.
I have the issue on every node (3 nodes cluster). Probably since Quincy update.
Does anyone observe the same problem ?
Do you have some recommandations or fixes ?
snap_schedule unavailable
root@pve3:~# ceph fs snap-schedule status / | jq
Error ENOENT: Module 'snap_schedule' is not available
The crash info:
root@pve3:~# ceph crash info '2023-04-11T06:23:22.105089Z_356de37b-2e16-4f44-b050-326ddad84773'
"backtrace": [
" File \"/usr/share/ceph/mgr/snap_schedule/module.py\", line 38, in __init__\n self.client = SnapSchedClient(self)",
" File \"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py\", line 169, in __init__\n with self.get_schedule_db(fs_name) as conn_mgr:",
" File \"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py\", line 203, in get_schedule_db\n db.executescript(dump)",
"sqlite3.OperationalError: unable to open database file"
"ceph_version": "17.2.5",
"crash_id": "2023-04-11T06:23:22.105089Z_356de37b-2e16-4f44-b050-326ddad84773",
"entity_name": "mgr.pve3",
"mgr_module": "snap_schedule",
"mgr_module_caller": "ActivePyModule::load",
"mgr_python_exception": "OperationalError",
"os_id": "11",
"os_name": "Debian GNU/Linux 11 (bullseye)",
"os_version": "11 (bullseye)",
"os_version_id": "11",
"process_name": "ceph-mgr",
"stack_sig": "2fb4f03ffef7798ee981190306cedadb7d698a3a4cd6dbb59c0400ec3f76b6ba",
"timestamp": "2023-04-11T06:23:22.105089Z",
"utsname_hostname": "pve3",
"utsname_machine": "x86_64",
"utsname_release": "5.15.102-1-pve",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z)"
Additional information on the ceph setup
balancer on (always on)
crash on (always on)
devicehealth on (always on)
orchestrator on (always on)
pg_autoscaler on (always on)
progress on (always on)
rbd_support on (always on)
status on (always on)
telemetry on (always on)
volumes on (always on)
dashboard on
iostat on
nfs on
prometheus on
restful on
snap_schedule on
stats on
alerts -
influx -
insights -
localpool -
mirroring -
osd_perf_query -
osd_support -
selftest -
telegraf -
test_orchestrator -
zabbix -
root@pve3:~# ceph crash ls
ID ENTITY NEW
2023-02-10T15:55:33.246668Z_d7bfe3b0-2647-4583-b257-60cc8bebb820 mgr.pve3
2023-02-10T21:10:48.333710Z_fab6d271-2708-4bf4-a70b-36918d268a14 mgr.pve1
2023-02-10T21:11:32.956340Z_d5fb4e86-1dbc-4247-845f-b9777af168c4 osd.2
2023-02-10T21:11:34.464160Z_0b4effa2-0538-4090-9337-89259d3d78bd osd.4
2023-02-16T11:07:38.833449Z_6065ff7b-ef02-4d6f-bbd2-a3beb7ca7f9a mgr.pve2
2023-02-19T23:12:31.685803Z_9a53dab2-3817-4ebd-8a0b-f479d277d751 mgr.pve3
2023-02-19T23:30:00.594498Z_5da1ca3c-8d1b-470f-818f-e8bde0ed01f7 mgr.pve1
2023-03-05T16:00:29.692042Z_0f6ae1ae-8d74-4a0e-9e41-3172086f8460 mgr.pve2
2023-03-12T11:46:48.532363Z_2ec74f76-9cbe-438c-8948-82d3732f0aea mgr.pve3
2023-03-12T21:32:17.037999Z_6df8796c-ec4c-42f6-b352-b273e345c22e mgr.pve1
2023-03-13T09:19:09.578815Z_849121da-6ad9-4036-b3b6-c556263a0f05 mgr.pve2
2023-03-13T12:59:09.792996Z_b1de9b4f-c8a4-48ae-9dff-d3413e527e43 mgr.pve3
2023-03-13T13:34:43.233360Z_4ed2be9e-2f07-4853-b3bf-1e0efe69afea osd.5
2023-03-18T09:22:35.338683Z_ab963d23-e823-40aa-b0ad-0330b5265ff7 mgr.pve1
2023-03-24T17:39:43.801037Z_40bdd143-b099-4f15-bfd6-39a0d544ee38 mgr.pve1
2023-04-07T14:02:36.377029Z_c1510456-3f13-4a5e-ba86-9ea7779817f2 mgr.pve3
2023-04-07T14:03:37.643954Z_4c946518-c119-480b-87b5-2b0d48f583df mgr.pve2
2023-04-07T19:19:58.561573Z_e5057586-899c-4bd3-bf4e-f08d469f2013 mgr.pve1 *
2023-04-08T17:43:39.891129Z_1770ab87-0303-4d0e-bbbb-6e62143876c0 mgr.pve3 *
2023-04-08T18:18:12.318248Z_b8563587-a64f-4b97-a4ab-2d55049d1261 mgr.pve2 *
2023-04-11T06:23:22.105089Z_356de37b-2e16-4f44-b050-326ddad84773 mgr.pve3 *
Turning this thread as "Fixed" since probably related problems have been discussed here:
-
https://www.spinics.net/lists/ceph-users/msg74696.html
-
https://tracker.ceph.com/issues/57851
And the proposed fix also fixes the above problem :
https://github.com/ceph/ceph/pull/48449/commits/8d853cc4990dc4dbccdc916115b0b30e0ac9dc19
This fix will probably come in the next ceph update.
The problem seems to be caused by the migration from 16 (Pacific) to 17 (Quincy) if you enable the snap_schedule before the migration.
The sqlite DB storage has been moved to the cephfs_metadata, and the mgr can't migrate the old database without this minor patch.
I added the 2 lines from the match manually in the files (paths can be guessed from the crash report), rebooted the node (a restart of the mgr might have been enough), and then made this mgr active. I did the change on one node only, since it's just for the DB migration to the new storage. The patch can then be removed if desired.
This immediately fixed the fs snap-schedule status command:
"start": "2022-07-17T00:00:00",
"created": "2022-07-17T22:44:20",
"first": "2023-04-11T21:00:00",
"last": "2023-04-11T21:00:00",
"last_pruned": "2023-04-11T21:00:00",
"created_count": 1,
"pruned_count": 1,
"active": true
And the cephfs automatic snapshots works again (with an appended _UTC):
root@pve3:~# ls -tr -1 /mnt/pve/cephfs/.snap
weekly_2022-07-10_231701
daily_2022-07-10_231701
scheduled-2023-04-11-21_00_00_UTC
scheduled-2023-02-05-18_00_00
scheduled-2023-02-05-17_00_00
scheduled-2023-02-05-16_00_00
As you might have noticed, I lost the retention, so I had to re-apply and check:
root@pve3:~# ceph fs snap-schedule retention add / m 12
ceph fs snap-schedule retention add / w 4
ceph fs snap-schedule retention add / d 7
ceph fs snap-schedule retention add / h 24
Retention added to path /
Retention added to path /
Retention added to path /
Retention added to path /
root@pve3:~# ceph fs snap-schedule status | jq
"fs": "cephfs",
"subvol": null,
"path": "/",
"rel_path": "/",
"schedule": "1h",
"retention": {
"m": 12,
"w": 4,
"d": 7,
"h": 24
"start": "2022-07-17T00:00:00",
"created": "2022-07-17T22:44:20",
"first": "2023-04-11T21:00:00",
"last": "2023-04-11T21:00:00",
"last_pruned": "2023-04-11T21:00:00",
"created_count": 1,
"pruned_count": 1,
"active": true
The details of my cephfs snapshot setup on my blog
I hope this post will save some hours to people experiencing the same issue.
The Proxmox community has been around for many years and offers help and support for
Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!
The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.
Buy now!
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.
Accept
Learn more…