添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
JavaScript is disabled. For a better experience, please enable JavaScript in your browser before proceeding.
On every reboot or power loss, my ceph managers are crashing, and the cephfs snap_schedule is not working since 2023-02-05-18.
The ceph mgr starts anyway, and generates a crash report turning the ceph cluster in HEALTH_WARN status.
I have the issue on every node (3 nodes cluster). Probably since Quincy update.
Does anyone observe the same problem ?
Do you have some recommandations or fixes ?
snap_schedule unavailable
root@pve3:~# ceph fs snap-schedule status / | jq
Error ENOENT: Module 'snap_schedule' is not available
The crash info:
root@pve3:~# ceph crash info '2023-04-11T06:23:22.105089Z_356de37b-2e16-4f44-b050-326ddad84773'
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/snap_schedule/module.py\", line 38, in __init__\n    self.client = SnapSchedClient(self)",
        "  File \"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py\", line 169, in __init__\n    with self.get_schedule_db(fs_name) as conn_mgr:",
        "  File \"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py\", line 203, in get_schedule_db\n    db.executescript(dump)",
        "sqlite3.OperationalError: unable to open database file"
    "ceph_version": "17.2.5",
    "crash_id": "2023-04-11T06:23:22.105089Z_356de37b-2e16-4f44-b050-326ddad84773",
    "entity_name": "mgr.pve3",
    "mgr_module": "snap_schedule",
    "mgr_module_caller": "ActivePyModule::load",
    "mgr_python_exception": "OperationalError",
    "os_id": "11",
    "os_name": "Debian GNU/Linux 11 (bullseye)",
    "os_version": "11 (bullseye)",
    "os_version_id": "11",
    "process_name": "ceph-mgr",
    "stack_sig": "2fb4f03ffef7798ee981190306cedadb7d698a3a4cd6dbb59c0400ec3f76b6ba",
    "timestamp": "2023-04-11T06:23:22.105089Z",
    "utsname_hostname": "pve3",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.102-1-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z)"
Additional information on the ceph setup
balancer on (always on) crash on (always on) devicehealth on (always on) orchestrator on (always on) pg_autoscaler on (always on) progress on (always on) rbd_support on (always on) status on (always on) telemetry on (always on) volumes on (always on) dashboard on iostat on nfs on prometheus on restful on snap_schedule on stats on alerts - influx - insights - localpool - mirroring - osd_perf_query - osd_support - selftest - telegraf - test_orchestrator - zabbix -
root@pve3:~# ceph crash ls
ID                                                                ENTITY    NEW 
2023-02-10T15:55:33.246668Z_d7bfe3b0-2647-4583-b257-60cc8bebb820  mgr.pve3       
2023-02-10T21:10:48.333710Z_fab6d271-2708-4bf4-a70b-36918d268a14  mgr.pve1       
2023-02-10T21:11:32.956340Z_d5fb4e86-1dbc-4247-845f-b9777af168c4  osd.2         
2023-02-10T21:11:34.464160Z_0b4effa2-0538-4090-9337-89259d3d78bd  osd.4         
2023-02-16T11:07:38.833449Z_6065ff7b-ef02-4d6f-bbd2-a3beb7ca7f9a  mgr.pve2       
2023-02-19T23:12:31.685803Z_9a53dab2-3817-4ebd-8a0b-f479d277d751  mgr.pve3       
2023-02-19T23:30:00.594498Z_5da1ca3c-8d1b-470f-818f-e8bde0ed01f7  mgr.pve1       
2023-03-05T16:00:29.692042Z_0f6ae1ae-8d74-4a0e-9e41-3172086f8460  mgr.pve2       
2023-03-12T11:46:48.532363Z_2ec74f76-9cbe-438c-8948-82d3732f0aea  mgr.pve3       
2023-03-12T21:32:17.037999Z_6df8796c-ec4c-42f6-b352-b273e345c22e  mgr.pve1       
2023-03-13T09:19:09.578815Z_849121da-6ad9-4036-b3b6-c556263a0f05  mgr.pve2       
2023-03-13T12:59:09.792996Z_b1de9b4f-c8a4-48ae-9dff-d3413e527e43  mgr.pve3       
2023-03-13T13:34:43.233360Z_4ed2be9e-2f07-4853-b3bf-1e0efe69afea  osd.5         
2023-03-18T09:22:35.338683Z_ab963d23-e823-40aa-b0ad-0330b5265ff7  mgr.pve1       
2023-03-24T17:39:43.801037Z_40bdd143-b099-4f15-bfd6-39a0d544ee38  mgr.pve1       
2023-04-07T14:02:36.377029Z_c1510456-3f13-4a5e-ba86-9ea7779817f2  mgr.pve3       
2023-04-07T14:03:37.643954Z_4c946518-c119-480b-87b5-2b0d48f583df  mgr.pve2       
2023-04-07T19:19:58.561573Z_e5057586-899c-4bd3-bf4e-f08d469f2013  mgr.pve1   *   
2023-04-08T17:43:39.891129Z_1770ab87-0303-4d0e-bbbb-6e62143876c0  mgr.pve3   *   
2023-04-08T18:18:12.318248Z_b8563587-a64f-4b97-a4ab-2d55049d1261  mgr.pve2   *   
2023-04-11T06:23:22.105089Z_356de37b-2e16-4f44-b050-326ddad84773  mgr.pve3   *
Turning this thread as "Fixed" since probably related problems have been discussed here:
- https://www.spinics.net/lists/ceph-users/msg74696.html
- https://tracker.ceph.com/issues/57851
And the proposed fix also fixes the above problem :
https://github.com/ceph/ceph/pull/48449/commits/8d853cc4990dc4dbccdc916115b0b30e0ac9dc19
This fix will probably come in the next ceph update.
The problem seems to be caused by the migration from 16 (Pacific) to 17 (Quincy) if you enable the snap_schedule before the migration.
The sqlite DB storage has been moved to the cephfs_metadata, and the mgr can't migrate the old database without this minor patch.
I added the 2 lines from the match manually in the files (paths can be guessed from the crash report), rebooted the node (a restart of the mgr might have been enough), and then made this mgr active. I did the change on one node only, since it's just for the DB migration to the new storage. The patch can then be removed if desired.
This immediately fixed the fs snap-schedule status command:
"start": "2022-07-17T00:00:00", "created": "2022-07-17T22:44:20", "first": "2023-04-11T21:00:00", "last": "2023-04-11T21:00:00", "last_pruned": "2023-04-11T21:00:00", "created_count": 1, "pruned_count": 1, "active": true And the cephfs automatic snapshots works again (with an appended _UTC):
root@pve3:~# ls -tr -1 /mnt/pve/cephfs/.snap
weekly_2022-07-10_231701
daily_2022-07-10_231701
scheduled-2023-04-11-21_00_00_UTC
scheduled-2023-02-05-18_00_00
scheduled-2023-02-05-17_00_00
scheduled-2023-02-05-16_00_00
As you might have noticed, I lost the retention, so I had to re-apply and check:
root@pve3:~# ceph fs snap-schedule retention add / m 12
ceph fs snap-schedule retention add / w 4
ceph fs snap-schedule retention add / d 7
ceph fs snap-schedule retention add / h 24
Retention added to path /
Retention added to path /
Retention added to path /
Retention added to path /
root@pve3:~# ceph fs snap-schedule status | jq
  "fs": "cephfs",
  "subvol": null,
  "path": "/",
  "rel_path": "/",
  "schedule": "1h",
  "retention": {
    "m": 12,
    "w": 4,
    "d": 7,
    "h": 24
  "start": "2022-07-17T00:00:00",
  "created": "2022-07-17T22:44:20",
  "first": "2023-04-11T21:00:00",
  "last": "2023-04-11T21:00:00",
  "last_pruned": "2023-04-11T21:00:00",
  "created_count": 1,
  "pruned_count": 1,
  "active": true
The details of my cephfs snapshot setup on my blog
I hope this post will save some hours to people experiencing the same issue.
The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you! The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.