Исправляем ошибку с Placement Group в Ceph
Roman Bogachev
VMware Specialist | Drone Pilot | Traveler
Устранение проблем c PG Ceph кластера HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
После замены патч-кордов на тестовом кластере обнаружил ошибку в Ceph.
1 2 3 4
| $ ceph health detail HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 2.2e is active+clean+inconsistent, acting [8] 1 scrub errors
|
Проверяем лог-файлы OSD:
1 2 3 4 5 6 7 8
| $ grep 2.2e /var/log/ceph/* /var/log/ceph/ceph.audit.log:2017-04-24 15:38:46.124303 mon.2 192.168.2.120:6789/0 103771 : audit [INF] from='client.? 192.168.2.101:0/2439225090' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "2.2e"}]: dispatch /var/log/ceph/ceph.log:2017-04-24 10:10:24.576558 osd.3 192.168.2.100:6804/4914 4445 : cluster [INF] 2.2e deep-scrub starts /var/log/ceph/ceph.log:2017-04-24 10:10:37.434117 osd.3 192.168.2.100:6804/4914 4446 : cluster [ERR] 2.2e shard 8: soid 2:74433408:::rbd_data.a12342ae8944a.000000000000071f:head candidate had a read error /var/log/ceph/ceph.log:2017-04-24 10:10:48.940079 osd.3 192.168.2.100:6804/4914 4447 : cluster [ERR] 2.2e deep-scrub 0 missing, 1 inconsistent objects /var/log/ceph/ceph.log:2017-04-24 10:10:48.940085 osd.3 192.168.2.100:6804/4914 4448 : cluster [ERR] 2.2e deep-scrub 1 errors /var/log/ceph/ceph.log:2017-04-24 15:38:46.717506 osd.3 192.168.2.100:6804/4914 4459 : cluster [INF] 2.2e repair starts /var/log/ceph/ceph.log:2017-04-24 15:38:56.741299 osd.3 192.168.2.100:6804/4914 4460 : cluster [ERR] 2.2e shard 8: soid 2:74433408:::rbd_data.a12342ae8944a.000000000000071f:head candidate had a read error
|
Запускаем восстановление Placement Group:
1 2
| $ ceph pg repair 2.2e instructing pg 2.2e on osd.3 to repair
|
Спустя несколько секунд наблюдаем, что PG успешно восстановлена и состояние кластера вернулось в нормальный режим работы.
1 2
| $ ceph health detail HEALTH_OK
|
Проверим информацию о PG:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| $ ceph pg 2.2e query { "state": "active+clean", "snap_trimq": "[]", "epoch": 516, "up": [ 3, 8 ], "acting": [ 3, 8 ], "actingbackfill": [ "3", "8" ],
|
Проверяем еще раз лог и наблюдаем, что проблема устранена:
1 2 3 4 5
| $ grep 2.2e /var/log/ceph/* /var/log/ceph/ceph-osd.3.log:2017-04-24 15:38:46.717501 7f2070c50700 0 log_channel(cluster) log [INF] : 2.2e repair starts /var/log/ceph/ceph-osd.3.log:2017-04-24 15:38:56.741297 7f2070c50700 -1 log_channel(cluster) log [ERR] : 2.2e shard 8: soid 2:74433408:::rbd_data.a12342ae8944a.000000000000071f:head candidate had a read error /var/log/ceph/ceph-osd.3.log:2017-04-24 15:39:07.692446 7f206e44b700 -1 log_channel(cluster) log [ERR] : 2.2e repair 0 missing, 1 inconsistent objects /var/log/ceph/ceph-osd.3.log:2017-04-24 15:39:07.752099 7f206e44b700 -1 log_channel(cluster) log [ERR] : 2.2e repair 1 errors, 1 fixed
|
Правила хорошего тона - использовать на серверах LACP.
Но поскольку кластер тестовый, то такие ошибки не исключение.