Исправляем ошибку с Placement Group в Ceph

Исправляем ошибку с Placement Group в Ceph

Roman Bogachev VMware Specialist | Drone Pilot | Traveler

Устранение проблем c PG Ceph кластера HEALTH_ERR 1 pgs inconsistent; 1 scrub errors

После замены патч-кордов на тестовом кластере обнаружил ошибку в Ceph.

1
2
3
4
$ ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 2.2e is active+clean+inconsistent, acting [8]
1 scrub errors

Проверяем лог-файлы OSD:

1
2
3
4
5
6
7
8
$ grep 2.2e /var/log/ceph/*
/var/log/ceph/ceph.audit.log:2017-04-24 15:38:46.124303 mon.2 192.168.2.120:6789/0 103771 : audit [INF] from='client.? 192.168.2.101:0/2439225090' entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "2.2e"}]: dispatch
/var/log/ceph/ceph.log:2017-04-24 10:10:24.576558 osd.3 192.168.2.100:6804/4914 4445 : cluster [INF] 2.2e deep-scrub starts
/var/log/ceph/ceph.log:2017-04-24 10:10:37.434117 osd.3 192.168.2.100:6804/4914 4446 : cluster [ERR] 2.2e shard 8: soid 2:74433408:::rbd_data.a12342ae8944a.000000000000071f:head candidate had a read error
/var/log/ceph/ceph.log:2017-04-24 10:10:48.940079 osd.3 192.168.2.100:6804/4914 4447 : cluster [ERR] 2.2e deep-scrub 0 missing, 1 inconsistent objects
/var/log/ceph/ceph.log:2017-04-24 10:10:48.940085 osd.3 192.168.2.100:6804/4914 4448 : cluster [ERR] 2.2e deep-scrub 1 errors
/var/log/ceph/ceph.log:2017-04-24 15:38:46.717506 osd.3 192.168.2.100:6804/4914 4459 : cluster [INF] 2.2e repair starts
/var/log/ceph/ceph.log:2017-04-24 15:38:56.741299 osd.3 192.168.2.100:6804/4914 4460 : cluster [ERR] 2.2e shard 8: soid 2:74433408:::rbd_data.a12342ae8944a.000000000000071f:head candidate had a read error

Запускаем восстановление Placement Group:

1
2
$ ceph pg repair 2.2e
instructing pg 2.2e on osd.3 to repair

Спустя несколько секунд наблюдаем, что PG успешно восстановлена и состояние кластера вернулось в нормальный режим работы.

1
2
$ ceph health detail
HEALTH_OK

Проверим информацию о PG:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ ceph pg 2.2e query
{
"state": "active+clean",
"snap_trimq": "[]",
"epoch": 516,
"up": [
3,
8
],
"acting": [
3,
8
],
"actingbackfill": [
"3",
"8"
],

Проверяем еще раз лог и наблюдаем, что проблема устранена:

1
2
3
4
5
$ grep 2.2e /var/log/ceph/*
/var/log/ceph/ceph-osd.3.log:2017-04-24 15:38:46.717501 7f2070c50700 0 log_channel(cluster) log [INF] : 2.2e repair starts
/var/log/ceph/ceph-osd.3.log:2017-04-24 15:38:56.741297 7f2070c50700 -1 log_channel(cluster) log [ERR] : 2.2e shard 8: soid 2:74433408:::rbd_data.a12342ae8944a.000000000000071f:head candidate had a read error
/var/log/ceph/ceph-osd.3.log:2017-04-24 15:39:07.692446 7f206e44b700 -1 log_channel(cluster) log [ERR] : 2.2e repair 0 missing, 1 inconsistent objects
/var/log/ceph/ceph-osd.3.log:2017-04-24 15:39:07.752099 7f206e44b700 -1 log_channel(cluster) log [ERR] : 2.2e repair 1 errors, 1 fixed

Правила хорошего тона - использовать на серверах LACP.
Но поскольку кластер тестовый, то такие ошибки не исключение.

On this page