目标
解决下面故障
# ceph -s
cluster:
id: 7e720238-7ada-4922-ba2e-xxxxxx4e4
health: HEALTH_WARN
Degraded data redundancy: 85 pgs unclean, 85 pgs degraded, 85 pgs undersized
services:
mon: 3 daemons, quorum ns-storage-020100,ns-storage-020101,ns-storage-020102
mgr: ns-storage-020100(active), standbys: ns-storage-020101, ns-storage-020102
osd: 18 osds: 18 up, 18 in; 43 remapped pgs
data:
pools: 3 pools, 1152 pgs
objects: 250 objects, 631 MB
usage: 40579 MB used, 66966 GB / 67006 GB avail
pgs: 1024 active+clean
85 active+undersized+degraded
43 active+clean+remapped
尝试使用 ceph pg repair 对 pg 执行修复失败
检测 ceph pg 状态 如下
# ceph health detail
.......
pg 3.c is stuck undersized for 6137651.255431, current state active+undersized+degraded, last acting [14,13]
pg 3.d is stuck undersized for 6137651.146218, current state active+undersized+degraded, last acting [15,13]
.......
状态分析
pg状态 | 解释 |
---|---|
unclean | PG 故障, 没有完成指定副本数量 |
degraded | PG中的一些对象还没有被复制到规定的份数 |
undersized | 该PG的副本数量小于存储池所配置的副本数量 |
故障原因
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-12 16.00000 root noah
-9 8.00000 host ns-storage-020100.vclound.com
12 hdd 4.00000 osd.12 up 1.00000 1.00000
13 hdd 4.00000 osd.13 up 1.00000 1.00000
-10 8.00000 host ns-storage-020101.vclound.com
14 hdd 4.00000 osd.14 up 1.00000 1.00000
15 hdd 4.00000 osd.15 up 1.00000 1.00000
-11 0 host ns-storage-020102.vclound.com
-1 55.63620 root default
-2 15.63620 host ns-storage-020100
0 hdd 3.63620 osd.0 up 1.00000 1.00000
1 hdd 4.00000 osd.1 up 1.00000 1.00000
2 hdd 4.00000 osd.2 up 1.00000 1.00000
3 hdd 4.00000 osd.3 up 1.00000 1.00000
-3 16.00000 host ns-storage-020101
4 hdd 4.00000 osd.4 up 1.00000 1.00000
5 hdd 4.00000 osd.5 up 1.00000 1.00000
6 hdd 4.00000 osd.6 up 1.00000 1.00000
7 hdd 4.00000 osd.7 up 1.00000 1.00000
-4 24.00000 host ns-storage-020102
8 hdd 4.00000 osd.8 up 1.00000 1.00000
9 hdd 4.00000 osd.9 up 1.00000 1.00000
10 hdd 4.00000 osd.10 up 1.00000 1.00000
11 hdd 4.00000 osd.11 up 1.00000 1.00000
16 hdd 4.00000 osd.16 up 1.00000 1.00000 <--- 正常应该在 noah 根下
17 hdd 4.00000 osd.17 up 1.00000 1.00000 <--- 正常应该在 noah 根下
迁移 osd 至 noah 根
# ceph osd crush rm osd.16
removed item id 16 name 'osd.16' from crush map
# ceph osd crush rm osd.17
removed item id 17 name 'osd.17' from crush map
# ceph osd crush add osd.16 4.0 host=ns-storage-020102.vclound.com
add item id 16 name 'osd.16' weight 4 at location {host=ns-storage-020102.vclound.com} to crush map
# ceph osd crush add osd.17 4.0 host=ns-storage-020102.vclound.com
add item id 17 name 'osd.17' weight 4 at location {host=ns-storage-020102.vclound.com} to crush map
观察 osd tree
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-12 24.00000 root noah
-9 8.00000 host ns-storage-020100.vclound.com
12 hdd 4.00000 osd.12 up 1.00000 1.00000
13 hdd 4.00000 osd.13 up 1.00000 1.00000
-10 8.00000 host ns-storage-020101.vclound.com
14 hdd 4.00000 osd.14 up 1.00000 1.00000
15 hdd 4.00000 osd.15 up 1.00000 1.00000
-11 8.00000 host ns-storage-020102.vclound.com
16 4.00000 osd.16 up 1.00000 1.00000
17 4.00000 osd.17 up 1.00000 1.00000
-1 47.63620 root default
-2 15.63620 host ns-storage-020100
0 hdd 3.63620 osd.0 up 1.00000 1.00000
1 hdd 4.00000 osd.1 up 1.00000 1.00000
2 hdd 4.00000 osd.2 up 1.00000 1.00000
3 hdd 4.00000 osd.3 up 1.00000 1.00000
-3 16.00000 host ns-storage-020101
4 hdd 4.00000 osd.4 up 1.00000 1.00000
5 hdd 4.00000 osd.5 up 1.00000 1.00000
6 hdd 4.00000 osd.6 up 1.00000 1.00000
7 hdd 4.00000 osd.7 up 1.00000 1.00000
-4 16.00000 host ns-storage-020102
8 hdd 4.00000 osd.8 up 1.00000 1.00000
9 hdd 4.00000 osd.9 up 1.00000 1.00000
10 hdd 4.00000 osd.10 up 1.00000 1.00000
11 hdd 4.00000 osd.11 up 1.00000 1.00000
ceph 自动完成修复
# ceph -s
cluster:
id: 7e720238-7ada-4922-ba2e-d9d9a49ac4e4
health: HEALTH_OK
services:
mon: 3 daemons, quorum ns-storage-020100,ns-storage-020101,ns-storage-020102
mgr: ns-storage-020100(active), standbys: ns-storage-020101, ns-storage-020102
osd: 18 osds: 18 up, 18 in
data:
pools: 3 pools, 1152 pgs
objects: 250 objects, 631 MB
usage: 40584 MB used, 66966 GB / 67006 GB avail
pgs: 1152 active+clean
io:
recovery: 341 kB/s, 0 objects/s