本操作已经在生产环境中实施(cell节点),记录操作过程(大概过程,部分命令为docs文档命令,部分为实际操作命令)。
参考文档:
Maintaining Oracle Exadata Storage Servers
3.3.6 Replacing a Hard Disk Proactively
How to Replace a Hard Drive in an Exadata Storage Cell Server (Hard Failure) (Doc ID 1386147.1)
How to Replace a Hard Drive in an Exadata Storage Cell Server (Predictive Failure) (Doc ID 1390836.1)
决定在什么时候应该更换Exadata服务器上的硬盘 (Doc ID 2661785.1)
Exadata ALTER PHYSICALDISK N:N DROP FOR REPLACEMENT is hung (Doc ID 2574663.1)
Exadata Storage software has a complete set of automated operations for hard disk maintenance, when a hard disk has failed or has been flagged as a problematic disk. But there are situations where a hard disk has to be removed proactively from the configuration.
In the CellCLI ALTER PHYSICALDISK
command, the drop for replacement
option checks if a normal functioning hard disk can be removed safely without the risk of data lost. However, after the execution of the command, the grid disks on the hard disk are inactivated on the storage cell and set to offline in the Oracle ASM disk groups.
The redundancy of the disk group is compromised until the hard disk has been replaced or re-enabled, and the subsequent rebalance completes. This is especially important for disk groups using normal redundancy.
To reduce the risk of having a disk group without full redundancy and proactively replace a hard disk, follow this procedure:
确认物理硬盘,关联的LUN、celldisk、griddisk
# cellcli –e "list diskmap" | grep 'X:Y'
结果类似下面:
20:5 KEBTDJ 5 normal 559G
CD_05_exaceladm01 /dev/sdf
"DATAC1_CD_05_exaceladm01, DBFS_DG_CD_05_exaceladm01,
RECOC1_CD_05_exaceladm01"
查看LUN的信息
CellCLI> list lun where deviceName='/dev/sdf/'
0_5 0_5 normal
在ASM层面drop掉griddisk
SQL> ALTER DISKGROUP diskgroup_name DROP DISK asm_disk_name;
等待完成reblance
SQL> select * from v$asm_operation;
drop磁盘for replace
CellCLI> alter physicaldisk 20:4 serviceled on -- 之前的方法,点亮灯,已经被淘汰,无法使用
ALTER PHYSICALDISK 20:4 DROP FOR REPLACEMENT; -- 是使用这个命令,但是会hung住,具体解决方法参考前面的参考文档
执行完毕上面的drop for replace后,存储cell上,硬盘的灯会变成蓝色。(Cell上有个HDD MAP,可以看硬盘在那个插槽,为了确保准确,还是将该硬盘的灯点亮)
替换硬盘,拔掉硬盘,官方文档建议等待3分钟后插入硬盘(实际操作,没有等待3分钟)
查看LUN、celldisk、griddisk信息
CellCLI> list lun lun_name
CellCLI> list celldisk where lun=lun_name
CellCLI> list griddisk where celldisk=celldisk_name
确认磁盘已经加入到ASM中,以下查询会返回0. 如果没有加入,则需要手工加入,一般情况,LUN、Celldisk、griddisk会自动创建(在cell的alertlog中可以看到)。
SQL> SELECT path,header_status FROM v$asm_disk WHERE group_number=0;
手工加入磁盘到ASM
alter diskgroup DATA_ABC add disk 'o/192.168.0.1/DATA_ABC_CD_04_abccel02' rebalance power 4;
alter diskgroup RECO_ABC add disk 'o/192.168.0.1/RECO_ABC_CD_04_abccel02' rebalance power 4;
alter diskgroup DBFS_DG add disk 'o/192.168.0.1/DBFS_DG_CD_04_abccel02' rebalance power 4;
查看reblance。完工。
补充:如果拔错盘了。怎么处理,再插进去。官方文档有说明
3.3.9 Removing and Replacing the Same Hard Disk
What happens if you accidentally remove the wrong hard disk?
If you inadvertently remove the wrong hard disk, then put the disk back. It will automatically be added back in the Oracle ASM disk group, and its data is resynchronized.
如果盘插入到了错误的插槽,被reject了,怎么处理,官方文档,re-enable
3.3.10 Re-Enabling a Hard Disk That Was Rejected
If a physical disk was rejected because it was inserted into the wrong slot, you can re-enable the disk.
Run the following command:
Caution:
The following command removes all data on the physical disk.
CellCLI> ALTER PHYSICALDISK hard_disk_name reenable force
The following is an example of the output from the command:
Physical disk 20:0 was reenabled.
END