近期,在实施一套Oracle ODA X9-2一体机时,遇到心跳网络不通的问题,表现的特征为心跳网线(25GB直连)线缆连接后,网卡的指示灯不亮,操作系统中也无法识别网卡状态,更换线缆排查法验证,确定为网卡问题。
考虑为新服务器,硬件损坏概率不高,有没有可能是软件类问题(如兼容性、固件版本等);在MOS后台一通查询,看到一篇:Network not working on ODA X9-2 systems using Mellanox Dual Port SFP28 CX5 25Gb Ethernet Adapter with firmware 16.29.1436 (Doc ID 2910654.1),检查确认网卡型号,与文档中的描述一致。安装解决办法,可以升级网卡FIREWARE固件版本,来解决次问题。参照此方式进行处理,后续网卡恢复正常。
如下为处理过程:
1、无法使用网卡的情况
# odacli configure-firstnet
INFO: Using default bonding configuration
Select the Interface to configure the network on (btbond1) [btbond1]:
WARNING: Port [p6p1] is not connected
WARNING: Interface [btbond1] is not connected correctly, please check the connection
Checking the status of the ports, we will see that the link on ports p6p1 and p6p2 is down:
# ethtool p6p1
Settings for p6p1:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: Unknown!
Duplex: Unknown! (255)
Port: FIBRE
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000004 (4)
link
Link detected: no
# ethtool p6p2
Settings for p6p1:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: Unknown!
Duplex: Unknown! (255)
Port: FIBRE
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000004 (4)
link
Link detected: no
2、查看网卡固件信息
# odacli describe-component
For systems with image 19.16 the firmware package can be found under the following location:
#cd /opt/oracle/oak/pkgrepos/thirdpartypkgs/Firmware/Controller/Mellanox/0x1017/16.32.1010/ORC0000000004/Base
#ls
componentmetadata.xml fw-ConnectX5-rel-16_32_1010-8201339_Ax_Bx-UEFI-14.25.17.signed.bin metadata.xml
# fwupdate list controller
==================================================
CONTROLLER
==================================================
ID Type Manufacturer Model Product Name FW Version BIOS Version EFI Version FCODE Version Package Version NVDATA Version XML Support
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
c0 HDC Intel 0x2826 0x4873 - - - - - - N/A
c1 NVMe Intel 0x0b60 INTEL SSDPF2KX076T9S 2CV1RC30 - - - - - N/A
c2 NET Mellanox 0x1017 Oracle 25GbE Dual-Port SF 16.29.1436 - 14.22.16 - - - N/A <<<< this is the Mellanox card, ID c2
c3 NET Intel 0x1589 Oracle Quad Port 10GBase - - 4.5.5 3.10.0 8000A87E - N/A
3、固件更新方法
# fwupdate update controller -n c2 -x metadata.xml
The following actions will be taken:
===================================================================================
ID Priority Action Status Old Firmware Ver. Proposed Ver. New Firmware Ver. System Reboot
-------------------------------------------------------------------------------------------------------------------------------------
c2 1 Check FW Success 16.29.1436 16.32.1010 N/A System Reset
Do you wish to process the above actions? [y/n]? y
Update c2: Updating c2: Success
Sleeping for 15 seconds for a component to recover
Verifying all priority 1 updates
Execution Summary
===================================================================================
ID Priority Action Status Old Firmware Ver. Proposed Ver. New Firmware Ver. System Reboot
-------------------------------------------------------------------------------------------------------------------------------------
c2 1 Post Power Pending 16.29.1436 16.32.1010 N/A System Reset
System Reboot required for some applied firmware
Do you wish to automatically reboot now? [y/n]? y
As above shown, you will be prompted if you wish to reboot now, please enter y (yes) to reboot.
4、重启服务器,查看网卡硬件及主机系统内查看固件版本,恢复正常
# fwupdate list controller
==================================================
CONTROLLER
==================================================
ID Type Manufacturer Model Product Name FW Version BIOS Version EFI Version FCODE Version Package Version NVDATA Version XML Support
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
c0 HDC Intel 0x2826 0x4873 - - - - - - - N/A
c1 NVMe Intel 0x0b60 INTEL SSDPF2KX076T9S 2CV1RC30 - - - - - - N/A
c2 NET Mellanox 0x1017 Oracle 25GbE Dual-Port SF 16.32.1010 - - 14.25.17 - - - N/A <<< should show 16.32.1010
c3 NET Intel 0x1589 Oracle Quad Port 10GBase - - - 4.5.5 3.10.0 8000A87E - N/A
Also check if the ports p6p1 and p6p2 are up and re-run configure-firstnet
# ethtool p6p1
Settings for p6p1:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 1000baseKX/Full
10000baseKR/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 25000Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000004 (4)
link
Link detected: yes