服务器是IBM的X3650 ,5块硬盘RAID 5,1块Hot Spare (但是估计当时装机的人没有配置成hot spare,后面再看详细内容)
故障时的磁盘状态:
[root@serv1 cmdline]# ./arcconf GETCONFIG 1
Could not open log file: UcliEvt.log
Controllers found: 1
----------------------------------------------------------------------
Controller information
----------------------------------------------------------------------
Controller Status : Okay
Channel description : SAS/SATA
Controller Model : IBM ServeRAID 8k
Controller Serial Number : 40703B9
Physical Slot : 0
Installed memory : 256 MB
Copyback : Disabled
Data scrubbing : Enabled
Defunct disk drive count : 1
Logical drives/Offline/Critical : 1/0/1
--------------------------------------------------------
Controller Version Information
--------------------------------------------------------
BIOS : 5.2-0 (15421)
Firmware : 5.2-0 (15421)
Driver : 1.1-5 (2453)
Boot Flash : 5.1-0 (15411)
--------------------------------------------------------
Controller Battery Information
--------------------------------------------------------
Status : Okay
Over temperature : No
Capacity remaining : 100 percent
Time remaining (at current draw) : 4 days, 5 hours, 20 minutes
--------------------------------------------------------
Controller Vital Product Data
--------------------------------------------------------
VPD Assigned# : 39R8875
EC Version# : J85096
Controller FRU# : 25R8076
Battery FRU# : 25R8088
----------------------------------------------------------------------
Logical drive information
----------------------------------------------------------------------
Logical drive number 1
Logical drive name :
RAID level : 5
Status of logical drive : Critical
Size : 1143986 MB
Read-cache mode : Enabled
Write-cache mode : Enabled (write-back)
Write-cache setting : Enabled (write-back) when protected by battery
Partitioned : Yes
Number of segments : 5
Stripe-unit size : 256 KB
Stripe order (Channel,Device) : DDD 0,1 0,2 0,3 0,4
Defunct segments : 0,0
Defunct stripes : No
----------------------------------------------------------------------
Physical Device information
----------------------------------------------------------------------
Device #0
Device is a Hard drive
State : Defunct
Supported : Yes
Transfer Speed : Defunct
Reported Channel,Device : 0,0
Reported Location : Enclosure 0, Slot 0
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : MBA3300RC
Firmware : SA06
Serial number : BJ504MHL
World-wide name : 500000E01F7F8CF1
Size : 0 MB
Write Cache : Unknown
FRU : 43X0817
PFA : Yes
Device #1
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,1
Reported Location : Enclosure 0, Slot 1
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : MBA3300RC
Firmware : SA06
Serial number : BJ504M6N
World-wide name : 500000E01F7DC3D1
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0817
PFA : No
Device #2
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,2
Reported Location : Enclosure 0, Slot 2
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : MBA3300RC
Firmware : SA06
Serial number : BJ504M8J
World-wide name : 500000E01F7DCE11
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0817
PFA : No
Device #3
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,3
Reported Location : Enclosure 0, Slot 3
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : MBA3300RC
Firmware : SA06
Serial number : BJ504M5C
World-wide name : 500000E01F7DBEE1
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0817
PFA : No
Device #4
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,4
Reported Location : Enclosure 0, Slot 4
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : MBA3300RC
Firmware : SA06
Serial number : BJ504JWK
World-wide name : 500000E01F6EA341
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0817
PFA : Yes
Device #5
Device is a Hard drive
State : Ready
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,5
Reported Location : Enclosure 0, Slot 5
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : MBA3300RC
Firmware : SA06
Serial number : BJ504MKL
World-wide name : 500000E01F812561
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0817
PFA : No
Device #6
Device is an Enclosure services device
Reported Channel,Device : 2,0
Enclosure ID : 0
Type : SES2
Vendor : IBM-ESXS
Model : VSC7160
Firmware : 1.07
Status of Enclosure services device
Temperature : Normal
从上面的状态可以看到Device #0 状态是Defunct 表示不可用,也就是挂了。预期的Device #5应该是hot spare,在这个时候应该顶替上去先rebuilding然后变为online的。但是估计是当时安装的时候没有配置好,Device #0挂了以后居然没有顶替上去。
现场约了IBM售后工程师后,IBM 工程师到现场看发现服务器上的硬盘灯没有变红或者变黄。也就是说灯坏了。。
再他更换了Device #0 的硬盘后:
Logical drive information
----------------------------------------------------------------------
Logical drive number 1
Logical drive name :
RAID level : 5
Status of logical drive : Critical
Size : 1143986 MB
Read-cache mode : Enabled
Write-cache mode : Enabled (write-back)
Write-cache setting : Enabled (write-back) when protected by battery
Partitioned : Yes
Number of segments : 5
Stripe-unit size : 256 KB
Stripe order (Channel,Device) : 0,0 0,1 0,2 0,3 0,4
Defunct segments : No
Defunct stripes : No
Device #0
Device is a Hard drive
State : Rebuilding
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,0
Reported Location : Enclosure 0, Slot 0
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : ST3300655SS
Firmware : BA2D
Serial number : 3LM0LE9Z
World-wide name : 5000C5001C8401B0
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0805
PFA : No
把Device #5 设置为hot spare
[root@serv1 cmdline]# ./arcconf SETSTATE 1 DEVICE 0 5 HSP
然后过了3个小时后再看:
[root@serv1 cmdline]# ./arcconf GETCONFIG 1
Could not open log file: UcliEvt.log
Controllers found: 1
----------------------------------------------------------------------
Controller information
----------------------------------------------------------------------
Controller Status : Okay
Channel description : SAS/SATA
Controller Model : IBM ServeRAID 8k
Controller Serial Number : 40703B9
Physical Slot : 0
Installed memory : 256 MB
Copyback : Disabled
Data scrubbing : Enabled
Defunct disk drive count : 0
Logical drives/Offline/Critical : 1/0/0
--------------------------------------------------------
Controller Version Information
--------------------------------------------------------
BIOS : 5.2-0 (15421)
Firmware : 5.2-0 (15421)
Driver : 1.1-5 (2453)
Boot Flash : 5.1-0 (15411)
--------------------------------------------------------
Controller Battery Information
--------------------------------------------------------
Status : Okay
Over temperature : No
Capacity remaining : 100 percent
Time remaining (at current draw) : 4 days, 5 hours, 20 minutes
--------------------------------------------------------
Controller Vital Product Data
--------------------------------------------------------
VPD Assigned# : 39R8875
EC Version# : J85096
Controller FRU# : 25R8076
Battery FRU# : 25R8088
----------------------------------------------------------------------
Logical drive information
----------------------------------------------------------------------
Logical drive number 1
Logical drive name :
RAID level : 5
Status of logical drive : Okay
Size : 1143986 MB
Read-cache mode : Enabled
Write-cache mode : Enabled (write-back)
Write-cache setting : Enabled (write-back) when protected by battery
Partitioned : Yes
Number of segments : 5
Stripe-unit size : 256 KB
Stripe order (Channel,Device) : 0,0 0,1 0,2 0,3 0,4
Defunct segments : No
Defunct stripes : No
----------------------------------------------------------------------
Physical Device information
----------------------------------------------------------------------
Device #0
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,0
Reported Location : Enclosure 0, Slot 0
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : ST3300655SS
Firmware : BA2D
Serial number : 3LM0LE9Z
World-wide name : 5000C5001C8401B0
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0805
PFA : No
Device #1
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,1
Reported Location : Enclosure 0, Slot 1
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : MBA3300RC
Firmware : SA06
Serial number : BJ504M6N
World-wide name : 500000E01F7DC3D1
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0817
PFA : No
Device #2
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,2
Reported Location : Enclosure 0, Slot 2
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : MBA3300RC
Firmware : SA06
Serial number : BJ504M8J
World-wide name : 500000E01F7DCE11
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0817
PFA : No
Device #3
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,3
Reported Location : Enclosure 0, Slot 3
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : MBA3300RC
Firmware : SA06
Serial number : BJ504M5C
World-wide name : 500000E01F7DBEE1
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0817
PFA : No
Device #4
Device is a Hard drive
State : Online
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,4
Reported Location : Enclosure 0, Slot 4
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : MBA3300RC
Firmware : SA06
Serial number : BJ504JWK
World-wide name : 500000E01F6EA341
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0817
PFA : Yes
Device #5
Device is a Hard drive
State : Hot Spare
Supported : Yes
Transfer Speed : SAS 3.0 Gb/s
Reported Channel,Device : 0,5
Reported Location : Enclosure 0, Slot 5
Reported ESD : 2,0
Vendor : IBM-ESXS
Model : MBA3300RC
Firmware : SA06
Serial number : BJ504MKL
World-wide name : 500000E01F812561
Size : 286102 MB
Write Cache : Disabled (write-through)
FRU : 43X0817
PFA : No
Device #6
Device is an Enclosure services device
Reported Channel,Device : 2,0
Enclosure ID : 0
Type : SES2
Vendor : IBM-ESXS
Model : VSC7160
Firmware : 1.07
Status of Enclosure services device
Temperature : Normal
Command completed successfully.
希望下次再有硬盘挂的时候hotspare 能顶替上去。