一块磁盘导致的后端服务崩溃-JobPlus

前两天DBA和另外一位硬件工程师，在更换硬盘的时候发现的问题，还好处理及时，没有导致更大的影响面。

什么问题呢？

这次问题就是因为服务器raid出现坏道，导致数据库写入数据出现问题，mysql不断的回写磁盘，最终，mysql的服务时段时续。

一、数据库错误现象如下

1、mysql的error日志

171208 19:16:07 InnoDB: Rollback of non-prepared transactions completed

171208 19:16:18 InnoDB: Warning: purge reached the head of the history list,

InnoDB: but its length is still reported as 583274! Make a detailed bug

InnoDB: report, and submit it to http://bugs.mysql.com

InnoDB: Error: tried to read 16384 bytes at offset 0 413483008.

InnoDB: Was only able to read -1.

171208 19:18:58 InnoDB: Operating system error number 5 in a file operation.

InnoDB: Error number 5 means 'Input/output error'.

InnoDB: Some operating system error numbers are described at

InnoDB: http://dev.mysql.com/doc/refman/5.1/en/operating-system-error-codes.html

InnoDB: File operation call: 'read'.

InnoDB: Cannot continue operation.

2、mysql的连接中断，连接不上，现象如下：

(1)监控系统报警

Mysql of 3301 port is down

(2)终端操作情况，偶尔发现连接不上

终端时常连接不上，这样客户端连接也会出现问题

3、这个时候查看操作系统日志

tail -f /var/log/message 日志，错误情况如下：

Dec 8 19:48:40 kernel: end_request: I/O error, dev sdb, sector 72629874

Dec 8 19:49:51 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002

Dec 8 19:49:51 kernel: end_request: I/O error, dev sdb, sector 72629874

Dec 8 19:51:01 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002

Dec 8 19:51:01 kernel: end_request: I/O error, dev sdb, sector 72629874

Dec 8 19:51:48 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002

Dec 8 19:51:48 kernel: end_request: I/O error, dev sdb, sector 72629874

Dec 8 19:53:00 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002

Dec 8 19:53:00 kernel: end_request: I/O error, dev sdb, sector 72629874

二、原因分析

背景：磁盘作的raid5,曾经掉过盘，工程师修复时再直接插拔一块磁盘，操作系统出现的I/O错误频发，最后出现的上述的问题。

1、查看dell磁盘状态命令

omreport storage controller controller=0

发现物理磁盘都正常，但是raid5的磁盘状态不正常。如下：

ID : 1

Status : Critical

Name : Virtual Disk 1

State : Ready

Encrypted : No

Layout : RAID-5

Size : 2,233.50 GB (2398202363904 bytes)

Device Name : /dev/sdb

Bus Protocol : SAS

Media : HDD

Read Policy : Adaptive Read Ahead

Write Policy : Force Write Back

Cache Policy : Not Applicable

Stripe Element Size : 64 KB

Disk Cache Policy : Enabled

三、修复操作

1、修复命令

什么原因呢？怀疑是在磁盘更换过程中，或者掉盘过程，整个raid产生了坏道。

所以通过如下方式，进行修复：

2、重启系统

系统的错误日志消失

3、通过omconfig修复

omconfig storage vdisk action=clearvdbadblocks controller=0 vdisk=1

虽然我们硬件服务器，作了raid,raid卡电池，但硬件也有着自身的一些问题。

这些问题，需要每一个工程师去发现和了解。

前两天DBA和另外一位硬件工程师，在更换硬盘的时候发现的问题，还好处理及时，没有导致更大的影响面。什么问题呢？这次问题就是因为服务器raid出现坏道，导致数据库写入数据出现问题，mysql不断的回写磁盘，最终，mysql的服务时段时续。 一、数据库错误现象如下1、mysql的error日志171208 19:16:07 InnoDB: Rollback of non-prepared transactions completed171208 19:16:18 InnoDB: Warning: purge reached the head of the history list,InnoDB: but its length is still reported as 583274! Make a detailed bugInnoDB: report, and submit it to <a href="http://#" onclick="return false;">http://bugs.mysql.com</a>InnoDB: Error: tried to read 16384 bytes at offset 0 413483008.InnoDB: Was only able to read -1.171208 19:18:58 InnoDB: Operating system error number 5 in a file operation.InnoDB: Error number 5 means 'Input/output error'.InnoDB: Some operating system error numbers are described atInnoDB: <a href="http://#" onclick="return false;">http://dev.mysql.com/doc/refman/5.1/en/operating-system-error-codes.html</a>InnoDB: File operation call: 'read'.InnoDB: Cannot continue operation.2、mysql的连接中断，连接不上，现象如下：(1)监控系统报警Mysql of 3301 port is down(2)终端操作情况，偶尔发现连接不上终端时常连接不上，这样客户端连接也会出现问题3、这个时候查看操作系统日志tail -f /var/log/message 日志，错误情况如下：Dec 8 19:48:40 kernel: end_request: I/O error, dev sdb, sector 72629874Dec 8 19:49:51 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002Dec 8 19:49:51 kernel: end_request: I/O error, dev sdb, sector 72629874Dec 8 19:51:01 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002Dec 8 19:51:01 kernel: end_request: I/O error, dev sdb, sector 72629874Dec 8 19:51:48 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002Dec 8 19:51:48 kernel: end_request: I/O error, dev sdb, sector 72629874Dec 8 19:53:00 kernel: sd 0:2:1:0: SCSI error: return code = 0x00070002Dec 8 19:53:00 kernel: end_request: I/O error, dev sdb, sector 72629874二、原因分析背景：磁盘作的raid5,曾经掉过盘，工程师修复时再直接插拔一块磁盘，操作系统出现的I/O错误频发，最后出现的上述的问题。1、查看dell磁盘状态命令omreport storage controller controller=0发现物理磁盘都正常，但是raid5的磁盘状态不正常。如下：ID : 1Status : CriticalName : Virtual Disk 1State : ReadyEncrypted : NoLayout : RAID-5Size : 2,233.50 GB (2398202363904 bytes)Device Name : /dev/sdbBus Protocol : SASMedia : HDDRead Policy : Adaptive Read AheadWrite Policy : Force Write BackCache Policy : Not ApplicableStripe Element Size : 64 KBDisk Cache Policy : Enabled三、修复操作1、修复命令什么原因呢？怀疑是在磁盘更换过程中，或者掉盘过程，整个raid产生了坏道。所以通过如下方式，进行修复：2、重启系统系统的错误日志消失3、通过omconfig修复omconfig storage vdisk action=clearvdbadblocks controller=0 vdisk=1虽然我们硬件服务器，作了raid,raid卡电池，但硬件也有着自身的一些问题。这些问题，需要每一个工程师去发现和了解。