问题
测试服务器频繁死机,刚开始一周一次,后面应用服务启动就死机。
服务器系统: CentOS 6.5 内核版本:2.6.32-431.el6.x86_64服务器系统日志分析
查看日志:/var/log/message ,下面是出错比较多的
Dec 4 14:11:46 localhost abrtd: Init complete, entering main loopDec 4 14:11:53 localhost modem-manager: (ttyS1) closing serial device...Dec 4 14:11:53 localhost modem-manager: (ttyS1) opening serial device...Dec 4 14:11:59 localhost modem-manager: (ttyS1) closing serial device...Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: APEI generic hardware error statusDec 4 14:12:16 localhost kernel: {1}[Hardware Error]: severity: 2, correctedDec 4 14:12:16 localhost kernel: {1}[Hardware Error]: section: 0, severity: 2, correctedDec 4 14:12:16 localhost kernel: {1}[Hardware Error]: flags: 0x01Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: primaryDec 4 14:12:16 localhost kernel: {1}[Hardware Error]: fru_text: CorrectedErrDec 4 14:12:16 localhost kernel: {1}[Hardware Error]: section_type: memory errorDec 4 14:12:16 localhost kernel: {1}[Hardware Error]: node: 15424Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: device: 12343Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: error_type: 2, single-bit ECCDec 4 14:12:16 localhost kernel: [Hardware Error]: Machine check events logged 【死机】Dec 9 04:05:06 localhost kernel: imklog 5.8.10, log source = /proc/kmsg started. 【重启】Dec 9 04:05:06 localhost rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1601" x-info="http://www.rsyslog.com"] startDec 9 04:05:06 localhost kernel: Initializing cgroup subsys cpusetDec 9 04:05:11 localhost abrtd: Init complete, entering main loopDec 9 04:05:19 localhost modem-manager: (ttyS1) closing serial device...Dec 9 04:05:19 localhost modem-manager: (ttyS1) opening serial device...Dec 9 04:05:25 localhost modem-manager: (ttyS1) closing serial device...Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: APEI generic hardware error statusDec 9 04:05:52 localhost kernel: {1}[Hardware Error]: severity: 2, correctedDec 9 04:05:52 localhost kernel: {1}[Hardware Error]: section: 0, severity: 2, correctedDec 9 04:05:52 localhost kernel: {1}[Hardware Error]: flags: 0x01Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: primaryDec 9 04:05:52 localhost kernel: {1}[Hardware Error]: fru_text: CorrectedErrDec 9 04:05:52 localhost kernel: {1}[Hardware Error]: section_type: memory errorDec 9 04:05:52 localhost kernel: {1}[Hardware Error]: node: 24208Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: device: 12343Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: error_type: 2, single-bit ECCDec 9 04:05:52 localhost kernel: [Hardware Error]: Machine check events logged 【死机】Dec 11 10:40:00 localhost kernel: imklog 5.8.10, log source = /proc/kmsg started. 【重启】Dec 11 10:40:00 localhost rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1603" x-info="http://www.rsyslog.com"] startDec 11 10:40:00 localhost kernel: Initializing cgroup subsys cpusetDec 11 10:40:00 localhost kernel: Initializing cgroup subsys cpu
当时看到这些错误还是比较懵,Hardware Error硬件错误,以为无法挽救。
解决办法
在bing搜索关键“Hardware error from APEI Generic Hardware Error Source: 1”找到一篇匹配度还算比较高的: 大致是系统与ECC 内存相关的问题导致
后面我进行了2个操作:
- 1.内存条拔出来清理灰尘换个插槽重新插入【重启后问题没解决】
- 2.升级内核 (内核从 2.6.32-431.el6.x86_64 升级到 )
目前服务器已经运行一周多,暂没出现死机现象,/var/log/message 无任何报错出现。
事后思考
服务器出现这个问题,可能与前几次突然停电有关。