内存参数设置不合理导致数据库HANG

内存参数设置不合理导致数据库hang 现象： 2节点rac，数据库忽然hang住，重启一个实例后恢复正常。分析：故障时间段约为8:30-10:00，以下为alert报错： alert_crm2.log： mon may 27 06:54:26 2013 success:>mon may 27 07:32:24 2013 thread 2>ora-07445:
内存参数设置不合理导致数据库hang
现象：
2节点rac，数据库忽然hang住，重启一个实例后恢复正常。
分析：
故障时间段约为8:30-10:00，以下为alert报错：
alert_crm2.log：
mon may 27 06:54:26 2013
success:>mon may 27 07:32:24 2013
thread 2>ora-07445: 出现异常错误: 核心转储 [kksmapcursor()+323] [sigsegv] [addr:0x8] [pc:0x763597b] [address not mapped to object] []
ora-03135: 连接失去联系
mon may 27 09:54:56 2013
errors>ora-07445: 出现异常错误: 核心转储 [kksmapcursor()+323] [sigsegv] [addr:0x8] [pc:0x763597b] [address not mapped to object] []
ora-03135: 连接失去联系
mon may 27 09:54:56 2013
errors>ora-07445: 出现异常错误: 核心转储 [kksmapcursor()+323] [sigsegv] [addr:0x8] [pc:0x763597b] [address not mapped to object] []
ora-03135: 连接失去联系
mon may 27 09:54:56 2013
errors>ora-07445: 出现异常错误: 核心转储 [kksmapcursor()+323] [sigsegv] [addr:0x8] [pc:0x763597b] [address not mapped to object] []
ora-03135: 连接失去联系
incident>user (ospid: 15258): terminating the instance
mon may 27 09:55:05 2013
ora-1092 :>ora-00600: internal error code, arguments: [723], [109464], [127072], [memory leak], [], [], [], [] incident>loadavg : 69.72 40.04 27.44
memory>swap info: free = 0.00m alloc = 0.00m total = 0.00m
f s uid pid ppid c pri ni addr sz wchan stime tty time cmd
0 s>#0 0x0000003e2d6d50e7 in semop () from /lib64/libc.so.6
#1 0x000000000778a4f6>#7 0x0000000003b87b4a in kjdrchkdrm ()
#8 0x0000000003a38c5a>
snap id snap time sessions curs/sess
--------- ------------------- -------- ---------
begin snap: 6261 27-may-13 09:00:40 404 7.5
end snap: 6262 27-may-13 10:00:34 488 5.3
elapsed: 59.90 (mins)
db time: 10,417.13 (mins)
top 5 timed foreground events
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
avg
wait % db
event waits time(s) (ms)>gc current block 2-way 411,847 673 2 3.8 cluster
gc>
*** 2013-05-27 07:26:41.101
trace>(session) sid: 1645 ser: 1 trans: (nil), creator: 0x590fc76e0
flags: (0x51) usr/->dumping session wait history:
0:> wait_id=348204630 seq_num=17176 snap_id=1
wait> wait times (usecs) - max=infinite
wait> occurred after 228 microseconds of elapsed time
1:>
crmm01_130527_0800.nmon.xlsx:
cpu total crmm01 user% sys% wait% idle% cpu% cpus
9:38:00 2.4 0.8 6.2 90.7 3.2 24
9:39:31 1.3 1 5.9 91.9 2.3 24
9:41:01 16 5 7.6 71.4 21 24
9:42:33 91.3 7.9 0.2 0.6 99.2 24 time pid %cpu %usr %sys size resset restext resdata shdlib minorfault majorfault command
8:07:34 773 0.76 0 0.76 0 0 0 0 0 0 0>8:07:34 774 36.91 0 36.91 0 0 0 0 0 0 0 kswapd1 1.54 0
memory mb crmm01>
paging>node1: swap increased after 8:00.
crmm01_130527_0800.nmon.xlsx:
paging>8:03:03 589 10.81 kswapd0
8:06:03 589 1.68>
通过以上的日志分析，大致发现客户的db在故障时间段存在一些问题：
1.内存资源紧张（a.lmd0在进行一些内存释放的操作;b.free>2.空闲swap页面紧张，大量的page in/out 3.严重的shared>4.实例1的lmd0在9:42-9:44hang住（stall），
还未完全理清的时候，客户的db又出现了hang住的情况，这次客户做了systemstate>hang analysis:
instances (db_name.oracle_sid):>
chains> chain 1 signature hash: 0xb52ba8a9
[b] chain 2 signature: 'latch:> chain 2 signature hash: 0x985d217a
[c] chain 3 signature: 'latch:> chain 3 signature hash: 0xb52ba8a9
chain 1:
-------------------------------------------------------------------------------
oracle> p2: 'number'=0x101
p3: 'tries'=0x0
time> short stack: wait> time waited: 4.944027 secs p2: 'number'=0x101
p3: 'tries'=0x0
2.> time waited: 0.104395 secs p2: 'number'=0x101
p3: 'tries'=0x0
3.> time waited: 0.079024 secs p2: 'number'=0x101
p3: 'tries'=0x0
}
and> {
instance: 1 (crm.crm1)
os> p2: 'number'=0x115
p3: 'tries'=0x0
time> current sql:
short> time waited: 5.627769 secs p2: 'number'=0x101
p3: 'tries'=0x0
2.> time waited: 0.465190 secs p2: 'number'=0x101
p3: 'tries'=0x0
3.> time waited: 0.082002 secs p2: 'number'=0x101
p3: 'tries'=0x0
}
从dump信息看来，这次的情况跟上次类似，大量的latch: shared pool等待。
客户的db配置情况：
物理内存24g，而memory_target设置为22g，感觉配置的非常不合理，客户的情况跟我之前处理过的一个case很像（ora-609：疑似memory_target设置过大导致的宕机http://blog.csdn.net/zhou1862324/article/details/17288103），都是memory_target参数设置过大导致出现swap page in/out的情况，最终导致数据库hang住或宕机。
之前的case发生在另一位客户的一套非常重要的生产库上，数据库屡次宕机客户苦不堪言，而客户接收了我的建议将memory_target调低到一个合理的值之后，类似的问题没有再发生了。
所以，对于这个案例，我给了客户2个建议：
1.减少 memory_target 和 memory_max_target，预留更多内存供os使用，减少发生swap page in/out的可能性。
2.启用hugepages，hugepages本身就是锁定在内存中不能被swap的，但hugepages与memory_target不兼容，所以需要禁用memory_target，设置sga_target和pga_aggregate_target。
关于hugepage，可以参考我转的一篇文章hugepages on oracle linux 64-bit（http://blog.csdn.net/zhou1862324/article/details/17540277）。
解决方法：
最终客户选择了调小memory_target 和 memory_max_target，问题未再出现。

内存参数设置不合理导致数据库HANG

推荐信息