Oracle 11g控制文件损坏问题分析

对于oracle 11g版本以下数据库当控制文件损坏后，我们在mount数据库时，会有很明显的ora-600错误，这样就很容易知道控制文件损坏
对于oracle 11g版本以下数据库当控制文件损坏后，我们在mount数据库时，会有很明显的ora-600错误，这样就很容易知道控制文件损坏的错误，但是对于oracle 11g r2就不是很明显了，
当时是一个oracle 11g 的rac系统，出现问题时数据库实例可以nomount打开但是在mount控制文件时就会出现如下告警：
ora-3113 end of file on communication channel
然后整个sqlplus连接终止,需要重新连接，当然我们知道通常mount阶段无法进行，问题就出在控制文件本身的存在损坏的问题，但是对于专业的人员来说，如果仅仅满足这样的心态，显然是不行的，所以需要对其进行进一步分析：
但是在asm日志中我们可以看到如下信息：
tue mar 27 13:35:11 2012
note: client prod1:prod registered, osid 6726, mbr 0x1
tue mar 27 13:35:24 2012
note: asm client prod1:prod disconnected unexpectedly.
note: check client alert log.
note: trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+asm1/trace/+asm1_ora_6726.trc
tue mar 27 13:40:35 2012
note: client prod1:prod registered, osid 7477, mbr 0x1
tue mar 27 13:41:45 2012
note: asm client prod1:prod disconnected unexpectedly.
note: check client alert log.
note: trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+asm1/trace/+asm1_ora_7477.trc
tue mar 27 13:41:47 2012
note: client prod1:prod registered, osid 7736, mbr 0x1
tue mar 27 13:42:01 2012
note: asm client prod1:prod disconnected unexpectedly.
note: check client alert log.
note: trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+asm1/trace/+asm1_ora_7736.tr
对于生成的trace文件我们仅能够看到如下些信息：
2012-03-27 13:41:08.022438 :802eefe8:kfns:kfn.c@702:kfndispatch(): calling server stub for kfnop=5
2012-03-27 13:41:13.027006 :802ef0f4:kfnu:kfns.c@1924:kfnsbackground(): kfnsbackground completed in 5 seconds (kfnpm=0)
2012-03-27 13:41:13.027012 :802ef0f5:kfns:kfn.c@729:kfndispatch(): completed kfnop=5
2012-03-27 13:41:13.027122 :802ef0f6:kfns:kfn.c@702:kfndispatch(): calling server stub for kfnop=5
对于此问题显然没什么用处，并且问题应该还是在数据库方面。
所以对数据库实例的alert告警检查，当执行alter database mount状态时的日志如下：
tue mar 27 11:42:01 2012
alter database mount
this instance was first to mount
tue mar 27 11:42:01 2012
note: loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
note: loaded library: system
tue mar 27 11:42:01 2012
success: diskgroup proddata was mounted
tue mar 27 11:42:01 2012
note: dependency between database prod and diskgroup resource ora.proddata.dg is established
user (ospid: 26774): terminating the instance
tue mar 27 11:42:07 2012
system state dump requested by (instance=1, osid=26774), summary=[abnormal instance termination].
system state dumped to trace file /d01/oracle/11.2.0/admin/prod1_db01/diag/rdbms/prod/prod1/trace/prod1_diag_26656.trc
dumping diagnostic data in directory=[cdmp_20120327114207], requested by (instance=1, osid=26774), summary=[abnormal instance termination].
instance terminated by user, pid = 26774
还是不明显的日志提示，检查告警trace文件：/d01/oracle/11.2.0/admin/prod1_db01/diag/rdbms/prod/prod1/trace/prod1_diag_26656.trc也无明细的信息
后来采用10046事件来跟踪mount这个过程，才看到了比较明细的提示，
alter session set events='10046 trace name context forever,level 12';
trace file /d01/oracle/11.2.0/admin/prod1_db01/diag/rdbms/prod/prod1/trace/prod1_ora_7764.trc
oracle database 11g enterprise edition release 11.2.0.2.0 - 64bit production
with the partitioning, real application clusters, automatic storage management, olap,
data mining and real application testing options
oracle_home = /d01/oracle/11.2.0
system name:    linux
node name:      db01.clc.com
release:        2.6.18-238.el5
version:        #1 smp sun dec 19 14:22:44 est 2010
machine:        x86_64
instance name: prod1
redo thread mounted by this instance: 0
oracle process number: 31
unix process pid: 7764, image: oracle@db01.clc.com (tns v1-v3)
*** 2012-03-27 13:41:55.101
*** session id:(1751.3) 2012-03-27 13:41:55.101
*** client id:() 2012-03-27 13:41:55.101
*** service name:() 2012-03-27 13:41:55.101
*** module name:(oraagent.bin@db01.clc.com (tns v1-v3)) 2012-03-27 13:41:55.101
*** action name:() 2012-03-27 13:41:55.101
error: kccpb_sanity_check_2
control file sequence number mismatch!
fhcsq: 312916 bhcsq: 313137 cfn 0
*** 2012-03-27 13:41:55.101
submitting synchronized dump request [268435460]. summary=[controlfile header dump (kccpbsc)].
*** 2012-03-27 13:41:57.102
kjzduptcctx: notifying diag for crash event
----- abridged call stack trace -----
ksedsts()+461----- end of abridged call stack trace -----
*** 2012-03-27 13:41:57.141
user (ospid: 7764): terminating the instance
ksuitm: waiting up to [5] seconds before killing diag(7652)
如上红色字段可以看到，是控制文件中序列号不匹配造成控制文件一致性验证损坏，而无法正常mount数据库。
这样问题就明了了，，可以修改或重建控制文件方式来打开数据库。
更多oracle相关信息见oracle 专题页面 ?tid=12

Oracle 11g控制文件损坏问题分析

推荐信息