【Oracle】RAC11gR2 Grid启动顺序及启动故障诊断思路

从11gr2开始，oracle rac的架构有了比较大的变化，集群层面相交于之前的版本有了比较大的变动，原来的rac架构基本上属于cssd、crsd、evmd三大光秃秃的主干进程，日志数量较少，对于rac无法启动原因，采用最原始的方法逐一查看各个进程的日志也可找到无法启动
从11gr2开始，oracle rac的架构有了比较大的变化，集群层面相交于之前的版本有了比较大的变动，原来的rac架构基本上属于cssd、crsd、evmd三大光秃秃的主干进程，日志数量较少，对于rac无法启动原因，采用最原始的方法逐一查看各个进程的日志也可找到无法启动的原因。然而从11gr2之后，集群层发生了比较大的变动，以下是$grid_home/log/rac1/下的目录情况：
[grid@rac1 rac1]$ ls
acfs     acfsrepl      acfssec agent          client   crfmond cssd   cvu      evmd   gnsd   mdnsd racg
acfslog acfsreplroot admin    alertrac1.log crflogd crsd     ctssd diskmon gipcd gpnpd ohasd srvm
可以看到在这个目录中的文件夹非常多，在rac无法启动的情况下，如果去所有日志下查看无法启动的原因无疑效率极低。所以我们需要一个比较明确的诊断思路。
ok，接下来进入正题，希望可以为大家的日常诊断提供帮助。
第一步，在诊断grid无法启动的情况之前我们需要先了解11gr2中grid的启动流程，下面这张图比较清晰的说明了现在grid的启动顺序：
从图中我们可以看到，相比的原来oracle 10g的集群架构，11gr2有了比较大的改动。具体的进程作用在这里不再赘述，不了解的可以自己去恶补一下，这里只说进程启动顺序相关的内容。在启动集群的过程中首先启动的是ohasd进程，在ohasd进程启动之后会启动4个agent：
1.cssd agent
以root用户权限启动，负责启动cssd进程。
2.orarootagent
以root用户权限启动，负责启动以下这些守护进程：crsd进程、ctssd进程、diskmon进程、acfs进程。这些进程也都是以root用户权限启动。
3.oraagent
以grid用户权限启动，负责mdnsd进程、gipcd进程、gpnpd进程、evmd进程、asm进程（11gr2之后的asm在集群中被放置到了更底层，和之前版本区别较大）。
4.cssdmonitor。
以root用户权限启动，负责cssdmonitor进程的启动。
从图中我们可以看到之后又由crsd进程负责启动了两个agent：orarootagent和oraagent（最后的进程中我们可以看到两个oraagent进程，就是之前启动的那个加上这个），之后再由orarootagent和oraagent去负责启动之后的用户资源，进程启动到这里我认为grid底层启动完毕，之后再由orarootagent和oraagent启动的资源出现的问题不再本文的讨论范围内了。
第二步，我们已经对grid的进程启动顺序进行了梳理，之后对于grid无法启动的诊断也就变得简单。我们只要通过ps -ef|grep /oracle/app/grid/product/11.2.0（$grid_home）就可以了解到grid已经启动到哪一步，哪些进程已经启动，哪些进程还未启动，卡在了哪个进程上，这样我们就能快速找到应该查看的日志。比如crsd进程没有启动，我们就可以通过查看$grid_home/log/rac1/crsd目录下的crsd.log来进行查看，究竟在crsd进程启动过程中遭遇了哪些错误导致进程无法正常启动。
举例：
[grid@rac1 crsd]$ ps -ef|grep /oracle
root     15235     1 0 14:12 ?        00:00:06 /oracle/app/grid/product/11.2.0/bin/ohasd.bin reboot
grid     15356     1 0 14:12 ?        00:00:00 /oracle/app/grid/product/11.2.0/bin/oraagent.bin
grid     15367     1 0 14:12 ?        00:00:00 /oracle/app/grid/product/11.2.0/bin/mdnsd.bin
grid     15378     1 0 14:12 ?        00:00:02 /oracle/app/grid/product/11.2.0/bin/gpnpd.bin
grid     15388     1 2 14:12 ?        00:00:19 /oracle/app/grid/product/11.2.0/bin/gipcd.bin
root     15390     1 0 14:12 ?        00:00:00 /oracle/app/grid/product/11.2.0/bin/orarootagent.bin
root     15403     1 0 14:12 ?        00:00:08 /oracle/app/grid/product/11.2.0/bin/osysmond.bin
root     15477     1 0 14:12 ?        00:00:02 /oracle/app/grid/product/11.2.0/bin/ologgerd -m -d /oracle/app/grid/product/11.2.0/crf/db/rac1
root     15637     1 0 14:22 ?        00:00:00 /oracle/app/grid/product/11.2.0/bin/cssdmonitor
root     15665     1 0 14:22 ?        00:00:00 /oracle/app/grid/product/11.2.0/bin/cssdagent
grid     15676     1 0 14:22 ?        00:00:00 /oracle/app/grid/product/11.2.0/bin/ocssd.bin
grid     15730 13826 0 14:27 pts/1    00:00:00 grep /oracle
从以上的输出我们就可以看到，此时grid无法启动的原因在于cssd进程无法启动，所以我们直接查看ocssd.log，查看无法启动的原因，在日志中找到以下内容：
2016-05-09 14:30:26.476: [    cssd][1104030016]clssnmvdhbvalidatencopy: node 2, rac2,has a disk hb, but no network hb, dhb has rcfg 358258450, wrtcnt, 177436, lats 10923264, lastseqno 177435, uniqueness 1462763679, timestamp 1462775426/10874194
可以看到是因为私网出现了问题，导出有disk hb，而没有network hb，修复私网问题后，集群可以正常启动。
第三步，附送一篇mos文章：id 1623340.1，里边罗列了grid各个进程无法启动的常见原因以及对应的日志：
1.1.1. 集群状态
查询集群和守护进程的状态:
$grid_home/bin/crsctl check crs
crs-4638: oracle high availability services is online
crs-4537: cluster ready services is online
crs-4529: cluster synchronization services is online
crs-4533: event manager is online
$grid_home/bin/crsctl stat res -t -init
--------------------------------------------------------------------------------
name           target state        server                   state_details
--------------------------------------------------------------------------------
cluster resources
--------------------------------------------------------------------------------
ora.asm
      1        online online       rac1                  started
ora.crsd
      1        online online       rac1
ora.cssd
      1        online online       rac1
ora.cssdmonitor
      1        online online       rac1
ora.ctssd
      1        online online       rac1                  observer
ora.diskmon
      1        online online       rac1
ora.drivers.acfs
      1        online online       rac1
ora.evmd
      1        online online       rac1
ora.gipcd
      1        online online       rac1
ora.gpnpd
      1        online online       rac1
ora.mdnsd
      1        online online       rac1
对于11.2.0.2和以上的版本，会有以下两个额外的进程：
ora.cluster_interconnect.haip
      1        online online       rac1
ora.crf
      1        online online       rac1
对于11.2.0.3以上的非exadata的系统，ora.diskmon会处于offline的状态，如下：
ora.diskmon
      1        offline offline rac1
对于 12c 以上的版本,会出现ora.storage资源：
ora.storage
1 online online racnode1 stable
如果守护进程 offline我们可以通过以下命令启动:
$grid_home/bin/crsctl start res ora.crsd -init
1.1.2. 问题1: ohasd 无法启动
由于 ohasd.bin的责任是直接或者间接的启动集群所有的其它进程，所以只有这个进程正常启动了，其它的进程才能起来，如果ohasd.bin 的进程没有起来，当我们检查资源状态的时候会报错crs-4639 (could not contact oracle high availability services);如果 ohasd.bin已经启动了，而再次尝试启时，错误 crs-4640会出现；如果它启动失败了，那么我们会看到以下的错误信息：
crs-4124: oracle high availability services startup failed.
crs-4000: command start failed, or completed with errors.
自动启动 ohasd.bin依赖于以下的配置：
1. 操作系统配置了正确的run level:
os 需要在 crs 启动之前设置成指定的run level 来确保crs 的正常启动。
我们可以通过以下方式找到 crs 需要 os设置的 run level：
cat /etc/inittab|grep init.ohasd
h1:35:respawn:/etc/init.d/init.ohasd run >/dev/null 2>&1 /null
以上例子展示了，crs需要 os 运行在 run level 3或 5；请注意，由于操作系统的不同，crs启动需要的 os的 run level也会不同。
找到当前 os 正在运行的 run level:
who -r
2. init.ohasd run启动
在 linux/unix平台上，由于init.ohasd run是配置在 /etc/inittab中，进程init（进程id 1，linux，solars和hp-ux上为/sbin/init，aix上为/usr/sbin/init）会启动并且产生init.ohasd run进程，如果这个过程失败了，就不会有init.ohasd run的启动和运行，ohasd.bin也是无法启动的：
ps -ef|grep init.ohasd|grep -v grep
root      2279     1 0 18:14 ?        00:00:00 /bin/sh /etc/init.d/init.ohasd run
注意：oracle linux (ol6)以及red hat linux 6 (rhel6) 已经不再支持inittab 了，所以init.ohasd 会被配置在/etc/init 中，并被/etc/init 启动，尽管如此，我们还是应该能看到进程/etc/init.d/init.ohasd run 被启动；
如果任何 rc snncommand 的脚本（在rcn.d 中，如s98gcstartup）在启动的过程中挂死，此时init 的进程可能无法启动/etc/init.d/init.ohasd run；您需要寻求os 厂商的帮助，找到为什么snncommand 脚本挂死或者无法正常启动的原因；
错误[ohasd()] crs-0715:oracle high availability service has timed out waiting for init.ohasd to be started.可能会在 init.ohasd无法在指定时间内启动后出现
如果系统管理员无法在短期内找到 init.ohasd 无法启动的原因，以下办法可以作为一个临时的解决办法：
cd
nohup ./init.ohasd run &
3. clusterware自动启动；--自动启动默认是开启的
默认情况下 crs自动启动是开启的，我们可以通过以下方式开启：
$grid_home/bin/crsctl enable crs
检查这个功能是否被开启:
$grid_home/bin/crsctl config crs
如果以下信息被输出在os的日志中
feb 29 16:20:36 racnode1 logger: oracle cluster ready services startup disabled.
feb 29 16:20:36 racnode1 logger: could not access /var/opt/oracle/scls_scr/racnode1/root/ohasdstr
原因是由于这个文件不存在或者不可访问，产生这个问题的原因一般是人为的修改或者是打 gi 补丁的过程中使用了错误的 opatch (如：使用solaris 平台上的opatch 在linux 上打补丁)
4. syslogd启动并且 os 能够执行 init 脚本 s96ohasd
节点启动之后，os可能停滞在一些其它的 snn的脚本上，所以可能没有机会执行到脚本 s96ohasd；如果是这种情况，我们不会在os 日志中看到以下信息
（aix /var/adm/syslog    linux /var/log/messages）
jan 20 20:46:51 rac1 logger: oracle ha daemon is enabled for autostart.
如果在 os 日志里看不到上面的信息，还有一种可能是 syslogd（(/usr/sbin/syslogd）没有被完全启动。grid在这种情况下也是无法正常启动的，这种情况不适用于 aix的平台。
为了了解 os 启动之后是否能够执行 s96ohasd脚本，可以按照以下的方法修改该脚本：
from:
    case `$cat $autostartfile` in
      enable*)
        $logerr oracle ha daemon is enabled for autostart.
to:
    case `$cat $autostartfile` in
      enable*)
        /bin/touch /tmp/ohasd.start.`date`
        $logerr oracle ha daemon is enabled for autostart.
重启节点后，如果您没有看到文件 /tmp/ohasd.start.timestamp 被创建，那么就是说 os停滞在其它的 snn的脚本上。如果您能看到 /tmp/ohasd.start.timestamp生成了，但是oracle ha daemon is enabled for autostart没有写入到messages文件里，就是 syslogd没有被完全启动了。以上的两种情况，您都需要寻求系统管理员的帮助，从 os的层面找到问题的原因，对于后一种情况，有个临时的解决办法是“休眠”2分钟，按照以下的方法修改ohasd 脚本：
from:
    case `$cat $autostartfile` in
      enable*)
        $logerr oracle ha daemon is enabled for autostart.
to:
    case `$cat $autostartfile` in
      enable*)
        /bin/sleep 120
        $logerr oracle ha daemon is enabled for autostart.
5. grid_home所在的文件系统在执行初始化脚本 s96ohasd的时候在线；正常情况下一旦 s96ohasd执行结束，我们会在 os message里看到以下信息：
jan 20 20:46:51 rac1 logger: oracle ha daemon is enabled for autostart.
..
jan 20 20:46:57 rac1 logger: exec /ocw/grid/perl/bin/perl -i/ocw/grid/perl/lib /ocw/grid/bin/crswrapexece.pl /ocw/grid/crs/install/s_crsconfig_rac1_env.txt /ocw/grid/bin/ohasd.bin reboot
如果您只看到了第一行，没有看到最后一行的信息，很可能是 grid_home 所在的文件系统在脚本 s96ohasd执行的时候还没有正常挂载。
6. oracle local registry (olr, $grid_home/cdata/${hostname}.olr)有效并可以正常读写
ls -l $grid_home/cdata/*.olr
-rw------- 1 root oinstall 272756736 feb 2 18:20 rac1.olr
如果 olr 是不可读写的或者损坏的，我们会在 ohasd.log中看到以下的相关信息
..
2010-01-24 22:59:10.470: [ default][1373676464] initializing olr
2010-01-24 22:59:10.472: [ ocrosd][1373676464]utopen:6m':failed in stat ocr file/disk /ocw/grid/cdata/rac1.olr, errno=2, os err string=no such file or directory
2010-01-24 22:59:10.472: [ ocrosd][1373676464]utopen:7:failed to open any ocr file/disk, errno=2, os err string=no such file or directory
2010-01-24 22:59:10.473: [ ocrraw][1373676464]proprinit: could not open raw device
2010-01-24 22:59:10.473: [ ocrapi][1373676464]a_init:16!: backend init unsuccessful : [26]
2010-01-24 22:59:10.473: [ crsocr][1373676464] ocr context init failure. error: procl-26: error while accessing the physical storage operating system error [no such file or directory] [2]
2010-01-24 22:59:10.473: [ default][1373676464] olr initalization failured, rc=26
2010-01-24 22:59:10.474: [ default][1373676464]created alert : (:ohas00106:) : failed to initialize oracle local registry
2010-01-24 22:59:10.474: [ default][1373676464][panic] ohasd exiting; could not init olr
或者
..
2010-01-24 23:01:46.275: [ ocrosd][1228334000]utread:3: problem reading buffer 1907f000 buflen 4096 retval 0 phy_offset 102400 retry 5
2010-01-24 23:01:46.275: [ ocrraw][1228334000]propriogid:1_1: failed to read the whole bootblock. assumes invalid format.
2010-01-24 23:01:46.275: [ ocrraw][1228334000]proprioini: all disks are not ocr/olr formatted
2010-01-24 23:01:46.275: [ ocrraw][1228334000]proprinit: could not open raw device
2010-01-24 23:01:46.275: [ ocrapi][1228334000]a_init:16!: backend init unsuccessful : [26]
2010-01-24 23:01:46.276: [ crsocr][1228334000] ocr context init failure. error: procl-26: error while accessing the physical storage
2010-01-24 23:01:46.276: [ default][1228334000] olr initalization failured, rc=26
2010-01-24 23:01:46.276: [ default][1228334000]created alert : (:ohas00106:) : failed to initialize oracle local registry
2010-01-24 23:01:46.277: [ default][1228334000][panic] ohasd exiting; could not init olr
或者
..
2010-11-07 03:00:08.932: [ default][1] created alert : (:ohas00102:) : ohasd is not running as privileged user
2010-11-07 03:00:08.932: [ default][1][panic] ohasd exiting: must be run as privileged user
或者
ohasd.bin comes up but output of crsctl stat res -t -initshows no resource, and ocrconfig -local -manualbackup fails
或者
..
2010-08-04 13:13:11.102: [   crspe][35] resources parsed
2010-08-04 13:13:11.103: [   crspe][35] server [] has been registered with the pe data model
2010-08-04 13:13:11.103: [   crspe][35] startupcmd_req = false:
2010-08-04 13:13:11.103: [   crspe][35] server [] has changed state from [invalid/unitialized] to [visible]
2010-08-04 13:13:11.103: [ crsocr][31] multi write batch processing...
2010-08-04 13:13:11.103: [ default][35] dump state starting ...
..
2010-08-04 13:13:11.112: [   crspe][35] servers:
:visible:address{{absolute|node:0|process:-1|type:1}}; recovered state:visible. assigned to no pool
------------- server pools:
free [min:0][max:-1][importance:0] no servers assigned
2010-08-04 13:13:11.113: [   crspe][35] dumping ice contents...:ice operation count: 0
2010-08-04 13:13:11.113: [ default][35] dump state done.
解决办法就是使用下面的命令，恢复一个好的备份 ocrconfig -local -restore 。
默认情况下，olr在系统安装结束后会自动的备份在 $grid_home/cdata/$host/backup_$time_stamp.olr。
7. ohasd.bin可以正常的访问到网络的socket 文件:
2010-06-29 10:31:01.570: [ commcrs][1206901056]clsclisten: permission denied for (address=(protocol=ipc)(key=procr_local_conn_0_prol))
2010-06-29 10:31:01.571: [ ocrsrv][1217390912]th_listen: clsclisten failed clsc_ret= 3, addr= [(address=(protocol=ipc)(key=procr_local_conn_0_prol))]
2010-06-29 10:31:01.571: [ ocrsrv][3267002960]th_init: local listener did not reach valid state
在 grid infrastructure环境中，和 ohasd有关的 socket文件属主应该是 root用户，但是在 oracle restart的环境中，他们应该是属于 grid用户的，关于更多的关于网络 socket文件权限和属主，请参考章节网络socket 文件，属主和权限给出的例子.
8. ohasd.bin能够访问日志文件的位置:
os messages/syslog 显示以下信息:
feb 20 10:47:08 racnode1 ohasd[9566]: ohasd exiting; directory /ocw/grid/log/racnode1/ohasd not found.
请参考章节日志位置,属主和权限部分的例子，并确定这些必要的目录是否有丢失的，并且是按照正确的权限和属主创建的。
9. 节点启动后，在suse linux 的系统上，ohasd可能无法启动，此问题请参考 note 1325718.1 - ohasd not starting after reboot on sles
10. ohasd无法启动，使用 ps -ef| grep ohasd.bin显示 ohasd.bin的进程已经启动，但是 $grid_home/log//ohasd/ohasd.log在好几分钟之后都没有任何信息更新,使用os 的 truss 工具可以看到该进程一致在循环的执行关闭从未被打开的文件句柄的操作:
..
15058/1:         0.1995 close(2147483646)                               err#9 ebadf
15058/1:         0.1996 close(2147483645)                               err#9 ebadf
..
通过 ohasd.bin的 call stack，可以看到以下信息:
_close sclssutl_closefiledescriptors main ..
这是由于 bug 11834289 导致的，该问题在 11.2.0.3 和之上的版本已经被修复,该bug 的其它症状还有：集群的进程无法启动，而且做call stack 和truss 查看的时候也会看到相同的情况(循环的执行os 函数 close) . 如果该bug 发生在启动其它的资源时，我们会看到错误信息：crs-5802: unable to start the agent process提示。
11. 其它的一些潜在的原因和解决办法请参见 note 1069182.1 - ohasd failed to start: inappropriate ioctl for device
12. ohasd.bin正常启动，但是, crsctl check crs只显示以下一行信息:
crs-4638: oracle high availability services is online
并且命令 crsctl stat res -p -init无法显示任何信息
这个问题是由于 olr 损坏导致的，请参考 note 1193643.1 进行恢复。
13. 如果ohasd 仍然无法启动，请参见ohasd 的日志/log//ohasd/ohasd.log和 ohasdout.log来获取更多的信息；
1.1.3. 问题2: ohasd agents 未启动
ohasd.bin 会启动4 个 agents/monitors 来启动其它的资源:
oraagent:负责启动 ora.asm, ora.evmd, ora.gipcd, ora.gpnpd, ora.mdnsd等
  orarootagent:负责启动 ora.crsd, ora.ctssd, ora.diskmon, ora.drivers.acfs等
  cssdagent / cssdmonitor:负责启动 ora.cssd(对应ocssd.bin) 和ora.cssdmonitor(对应cssdmonitor)
如果 ohasd.bin不能正常地启动以上任何一个 agents，集群都无法运行在正常的状态。
1. 通常情况下，agents无法启动的原因是 agent的日志或者日志所在的目录没有正确设置属主和权限。
关于日志文件和文件夹的权限和属主设置，请参见章节日志文件位置,属主和权限中的介绍。
2. 如果 agent 的二进制文件（oraagent.bin或者 orarootagent.bin等）损坏, agent也将无法启动，从而导致相关的资源也无法启动：
2011-05-03 11:11:13.189
[ohasd(25303)]crs-5828:could not start agent '/ocw/grid/bin/orarootagent_grid'. details at (:crsagf00130:) {0:0:2} in /ocw/grid/log/racnode1/ohasd/ohasd.log.
2011-05-03 12:03:17.491: [    agfw][1117866336] {0:0:184} created alert : (:crsagf00130:) :  failed to start the agent /ocw/grid/bin/orarootagent_grid
2011-05-03 12:03:17.491: [    agfw][1117866336] {0:0:184} agfw proxy server sending the last reply to pe for message:resource_start[ora.diskmon 1 1] id 4098:403
2011-05-03 12:03:17.491: [    agfw][1117866336] {0:0:184} can not stop the agent: /ocw/grid/bin/orarootagent_grid because pid is not initialized
..
2011-05-03 12:03:17.492: [   crspe][1128372576] {0:0:184} fatal error from agfw proxy: unable to start the agent process
2011-05-03 12:03:17.492: [   crspe][1128372576] {0:0:184} crs-2674: start of 'ora.diskmon' on 'racnode1' failed
..
2011-06-27 22:34:57.805: [    agfw][1131669824] {0:0:2} created alert : (:crsagf00123:) :  failed to start the agent process: /ocw/grid/bin/cssdagent category: -1 operation: fail loc: canexec2 os error: 0 other : no exe permission, file [/ocw/grid/bin/cssdagent]
2011-06-27 22:34:57.805: [    agfw][1131669824] {0:0:2} created alert : (:crsagf00126:) : agent start failed
..
2011-06-27 22:34:57.806: [    agfw][1131669824] {0:0:2} created alert : (:crsagf00123:) : failed to start the agent process: /ocw/grid/bin/cssdmonitor category: -1 operation: fail loc: canexec2 os error: 0 other : no exe permission, file [/ocw/grid/bin/cssdmonitor]
解决办法：您可以和正常节点上的 agent 文件进行比较，并且恢复一个好的副本回来。
1.1.4. 问题3: ocssd.bin 无法启动

【Oracle】RAC11gR2 Grid启动顺序及启动故障诊断思路

推荐信息