Oracle RAC Multi-Node Instance Database Crash Recovery

Facing a multi-node database crash in Oracle RAC can be a critical challenge, but with the right approach, recovery is seamless. Recently, I encountered such a situation and successfully reinstated the Oracle RAC environment. Here’s how I did it:

✅ Steps to Reinstate Oracle RAC Multi-Node Instance After a Crash:
1️⃣ Identify the Cause of the Crash:

Checked alert logs (alert.log) and trace files in $ORACLE_BASE/diag/rdbms///trace/.
Verified ASM logs and clusterware logs (crsd.log, cssd.log).
2️⃣ Check Cluster and Instance Status:

Used crsctl status resource -t to check CRS resource status.
Verified node health using olsnodes -n.
3️⃣ Restart the Failed Instance(s):

Attempted to start the instance with srvctl start instance -d -i .
If failed, checked and cleared any process locks (ps -ef | grep pmon).
4️⃣ Perform Crash Recovery:

Ensured SMON background process initiated automatic recovery.
Manually checked for redo log corruption (v$log, v$logfile) and applied recovery if needed.
Used ALTER DATABASE RECOVER AUTOMATIC if required.
5️⃣ Validate ASM & Cluster Services:

Restarted cluster resources if needed (crsctl start res -all).
Ensured ASM disks and mount points were accessible (asmcmd lsdg).
6️⃣ Verify Data Integrity & Availability:

Queried v$database to confirm OPEN mode.
Tested application connectivity using tnsping and sqlplus.
7️⃣ Proactive Measures for Future Stability:

Implemented cluster node fencing to prevent split-brain scenarios.
Set up automated alerts & health monitoring (OEM / Cloud Control).
This was a great learning experience, reinforcing the importance of high availability and disaster recovery strategies in Oracle RAC environments.

💡 Have you faced similar RAC challenges? Let’s discuss best practices in the comments!

Leave a Reply

Your email address will not be published. Required fields are marked *