SlideShare a Scribd company logo
1
Smart Monitoring! How does Oracle
RAC manage Resource and State?
Copyright © 2019 Oracle and/or its affiliates.
Anil Nair
Sr Principal Product Manager,
Oracle Real Application Clusters (RAC)
@RACMasterPM
http://www.linkedin.com/in/anil-nair-01960b6
http://www.slideshare.net/AnilNair27/
The preceding is intended to outline our general product direction. It is intended for information purposes
only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code,
or functionality, and should not be relied upon in making purchasing decisions. The development,
release, timing, and pricing of any features or functionality described for Oracle’s products may change
and remains at the sole discretion of Oracle Corporation.
Statements in this presentation relating to Oracle’s future plans, expectations, beliefs, intentions and
prospects are “forward-looking statements” and are subject to material risks and uncertainties. A detailed
discussion of these factors and other risks that affect our business is contained in Oracle’s Securities and
Exchange Commission (SEC) filings, including our most recent reports on Form 10-K and Form 10-Q
under the heading “Risk Factors.” These filings are available on the SEC’s website or on Oracle’s website
at http://www.oracle.com/investor. All information in this presentation is current as of September
2019 and Oracle undertakes no duty to update any statement in light of new information or future events.
Safe Harbor
Copyright © 2019 Oracle and/or its affiliates.
Quickly Detect Outage Quickly Resolve Outage
with minimum
disruption
Failover to disaster
recovery site for site
level failures
How to achieve Maximum Availability?
Application/Client
Outage
Host CPU, Memory,
Network, Storage
Outage
Complete site Outage
Scope of Outage
Very Important to identify the failure quickly and attempt to resolve it locally
Detect Outage
Oracle Cluster
Synchronization
Services
Oracle LMS process
Oracle Clusterware
Agents
Oracle LMON process
Oracle ASM
Oracle Memory Guard
Resolve Outage
Node Eviction by
CSS/Agents
Instance eviction by
LMON
Resource move by
Oracle Clusterware
Agents
Service Shutdown by
Memory Guard
5
Failover to remote site
Data Guard
Oracle Database HA Features*
* Not a complete list
Process, Resource, Instance, Node Failures Compete Site Outage
6
Oracle Cluster Synchronization Services
Detect and Evict Un-Responsive Nodes
CSSD provides Node Membership services
• CSSD is started by CSSDAgent
• Runs as Oracle User
• Sends Heartbeat to both Voting disk and via Private network to
remote CSSD
• Evicts the node if Heartbeats are missing
7
8
Oracle Clusterware CSSD Monitor & Agent
Monitors CSSD and other critical processes
Pre-11.2 Runtime View
9
Clusterware Agent & Resource(s) Startup
Oracle Clusterware Agents
• Agents are spawned by OHASD and
CRSD and they monitor the
corresponding resources.
• Actions based on policy master
• They are persistent processes, therefore
they have better performance over the
script based CRS resource action in pre-
11.2 releases.
• For example, CHECK_INTERVAL of VIP
resource is 1s starting with 11.2.*
Agents in Cluster Startup
• OHASD invokes the following agents
• cssdagent
• orarootagent
• oraagent
• cssdmonitor
• CRSD invokes the following agents
• orarootagent
• oraagent
• orajagent aka Java Agent (new in 12.2)
• Any user defined agents
Agent Actions
• START, STOP
• CHECK: If it notices any state change during this action, then the agent
framework notifies Oracle Clusterware about the change in the state of the
specific resource.
• CLEAN: The CLEAN entry point acts whenever there is a need to clean up a
resource. It is a non-graceful operation that is invoked when users must
forcefully terminate a resource. This command cleans up the resource-specific
environment so that the resource can be restarted.
• ABORT: If any of the other entry points hang, the agent framework calls the
ABORT entry point to abort the ongoing action.
Resource State Information
• Check returns one of the following values to indicate the
resource state:
• ONLINE
• UNPLANNED_OFFLINE
• PLANNED_OFFLINE
• UNKNOWN
• PARTIAL
• FAILED
• Checks are implicitly called after start, stop, clean.
Check Action (CRSD)
• CRSD also has a deep check implementation
once every 10 checks
• Deep check involves making sure that the
OCR thread within the CRSD Process is not
hung
• Deep check also involves making sure the
Policy Engine Module within CRSD is not
hung
• Agent will ignore the first two consecutive
deep check failures before declaring that
the daemon has failed
HAIP
High Availability for the Private Network
HAIP Network Configuration
Node A Node B Node C
SW1 - 192.168.0.0/24
SW2 - 10.0.0.0/24
• Highly Available Network providing
redundancy and aggregation
functions for the private
interconnect.
• No longer requires OS level
bonding configuration
• Better utilization of private
interfaces configured in the cluster
profile.
• Used by both Oracle Clusterware
components and the database.
HAIP Implementation Details
• All networks configured in cluster profile is used.
• Configures HAIP Addresses on the private interconnect.
• Addresses created through Link Local Address Protocol.
• Creates IP address in the 169.254.0.0/16 subnet
• Maximum of 4 HAIP addresses configured on any node
• Tolerates interface failures
• HAIP address on failed interface dynamically moved to another
interface
• Dynamically add/remove interfaces from the cluster profile
HAIP Failure Handling
Inst 1 Inst 2
SW1 - 10.0.0.0/24
Inst 3
SW1 - 192.168.0.0/24
169.254.0.1 169.254.0.3169.254.0.2
169.254.128.1 169.254.128.3169.254.128.2169.254.128.1 169.254.128.2 169.254.128.3
HAIP Failure Handling
Inst 1 Inst 2
SW1 - 10.0.0.0/24
Inst 3
SW1 - 192.168.0.0/24
169.254.0.1 169.254.0.3169.254.0.2
169.254.128.3169.254.128.2
169.254.128.1
169.254.128.2 169.254.128.3
HAIP Failure Handling
Inst 1 Inst 2
SW1 - 10.0.0.0/24
Inst 3
SW1 - 192.168.0.0/24
169.254.0.1 169.254.0.3169.254.0.2
169.254.128.3
169.254.128.1 169.254.128.2
169.254.128.3
HAIP Failure Handling
Inst 1 Inst 2
SW1 - 10.0.0.0/24
Inst 3
SW1 - 192.168.0.0/24
169.254.0.1 169.254.0.3169.254.0.2
169.254.128.1 169.254.128.2 169.254.128.3
HAIP Failure Handling
Inst 1 Inst 2
SW1 - 10.0.0.0/24
Inst 3
SW1 - 192.168.0.0/24
169.254.0.1 169.254.0.3169.254.0.2
169.254.128.2169.254.128.1 169.254.128.3
24
LMS
Manage Global Buffer Cache
• LMS ships blocks based on
requests by remote clients
• LMS has its own retry
mechanism to handle block
shipping failures
• Very Expensive
• Bad for performance
• LMS can offload to its slaves
to mitigate outliers
• LMS CR Slaves
• LMS monitored by LMHB
LMS manages the Global Buffer Cache
Buffer
Cache
LMS*
Shared
Pool
In-
Memory
Misc
Buffer
Cache
LMS*
Shared
Pool
In-
Memory
Misc
Buffer
Cache
LMS*
Shared
Pool
In-
Memory
Misc
Buffer
Cache
LMS*
Shared
Pool
In-
Memory
Misc
Total SGA
S
G
A
S
G
A
S
G
A
CR Slaves to Mitigate Performance Outliers
• In previous releases, LMS work on incoming
consistent read requests in sequential fashion
• Sessions requesting consistent blocks that
require applying lot of undo may cause LMS to
be busy
• Starting with Oracle RAC 12c Release 2, LMS
offloads work to ‘CR slaves’ if the amount of
UNDO to be applied exceeds a certain, dynamic
threshold
• Default is 1 slave and additional slaves are
spawned as needed
26
Time Account Amount
T 13579 $2500
T+1 13579 $2000
T+2 13579 $1000
T+3 13579 $200
LMON
Instance Membership
LMON Has the Final Word on
Which Instances are Part of a
Cluster DB
LMON has it’s own heartbeat to the
other LMONs and to the control file
If there is a timeout, LMON can
evict another instance
IMR – Instance Membership Recovery
IMR – Send Timeouts
• "IPC send timeout" occurs
when a cross instance
message is not acknowledged
by the remote instance within
5 min (default timeout)
resulting in an ORA-29740
• ORA-29770 and ORA-29771
error messages introduced in
11.2+ to take action before we
hit an “IPC send timeout” in
most cases.
Review: IPC Send Timeouts
• This example is from a 4 node cluster. The alert log from
instance 1 showed a send timeout and the receiver was on
instance 4:
alert_p599a.log-Mon Apr 17 09:42:10 2006
alert_p599a.log:IPC Send timeout detected. Sender ospid 3859
alert_p599d.log-Mon Apr 17 09:42:11 2006
alert_p599d.log:IPC Send timeout detected. Receiver ospid 9014
alert_p599d.log-Mon Apr 17 09:42:11 2006
Solving IPC Send Timeouts
• In 11.2 a non-fatal background process called LMHB (Heart beat
monitor) created to monitor health via periodic heart beats
• Processes monitored by LMHB:
• LMON (global enqueue service monitor)
• LMD0 (global enqueue service daemon)
• LMS* (global cache service process)
• LCK0 (Lock Process)
• DIAG and DIA0 (Diagnostic Processes)
• RMS0 (Oracle RAC management server)
• Possibly more depending on version
ORA-29770 and ORA-29771
• Any non fatal processes blocking the monitored processes (e.g.
holding latches) will be terminated after a timeout regardless of
system load.
• Fatal processes will only be terminated when load is low.
• Exceptions (no kill) are given when any of LM*, LCK, DIA*
processes are in the middle of CF enqueue or CF I/O operations,
row cache and library cache background operations, or doing
system state dump.
Evict Sick Unresponsive Nodes
• LMS1 (ospid: 22636) has detected no messaging activity from instance 1
LMS1 (ospid: 22636) issues an IMR to resolve the situation
Communications reconfiguration: instance number 1
• Evicting instance 1 from cluster
Waiting for instances to leave: 1
Sat Jul 24 10:38:45 2010
Remote instance kill is issued with system inc 10
Remote instance kill map (size 1) : 1
Waiting for instances to leave: 1
• Analysis: Instance 1 was hanging and not responding, so instance 2 evicted
instance 1 and waited instance 1 to abort.
Confidential – Oracle Internal/Restricted/Highly Restricted34
Automatic Storage Management
Stripe and Mirror Everything
Oracle-ASM Automatic Storage Management
35
ASM Cluster Pool of Storage
Disk Group BDisk Group AShared Disk
Groups
Wide File Striping
One to One
Mapping of ASM
Instances to
Servers
ASM Instance
Database Instance
ASM Disk
RAC Cluster
Node4Node3Node2Node1 Node5ASM ASM ASM ASM ASM
ASM Instance
Database Instance
DBA DBA DBCDBB DBBDBB
36
Removal of One to One Mapping and HA
Oracle Flex ASM provides even higher HA
ASM Cluster Pool of Storage
Disk Group BDisk Group A
Databases share
ASM instances
ASM Instance
Database Instance
ASM Disk
RAC Cluster
Node5Node4Node3Node2Node1
Node1 runs as ASM
Client to Node2
ASM ASM
ASM Instance
DBA DBA DBCDBB DBBDBB
ASM
37
Removal of One to One Mapping and HA
Oracle Flex ASM provides even higher HA
ASM Cluster Pool of Storage
Disk Group BDisk Group A
Databases share
ASM instances
ASM Instance
Database Instance
ASM Disk
RAC Cluster
Node5Node4Node3Node2Node1 ASMASM ASM
ASM Instance
DBA DBA DBCDBB DBBDBB
Node1 runs as ASM
Client to Node4
38
ASM Flex Disk Groups
Diskgroup
DB3 : File 1
DB2 : File 2 DB1 : File 3
DB3 : File 3
DB2 : File 1
DB1 : File 1
DB1 : File 2
DB2 : File 3DB3 : File 2
DB2 : File 4
Flex Diskgroup
DB1
File 1
File 2
File 3
DB2
File 1
File 2
File 3
File 4
DB3
File 1
File 2
File 3
Database-oriented Storage Management for additional flexibility and availability
File Group
Flex Disk Group
DB1
File 1
File 2
File 3
DB2
File 1
File 2
File 3
File 4
DB3
File 1
File 2
File 3
39
Database-oriented Storage Management for more flexibility and availability
ASM Flex Disk Groups
Quota
DB3
File 1
File 2
File 3
12.2 Flex Disk Group Organization • Flex Disk Groups enable
– Quota Management - limit the space
databases can allocate in a diskgroup and
thereby improve the customers’ ability to
consolidate databases into fewer DGs
– Redundancy Change – utilize lower
redundancy for less critical databases and
even change redundancies online.
Confidential – Oracle Internal/Restricted/Highly Restricted40
Hang Manager
Detects and Resolves Hangs and Deadlocks
Overlooked & Underestimated – Hang Manager
Customers experience database hangs for a variety of reasons
High system load, workload contention, network congestion or errors
Before Hang Manager was introduced with Oracle RAC 11.2.0.2
Oracle required information to troubleshoot a hang - e.g.:
System state dumps
For RAC: global system state dumps
Customer usually had to reproduce with additional events
41
Why is a Hang Manager required?
42
Hang Manager - Workings
• Always on - Enabled by default
• Reliably detects database hangs
• Autonomically resolves them
• Considers QoS policies during Hang
Resolution
• Logs all detected hangs and their
resolutions
• New SQL interface to configure sensitivity
(Normal/High)
Hang Manager auto-tunes itself by
periodically collecting instance-and
cluster-wide hang statistics
Metrics like Cluster Health/Instance
health is tracked over a moving
average
This moving Average considered
during resolution
Holders waiting on SQL*Net
break/reset are fast tracked
Hang Manager Optimizations
43
Early Warning exposed via (V$ view)
Sensitivity can be set higher, if the user
feels the default level is too
conservative.
Hang Manager behavior can be further
fine-tuned by setting appropriate QoS
policies
DBMS_HANG_MANAGER.Sensitivity
44
Hang
Sensitivity
Level
Description Note
NORMAL Hang Manager uses its
default internal operating
parameters to try to meet
typical requirements for any
environments.
Default
HIGH Hang Manager is more alert
to sessions waiting in a chain
than when sensitivity is in
NORMAL level.
Confidential – Oracle Internal/Restricted/Highly Restricted45
Data Guard
Respond to catastrophic site failures
46
Included with Oracle Database Enterprise Edition
Data Guard: Real-time Data Protection
Automatic Block Repair
Data Guard Broker
(Enterprise Manager Cloud Control or DGMGRL)
Failover to remote site
Primary Data Center DR Data Center
A licensable option to the Oracle Database Enterprise Edition
Active Data Guard: Advanced Capabilities
Zero data loss at any distance
Automatic Block Repair
Data Guard Broker
(Enterprise Manager Cloud Control or DGMGRL)
Offload Fast
Incremental
Backups
Offload read-only
workload to open
standby database
Primary Data Center DR Data Center
48
Getting most of your Active Data Guard DR site
Active Data Guard: Advanced Capabilities
Zero data loss at any distance
Primary Data Center DR Data Center
Automatic Block Repair
DML Redirection
Offload Fast
Incremental
Backups
Offload read-
mostly workload
to open standby
database
Data Guard Broker
(Enterprise Manager Cloud Control or DGMGRL)
Data Guard Standby Redo Apply
In a typical RAC Primary and RAC standby, Only one node of the
standby can apply redo
Other RAC nodes of the standby instance are typically in waiting mode
even if the apply is CPU bound.
Other instance only takes over redo apply only if the instance applying
redo crashes
Data Guard Standby Redo Apply
50
Multi-Instance Redo Apply
• Utilize all RAC nodes on standby to apply Redo
• Parallel, multi-instance recovery means “the standby DB will keep up”
• Standby recovery - utilizes CPU and I/O across all nodes of RAC standby
• Up to 3500MB+/sec apply rate on an 8 node RAC
• Multi-Instance Apply runs on all MOUNTED instances or all OPEN
Instances
• Exposed in the Broker with the ‘ApplyInstances’ property on
standby
recover managed standby database disconnect using instances 4;
Multi-Instance Redo Apply
Multi-Instance Redo Apply Performance
Utilize all Oracle RAC
instances on the Standby
database to parallelize
recovery
190 380
740
1480
700
1400
2752
5000
0
1000
2000
3000
4000
5000
6000
7000
1 Instance 2 Instances 4 Instances 8 Instances
Batch OLTP
Standby Apply rates in MB/sec running OLTP, Batch workload on Exadata
Autonomous Database = RAC on Exadata (& More)
Autonomous
Database
Automated
Data Center Operations
Oracle Cloud
• Oracle RAC is enabled on Oracle Autonomous Cloud offering
• Oracle RAC meets and exceeds the stringent Autonomous Transaction Processing
Dedicated (ATP-D) requirements
• Successfully providing scalability and availability to the Oracle Database for all
Oracle RAC Family of Solutions is
an integrated that works together
cohesively to ensure that
regardless of the failure, the stack
will continue to run with
minimum or no interruptions to
user sessions on both On
premise and Oracle Cloud
environments
55
Summary

More Related Content

PPTX
Oracle RAC features on Exadata
PDF
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
PDF
Oracle Clusterware Node Management and Voting Disks
PDF
Make Your Application “Oracle RAC Ready” & Test For It
PDF
Oracle RAC - New Generation
PDF
Exadata master series_asm_2020
PDF
Oracle Extended Clusters for Oracle RAC
PDF
New Generation Oracle RAC Performance
Oracle RAC features on Exadata
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
Oracle Clusterware Node Management and Voting Disks
Make Your Application “Oracle RAC Ready” & Test For It
Oracle RAC - New Generation
Exadata master series_asm_2020
Oracle Extended Clusters for Oracle RAC
New Generation Oracle RAC Performance

What's hot (20)

PDF
Understanding oracle rac internals part 1 - slides
PDF
Oracle RAC 19c: Best Practices and Secret Internals
PDF
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
PDF
MAA Best Practices for Oracle Database 19c
PDF
Oracle RAC Internals - The Cache Fusion Edition
PDF
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
PDF
The Oracle RAC Family of Solutions - Presentation
PDF
Oracle RAC 19c and Later - Best Practices #OOWLON
PDF
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
PDF
Understanding oracle rac internals part 2 - slides
PDF
Oracle Flex ASM - What’s New and Best Practices by Jim Williams
PDF
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...
PPTX
Part1 of SQL Tuning Workshop - Understanding the Optimizer
PDF
A deep dive about VIP,HAIP, and SCAN
PDF
Tanel Poder - Scripts and Tools short
PDF
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
PDF
Oracle data guard for beginners
PDF
Oracle RAC on Extended Distance Clusters - Customer Examples
PDF
Migration to Oracle Multitenant
PDF
Oracle statistics by example
Understanding oracle rac internals part 1 - slides
Oracle RAC 19c: Best Practices and Secret Internals
Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...
MAA Best Practices for Oracle Database 19c
Oracle RAC Internals - The Cache Fusion Edition
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
The Oracle RAC Family of Solutions - Presentation
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC Virtualized - In VMs, in Containers, On-premises, and in the Cloud
Understanding oracle rac internals part 2 - slides
Oracle Flex ASM - What’s New and Best Practices by Jim Williams
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...
Part1 of SQL Tuning Workshop - Understanding the Optimizer
A deep dive about VIP,HAIP, and SCAN
Tanel Poder - Scripts and Tools short
DB Time, Average Active Sessions, and ASH Math - Oracle performance fundamentals
Oracle data guard for beginners
Oracle RAC on Extended Distance Clusters - Customer Examples
Migration to Oracle Multitenant
Oracle statistics by example
Ad

Similar to Smart monitoring how does oracle rac manage resource, state ukoug19 (20)

PDF
AIOUG-GroundBreakers-Jul 2019 - 19c RAC
PPSX
RAC - The Savior of DBA
PPTX
Anil nair rac_internals_sangam_2016
PPTX
Oracle real application clusters system tests with demo
PDF
New availability features in oracle rac 12c release 2 anair ss
PDF
Racsig rac internals
PDF
An introduction to_rac_system_test_planning_methods
PDF
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - 19c RAC
PDF
Scaling paypal workloads with oracle rac ss
PDF
Oracle RAC 12c Rel. 2 for Continuous Availability
PPT
les_01.ppt of the Oracle course train_1 file
PPTX
HPC Controls Future
PDF
Rac introduction
PDF
IBM MQ High Availabillity and Disaster Recovery (2017 version)
PPTX
C15LV: Ins and Outs of Concurrent Processing Configuration in Oracle e-Busine...
PPT
01_Architecture_JFV14_01_Architecture_JFV14.ppt
PDF
Managing troubleshooting cluster_360dgrees
PDF
Using Machine Learning to Debug Oracle RAC Issues
PDF
les08.pdf
PDF
Oracle RAC 12c and Policy-Managed Databases, a Technical Overview
AIOUG-GroundBreakers-Jul 2019 - 19c RAC
RAC - The Savior of DBA
Anil nair rac_internals_sangam_2016
Oracle real application clusters system tests with demo
New availability features in oracle rac 12c release 2 anair ss
Racsig rac internals
An introduction to_rac_system_test_planning_methods
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - 19c RAC
Scaling paypal workloads with oracle rac ss
Oracle RAC 12c Rel. 2 for Continuous Availability
les_01.ppt of the Oracle course train_1 file
HPC Controls Future
Rac introduction
IBM MQ High Availabillity and Disaster Recovery (2017 version)
C15LV: Ins and Outs of Concurrent Processing Configuration in Oracle e-Busine...
01_Architecture_JFV14_01_Architecture_JFV14.ppt
Managing troubleshooting cluster_360dgrees
Using Machine Learning to Debug Oracle RAC Issues
les08.pdf
Oracle RAC 12c and Policy-Managed Databases, a Technical Overview
Ad

More from Anil Nair (6)

PDF
Using Machine Learning to Debug complex Oracle RAC Issues
PDF
Rac 12c rel2_operational_best_practices_sangam_2017_as_pdf
PPTX
Rac 12c rel2_operational_best_practices_sangam_2017
PPTX
Collaborate 17 Oracle RAC 12cRel 2 Best Practices
PDF
Step by Step instructions to install Cluster Domain deployment model
PPTX
Con8780 nair rac_best_practices_final_without_12_2content
Using Machine Learning to Debug complex Oracle RAC Issues
Rac 12c rel2_operational_best_practices_sangam_2017_as_pdf
Rac 12c rel2_operational_best_practices_sangam_2017
Collaborate 17 Oracle RAC 12cRel 2 Best Practices
Step by Step instructions to install Cluster Domain deployment model
Con8780 nair rac_best_practices_final_without_12_2content

Recently uploaded (20)

PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Welding lecture in detail for understanding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Geodesy 1.pptx...............................................
PPT
Mechanical Engineering MATERIALS Selection
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPT
Project quality management in manufacturing
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Internet of Things (IOT) - A guide to understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
CYBER-CRIMES AND SECURITY A guide to understanding
Welding lecture in detail for understanding
Model Code of Practice - Construction Work - 21102022 .pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Lesson 3_Tessellation.pptx finite Mathematics
UNIT 4 Total Quality Management .pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Geodesy 1.pptx...............................................
Mechanical Engineering MATERIALS Selection
Embodied AI: Ushering in the Next Era of Intelligent Systems
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
additive manufacturing of ss316l using mig welding
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Project quality management in manufacturing
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx

Smart monitoring how does oracle rac manage resource, state ukoug19

  • 1. 1 Smart Monitoring! How does Oracle RAC manage Resource and State? Copyright © 2019 Oracle and/or its affiliates. Anil Nair Sr Principal Product Manager, Oracle Real Application Clusters (RAC) @RACMasterPM http://www.linkedin.com/in/anil-nair-01960b6 http://www.slideshare.net/AnilNair27/
  • 2. The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and pricing of any features or functionality described for Oracle’s products may change and remains at the sole discretion of Oracle Corporation. Statements in this presentation relating to Oracle’s future plans, expectations, beliefs, intentions and prospects are “forward-looking statements” and are subject to material risks and uncertainties. A detailed discussion of these factors and other risks that affect our business is contained in Oracle’s Securities and Exchange Commission (SEC) filings, including our most recent reports on Form 10-K and Form 10-Q under the heading “Risk Factors.” These filings are available on the SEC’s website or on Oracle’s website at http://www.oracle.com/investor. All information in this presentation is current as of September 2019 and Oracle undertakes no duty to update any statement in light of new information or future events. Safe Harbor Copyright © 2019 Oracle and/or its affiliates.
  • 3. Quickly Detect Outage Quickly Resolve Outage with minimum disruption Failover to disaster recovery site for site level failures How to achieve Maximum Availability?
  • 4. Application/Client Outage Host CPU, Memory, Network, Storage Outage Complete site Outage Scope of Outage Very Important to identify the failure quickly and attempt to resolve it locally
  • 5. Detect Outage Oracle Cluster Synchronization Services Oracle LMS process Oracle Clusterware Agents Oracle LMON process Oracle ASM Oracle Memory Guard Resolve Outage Node Eviction by CSS/Agents Instance eviction by LMON Resource move by Oracle Clusterware Agents Service Shutdown by Memory Guard 5 Failover to remote site Data Guard Oracle Database HA Features* * Not a complete list Process, Resource, Instance, Node Failures Compete Site Outage
  • 6. 6 Oracle Cluster Synchronization Services Detect and Evict Un-Responsive Nodes
  • 7. CSSD provides Node Membership services • CSSD is started by CSSDAgent • Runs as Oracle User • Sends Heartbeat to both Voting disk and via Private network to remote CSSD • Evicts the node if Heartbeats are missing 7
  • 8. 8 Oracle Clusterware CSSD Monitor & Agent Monitors CSSD and other critical processes
  • 10. Clusterware Agent & Resource(s) Startup
  • 11. Oracle Clusterware Agents • Agents are spawned by OHASD and CRSD and they monitor the corresponding resources. • Actions based on policy master • They are persistent processes, therefore they have better performance over the script based CRS resource action in pre- 11.2 releases. • For example, CHECK_INTERVAL of VIP resource is 1s starting with 11.2.*
  • 12. Agents in Cluster Startup • OHASD invokes the following agents • cssdagent • orarootagent • oraagent • cssdmonitor • CRSD invokes the following agents • orarootagent • oraagent • orajagent aka Java Agent (new in 12.2) • Any user defined agents
  • 13. Agent Actions • START, STOP • CHECK: If it notices any state change during this action, then the agent framework notifies Oracle Clusterware about the change in the state of the specific resource. • CLEAN: The CLEAN entry point acts whenever there is a need to clean up a resource. It is a non-graceful operation that is invoked when users must forcefully terminate a resource. This command cleans up the resource-specific environment so that the resource can be restarted. • ABORT: If any of the other entry points hang, the agent framework calls the ABORT entry point to abort the ongoing action.
  • 14. Resource State Information • Check returns one of the following values to indicate the resource state: • ONLINE • UNPLANNED_OFFLINE • PLANNED_OFFLINE • UNKNOWN • PARTIAL • FAILED • Checks are implicitly called after start, stop, clean.
  • 15. Check Action (CRSD) • CRSD also has a deep check implementation once every 10 checks • Deep check involves making sure that the OCR thread within the CRSD Process is not hung • Deep check also involves making sure the Policy Engine Module within CRSD is not hung • Agent will ignore the first two consecutive deep check failures before declaring that the daemon has failed
  • 16. HAIP High Availability for the Private Network
  • 17. HAIP Network Configuration Node A Node B Node C SW1 - 192.168.0.0/24 SW2 - 10.0.0.0/24 • Highly Available Network providing redundancy and aggregation functions for the private interconnect. • No longer requires OS level bonding configuration • Better utilization of private interfaces configured in the cluster profile. • Used by both Oracle Clusterware components and the database.
  • 18. HAIP Implementation Details • All networks configured in cluster profile is used. • Configures HAIP Addresses on the private interconnect. • Addresses created through Link Local Address Protocol. • Creates IP address in the 169.254.0.0/16 subnet • Maximum of 4 HAIP addresses configured on any node • Tolerates interface failures • HAIP address on failed interface dynamically moved to another interface • Dynamically add/remove interfaces from the cluster profile
  • 19. HAIP Failure Handling Inst 1 Inst 2 SW1 - 10.0.0.0/24 Inst 3 SW1 - 192.168.0.0/24 169.254.0.1 169.254.0.3169.254.0.2 169.254.128.1 169.254.128.3169.254.128.2169.254.128.1 169.254.128.2 169.254.128.3
  • 20. HAIP Failure Handling Inst 1 Inst 2 SW1 - 10.0.0.0/24 Inst 3 SW1 - 192.168.0.0/24 169.254.0.1 169.254.0.3169.254.0.2 169.254.128.3169.254.128.2 169.254.128.1 169.254.128.2 169.254.128.3
  • 21. HAIP Failure Handling Inst 1 Inst 2 SW1 - 10.0.0.0/24 Inst 3 SW1 - 192.168.0.0/24 169.254.0.1 169.254.0.3169.254.0.2 169.254.128.3 169.254.128.1 169.254.128.2 169.254.128.3
  • 22. HAIP Failure Handling Inst 1 Inst 2 SW1 - 10.0.0.0/24 Inst 3 SW1 - 192.168.0.0/24 169.254.0.1 169.254.0.3169.254.0.2 169.254.128.1 169.254.128.2 169.254.128.3
  • 23. HAIP Failure Handling Inst 1 Inst 2 SW1 - 10.0.0.0/24 Inst 3 SW1 - 192.168.0.0/24 169.254.0.1 169.254.0.3169.254.0.2 169.254.128.2169.254.128.1 169.254.128.3
  • 25. • LMS ships blocks based on requests by remote clients • LMS has its own retry mechanism to handle block shipping failures • Very Expensive • Bad for performance • LMS can offload to its slaves to mitigate outliers • LMS CR Slaves • LMS monitored by LMHB LMS manages the Global Buffer Cache Buffer Cache LMS* Shared Pool In- Memory Misc Buffer Cache LMS* Shared Pool In- Memory Misc Buffer Cache LMS* Shared Pool In- Memory Misc Buffer Cache LMS* Shared Pool In- Memory Misc Total SGA S G A S G A S G A
  • 26. CR Slaves to Mitigate Performance Outliers • In previous releases, LMS work on incoming consistent read requests in sequential fashion • Sessions requesting consistent blocks that require applying lot of undo may cause LMS to be busy • Starting with Oracle RAC 12c Release 2, LMS offloads work to ‘CR slaves’ if the amount of UNDO to be applied exceeds a certain, dynamic threshold • Default is 1 slave and additional slaves are spawned as needed 26 Time Account Amount T 13579 $2500 T+1 13579 $2000 T+2 13579 $1000 T+3 13579 $200
  • 28. LMON Has the Final Word on Which Instances are Part of a Cluster DB LMON has it’s own heartbeat to the other LMONs and to the control file If there is a timeout, LMON can evict another instance IMR – Instance Membership Recovery
  • 29. IMR – Send Timeouts • "IPC send timeout" occurs when a cross instance message is not acknowledged by the remote instance within 5 min (default timeout) resulting in an ORA-29740 • ORA-29770 and ORA-29771 error messages introduced in 11.2+ to take action before we hit an “IPC send timeout” in most cases.
  • 30. Review: IPC Send Timeouts • This example is from a 4 node cluster. The alert log from instance 1 showed a send timeout and the receiver was on instance 4: alert_p599a.log-Mon Apr 17 09:42:10 2006 alert_p599a.log:IPC Send timeout detected. Sender ospid 3859 alert_p599d.log-Mon Apr 17 09:42:11 2006 alert_p599d.log:IPC Send timeout detected. Receiver ospid 9014 alert_p599d.log-Mon Apr 17 09:42:11 2006
  • 31. Solving IPC Send Timeouts • In 11.2 a non-fatal background process called LMHB (Heart beat monitor) created to monitor health via periodic heart beats • Processes monitored by LMHB: • LMON (global enqueue service monitor) • LMD0 (global enqueue service daemon) • LMS* (global cache service process) • LCK0 (Lock Process) • DIAG and DIA0 (Diagnostic Processes) • RMS0 (Oracle RAC management server) • Possibly more depending on version
  • 32. ORA-29770 and ORA-29771 • Any non fatal processes blocking the monitored processes (e.g. holding latches) will be terminated after a timeout regardless of system load. • Fatal processes will only be terminated when load is low. • Exceptions (no kill) are given when any of LM*, LCK, DIA* processes are in the middle of CF enqueue or CF I/O operations, row cache and library cache background operations, or doing system state dump.
  • 33. Evict Sick Unresponsive Nodes • LMS1 (ospid: 22636) has detected no messaging activity from instance 1 LMS1 (ospid: 22636) issues an IMR to resolve the situation Communications reconfiguration: instance number 1 • Evicting instance 1 from cluster Waiting for instances to leave: 1 Sat Jul 24 10:38:45 2010 Remote instance kill is issued with system inc 10 Remote instance kill map (size 1) : 1 Waiting for instances to leave: 1 • Analysis: Instance 1 was hanging and not responding, so instance 2 evicted instance 1 and waited instance 1 to abort.
  • 34. Confidential – Oracle Internal/Restricted/Highly Restricted34 Automatic Storage Management Stripe and Mirror Everything
  • 35. Oracle-ASM Automatic Storage Management 35 ASM Cluster Pool of Storage Disk Group BDisk Group AShared Disk Groups Wide File Striping One to One Mapping of ASM Instances to Servers ASM Instance Database Instance ASM Disk RAC Cluster Node4Node3Node2Node1 Node5ASM ASM ASM ASM ASM ASM Instance Database Instance DBA DBA DBCDBB DBBDBB
  • 36. 36 Removal of One to One Mapping and HA Oracle Flex ASM provides even higher HA ASM Cluster Pool of Storage Disk Group BDisk Group A Databases share ASM instances ASM Instance Database Instance ASM Disk RAC Cluster Node5Node4Node3Node2Node1 Node1 runs as ASM Client to Node2 ASM ASM ASM Instance DBA DBA DBCDBB DBBDBB ASM
  • 37. 37 Removal of One to One Mapping and HA Oracle Flex ASM provides even higher HA ASM Cluster Pool of Storage Disk Group BDisk Group A Databases share ASM instances ASM Instance Database Instance ASM Disk RAC Cluster Node5Node4Node3Node2Node1 ASMASM ASM ASM Instance DBA DBA DBCDBB DBBDBB Node1 runs as ASM Client to Node4
  • 38. 38 ASM Flex Disk Groups Diskgroup DB3 : File 1 DB2 : File 2 DB1 : File 3 DB3 : File 3 DB2 : File 1 DB1 : File 1 DB1 : File 2 DB2 : File 3DB3 : File 2 DB2 : File 4 Flex Diskgroup DB1 File 1 File 2 File 3 DB2 File 1 File 2 File 3 File 4 DB3 File 1 File 2 File 3 Database-oriented Storage Management for additional flexibility and availability File Group
  • 39. Flex Disk Group DB1 File 1 File 2 File 3 DB2 File 1 File 2 File 3 File 4 DB3 File 1 File 2 File 3 39 Database-oriented Storage Management for more flexibility and availability ASM Flex Disk Groups Quota DB3 File 1 File 2 File 3 12.2 Flex Disk Group Organization • Flex Disk Groups enable – Quota Management - limit the space databases can allocate in a diskgroup and thereby improve the customers’ ability to consolidate databases into fewer DGs – Redundancy Change – utilize lower redundancy for less critical databases and even change redundancies online.
  • 40. Confidential – Oracle Internal/Restricted/Highly Restricted40 Hang Manager Detects and Resolves Hangs and Deadlocks
  • 41. Overlooked & Underestimated – Hang Manager Customers experience database hangs for a variety of reasons High system load, workload contention, network congestion or errors Before Hang Manager was introduced with Oracle RAC 11.2.0.2 Oracle required information to troubleshoot a hang - e.g.: System state dumps For RAC: global system state dumps Customer usually had to reproduce with additional events 41 Why is a Hang Manager required?
  • 42. 42 Hang Manager - Workings • Always on - Enabled by default • Reliably detects database hangs • Autonomically resolves them • Considers QoS policies during Hang Resolution • Logs all detected hangs and their resolutions • New SQL interface to configure sensitivity (Normal/High)
  • 43. Hang Manager auto-tunes itself by periodically collecting instance-and cluster-wide hang statistics Metrics like Cluster Health/Instance health is tracked over a moving average This moving Average considered during resolution Holders waiting on SQL*Net break/reset are fast tracked Hang Manager Optimizations 43
  • 44. Early Warning exposed via (V$ view) Sensitivity can be set higher, if the user feels the default level is too conservative. Hang Manager behavior can be further fine-tuned by setting appropriate QoS policies DBMS_HANG_MANAGER.Sensitivity 44 Hang Sensitivity Level Description Note NORMAL Hang Manager uses its default internal operating parameters to try to meet typical requirements for any environments. Default HIGH Hang Manager is more alert to sessions waiting in a chain than when sensitivity is in NORMAL level.
  • 45. Confidential – Oracle Internal/Restricted/Highly Restricted45 Data Guard Respond to catastrophic site failures
  • 46. 46 Included with Oracle Database Enterprise Edition Data Guard: Real-time Data Protection Automatic Block Repair Data Guard Broker (Enterprise Manager Cloud Control or DGMGRL) Failover to remote site Primary Data Center DR Data Center
  • 47. A licensable option to the Oracle Database Enterprise Edition Active Data Guard: Advanced Capabilities Zero data loss at any distance Automatic Block Repair Data Guard Broker (Enterprise Manager Cloud Control or DGMGRL) Offload Fast Incremental Backups Offload read-only workload to open standby database Primary Data Center DR Data Center
  • 48. 48 Getting most of your Active Data Guard DR site Active Data Guard: Advanced Capabilities Zero data loss at any distance Primary Data Center DR Data Center Automatic Block Repair DML Redirection Offload Fast Incremental Backups Offload read- mostly workload to open standby database Data Guard Broker (Enterprise Manager Cloud Control or DGMGRL)
  • 49. Data Guard Standby Redo Apply In a typical RAC Primary and RAC standby, Only one node of the standby can apply redo Other RAC nodes of the standby instance are typically in waiting mode even if the apply is CPU bound. Other instance only takes over redo apply only if the instance applying redo crashes
  • 50. Data Guard Standby Redo Apply 50
  • 51. Multi-Instance Redo Apply • Utilize all RAC nodes on standby to apply Redo • Parallel, multi-instance recovery means “the standby DB will keep up” • Standby recovery - utilizes CPU and I/O across all nodes of RAC standby • Up to 3500MB+/sec apply rate on an 8 node RAC • Multi-Instance Apply runs on all MOUNTED instances or all OPEN Instances • Exposed in the Broker with the ‘ApplyInstances’ property on standby recover managed standby database disconnect using instances 4;
  • 53. Multi-Instance Redo Apply Performance Utilize all Oracle RAC instances on the Standby database to parallelize recovery 190 380 740 1480 700 1400 2752 5000 0 1000 2000 3000 4000 5000 6000 7000 1 Instance 2 Instances 4 Instances 8 Instances Batch OLTP Standby Apply rates in MB/sec running OLTP, Batch workload on Exadata
  • 54. Autonomous Database = RAC on Exadata (& More) Autonomous Database Automated Data Center Operations Oracle Cloud • Oracle RAC is enabled on Oracle Autonomous Cloud offering • Oracle RAC meets and exceeds the stringent Autonomous Transaction Processing Dedicated (ATP-D) requirements • Successfully providing scalability and availability to the Oracle Database for all
  • 55. Oracle RAC Family of Solutions is an integrated that works together cohesively to ensure that regardless of the failure, the stack will continue to run with minimum or no interruptions to user sessions on both On premise and Oracle Cloud environments 55 Summary