SlideShare a Scribd company logo
Managing a Large OLTP Database
Paresh Patel
Database Engineer
11/19/2014
1
• Introduction to PayPal
• Who am I
• Overview of PayPal Database infrastructure
• Capacity management
• Planned maintenances
• Performance management
• Troubleshooting
• Summary
• Q & A
Disclaimer: Some of the observations here may not be applicable to your environment so test them out or contact Oracle before implementing.
2
Agenda
Who am I
• Database Engineer, MTS2
• Oracle RAC Certified Professional with more than a decade’s experience starting with Oracle 9i
• Oracle RAC, ADG, performance tuning and GoldenGate expert
• Conversant with MongoDB, Cassandra and Couchbase
3
4
Database Deployment Pattern
OCI1 OCI2 OCI3 OCIn
Primary DB
ADG
Primary Data Center
ADG
GG
GG
ADG
(LDR)
ASYNC
…
ADG
ADG
GG
GG
ADG
(DR)
ASYNC
L
O
A
D
B
A
L
A
N
C
E
R
.
.
.
OCI1
OCI2
OCI3
OCIn
L
O
A
D
B
A
L
A
N
C
E
R
.
.
.
OCI1
OCI2
OCI3
OCIn
Note: Primary, all ADGs and GGs targets are RAC clusters.
Remote Data Center
– To support defined business goals, capacitize database tier to provide uninterrupted service to end
users
– Following KPI are used to determine how much business DB tier can support,
– Storage Read/Write IOPS
• Virtual Instruments
Output from VI
• asmcmd iostat
– CPU
• vmstat
5
Capacity Management
– Interconnect(applicable to Oracle RAC)
• nmon utility (AIX)
• netstat -i -I ibd1 -P udp 1 (Solaris, AIX)
• /usr/sbin/perfquery --extended <lid> <port> (Exadata)
» Ibstat command provides lid and port
• DBA_HIST_IC_DEVICE_STATS (Populated only when UDP protocol is used)
Output from netstat command measuring in and out packets per second
6
Capacity Management Continued…
– Latency of cluster related wait events
• V$EVENT_HISTOGRAM
• DBA_HIST_EVENT_HISTOGRAM
• Goal is to keep avg wait time for GC * grant wait events below a ms
• Goal is to keep avg wait time for GC block transfer wait events below 1.5 ms
As you see in the AWR snippet below, it provides details about cluster related wait events
7
Capacity Management Continued…
– Homegrown tools
• Provides holistic view of all databases
– Helps in detecting and mitigating a problem quickly
• Provides detailed instance level view of all critical metrics
– Executions, redo/sec, active sessions, physical reads, consistent gets, buffer gets, load, CPU etc.
– Same stats are used to derive deviations for key metrics
Snippet from homegrown tool for monitoring Database instance
• AWR warehouse
– Helps us monitor key metrics across Read Replicas
– Keep historical AWR data
– SQL profiling for W-o-W/M-o-M deviation in execs, lio, pio, cpu, elapsed time
8
Database Monitoring
– AWR
• @?/rdbms/admin/awrgrpt.sql (Global report of RAC cluster)
• @?/rdbms/admin/awrgdrpt.sql (Global diff report of RAC cluster)
• @?/rdbms/admin/awrddrpt.sql (Instance diff report)
NOTE: Generate reports from physical standbys rather than getting them from live
9
Database Monitoring Continued…
– Monitor ADG
• Using STATSPACK to monitor ADGs
NOTE: Please follow MOS: Doc ID 454848.1 to install Statspack on ADG
• Using homegrown tools
snippet from homegrown tool Executions RO vs Live
– System Metrics
• Home grown utilities
• Oracle OSWatcher
10
Database Monitoring Continued…
– Performing DDL operations on busy tables
– Set DDL_LOCK_TIMEOUT to 10 sec
» If lock is not acquired in specified seconds then DDL will error out
– Make sure to clear DML batch job prior to issuing this if there is one running
– Any new DMLs will queue behind this DDL
– Expect hard parses
– Creating/Dropping Indexes
– Always create in invisible mode to avoid adverse effects such as plan changes, cursor invalidations, etc.
– Decide to make it visible after verifying explain plan of the SQLs by setting
OPTIMIZER_USE_INVISIBLE_INDEXES=TRUE at session level
– To avoid impact to production databases, make indexes invisible before dropping them
– Leverage physical standby for testing
– Convert standby to snapshot standby for testing after taking a GRP
11
Planned Maintenance
– Patching process
– Build ORACLE_HOME and GRID_HOME Gold Images after performing extensive tests with production like
workload and concurrency
– Make sure all patches can be applied in rolling fashion
– Client connectivity is tested and verified
– Patching ramp up planning (start off with patching tier 3 databases)
– Copy important files from old home to new home
» GRID_HOME: gpnp, crs, dbs, cdata, network/admin etc.
» ORACLE_HOME: network/admin, dbs etc.
– In the case of RAC, compile binaries with the protocol other nodes of a cluster using
» /usr/ccs/bin/make -f ins_rdbms.mk ipc_rds ioracle (to compile using RDS)
» /usr/ccs/bin/make -f ins_rdbms.mk ipc_g ioracle (to compile using UDP)
» Use skgxpinfo after setting correct environment variables or nm on $HOME/lib/libskgxp11.so
Planned Maintenance Continued…
Snippet from nm command output
– Minimize Brown out during RAC reconfiguration
• Take instance out of traffic
• Shutting down instance(s) in RAC cluster has direct impact to ongoing DMLs
• Shrink DB_CACHE_SIZE, DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE pools gradually
» alter system set DB_CACHE_SIZE=5GB scope=memory sid=‘A_1’;
Planned Maintenance Continued…
– Database Switchover to Physical Standby
• To minimize the downtime
– Set below parameter on current primary,
» alter system set "_SWITCHOVER_TO_STANDBY_OPTION"="OPEN_ONE_IGNORE_SESSIONS"; (applicable
from 11.2.0.2 onwards)
NOTE: Killing session before Database switchover could take mins. To avoid that, we set this parameter which essentially ignore the
sessions. In Oracle Database 12c, this parameter is default.
– Defer all archive destinations except new primary target
– Enable flashback and take a Guaranteed Restore Point
– Mount all instances of new primary target before switchover to avoid brown out during RAC reconfiguration
– Create Online Redo Log files on new primary target
– Set following parameters on new primary target to avoid high Physical reads and load
» _DB_BLOCK_PREFETCH_QUOTA = 0
» _DB_BLOCK_PREFETCH_LIMIT = 0
» _DB_FILE_NONCONTIG_MBLOCK_READ_COUNT= 0
» _DB_CACHE_PRE_WARM = FALSE
NOTE: These parameters help disabling read ahead right after switchover. Only Index full scan operations get benefited by this.
– Once switchover to Physical standby command completed successfully on current primary and after MRP detects
the End-Of-Redo indicator, issue switchover to primary on Physical standby. Old primary can be shutdown after
new primary up and running
14
Planned Maintenance Continued...
– Database Switchover to Physical Standby
– In the failover situation, to avoid rebuilding all standbys, flash them back to activation SCN and apply redo from new
target
NOTE: Enable flashback on standbys before applying any redo If they didn’t have it enabled.
– How to switchover/failover GoldenGate:
» Copy over dirprm, dirchk directories to new target
» Make necessary changes in configuration parameters like RMTTRAIL, CACHEDIRECTORY etc.
» Use TRANLOGOPTIONS ARCHIVEDLOGONLY to recover data from archive logs in failover situation
– Plan stability is the key
– Explain plan stays stable during various growth phases of a segment
– Avoid plan invalidation when stats are published
– Disable _OPTIM_PEEK_USER_BINDS parameter
– Less Data skewness
– Set stats manually to derive the explain plan
15
Planned Maintenance Continued…
– Plan stability is the key
– STATS we set manually
» New Table: # of row set to 1,000,000 and # blocks to 100,000
» PK based Index stats: # of blocks set to 1,000 # distinct values to 1,000,000 clustering factor to 100,000
» Non-unique Index stats: # of blocks set to 1,500 # distinct values to 500,000 clustering factor to 150,000
» Column stats: Density set to 1/no of distinct rows
Use DBMS_STATS to set stats manually
dbms_stats.set_table_stats
dbms_stats.set_index_stats
dbms_stats.set_column_stats
– To avoid overhead of auto stats collection job
– Investigating SQL Plan baselines for next upgrade cycle
16
Planned Maintenance Continued…
– Oracle RAC
– Oracle RAC works great but there is certain amount of overhead on CPU depending on workload
– To reduce overhead on CPU, set workload isolation to subset of nodes of a cluster using Database services
– LMS processes directly impact system CPU utilization and interconnect traffic
– Starting/stopping instance causes RAC reconfiguration
» To reduce reconfiguration during planned maintenance, please follow the tips provided in “Planned Maintenance”
section above
– UDP protocol over Ethernet and use RDS protocol over IB. Please check
http://www.oracle.com/technetwork/database/clustering/tech-generic-unix-new-166583.html for certification Matrix
– RDS is low latency protocol compare to UDP but it doesn’t support Active-Active configuration unless bonding is
done at OS level
– Use UDP to enhance the network throughput
– Always start LMS, LMD, LGWR and VKTM processes with RT priority,
» Set _HIGH_PRIORITY_PROCESSES to ‘LMS*|VKTM|LMD*|LGWR’
» chmod 4750 $ORACLE_HOME/bin/oradism; chown root:dba $ORACLE_HOME/bin/oradism
17
Performance Management
– Oracle RAC
– Disable DRM on critical databases as it brings on unacceptable and unpredictable freezes
» Disable it via setting _GC_POLICY_TIME parameter to 0
– Monitor avg response time for cluster related wait events
– Disable crs autostart and set “RESTART_ATTEMPS” to 0 for DB resource to avoid crs and database coming up
after crash
» crsctl disable crs
» crsctl modify res ora.testdb.db –attr “RESTART_ATTEMPTS=0”
– ASSM vs MSSM
– With a very high level of concurrency, ASSM may cause contention while MSSM allows you to set freelist and
freelist groups with larger values
– Use ASSM tablespace to create index online due to a bug which gets exposed only in MSSM
» Bug 18715233 (ORA-00600: internal error code, arguments: [kdifind:objdchk_kcbgcur_6], [1], [31226], [0], [0], [], [], [], [], [], [], [])
– Data Reorganization
– Put data related to one logical entity in fewer data blocks periodically
– If the rows of a table on disk are sorted in the same order as the index keys, the database will perform a minimum
number of I/Os on the table to read rows via an index
– Keep old and new tables in sync using Oracle GoldenGate and switch public synonym to new table
18
Performance Management Continued…
– Active Data Guard
– All the blocks are mastered on a node where media recovery is running
– Starting/Stopping media recovery invokes RAC reconfiguration
– Query response time on node where MRP is running is always higher than non-MRP node(s)
– In primary database crash event, query response time on ADG goes up right after primary comes back online as
ADG tries to apply redo fast to resolve apply lag
– For critical read-mostly Databases, we maintain mix of ADG and Oracle GoldenGate reader farm
– For quick session failover, set _ABORT_ON_MRP_CRASH to true to crash all instances of a cluster. Create a crs
resource to introduce same behavior on GG based ROs
NOTE: ADG Internals by Sai Devabhaktuni http://sai-oracle.blogspot.com/2012/11/internals-of-active-dataguard.html
Snippet of ADG monitoring from homegrown tool
19
Performance Management Continued…
– Outliers
– ASH
– V$EVENT_HISTOGRAM
– Top SQLs
– Maintain inventory of TOP SQLs (by cluster wait time, executions, buffer gets, CPU etc.)
– Check AWR diff report or DBA_HIST_SQLSTAT
– Generate reports for comparing various metric data across ROs from AWR warehousing
– Bigger SGA
– Turn off Automatic SGA management
– Set appropriate values _LM_TICKETS and GCS_SERVER_PROCESSES
» Follow MOS note: Best Practices and Recommendations for RAC databases using SGA larger than 300GB (Doc ID 1619155.1)
– Consider configuring DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE pools and put appropriate
segments in them
– Managing Sequences
– Ordered sequences present scalability challenges due to high GC message activity
– Try to keep sequence no-ordered and route write workload to designated node
– Watch out for the large gaps in sequence values if write traffic is routed to set of nodes
– Create logon trigger to handle sequence order in failover scenario
20
Performance Management Continued…
– V$SESSION
– Active session count is an indicator of user activities in Database
– Action, module and client_identifier can reveal most important information about application requests
» OCI client can set bind variables’ value, client application name etc. using APIs
NOTE: We use this workaround as _optim_peek_user_binds parameter is set to FALSE
– WAIT_TIME_MICRO provides how long the session is waiting or waited if it’s not waiting
– EVENT provides why session is waiting
– Most of the time, query on v$session can provide enough clues to diagnose the issue further
– V$ACTIVE_SESSION_HISTORY
– Provides
» Timing and duration of the issue
» session details
» wait event information
» Blocking session information
» Wait time information
» IN_XXXX columns provide session’s execution state information
– Check last X mins of data to get clues on where the problem could be
– ASH data can be inconsistent due to lack of read consistency in underlying X$ fixed tables
– Always take a copy of v$active_session_history right after an incident
NOTE: Deep dive into ASH by Sai Devabhaktuni http://sai-oracle.blogspot.com/2012/11/deep-dive-into-ash.html
21
Troubleshooting
– Homegrown tools
– Provides us the various database metrics from V$SESSION, V$SYSSTAT, V$SYSTEM_EVENT every 10 seconds
– Executions, redo/sec, active sessions, physical reads, consistent gets, buffer gets, load, CPU etc.
NOTE: Doesn’t use any GV$ query as Oracle spawns new processes on all instances of RAC
– Reproducing issue in test environment
– Some of the issues happen in production don’t produce enough diagnostic data for Oracle to provide RCA and
possible fix
– Identify the workload and concurrency at the time of problem occurrence
– Set identical environment and run workload with same concurrency
– Log files
– RDBMS and ASM Alert log files
– agent, crsd log files
– gipcd, cssd log files
– System log files under /var/adm/
22
Troubleshooting Continued…
 Always perform scale up tests with 5x workload for new feature, patches and Oracle version upgrade
 Drive the database stack to failure to test capacity limits
 Master important views such as v$session and v$active_session_history
 Take advantage of Snapshot Standby for testing
 Stable Execution plans is the key for stable performance
 Measure capacity by various dimensions including Interconnect
 Monitor databases using complementary set of tools to fully understand the database profile
 Right tools will help troubleshooting the issue quicker
23
Summary
Q & A
Thank You!
24

More Related Content

PPTX
Data Guard Architecture & Setup
PPTX
Oracle database 12c new features
PDF
Ioug tip book11_gunukula
PPTX
What’s new in oracle 12c recovery manager (rman)
PDF
ORACLE 12C DATA GUARD: FAR SYNC, REAL-TIME CASCADE STANDBY AND OTHER GOODIES
PDF
Oracle Data Guard A to Z
PDF
Dataguard physical stand by setup
PDF
Nabil Nawaz Oracle Oracle 12c Data Guard Deep Dive Presentation
Data Guard Architecture & Setup
Oracle database 12c new features
Ioug tip book11_gunukula
What’s new in oracle 12c recovery manager (rman)
ORACLE 12C DATA GUARD: FAR SYNC, REAL-TIME CASCADE STANDBY AND OTHER GOODIES
Oracle Data Guard A to Z
Dataguard physical stand by setup
Nabil Nawaz Oracle Oracle 12c Data Guard Deep Dive Presentation

What's hot (20)

PDF
Real-Time Query for Data Guard
PPTX
Oracle12c data guard farsync and whats new
PDF
Oracle Data Guard Broker Webinar
PPT
Dataguard presentation
PDF
Data Guard Deep Dive UKOUG 2012
PDF
Oracle12c data guard farsync and whats new - Nassyam Basha
PPT
Oracle Data Guard
PPT
Oracle dataguard overview
PDF
Oracle RAC 12c and Policy-Managed Databases, a Technical Overview
PPT
Oracle DataGuard Online Training in USA | INDIA
PDF
Oracle db performance tuning
DOC
Oracle data guard configuration in 12c
DOCX
Data guard architecture
PPTX
Optimizing your Database Import!
DOC
Analyzing awr report
DOC
Backup and Recovery Procedure
PPT
Earl Shaffer Oracle Performance Tuning pre12c 11g AWR uses
PPT
High Availability And Oracle Data Guard 11g R2
PDF
Oracle 12c and its pluggable databases
PPT
Active / Active configurations with Oracle Active Data Guard
Real-Time Query for Data Guard
Oracle12c data guard farsync and whats new
Oracle Data Guard Broker Webinar
Dataguard presentation
Data Guard Deep Dive UKOUG 2012
Oracle12c data guard farsync and whats new - Nassyam Basha
Oracle Data Guard
Oracle dataguard overview
Oracle RAC 12c and Policy-Managed Databases, a Technical Overview
Oracle DataGuard Online Training in USA | INDIA
Oracle db performance tuning
Oracle data guard configuration in 12c
Data guard architecture
Optimizing your Database Import!
Analyzing awr report
Backup and Recovery Procedure
Earl Shaffer Oracle Performance Tuning pre12c 11g AWR uses
High Availability And Oracle Data Guard 11g R2
Oracle 12c and its pluggable databases
Active / Active configurations with Oracle Active Data Guard
Ad

Similar to NoCOUG_201411_Patel_Managing_a_Large_OLTP_Database (20)

DOC
Oracle10g rac course_contents
PDF
MIgrating to RAC using Dataguard
PDF
ORACLE RAC DBA ONLINE TRAINING
DOC
Migrating from Single Instance to RAC Data guard
PPTX
Database 12c is ready for you... Are you ready for 12c?
PPTX
Oracle core dba online training
PDF
Oracle RAC 12c Practical Performance Management and Tuning OOW13 [CON8825]
PDF
Oracle RAC 12c Overview
PDF
les12.pdf
PDF
Oracle Database 12c Multitenant for Consolidation
PDF
Oracle RAC Online Training.pdf
PPTX
Anil nair rac_internals_sangam_2016
PDF
11g R2 Live Part 1
DOC
Best Oracle dba online training institute
PDF
Perf tuning with-multitenant
PDF
Maximum Availability Architecture - Best Practices for Oracle Database 19c
PPTX
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
PDF
New availability features in oracle rac 12c release 2 anair ss
PPT
Oracle RAC Presentation at Oracle Open World
PDF
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
Oracle10g rac course_contents
MIgrating to RAC using Dataguard
ORACLE RAC DBA ONLINE TRAINING
Migrating from Single Instance to RAC Data guard
Database 12c is ready for you... Are you ready for 12c?
Oracle core dba online training
Oracle RAC 12c Practical Performance Management and Tuning OOW13 [CON8825]
Oracle RAC 12c Overview
les12.pdf
Oracle Database 12c Multitenant for Consolidation
Oracle RAC Online Training.pdf
Anil nair rac_internals_sangam_2016
11g R2 Live Part 1
Best Oracle dba online training institute
Perf tuning with-multitenant
Maximum Availability Architecture - Best Practices for Oracle Database 19c
Db As Behaving Badly... Worst Practices For Database Administrators Rod Colledge
New availability features in oracle rac 12c release 2 anair ss
Oracle RAC Presentation at Oracle Open World
UKOUG Tech15 - Deploying Oracle 12c Cloud Control in Maximum Availability Arc...
Ad

NoCOUG_201411_Patel_Managing_a_Large_OLTP_Database

  • 1. Managing a Large OLTP Database Paresh Patel Database Engineer 11/19/2014 1
  • 2. • Introduction to PayPal • Who am I • Overview of PayPal Database infrastructure • Capacity management • Planned maintenances • Performance management • Troubleshooting • Summary • Q & A Disclaimer: Some of the observations here may not be applicable to your environment so test them out or contact Oracle before implementing. 2 Agenda
  • 3. Who am I • Database Engineer, MTS2 • Oracle RAC Certified Professional with more than a decade’s experience starting with Oracle 9i • Oracle RAC, ADG, performance tuning and GoldenGate expert • Conversant with MongoDB, Cassandra and Couchbase 3
  • 4. 4 Database Deployment Pattern OCI1 OCI2 OCI3 OCIn Primary DB ADG Primary Data Center ADG GG GG ADG (LDR) ASYNC … ADG ADG GG GG ADG (DR) ASYNC L O A D B A L A N C E R . . . OCI1 OCI2 OCI3 OCIn L O A D B A L A N C E R . . . OCI1 OCI2 OCI3 OCIn Note: Primary, all ADGs and GGs targets are RAC clusters. Remote Data Center
  • 5. – To support defined business goals, capacitize database tier to provide uninterrupted service to end users – Following KPI are used to determine how much business DB tier can support, – Storage Read/Write IOPS • Virtual Instruments Output from VI • asmcmd iostat – CPU • vmstat 5 Capacity Management
  • 6. – Interconnect(applicable to Oracle RAC) • nmon utility (AIX) • netstat -i -I ibd1 -P udp 1 (Solaris, AIX) • /usr/sbin/perfquery --extended <lid> <port> (Exadata) » Ibstat command provides lid and port • DBA_HIST_IC_DEVICE_STATS (Populated only when UDP protocol is used) Output from netstat command measuring in and out packets per second 6 Capacity Management Continued…
  • 7. – Latency of cluster related wait events • V$EVENT_HISTOGRAM • DBA_HIST_EVENT_HISTOGRAM • Goal is to keep avg wait time for GC * grant wait events below a ms • Goal is to keep avg wait time for GC block transfer wait events below 1.5 ms As you see in the AWR snippet below, it provides details about cluster related wait events 7 Capacity Management Continued…
  • 8. – Homegrown tools • Provides holistic view of all databases – Helps in detecting and mitigating a problem quickly • Provides detailed instance level view of all critical metrics – Executions, redo/sec, active sessions, physical reads, consistent gets, buffer gets, load, CPU etc. – Same stats are used to derive deviations for key metrics Snippet from homegrown tool for monitoring Database instance • AWR warehouse – Helps us monitor key metrics across Read Replicas – Keep historical AWR data – SQL profiling for W-o-W/M-o-M deviation in execs, lio, pio, cpu, elapsed time 8 Database Monitoring
  • 9. – AWR • @?/rdbms/admin/awrgrpt.sql (Global report of RAC cluster) • @?/rdbms/admin/awrgdrpt.sql (Global diff report of RAC cluster) • @?/rdbms/admin/awrddrpt.sql (Instance diff report) NOTE: Generate reports from physical standbys rather than getting them from live 9 Database Monitoring Continued…
  • 10. – Monitor ADG • Using STATSPACK to monitor ADGs NOTE: Please follow MOS: Doc ID 454848.1 to install Statspack on ADG • Using homegrown tools snippet from homegrown tool Executions RO vs Live – System Metrics • Home grown utilities • Oracle OSWatcher 10 Database Monitoring Continued…
  • 11. – Performing DDL operations on busy tables – Set DDL_LOCK_TIMEOUT to 10 sec » If lock is not acquired in specified seconds then DDL will error out – Make sure to clear DML batch job prior to issuing this if there is one running – Any new DMLs will queue behind this DDL – Expect hard parses – Creating/Dropping Indexes – Always create in invisible mode to avoid adverse effects such as plan changes, cursor invalidations, etc. – Decide to make it visible after verifying explain plan of the SQLs by setting OPTIMIZER_USE_INVISIBLE_INDEXES=TRUE at session level – To avoid impact to production databases, make indexes invisible before dropping them – Leverage physical standby for testing – Convert standby to snapshot standby for testing after taking a GRP 11 Planned Maintenance
  • 12. – Patching process – Build ORACLE_HOME and GRID_HOME Gold Images after performing extensive tests with production like workload and concurrency – Make sure all patches can be applied in rolling fashion – Client connectivity is tested and verified – Patching ramp up planning (start off with patching tier 3 databases) – Copy important files from old home to new home » GRID_HOME: gpnp, crs, dbs, cdata, network/admin etc. » ORACLE_HOME: network/admin, dbs etc. – In the case of RAC, compile binaries with the protocol other nodes of a cluster using » /usr/ccs/bin/make -f ins_rdbms.mk ipc_rds ioracle (to compile using RDS) » /usr/ccs/bin/make -f ins_rdbms.mk ipc_g ioracle (to compile using UDP) » Use skgxpinfo after setting correct environment variables or nm on $HOME/lib/libskgxp11.so Planned Maintenance Continued…
  • 13. Snippet from nm command output – Minimize Brown out during RAC reconfiguration • Take instance out of traffic • Shutting down instance(s) in RAC cluster has direct impact to ongoing DMLs • Shrink DB_CACHE_SIZE, DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE pools gradually » alter system set DB_CACHE_SIZE=5GB scope=memory sid=‘A_1’; Planned Maintenance Continued…
  • 14. – Database Switchover to Physical Standby • To minimize the downtime – Set below parameter on current primary, » alter system set "_SWITCHOVER_TO_STANDBY_OPTION"="OPEN_ONE_IGNORE_SESSIONS"; (applicable from 11.2.0.2 onwards) NOTE: Killing session before Database switchover could take mins. To avoid that, we set this parameter which essentially ignore the sessions. In Oracle Database 12c, this parameter is default. – Defer all archive destinations except new primary target – Enable flashback and take a Guaranteed Restore Point – Mount all instances of new primary target before switchover to avoid brown out during RAC reconfiguration – Create Online Redo Log files on new primary target – Set following parameters on new primary target to avoid high Physical reads and load » _DB_BLOCK_PREFETCH_QUOTA = 0 » _DB_BLOCK_PREFETCH_LIMIT = 0 » _DB_FILE_NONCONTIG_MBLOCK_READ_COUNT= 0 » _DB_CACHE_PRE_WARM = FALSE NOTE: These parameters help disabling read ahead right after switchover. Only Index full scan operations get benefited by this. – Once switchover to Physical standby command completed successfully on current primary and after MRP detects the End-Of-Redo indicator, issue switchover to primary on Physical standby. Old primary can be shutdown after new primary up and running 14 Planned Maintenance Continued...
  • 15. – Database Switchover to Physical Standby – In the failover situation, to avoid rebuilding all standbys, flash them back to activation SCN and apply redo from new target NOTE: Enable flashback on standbys before applying any redo If they didn’t have it enabled. – How to switchover/failover GoldenGate: » Copy over dirprm, dirchk directories to new target » Make necessary changes in configuration parameters like RMTTRAIL, CACHEDIRECTORY etc. » Use TRANLOGOPTIONS ARCHIVEDLOGONLY to recover data from archive logs in failover situation – Plan stability is the key – Explain plan stays stable during various growth phases of a segment – Avoid plan invalidation when stats are published – Disable _OPTIM_PEEK_USER_BINDS parameter – Less Data skewness – Set stats manually to derive the explain plan 15 Planned Maintenance Continued…
  • 16. – Plan stability is the key – STATS we set manually » New Table: # of row set to 1,000,000 and # blocks to 100,000 » PK based Index stats: # of blocks set to 1,000 # distinct values to 1,000,000 clustering factor to 100,000 » Non-unique Index stats: # of blocks set to 1,500 # distinct values to 500,000 clustering factor to 150,000 » Column stats: Density set to 1/no of distinct rows Use DBMS_STATS to set stats manually dbms_stats.set_table_stats dbms_stats.set_index_stats dbms_stats.set_column_stats – To avoid overhead of auto stats collection job – Investigating SQL Plan baselines for next upgrade cycle 16 Planned Maintenance Continued…
  • 17. – Oracle RAC – Oracle RAC works great but there is certain amount of overhead on CPU depending on workload – To reduce overhead on CPU, set workload isolation to subset of nodes of a cluster using Database services – LMS processes directly impact system CPU utilization and interconnect traffic – Starting/stopping instance causes RAC reconfiguration » To reduce reconfiguration during planned maintenance, please follow the tips provided in “Planned Maintenance” section above – UDP protocol over Ethernet and use RDS protocol over IB. Please check http://www.oracle.com/technetwork/database/clustering/tech-generic-unix-new-166583.html for certification Matrix – RDS is low latency protocol compare to UDP but it doesn’t support Active-Active configuration unless bonding is done at OS level – Use UDP to enhance the network throughput – Always start LMS, LMD, LGWR and VKTM processes with RT priority, » Set _HIGH_PRIORITY_PROCESSES to ‘LMS*|VKTM|LMD*|LGWR’ » chmod 4750 $ORACLE_HOME/bin/oradism; chown root:dba $ORACLE_HOME/bin/oradism 17 Performance Management
  • 18. – Oracle RAC – Disable DRM on critical databases as it brings on unacceptable and unpredictable freezes » Disable it via setting _GC_POLICY_TIME parameter to 0 – Monitor avg response time for cluster related wait events – Disable crs autostart and set “RESTART_ATTEMPS” to 0 for DB resource to avoid crs and database coming up after crash » crsctl disable crs » crsctl modify res ora.testdb.db –attr “RESTART_ATTEMPTS=0” – ASSM vs MSSM – With a very high level of concurrency, ASSM may cause contention while MSSM allows you to set freelist and freelist groups with larger values – Use ASSM tablespace to create index online due to a bug which gets exposed only in MSSM » Bug 18715233 (ORA-00600: internal error code, arguments: [kdifind:objdchk_kcbgcur_6], [1], [31226], [0], [0], [], [], [], [], [], [], []) – Data Reorganization – Put data related to one logical entity in fewer data blocks periodically – If the rows of a table on disk are sorted in the same order as the index keys, the database will perform a minimum number of I/Os on the table to read rows via an index – Keep old and new tables in sync using Oracle GoldenGate and switch public synonym to new table 18 Performance Management Continued…
  • 19. – Active Data Guard – All the blocks are mastered on a node where media recovery is running – Starting/Stopping media recovery invokes RAC reconfiguration – Query response time on node where MRP is running is always higher than non-MRP node(s) – In primary database crash event, query response time on ADG goes up right after primary comes back online as ADG tries to apply redo fast to resolve apply lag – For critical read-mostly Databases, we maintain mix of ADG and Oracle GoldenGate reader farm – For quick session failover, set _ABORT_ON_MRP_CRASH to true to crash all instances of a cluster. Create a crs resource to introduce same behavior on GG based ROs NOTE: ADG Internals by Sai Devabhaktuni http://sai-oracle.blogspot.com/2012/11/internals-of-active-dataguard.html Snippet of ADG monitoring from homegrown tool 19 Performance Management Continued…
  • 20. – Outliers – ASH – V$EVENT_HISTOGRAM – Top SQLs – Maintain inventory of TOP SQLs (by cluster wait time, executions, buffer gets, CPU etc.) – Check AWR diff report or DBA_HIST_SQLSTAT – Generate reports for comparing various metric data across ROs from AWR warehousing – Bigger SGA – Turn off Automatic SGA management – Set appropriate values _LM_TICKETS and GCS_SERVER_PROCESSES » Follow MOS note: Best Practices and Recommendations for RAC databases using SGA larger than 300GB (Doc ID 1619155.1) – Consider configuring DB_KEEP_CACHE_SIZE and DB_RECYCLE_CACHE_SIZE pools and put appropriate segments in them – Managing Sequences – Ordered sequences present scalability challenges due to high GC message activity – Try to keep sequence no-ordered and route write workload to designated node – Watch out for the large gaps in sequence values if write traffic is routed to set of nodes – Create logon trigger to handle sequence order in failover scenario 20 Performance Management Continued…
  • 21. – V$SESSION – Active session count is an indicator of user activities in Database – Action, module and client_identifier can reveal most important information about application requests » OCI client can set bind variables’ value, client application name etc. using APIs NOTE: We use this workaround as _optim_peek_user_binds parameter is set to FALSE – WAIT_TIME_MICRO provides how long the session is waiting or waited if it’s not waiting – EVENT provides why session is waiting – Most of the time, query on v$session can provide enough clues to diagnose the issue further – V$ACTIVE_SESSION_HISTORY – Provides » Timing and duration of the issue » session details » wait event information » Blocking session information » Wait time information » IN_XXXX columns provide session’s execution state information – Check last X mins of data to get clues on where the problem could be – ASH data can be inconsistent due to lack of read consistency in underlying X$ fixed tables – Always take a copy of v$active_session_history right after an incident NOTE: Deep dive into ASH by Sai Devabhaktuni http://sai-oracle.blogspot.com/2012/11/deep-dive-into-ash.html 21 Troubleshooting
  • 22. – Homegrown tools – Provides us the various database metrics from V$SESSION, V$SYSSTAT, V$SYSTEM_EVENT every 10 seconds – Executions, redo/sec, active sessions, physical reads, consistent gets, buffer gets, load, CPU etc. NOTE: Doesn’t use any GV$ query as Oracle spawns new processes on all instances of RAC – Reproducing issue in test environment – Some of the issues happen in production don’t produce enough diagnostic data for Oracle to provide RCA and possible fix – Identify the workload and concurrency at the time of problem occurrence – Set identical environment and run workload with same concurrency – Log files – RDBMS and ASM Alert log files – agent, crsd log files – gipcd, cssd log files – System log files under /var/adm/ 22 Troubleshooting Continued…
  • 23.  Always perform scale up tests with 5x workload for new feature, patches and Oracle version upgrade  Drive the database stack to failure to test capacity limits  Master important views such as v$session and v$active_session_history  Take advantage of Snapshot Standby for testing  Stable Execution plans is the key for stable performance  Measure capacity by various dimensions including Interconnect  Monitor databases using complementary set of tools to fully understand the database profile  Right tools will help troubleshooting the issue quicker 23 Summary
  • 24. Q & A Thank You! 24