Performance Optimization With SAP On DB2 - Key Performance Indicators
Performance Optimization With SAP On DB2 - Key Performance Indicators
Jan Kritter
SAP DB6 Technology Centre of Expertise
[email protected]
Overview
Performance is key, response time & throughput are critical to decide if the following are met or missed:
Service level agreements
User expectations
Smooth operation of your business
KPIs help to isolate areas with sub-optimal performance and hence enable you to:
Quickly verify the database performances health
Catch trends before they become a problem
Determine optimization potential of new business applications
Help shorten turn-around times to problem solution if a problem occurs
Value proposition:
Nuts and bolts on what KPIs to pay attention to most
How to proactively and reactively analyze your system and tune it accordingly
You learn
Which database performance indicators are the most important ones for performance monitoring
and tuning
What they mean from a technical perspective
What influences them and what in turn is affected by them
How to monitor them
What thresholds typically work well for SAP systems
Agenda: topics
At-a-glace overview
1.
2.
3.
4.
5.
6.
7.
Motivation: Tasks that are important to determine and ensure adequate database performance 1, 2
Quickly verify the databases performance
Detect performance problems, analyze, and help shorten turn-around times to solution
Catch bad trends before they become a problem
Determine optimization need and potential at the database level of new business applications during their test
phase
Key Performance indicators - introduction
What are key performance indicators (KPIs) and performance indicators (PIs)? What are they good for?
KPIs where to access them
SAP transactions
Command line
SAP performance warehouse
3rd party
KPIs - how to store them
Not at all (in the case online observation)
Download (from within SAP transactions)
In tables (through table functions)
In files (through OS level commands)
Analyses: common scenarios
Spot check, ad hoc, system status (proactive & reactive, online)
Trend (proactive, offline)
Problem (reactive, offline)
Monitoring strategy: Which data to collect
KPIs details
Main families of KPIs
Which KPIs are the important ones?
The thresholds we work with
Potential problems they point to and further analyses to carry out
1: The new 9.7 monitoring approach of wait events is explicitly excluded here it will
eventually be addressed in another session
2: Only level 0 and 1 are shown, deeper levels have been neglected here and will be shown
and addressed on the more detailed slides only
1.
Motivation for this session: Tasks that enable you to determine and ensure adequate
database performance
4.
5.
6.
7.
For more in-depth data also collect dbm, db, buffer pool, table space, table, and
SQL data
For finer grained data, especially if the application runs for a long time and is
complex collect both the KPIs and table function output repeatedly as often as
every few seconds
NB: be very careful to space out subsequent data collections such that they do
not overlap (i.e. one is still running when the next is already kicked off) nor that
they produce unacceptable load, hence performance overhead. This requires to
carefully test the data collector on a representative (regarding workload, data
volume, and available resources) system.
Definition: Key Performance Indicators are the top tier of performance indicators,
pointing towards:
System-wide or otherwise far-reaching performance problems
Serious performance problems
The most frequently occurring performance problems
Performance indicators are the level beneath KPIs as far as extent, impact, and
frequency 1 is concerned
The classification is not based on hard and fast rules, rather based on experience
where exceptions exist and transitions between categories are sometimes blurred and
shifting
KPIs are often composed of PIs
This leads to families or groups of indicators
Analyses then are top down, starting with the KPI at the head of a tree, proceeding
with subsequently finer levels of PIs (best example is buffer pool quality)
Indicators not only are influenced through members of their own family but also
interact are influenced and influence themselves with members of other families
creating a web of possible dependencies that needs to be navigated during a more
complex performance analysis
Very important: Indicators are to be considered as indicators [sic!] not inconvertible
proof, and often to be corroborated with other indicators and further facts
SAP 2009 / Page 7
1:
Extent: how widely the system is affected (the system as a whole vs. localized)
Impact: how strongly performance is affected
Frequency: how often (according to our experience) the problem occurs
SAP transactions:
ST04 Performance
Database, Buffer pools, Table spaces, Tables, Applications, Lock waits and Deadlocks:
database KPIs
Granularity: system-wide vs. individual server(s), time (hour, day, week, month), transactions
ST06: OS data
I/O bottlenecks (through I/O wait time, but which is not conclusive, use further diagnostics
such as iostat or NMON
1: Not addressed here, refer to the remote expert session on this topic by Steffen
Siegmund (the developer of the SAP performance warehouse)
1: db2support, _the_ data collector par excellence for DB6 defect problems is not used
here, as it collects easily several hundred MB worth of data of which few are performance
relevant, hence to specifically collect only performance-relevant data is more effective
Kinds of analyses:
Spot check, current system status, ad hoc analysis
Trend analysis
Performance problem analysis 1
Types:
Proactive
Reactive
Modes:
Online
Offline
1: ad hoc analysis and performance problem analysis can be very similar in nature, the
difference being made is that the former tends to be more spur-of-the-moment, online, insitu, less deep, the latter more premeditated, using offline data, deeper-going.
Again, this is more a general classification with often blurring boundaries, especially the
former often leading to the latter, sort of to determine through a spot check whether a
performance problem might exist, seeing a possible one carrying on with a somewhat more
thorough ad hoc analysis and upon finding more indicators for a problem starting a fullblown in-depth performance problem analysis.
10
If regularly, then more rarely as a table snapshot takes much more space (a few
hundred kb each time)
Regularly when it is known to be of use, e.g. during an analysis where one wants
to find out which ones are the busiest tables
SQL statements:
Only save expensive statements (see below Expensive statements for details)
1: These monitoring elements exist in the snapshots too, but here they are presented as
collections of monitoring elements grouped for a specific purpose; which is why they are
called administrative _views_
DB2 online documentation Snapshot monitor SQL
Administrative Views
11
12
7. KPIs
Overview
In this section we look at the details of KPIs and related topics that
must be taken into account to make full use of the information KPIs
offer:
Introduction: on thresholds
Performance indicators absolute values, changes, and trends
How performance indicators can be used
Interpretation of deviation in performance indicators
Change and constancy
Deviation: What is normal, what is not
When things are different: temporary changes vs. permanent deviations or changes
Triage: Before you analyze how to decide what to analyze and what not
KPIs: details
The most important areas of performance problems
The details provided on the individual problems
13
7. KPIs
Introduction
The thresholds we work with are no hard and fast rules that are cast in
stone and immutable because
they have been established over time through analyzing a large number of systems
when noticing a problem we looked at the systems values for its KPIs, when not, we also
noted the values;
given that systems change in their hardware, functionality, requirements etc. these
values are, and will keep to be, evolving;
given that we wanted them to be useful most of the time we discarded outliers,
conversely they in all likelihood cannot be applied to highly specialized systems 1;
as a consequence, they work most of the time with standard, middle of the road, vanilla
flavored systems, which means they will never be a perfect fit for any system, but a
reasonable fit for most, if not almost all 2;
hence, the more a system is specialized the more one needs to expect and accept
deviations between the systems values and recommended KPI thresholds values
they are intended as a starting point against which system parameters can be
compared to catch problems, mostly when there are not yet any other (historical) values
from the system itself. After some time of operation, the systems values should be rather
compared to its own values.
1: Not necessarily as far as the application is concerned but as far as performanceinfluencing features like widely and abruptly varying workloads, exotic hardware
2: according to Life can be described with a Gaussian profile :)
14
Isolated values such as the initial values of a system: the values right after the system has
e.g. been installed or set live (I/O read and write time, number of users, etc.) i.e. when no
historical data exists that could be used for comparisons 3
Ideally, can be put into perspective by looking at factors that caused them to change
but often they are not all known or even if they are their effects are not fully
understood
especially as these factors can be many, complex, and interacting with each other
and it is almost impossible to isolate even just the important ones in an active production
system
while things tend to be a bit easier in a test environment where you have full control,
especially over what is going on at a given time (preferably nothing but the test you are
running), hence the value of a representative test system to reproduce and then analyze in
SAP 2009 / Page 15
#: Slides 15 to 18 were not shown in the presentation but only referred to and are here included as
promised to provide further information on KPIs
1: Which can even hold for LPARs using virtualization e.g. the maximum CPU entitlement or the CPU cap (if
an LPAR is capped regarding CPU)
2: Fixed, that is, until they are changed by adding e.g. RAM
3: Which is why thresholds were created as crutches; consequently, once the system has been running for
some time current values should be compared rather to history data rather than one size fits all general
threshold values
4: These thresholds have been established over time through past analyses of systems, thus have proven by
now to be useful in a (often rather extensive) number of contexts which makes them suitable as general
values against which to measure systems. (First and foremost in the absence of anything better!) However,
systems are vastly different which means that one size fits all, middle of the road values are usually helpful
but not the best. Often they would turn out to be significantly off the mark (if the system and its performance
were to be thoroughly analyzed), either too ambitious (because the present system perfectly tuned would still
fall short of a threshold e.g. because of innate limitation such as available hardware resources or workload) or
by far not ambitious enough (for the same reasons). The solution for this is, after a system runs stably for a
while, to carry out an in depth analysis of its individual performance aspects and their influences to detect and
realize further performance potential.
5: Initially, to compare system values with thresholds is a good idea, first because there is often little other to
compare with (unless there is already a similar, comparable installation of which the values can be used) but
also to see how far away from established values a system is, second because a deviation, especially if it is
important, has a good chance of having for a cause a significant problem such as system setup,
parameterization, possibly even sizing. However, once a system has been up and running under
representative conditions (business as usual mode) comparisons against the systems own, historical data
becomes more and more important. Differences and changes against its own, earlier data, new trends or
changes in existing trends become much more important signals to be heeded than comparisons with
standard thresholds
15
Neither to compare Wednesday with Friday (if week end closing runs on Fridays);
An online backup that exceptionally runs 2 hours later than usual can invalidate a whole
number of findings
#: Slides 15 to 18 were not shown in the presentation but only referred to and are
here included as promised to provide further information on KPIs
16
Problem: find out what is too much (deviation for absolute values) or disproportionate
(changes in absolute values and trends) given influences (e.g. hardware characteristics and
parameterization) and changes in influences (e.g. database size, table growth, number of
users, amount of throughput, parallel online backup)
Problem: compared to the kind and amount of change that occurred (a 20% increase in the
number of rows should not cause a 100% increase in the runtime of an SQL statement)
#: Slides 15 to 18 were not shown in the presentation but only referred to and are
here included as promised to provide further information on KPIs
17
#: Slides 15 to 18 were not shown in the presentation but only referred to and are
here included as promised to provide further information on KPIs
1: An example for an exception is a scenario that scales logarithmically such as table
growth where the table is accessed through a fully selective index
18
7. KPIs
Triage: Before you analyze
How to decide what to analyze and what not?
The urgency with which something must be looked at is mostly a function of:
Pragmatics: once-only occurrences can probably be ignored (unless the system was in
serious trouble)
Effort: the more effort is involved the higher a problems impact on the system has to be to
warrant an analysis 1
19
7. KPIs: Details
Overview
This section contains:
The most important areas of performance problems:
High response times
Resource bottlenecks (CPU, RAM, I/O)
Buffer pool problems
Locking problems
Sorting problems
Expensive SQL statements
20
7. KPIs: Details
Details provided on problems
For each of the most important problems that can affect performance the
following information will be provided:
Symptoms: For each family of problems, what are its most important symptoms?
Problems: What, for each family, are the problems that occur when things go cactus?
KPIs: Which KPI(s) indicate the problem?
KPI source & monitoring elements involved: Where can the KPIs be found and what
database monitoring elements are involved in a KPI?
Thresholds: What threshold values have proven to be the most useful?
Analysis: What, if any, further steps to analyze the problem need to be taken?
Ways to confirm or refute the current hypothesis as to what the cause(s) might be
Ways to dig deeper into the problem
Remedies and comments:
What corrective measures exist to solve the performance problem?
Further things worthy to be mentioned
21
Symptoms:
System (as a whole) is running slowly 0
Slow execution of report / transaction
As a consequence, time is lost, mostly by waiting for the response, system-wide or for
individual applications
Problems (all from transaction ST03) 1:
1. High overall response time (time-consuming application ( verify application requirements,
functions, realization)
2. High CPU time
3. High database time
KPI(s):
1. Total response time
2. Total time spent on CPU, percentage of total response time spent on CPU
3. Total time spent on the database, percentage of total response time spent on the database
KPI source & monitoring elements involved:
1. ST03: Total response time
2. ST03: CPU response time
3. ST03: Database response time
Notes on KPIs:
Calculation: average of STAD entries, aggregated as required (by server, period, transaction
etc.)
22
Thresholds:
Overall response time: 1000 ms
CPU response time: 400 ms or 40% of total response time
Database response time: 400 ms or 40% of total response time
Analysis:
(KPI #3)
Check for expensive SQL statements in the databases KPIs and its SQL cache
(KPIs #1 to #3)
Narrow down the problem: compare total times for the system with individual servers, times
frames, transactions, looking to successively narrow down the problem (e.g. to (1) the database
server; (2) yesterday between 2 am and 4 am; (3) transaction Z_ESF2009
2: lost means here time spent but nothing to show for, caused by:
(a) waiting for an available CPU (
CPU bottleneck);
23
24
6: I/O wait time is considered idle time as the CPU during I/O wait time has
nothing else to do but wait for I/O request
7: 80% are a (near) real-time value; the traditional, established average is
recommended to be as low as 20%; however, caution should be exercised
with this value as it only provides relative safety against peaks where CPU
utilization hits 100%
to verify system CPU percentage or run queue + swap
in will be of no help as they, too, will equally have any peaks averaged out
over time
8: Which could also explain why you have a CPU bottleneck, that is when sizing
has been carried out for an average workload, not taking into account that
some slices of the workload are particularly CPU-bound
25
Spikes are more difficult to avoid, as they may require a high amount of RAM which will only
be used for a short time per hour or even lower frequency
Paging which is rather equally distributed paging, more or less continuously is easier to
solve as most of the time paging occurs only by a trickle, hence a small amount of RAM should
solve the problem
26
27
28
29
Symptoms:
Database performance is lacking
Buffer pool quality is low, overall hit ratio, data and / or index hit ratio
Problems and causes:
1. Buffer pool size too small
2. Table scans
3. Full index scans or large index range scans
KPI(s):
1. Buffer pool size to total database size
2. Table scans
1.
Data logical reads over index logical reads
2.
Buffer pool data hit ratio minus index hit ratio
3. No reliable KPI available at the buffer pool level
KPI source & monitoring elements involved:
1. Buffer pool size to total database size:
1.
Buffer pool size: Transaction ST04 Performance Database (tab strip) Buffer pool: Total
size
2.
Database size: Transaction ST04 Space History Database and Table spaces
2. Transaction ST04 Performance Database (tab strip) Buffer pool:
1.
Data logical reads, index logical reads
2.
Buffer pool quality; data hit ratio, index hit ratio
3. No reliable KPI available at the buffer pool level
30
Thresholds:
1. Buffer pool size should be ~ 1% to 3% of the total database size
2. Table scans
1.
Data logical reads / Index logical reads = 1 : 1 (better 1 : 1.5)
2.
Data buffer pool quality Index buffer pool quality = -3% (-2.5% often is enough)
3. No threshold available at the buffer pool level 1
Analysis:
2. Check SQL cache for table scans
3. Check SQL cache for full index scans and large index scans
Also check for sort problems, lock problems, and expensive SQL statements
Remedies and comments:
1. Increase buffer pool size (which you should, however, only do after having looked for table and
index and index range scans
Sub-optimal database performance can also be related to sort problems, lock problems,
and expensive SQL statements which may, but not necessarily, reflect in the buffer pool
figures, so make sure to look into them in any case
1: While theoretically the KPI for full index scans and expensive index scans is
the inverse of table scans this (almost) never occurs in real life. (Ok, in the
unlikely event of your buffer pool data hit ratio being 2.5% or more better than
the index hit ratio or the ratio of data logical reads over index logical reads
being 0.75 or less you know where to look for. Let me know when this
happens, I would love to have a look at that system.
In real life youll find full index scans or large index range scans by looking at the
number of index rows read, per execution and in absolute numbers, in the
buffer pool.
31
Symptoms:
Performance of one or several applications leave to be desired
Performance rapidly degrades with increasing workload
Applications take an extremely long time to complete
Applications are terminated, a short dump occurs
Problems and causes:
1. Lock list is too small, causing lock and exclusive lock escalations, which in turn can cause deadlocks
2. High time waited for a lock to be released so that the current application may acquire it: Data is locked by a
badly performing application (holding locks for too long a time, i.e. not releasing them as soon as possible or a
sub-optimal application logic, requiring to hold locks for an extended time) resulting in extended lock wait times;
this is often exacerbated in the case of a highly parallelized, badly performing application where then a large
number of work processes running in parallel
3. (Exclusive) Lock escalations occur: an application uses either the maximum of lock space allowed for a single
application or the lock list runs full
4. Deadlock: bad application logic (usually the case) or simply bad luck; a deadlock is detected by the deadlock
monitor, its default is to check for deadlock every 300 seconds i.e. 5 minutes
5. Lock timeout: if an application holds locks that are held longer than a maximum of time (default is 3600 seconds
i.e. 1 hour) the application is terminated and the lock released
KPI(s) (all transaction ST04 Performance Database (tab strip) Locks and Deadlocks:
1. Lock list
2. Lock waits - Average time waited
3. Lock escalations & Exclusive lock escalations
4. Locks Deadlocks detected
5. Lock Lock timeouts
32
A system will quite naturally see a high number of them in the course of time, most of them quite benign, i.e.
fast.
The problems start when several applications vie for the same object (row or table)
1. Lock list too small:
Should right itself with a bit of time (< 2 hours, often less) when using STMM.
If you do not use STMM, increase the size of the lock list.
If you use STMM and your workload has sudden, occasional, high requirements in lock list size you can
exclude the lock list from STMM and manually find out which value works (increase by 10% to 50% per iteration,
depending on the numbers of KPIs #2 to #5).
33
Carry out a lock wait analysis using db6util -ls (lock situation)
Collect regular snapshots on applications looking for their number of lock waits and time waited on locks
Once the application is identified take an ST12 trace (combined ABAP SQL trace) looking for expensive
ABAP statements (which may cause the lock wait situation) and long running SQL statements, especially that run a
surprisingly long time for a good access plan (which may be the statements suffering from lock wait situations; in
particular, if the application at different times (and with an already warmed up buffer pool) has significantly different
runtimes the APAB statement and SQL statements that show largely differing times are likely to be involved in lock
wait situations
3. ((exclusive) lock escalation)
Take regular lock snapshots and application snapshots to see when and where lock escalation occur
Collect lock_help (administrative view) information and filter for lock_object_type = table lock, lock_mode =
IX (intent exclusive), SIX (share with intention exclusive), X (exclusive lock), or Z (super exclusive lock),
lock_escalation (small integer)
4. (deadlock)
Run the application, monitoring it in SM50 or SM66; once it is stuck for more than a few seconds go look up
its report and main action (highlight the work process and select Choose or Details) and look into the
corresponding application using its PID in ST04 Performance Application for further details
5. (timeout)
Collect regular snapshot information on applications looking for their number of lock timeouts to identify
which applications suffer
Collect regular snapshot_lockwait information (table function), filtering for the application identified above
Carry out a lock wait analysis using db6util -ls (lock situation) to catch the problem when it occurs
1.
34
Symptoms:
Some applications or SQL statements take a longer time than expected
Problems and causes:
1. The sort heap is not large enough, hence the sort flows over
2. The result set to be sorted is not in the order required and flows over
KPI(s):
1. Sort heap & sort overflows
1.
Average sort time
2.
Percentage of sort overflows
2. SQL statement
1.
Explain plan
2.
SELECT statement with number of rows written <> 0 and high sort time
KPI source & monitoring elements involved:
1. Sort heap & sort overflows
1.
DB, application, dynamic SQL snapshot: total_sort_time / total_sorts
2.
DB, application, dynamic SQL snapshot : sort overflows / total_sorts
2. SQL statement
1.
No KPI, interpretation of the SQL access plan
2.
SQL Cache: SELECT with a number of rows written <> 0 1 ; SQL Cache: sort time =
considerable 2
1: A SELECT should have 0 number of rows written; sorts and joins are carried
out in the sort heap; as long as the sort is carried out in the sort heap the
number of rows written stays at 0; only if the sort spills over onto disk is the
counter of rows written started
2: The meaning of considerable: Not every sort overflow is worth tuning, one
that spills and takes 10 ms will not help you gain much time if you avoid the
sort (you gain 10 ms) or at least the spill (youll gain around 9 or 9.5 ms); only
small overflows that occur _very_ often or massive overflow that come with
considerable sort time are worth looking into
35
Thresholds:
1. Sort heap & sort overflows
1.
Average sort time: > 1 ms (good systems often mostly depending on the size of the sort
required by the applications are below 0.1 ms)
2.
% of sort overflows: < 1% (good systems often have < 1/1000) 1
2. SQL statements
1.
When the access plan looks too expensive
2.
SQL Cache: cf. footnote #3
Remedies and comments:
Like locks, sorts per se are nothing bad
A problem occurs when the sort area is not large enough and the sort spills over into
temporary table space and onto disk
Analysis:
Usually, no further analysis is required at this point
36
Missing index
Sub-optimal access plan determined by the optimizer because of limited resources available
(heap sizes, dbm & db cfg are used by the optimizer while calculating the best access plan; best
means by taking into account available resources)
Expensive ABAP coding (or other programming language), e.g. a select * without WHERE
clause, i.e. copying the table into an internal table, with a subsequent program-internal filter on 1
column and 1 line
Expensive ABAP coding (or other programming language), e.g. a select that will return a
significant number of rows of a large table, as is typical with mass update of interest rates in the core
banking sector
37
Definitions of what an expensive statement is most current definitions for expensive SQL statements
(in DB2):
Long-running SQL statement
Statement that uses up a high percentage of the total time spent in the database
High percentage of all rows read
High number of rows read per execution
High percentage of all rows written when the statement is a SELECT statement: this shows a
sort overflow or a join that overflowed
High number of rows written per execution with a SELECT statement
Expensive SQL statements, how to identify them, how to analyze them, and how to tune them is an
extensive field in its own right and will not be addressed in detail here but in (at least) 2 separate expert
sessions (currently slated for Q1/2010)
38