Database Survival Guide 2
Database Survival Guide 2
If your test fails and you see an error in alert.log, the problem might be caused by a bug in the engine itself. Here is a way to go deep into a problem
5.1 Go deep with an oracle error
Here is an example extracted from SF ticket 174496. The original situation is an asm diskgroup in engine 10GR2. An upgrade to 11.2.0.4.4 is done
but after the upgrade, when trying to include the existing diskgroup in the configuration, the message below is printed :
5.1.1 Sample error
SQL> ALTER DISKGROUP ALL MOUNT
NOTE: Diskgroups listed in ASM_DISKGROUPS are data1
ORA-15032: not all alterations performed
ORA-15036: disk '/dev/zvol/rdsk/array/basempbparis' is truncated
ERROR: ALTER DISKGROUP ALL MOUNT
5.1.2 Oerr command
The first step is to get answer from the oerr tool :
oracle11@archi5> oerr ora 15032
15032, 00000, "not all alterations performed"
// *Cause: At least one ALTER DISKGROUP action failed.
// *Action: Check the other messages issued along with this summary error.
oracle11@archi5> oerr ora 15036
15036, 00000, "disk '%s' is truncated"
// *Cause: The size of the disk, as reported by the operating system, was smaller than the size of the
disk // as recorded in the disk header block on the disk.
// *Action: Check if the system configuration has changed.
This tool explains in detail the error but it does not give the solution. I have not changed my diskgroup why would it have been truncated ?
5.1.3 Google
Google might give a solution. Here, it is the case. Somebody faced the same problem before and was kind enough to explain it in his blog :
https://oraclehandson.wordpress.com/2010/04/13/changing-the-disk-size-in-asm/. Unfortunately, this is not always like that !
5.1.4 My Oracle support site
My opinion is that an Oracle support login is mandatory for you. Because if I connect to it and post the following string in the search engine :
ORA-15036: disk is truncated
I get at once two knowledge base articles that explain in detail how to fix the issue. I am sure this solution is not dangerous because it is proposed
by Oracle company engineers. The two knowledge articles are :
- ASM disk group mount fails with ORA-15036: disk <name> is truncated (Doc ID 1077175).
- How To Obtain The Raw Device Size in Solaris To Diagnostic & Fix The ORA-15036 ("Disk <Disk Name> I Is Truncated") In ASM 10gR2 (10.2).
(Doc ID 1486324.1)
The new release of Oracle does no compute size of asm device the same manner as the old one is the root cause of this issue.
5.1.5 Adrci again
If your problem placed the engine in a faulty situation, it will create an incident and a problem. There is a fast way to know if your tests produced
incident or errors : To illustrate that, I have created artificially an incident : (as oracle user at unix prompt)
adrci
adrci> show incident
ADR Home = /ldata/oracle/diag/rdbms/thunder/thunder:
*************************************************************************
INCIDENT_ID PROBLEM_KEY CREATE_TIME
128194 ORA 700 [kgerev1] 2016-03-04 11:32:11.864000 +01:00
adrci> show problem
ADR Home = /ldata/oracle/diag/rdbms/thunder/thunder:
************************************************************************
PROBLEM_ID PROBLEM_KEY LAST_INCIDENT LASTINC_TIME
3 ORA 700 [kgerev1] 128194 2016-03-04 11:32:11.864000 +01:00
5.2 Another example .
When doing an installation, I faced a problem with listener installation, I got :
PRCT-1011 : Failed to run "srvctl". Detailed error: #@=result[0]: version={12.1.0.2.0}
This error does not speak very much. Oerr gives no clue. And google is vague. But My Oracle support gave me the solution :
I realized I had to restart install because I was not careful enough with warnings about missing packages. It was package SUNWeu8os missing.
5.3 Execution plan of a slow request.
Sometimes, the request goes through correctly but it last for many seconds. You have to wonder if it can be speed up. At the database level, the
tool that gives the answer is the command explain plan.
5.3.1 A simple case influence of an index.
I illustrate that with a simple case. I have a huge table with an index on it. I will do the same query with 2 situations:
- The index is ok
- The index has been disabled
My table is source5. It has an index on column line named LINE_IDX.
The request is : select count(*) from source5 where line=1;
5.3.2 With sqlplus
Here the index is o.k. the important is the cost and the time :
SQL> explain plan for select count(*) from source5 where line=1;
Explained.
SET LINESIZE 130
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------
Plan hash value: 3418924228
------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 133 (1)| 00:00:02
| 1 | SORT AGGREGATE | | 1 | 13 | |
|* 2 | INDEX RANGE SCAN|LINE_IDX|62256 | 790K| 133 (1)| 00:00:02
-------------------------------------------------------------------------
Predicate Information (identified by operation id):
When I disable the index, the query duration will jump from 2secs to 2 minutes
! alter index line_idx unusable;
To get the plan :
explain plan for select count(*) from source5 where line=1;
Explained.
ensuite pour afficher le plan :
SET LINESIZE 130
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
PLAN_TABLE_OUTPUT
Plan hash value: 2301655728
------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time
------------------------------------------------------------------------
| 0 |SELECT STATEMENT | | 1 | 13 | 14645 (1)| 00:02:56
| 1 | SORT AGGREGATE | | 1 | 13 | |
|* 2 |TABLE ACCESS FULL | SOURCE5 | 62256 | 790K | 14645 (1)| 00:02:56
-------------------------------------------------------------------------
Predicate Information (identified by operation id): PLAN_TABLE_OUTPUT
5.3.3 With database control
From the performance page, I click on a particular sql request and I chose to display its plan . Then, I chose table format, and I get the display
below :
5.3.4 With enterprise manager database express
See video below : https://drive.google.com/a/eservglobal.com/file/d/0B6Rt2fV3N-dcOXo3aWFWRmpDVWM/view?usp=sharing
5.4 Get the DML being run by java
You are often blocked in your investigation with problems on composites because it is hard to get the command really issued by the java program.
Here is an attempt to extract what is really done in the databae when a composite is fired. I will take the example of an actor creation.
5.4.1 Setup
In the GUI of CC_ADMIN, I chose to create actor batch2.
(as oracle user connected as sysdba to the database)
alter system set timed_statistics = true;
alter system set statistics_level=all;
alter system set sql_trace=true;
(clean up trace files as oracle user at the unix prompt)
oracle11@zonebase> pwd
/ldata/oracle/diag/rdbms/thunder/thunder/trace
oracle11@zonebase> rm -rf * Now perform the query you want to analyze.
5.4.2 Gather results
First copy all trace files to another location: mkdir /ldata/tracebatch2 cp * /ldata/tracebatch2
then run tkprof as oracle user at unix prompt : for i in `ls *trc` ; do tkprof $i $i.txt done A set of text file are produced.
5.5 Oracle memory in Solaris
As a general rule the SGA is the area where shared memory of buffer and computed execution plans are. The PGA is the area which give memory
to server processes interacting with the database. Since we are OLTP ( Online transaction Processing), the advised ration is 60% SGA, 40% PGA.
We had many problem with automatic memory management, that is why we have disabled it. So to be short, let’s suppose you have a server with
16GO memory. You will give Oracle 75%. That is 12G.
So the SGA will be 0.6*12=7Go The PGA will be 0,4*12=5go
This is a maximum and on testbeds, you will have also good result with lower configurations. This configuration gives better results on site because
after some hours of activity, the buffer cache of the database is filled with thousands of blocks that are the most commonly used and there is very
few disk read activity. The term is read buffer cache hit. It is around 99.5%
On a testbed, since tables involved have few rows, the figure below should be o.k. (minimum) :
SGA=2Go RAM PGA=1Go RAM.
I will explain below how to change memory settings. There is a hidden parameter that has to be managed for the operation to succeed. Solaris has
an upper limit to the shared memory allocated to a user (in our case oracle). This limit is stored in /etc/project file. It can also be checked with the
command prctl $$ In the example below the limit is set to around 2Giga. I will increase it to 5giga to avoid problems at startup of the instance.
5.5.1 Increasing sga and pga
SQL> show parameter pga_aggregate_target;
pga_aggregate_target big integer 100M
SQL> show parameter sga;
sga_max_size big integer 800M
sga_target big integer 800M
SQL> alter system set sga_max_size=2G scope=spfile;
SQL> alter system set sga_target=2G scope=spfile;
SQL> alter system set pga_aggregate_target=1G scope=spfile;
SQL> shutdown immediate
Database closed. Database dismounted. ORACLE instance shut down.
SQL> exit
oracle11@archi5> prctl $$
process: 23046: -ksh
NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
process.max-port-events
privileged 65.5K - deny -
system 2.15G max deny -
process.max-msg-messages
-memory
system 16.0EB max deny -
project.max-port-ids
privileged 8.19K - deny -
system 65.5K max deny -
project.max-shm-memory
privileged 1.86GB - deny -
system 16.0EB max deny -
project.max-shm-ids
oracle11@archi5> exit
root11@archi5# diff /etc/project /etc/project.new
< user.oracle:100::::process.max-sem-nsems=(priv,256,deny);project.max-sem-ids=(priv,100,deny);project.max-shm-ids=(priv,100,deny);project.max-
shm-memory=(priv,2000000000,deny)
> user.oracle:100::::process.max-sem-nsems=(priv,256,deny);project.max-sem-ids=(priv,100,deny);project.max-shm-ids=(priv,100,deny);project.max-
shm-memory=(priv,5000000000,deny)
root11@archi5# cp /etc/project.new /etc/project
root11@archi5# su - oracle
oracle11@archi5> prctl $$
process: 24261: -ksh
NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
process.max-port-events
privileged 65.5K - deny -
system 2.15G max deny -
project.max-shm-memory
privileged 4.66GB - deny -
system 16.0EB max deny -
SQL> startup
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.
Total System Global Area 2137886720 bytes
Fixed Size 2252576 bytes Variable Size 687866080 bytes
Database Buffers 1442840576 bytes Redo Buffers 4927488 bytes
Database mounted. Database opened.
SQL> show parameter pga_aggregate_target;
pga_aggregate_target big integer 1G
SQL> show parameter sga;
sga_max_size big integer 2G sga_target big integer 2G
5.5.2 View buffer occupation
Here is how to know if your database buffer cache is really occupied by blocks :
COLUMN OBJECT_NAME FORMAT A40
COLUMN NUMBER_OF_BLOCKS FORMAT 999,999,999,999
SELECT o.OBJECT_NAME, COUNT(*) NUMBER_OF_BLOCKS FROM DBA_OBJECTS o, V$BH bh
WHERE o.DATA_OBJECT_ID = bh.OBJD AND o.OWNER != 'SYS' GROUP BY o.OBJECT_NAME ORDER BY COUNT(*);
SOURCE4 46,080 SOURCE3 79,415 SOURCE2 125,640 SOURCE1 316,894
We can compare the time of a query if all table is in buffer cache and if nothing is in buffer cache:
Here are two displays of same command but the difference is that in the second picture all the rows have been placed into buffer cache.
oracle11@archi5> time echo "update source1 set line=line+1;" | sqlplus esg/esg
34167722 rows updated. real 17m36.62s
I redo it so all buffer are in memory :
oracle11@archi5> time echo "update source1 set line=line+1;" | sqlplus esg/esg
34167722 rows updated. real 10m55.29s
The duration is 62 % of initial situation ! The conclusion is never stop a database !
5.6 What to do if :
5.6.1 My database is stuck
5.6.1.1 Reason 1: the server ran out of memory
I launch command top or as lsav : /opt/ESG/bin/CKS_Top. I launch also database control.
Here is a display where the problem is present. In that case, it is useless to try to find an error in a test, the error is the system itself !
5.6.1.2 Reason 2 : another zone is busying the server
Since the testbeds are multi zones, they can slow each other. For example I have launched many provisonning scripts on a database, There is a
jboss in the global zone, and a weblogic server is starting on a third zone. In that situation, the fourth zone will probably have delay in the answers.
The good command to show that is to issue in the global zone : Root# prstat –Z
It is shown below. You can have the same behaviour on a vm, if your antivirus plus music plus a video are run at the same time of your test
5.6.1.3 Reason 3 : a session is provoking huge io
Tis can be seen with command in the global zone : Root# iostat –xndz 4
If the column busy is near 100% and you see that huge io are read or written, you will have troubles with individual tests.
To go further with system monitoring, it is helpful to install package FRMAisar. Here is its user guide :
https://esurf.eservglobal.com/dbfm_send/26744
The basic rule is that your system has a bottleneck and only one. It can be the cpu, the memory, the disk or the network. It is useless to add 4 new
cpus if your disk is the bottleneck. The package FRMAisar will help you to find your bottleneck.
5.6.1.4 The syndrom of the long transaction after restart
The situation is the following. Somebody decided for example a dml such as :
Update voucher set voucherid=voucherid+1;
The voucher table has 35 million rows.
This transaction can last for 30 minutes.
But after 20 minutes, losing patience, the user as decided to bounce the database. He will have double punishment because the transaction will be
rolled back and he will have to wait another 20 minutes for the rollback to finish.
5.6.2 I see ORA32004—when I start the database.
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
This problem is minor. With different release, Oracle manages the parameter differently. The decide all the time to accept old syntax but issue a
warning to advice the user to chose the new syntax. Example in 10G2, to configure the behaviour of the engine related to commit, the parameter
was : commit_write = "batch,nowait"
But in 11G2, the two parameter name have changed, so you see in alert.log :
Deprecated system parameters with specified values: commit_write
In 11GR, the new names are : commit_wait = "nowait" commit_logging = "batch" But your system will function in any case !
5.6.3 I suspect my database to be corrupted
In case of corruption, there will be always a message (or many !) in the alert.log. You will probably have also incident in adrci repository. But you
can double check there is no corruption with rman. I have chosen to check for corruption in my SYSTEM tablespace.
oracle11@archi5> rman target /
RMAN> report schema;
List of Permanent Datafiles
===========================
File Size(MB) Tablespace RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1 800 SYSTEM *** +DATA1/thunder/datafile/system.261.856530393
2 5100 SYSAUX *** +DATA1/thunder/datafile/sysaux.262.856530409
3 890 UNDOTBS *** +DATA1/thunder/datafile/undotbs.263.856530421
RMAN> validate datafile 1;
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=590 device type=DISK
channel ORA_DISK_1: starting validation of datafile
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
1 OK 0 33199 102400 274619676
File Name: +DATA1/thunder/datafile/system.261.856530393
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data 0 49708
Index 0 16219
Other 0 3274
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
including current control file for validation
including current SPFILE in backup set
channel ORA_DISK_1: validation complete, elapsed time: 00:00:04
List of Control File and SPFILE
===============================
File Type Status Blocks Failing Blocks Examined
------------ ------ -------------- ---------------
SPFILE OK 0 2
Control File OK 0 854
Another solution : RMAN> list failure;
using target database control file instead of recovery catalog
Database Role: PRIMARY no failures found that match specification
5.6.4 My file system is full of old logs
Here are some places where you can find huge logs :
- Go to directory of alert.log and remove all files. - Go to directory of listener logs(same)
- Go to directory of listener traces (same). - Go to directory of asm logs
- Go to directory of admin sever logs and remove all files
5.6.5 My datafiles are huge because I ran load traffic for many days, I want to reduce them. Follow chapter 8.1
The explanation is when a database is modified by adding rows, updating rows, deleting tables and so on…, the engine does not try to recover the
exact situation as before. This can be done on live site with the command : Alter table shrink space
But on a testbed, there is a faster method, it is to export a schema drop it and reimport it.
This can be done with the whole database also. By doing that, the table are recreated with minimal storage occupation.
6 SQL tips
6.9 I want to shrink my virtual box vm that is a Solaris 10 guest inside a Windows host
The procedure below must be executed after having being prepared because it goes deep into the system. It is :
https://esurf.eservglobal.com/dbfm_send/46154 It will shrink your VM a the smallest compacted size.