Advanced Database Systems Lecture Note
Advanced Database Systems
Transactions Processing
What is a transaction?
¾ A Transaction is a mechanism for applying the desired
modifications/operations to a database. It is evident in real life that the
final database instance after a successful manipulation of the content of
the database is the most up-to-date copy of the database.
¾ Action, or series of actions, carried out by a single user or application
program, which accesses or changes contents of database. (i.e. Logical unit
of work on the database.)
¾ A transaction could be a whole program, part/module of a program or a
single command.
¾ Changes made in real time to a database are called transactions. Examples
include ATM transactions, credit card approvals, flight reservations, hotel
check-in, phone calls, supermarket canning, academic registration and
billing.
¾ A transaction could be composed of one or more database and non-
database operations.
¾ Transforms database from one consistent state to another, although
consistency may be violated during transaction.
¾ A database transaction is a unit of interaction with database management
system or similar system that is treated in a coherent and reliable way
independent of other transactions.
Transaction processing system
¾ A system that manages transactions and controls their access to a DBMS is
called a TP monitor. A transaction processing system (TPS) generally consists
of a TP monitor, one or more DBMSs, and a set of application programs
containing transaction.
¾ In database field, a transaction is a group of logical operations that must
all succeed or fail as a group. Systems dedicated to supporting such
operations are known as transaction processing systems.
In comparison with database transaction, application program is series of
transactions with non-database processing in between.
Distinction between business transactions and online transactions:
¾ A business transaction is an interaction in the real world, usually
between an enterprise and a person, where something is
exchanged.
Department of Information Science, WU Haleluya Kiflu 1
Advanced Database Systems Lecture Note
¾ An online transaction is the execution of a program that performs
an administrative or real-time function, often by accessing shared
data sources, usually on behalf of an online user (although some
transactions are run offline in batch).
What we are interested about is the online transaction, which is the interaction
between the users of a database system and the shared data stored in the
database. This transaction program contains the steps involved in the business
transaction.
Transactions can be started, attempted, then committed or aborted via data
manipulation commands of SQL.
Can have one of two outcomes for any transaction:
¾ Success - transaction commits and database reaches a new consistent state
Committed transaction cannot be aborted or rolled back.
How do you discard a committed transaction?
¾ Failure - transaction aborts, and database must be restored to consistent
state before it started.
Such a transaction is rolled back or undone.
Aborted transaction that is rolled back can be restarted later.
In many data manipulation languages, there are keywords used to indicate
different states of a transaction.
A single transaction might require several queries, each reading and/or writing
information in the database. When this happens it is usually important to be sure
that the database is not left with only some of the queries carried out. For
example, when doing a money transfer, if the money was debited from one
account, it is important that it also be credited to the depositing account. Also,
transactions should not interfere with each other.
A transaction is expected to exhibit some basic features or properties to be
considered as a valid transaction. These features are:
A: Atomicity
C: Consistency
I: Isolation
D: Durability
Department of Information Science, WU Haleluya Kiflu 2
Advanced Database Systems Lecture Note
It is referred as ACID property of transaction. Without the ACID property,
the integrity of the database cannot be guaranteed.
Atomicity
Is All or None property
Every transaction should be considered as an atomic process which can not be
sub divided into small tasks. Due to this property, just like an atom which exists
or does not exist, a transaction has only two states. Done or Never Started.
Done - a transaction must complete successfully and its effect
should be visible in the database.
Never Started - If a transaction fails during execution then all its
modifications must be undone to bring back the database to
the last consistent state, i.e., remove the effect of failed
transaction.
No state between Done and Never Started
Consistency
If the transaction code is correct then a transaction, at the end of its execution,
must leave the database consistent. A transaction should transform a database
from one previous consistent state to another consistent state.
Isolation
A transaction must execute without interference from other concurrent
transactions and its intermediate or partial modifications to data must not be
visible to other transactions.
Durability
The effect of a completed transaction must persist in the database, i.e., its updates
must be available to other transaction immediately after the end of its execution,
and is should not be affected due to failures after the completion of the
transaction.
In practice, these properties are often relaxed somewhat to provide better
performance.
Department of Information Science, WU Haleluya Kiflu 3
Advanced Database Systems Lecture Note
State of a Transaction
A transaction is an atomic operation from the users’ perspective. But it has a
collection of operations and it can have a number of states during its execution.
A transaction can end in three possible ways.
1. Successful Termination: when a transaction completes the execution of all
operations in it and reaches the COMMIT command.
2. Suicidal Termination: when the transaction detects an error during its
processing and decide to abrupt itself before the end of the transaction
and perform a ROLL BACK
3. Murderous Termination: When the DBMS or the system force the
execution to abort for any reason.
Start Ok to Commit
Commit Commit
Database
Modified
No Error
System Detects
Error
Modify Abort End of
Transaction
Error Detected System
by Transaction Initiated Consistent
State
Consistent
State
Error Transaction RollBack Database
Initiated unmodified
Most SQL statements seem to be very short and easy to execute. But the reverse
is true if you consider it as a one command transaction. Actually a database
system interprets a transaction not as an application program but as a logical
sequence of low- level operations read and write (referred to as primitives).
Department of Information Science, WU Haleluya Kiflu 4
Advanced Database Systems Lecture Note
Ways of Transaction Execution
In a database system many transactions are executed. Basically there are two
ways of executing a set of transactions:
(a) Serially: Serial Execution: In a serial execution transactions are executed
strictly serially. Thus, Transaction Ti completes and writes its results to
the database then only the next transaction Tj is scheduled for execution.
This means at one time there is only one transaction is being executed in
the system. The data is not shared between transactions at one specific
time.
In Serial transaction execution, one transaction being executed does not interfere
the execution of any other transaction.
Good things about serial execution
¾ Correct execution, i.e., if the input is correct then output will be
correct.
¾ Fast execution, since all the resources are available to the active.
The worst thing about serial execution is very inefficient resource utilization.
Example of a serial execution
Suppose data items X = 10, Y = 6, and N =1 and T1 and T2 are transactions.
T1 T2
read (X) read (X)
X := X+N X := X+N
write (X) write (X)
read (Y)
Y := Y+N
write (Y)
Department of Information Science, WU Haleluya Kiflu 5
Advanced Database Systems Lecture Note
We execute this transaction serially as follows:
Time T1 T2
read (X) {X = 10}
X := X+N {X = 11}
write (X) {X = 11}
read (Y) {Y = 6}
Y := Y+N {Y = 7}
write (Y) {Y = 7}
read (X) {X = 11}
X := X+N {X = 12}
write (X)
Final values of X, Y at the end of T1 and T2: X = 12 and Y = 7.
Thus we can witness that in serial execution of transaction, if we have two transactions
Ti and Ti+1, then Ti+1 will only be executed after the completion of Ti.
(b) Concurrently: is the reverse of serially executable transactions, in this
scheme the individual operations of transactions, i.e., reads and writes are
interleaved in some order.
Time T1 T2
read (X) {X = 10}
read (X) {X = 11}
X := X+N {X = 11}
X := X+N {X = 12}
write (X) {X = 11}
write (X)
read (Y) {Y = 6}
Y := Y+N {Y = 7}
write (Y) {Y = 7}
Final values at the end of T1 and T2 : X = 11, and Y = 7. This improves
resource utilization, unfortunately gives incorrect result.
The correct value of X is 12 but in concurrent execution X =11, which is
incorrect. The reason for this error is incorrect sharing of X by T1 and T2.
In serial execution T2 read the value of X written by T1 (i.e., 11) but in
concurrent execution T2 read the same value of X (i.e., 10) as T1 did and
the update made by T1 was overwritten by T2’s update.
This is the reason the final value of X is one less that what is produced by
serial execution.
Department of Information Science, WU Haleluya Kiflu 6
Advanced Database Systems Lecture Note
Problems Associated with Concurrent
Transaction Processing
Although two transactions may be correct in themselves, interleaving of
operations may produce an incorrect result which needs control over
access.
Having a concurrent transaction processing, one can enhance the
throughput of the system. As reading and writing is performed from and
on secondary storage, the system will not be idle during these operations if
there is a concurrent processing.
Every transaction should be correct by themselves, but this would not
guarantee that the interleaving of these transactions will produce a correct
result.
The three potential problems caused by concurrency are:
Lost Update Problem
Uncommitted Dependency Problem
Inconsistent Analysis Problem.
Lost Update Problem
Successfully completed update on a data set by one transaction is
overridden by another transaction/user.
E.g. Account with balance A=100.
T1 reads the account A
T1 withdraws 10 from A
T1 makes the update in the Database
T2 reads the account A
T2 adds 100 on A
T2 makes the update in the Database
In the above case, if done one after the other (serially) then we have
no problem.
¾ If the execution is T1 followed by T2 then A=190
¾ If the execution is T2 followed by T1 then A=190
But if they start at the same time in the following sequence:
T1 reads the account A=100
T1 withdraws 10 making the balance A=90
Department of Information Science, WU Haleluya Kiflu 7
Advanced Database Systems Lecture Note
T2 reads the account A=100
T2 adds 100 making A=200
T1 makes the update in the Database A=90
T2 makes the update in the Database A=200
After the successful completion of the operation in this schedule, the
final value of A will be 200 which override the update made by the
first transaction that changed the value from 100 to 90.
Uncommitted Dependency Problem
Occurs when one transaction can see intermediate results of another
transaction before it is committed.
E.g.
T2 increases 100 making it 200 but then aborts the transaction
before it is committed. T1 gets 200, subtracts 10 and make it
190. But the actual balance should be 90
Inconsistent Analysis Problem
Occurs when transaction reads several values but second transaction
updates some of them during execution and before the completion of the
first.
E.g.
T2 would like to add the values of A=10, B=20 and C=30. after
the values are read by T2 and before its completion, T1
updates the value of B to be 50. at the end of the execution of
the two transactions T2 will come up with the sum of 60 while
it should be 90 since B is updated to 50.
As discussed above, the objective of Concurrency Control Protocol is to
schedule transactions in such a way as to avoid any interference between
them. This demands a new principle in transaction processing, which is
serializability of the schedule of execution of multiple transactions.
Department of Information Science, WU Haleluya Kiflu 8
Advanced Database Systems Lecture Note
Serializability
In any transaction processing system, if concurrent processing is
implemented, there will be concept called schedule having or
determining the execution sequence of operations in different
transactions.
Schedule: time-ordered sequence of the important actions taken by
one or more transitions. Schedule represents the order in which
instructions are executed in the system in chronological ordering.
The scheduler component of a DBMS must ensure that the individual
steps of different transactions preserve consistency.
1. Serial Schedule: a schedule where the operations of each
transaction are executed consecutively without any interleaved
operations from other transactions. No guarantee that results of
all serial executions of a given set of transactions will be identical.
2. Nonserial Schedule: Schedule where operations from a set of
concurrent transactions are interleaved. The objective of
serializability is to find nonserial schedules that allow transactions
to execute concurrently without interfering with one another. In
other words, want to find nonserial schedules that are equivalent to
some serial schedule. Such a schedule is called serializable.
If a non serial schedule can come up with a result equivalent to some
serial schedule, we call the schedule Serializable. To mean that the
sequence of the execution of the operations will result in a value
identical with serially executed transaction, which always is correct.
Thus Serializability identifies those executions of transactions
guaranteed to ensure consistency.
Serialization
Objective of serialization is to find schedules that allow transactions
to execute concurrently without interfering with one another.
If two transactions only read data, order is not important.
Department of Information Science, WU Haleluya Kiflu 9
Advanced Database Systems Lecture Note
If two transactions either read or write completely separate data
items, they do not conflict and order is not important.
If one transaction writes a data item and another reads or writes the
same data item, order of execution is important
Possible solution: Run all transactions serially.
This is often too restrictive as it limits degree of concurrency or
parallelism in system.
Test for Serializability
Constrained write rule: transaction updates data item based on its
old value, which is first read.
Under the constrained write rule we can use precedence graph to
test for serializability.
A non serial schedule can be tasted for serializability by using a
method called precedence graph.
A precedence graph is composed of:
• Nodes to represent the transactions, and
• Arc to connect nodes indicating the order of execution for the
transactions
The rules of precedence graph:
Represent each transaction in the schedule with a node. Then draw a
directed arc from Ti to Tj (that is Ti Æto Tj ) if:
1. Tj Reads an item that is already Written by Ti
• Ti Write(x) Then Tj Read(x)
2. Tj Writes an item that already has been Read by Ti, or
• Ti Read(x) Then Tj Write(x)
3. Tj Writes an item that already has been Written by Ti
• Ti Write(x) Then Tj Write(x)
After drawing the precedence graph, if there is any cycle in the
graph, the schedule is said to be Non Serializable. Otherwise, that is if
there is no cycle in the graph, the schedule is equivalent with some
serial schedule and will be called serializable.
Department of Information Science, WU Haleluya Kiflu 10
Advanced Database Systems Lecture Note
Example1:
R1(x) W2(x) W1(x)
Check: T1ÆT2 and T2ÆT1
Example2:
R1(x)W2(x)W1(x)W3(x)
Check: T1ÆT2, T2ÆT1, T2ÆT3, T3ÆT2, T1ÆT3, and T3ÆT1
Department of Information Science, WU Haleluya Kiflu 11
Advanced Database Systems Lecture Note
Concurrency Control Techniques
Concurrency Control is the process of managing simultaneous operations
on the database without having them interfere with one another. Prevents
interference when two or more users are accessing database
simultaneously and at least one is updating data. Although two
transactions may be correct in themselves, interleaving of operations may
produce an incorrect result.
Three basic concurrency control techniques:
1. Locking methods
2. Time stamping
3. Optimistic
Locking and Time stamping are pessimistic approaches since they delay
transactions.
Both Locking and Time stamping are conservative approaches: delay
transactions in case they conflict with other transactions.
The optimistic approach allows us to proceed and check conflicts at the
end.
Optimistic methods assume conflict is rare and only check for conflicts at
commit.
Locking Method
A LOCK is a mechanism for enforcing limits on access to a resource in an
environment where there are many threads of execution. Locks are one
way of enforcing concurrency control policies. Transaction uses locks to
deny access to other transactions and so prevent incorrect updates.
Lock prevents another transaction from modifying item or even
reading it, in the case of a write lock.
Lock (X): If a transaction T1 applies Lock on data item X, then X is locked
and it is not available to any other transaction.
Unlock (X): T1 Unlocks X. X is available to other transactions.
Department of Information Science, WU Haleluya Kiflu 12
Advanced Database Systems Lecture Note
Types of a Lock
Shared lock: A Read operation does not change the value of a data item.
Hence a data item can be read by two different transactions
simultaneously under share lock mode. So only to read a
data item T1 will do: Share lock (X), then Read (X), and finally
Unlock (X).
Exclusive lock: A write operation changes the value of the data item.
Hence two write operations from two different transactions
or a write from T1 and a read from T2 are not allowed. A
data item can be modified only under Exclusive lock. To
modify a data item T1 will do: Exclusive lock (X), then Write
(X) and finally Unlock (X).
When these locks are applied, then a transaction must behave in a special
way. This special behavior of a transaction is referred to as well-formed.
Well-formed: A transaction is well- formed if it does not lock a locked
data item and it does not try to unlock an unlocked data
item.
Locking - Basic Rules
If transaction has shared lock on item, can read but not update item.
If transaction has exclusive lock on item, can both read and update
item.
Reads cannot conflict, so more than one transaction can hold shared
locks simultaneously on same item.
Exclusive lock gives transaction exclusive access to that item.
Some systems allow transaction to upgrade a shared lock to an
exclusive lock, or vice-versa.
Department of Information Science, WU Haleluya Kiflu 13
Advanced Database Systems Lecture Note
Examples: T1 and T2 are two transactions. They are executed under
locking as follows. T1 locks A in exclusive mode. When T2 want s to lock
A, it finds it locked by T1 so T2 waits for Unlock on A by T1. When A is
released then T2 locks A and begins execution.
Suppose a lock on a data item is applied, the data item is processed and it
is unlocked immediately after reading/writing is completed as follows.
Initial values of A = 10 and B = 20.
Serial Execution of T1 and then T2 Concurrent Execution of T1 and T2
T1 T2 T1 T2
Lock (A) Lock (A)
read (A) {A = 10} read (A) {A = 10}
A := A + 100 A := A + 100
write (A) (A = 110} write (A) (A = 110}
Unlock (A) Unlock (A)
Lock (B) Lock (B)
read (B) {B = 20} read (B) {B = 20}
B := B + 10 B := B * 5
write (B) {B =30} write (B) {B = 100}
Unlock (B) Unlock (B)
Lock (B) Lock (B)
read (B) {B = 30} read (B) {B = 100}
B := B * 5 B := B + 10
write (B) {B = 150} write (B) {B = 110}
Unlock (B) Unlock (B)
Lock (A) Lock (A)
Read (A) {A = 110} Read (A) {A = 110}
A := A + 20 A := A + 20
Write (A) {A = 130} Write (A) {A = 130}
Unlock (A) Unlock (A)
Final Result: A=130 B=150 Final Result: A=130 B=110
The final result of the two transactions using the two types of transaction
execution (serial and concurrent) is not the same. This indicates that the
above method of locking and unlocking is not correct. Thus, to preserve
consistency we have to use another approach to locking, two-phase
locking scheme.
Department of Information Science, WU Haleluya Kiflu 14
Advanced Database Systems Lecture Note
Two-Phase Locking (2PL)
A transaction follows 2PL protocol if all locking operations precede the
first unlock operation in the transaction.
The 2PL protocol demands locking and unlocking of a transaction to have
two pahses.
Growing phase - acquires all locks but cannot release any
locks.
Shrinking phase - releases locks but cannot acquire any new
locks.
Locking methods: problems
Deadlock: A deadlock that may result when two (or more)
transactions are each waiting for locks held by the other to be
released.
Deadlock - possible solutions
Only one way to break deadlock: abort one or more of the transactions in
the deadlock.
Deadlock should be transparent to user, so DBMS should restart
transaction(s).
Two general techniques for handling deadlock:
Deadlock prevention.
Deadlock detection and recovery.
Timeout
The deadlock detection could be done using the technique of TIMEOUT.
Every transaction will be given a time to wait in case of deadlock. If a
transaction waits for the predefined period of time in idle mode, the DBMS
will assume that deadlock occurred and it will abort and restart the
transaction.
Department of Information Science, WU Haleluya Kiflu 15
Advanced Database Systems Lecture Note
Time-stamping Method
Timestamp: a unique identifier created by DBMS that indicates relative
starting time of a transaction.
Can be generated by:
¾ using system clock at time transaction started, or
¾ Incrementing a logical counter every time a new transaction starts.
Time-stamping Æ A concurrency control protocol that orders transactions
in such a way that older transactions, transactions with smaller time
stamps, get priority in the event of conflict.
Transactions ordered globally base do their timestamp so that older
transactions, transactions with earlier timestamps, get priority in the
event of conflict.
Conflict is resolved by rolling back and restarting transaction.
Since there is no need to use lock there will be No Deadlock.
In timestamp ordering, the schedule is equivalent to the particular serial
order that corresponds to the order of the transaction timestamps.
To implement this scheme, every transaction will be given a
timestamp which is a unique identifier of a transaction.
If Ti came to processing prior to Tj then TS of Tj will be larger than TS
of Ti.
Again each data item will have a timestamp for Read and Write.
• WTS(A) which denotes the largest timestamp of any transaction
that successfully executed Write(A)
• RTS(A) which denotes the largest timestamp of any transaction
that successfully executed Read(A)
These timestamps are updated whenever a new Read(A) or
Write(A) instruction is executed.
Read/write proceeds only if last update on that data item was carried out
by an older transaction. Otherwise, transaction requesting read/write is
restarted and given a new timestamp.
Department of Information Science, WU Haleluya Kiflu 16
Advanced Database Systems Lecture Note
The timestamp ordering protocol ensures that any conflicting read
and write operations are executed in the timestamp order.
The protocol manages concurrent execution such that the time-
stamps determine the serializability order.
Rules for permitting execution of operations in Time-stamping
Method
Suppose that Transaction Ti issues Read(A)
• If TS(Ti) < WTS(A): this implies that Ti needs to read a value of
A which was already overwritten. Hence the read operation
must be rejected and Ti is rolled back.
• If TS(Ti) >= WTS(A): then the read is executed and RTS(A) is
set to the maximum of RTS(A) and TS(Ti).
Suppose that Transaction Ti issues Write(A)
• If TS(Ti) < RTS(A): then this implies that the value of A that Ti
is producing was previously needed and it was assumed that it
would never be produced. Hence, the Write operation must be
rejcted and Ti is rolled back.
• If TS(Ti) < WTS(A): then this implies that Ti is attempting to
Write an object value of A. hence, this write operation can be
ignored.
• Otherwise the Write operation is executed and WTS(A) is set to
the maximum of WTS(A) or TS(Ti).
A transaction that is rolled back due to conflict will be restarted and
be given a new timestamp.
Problem with timestamp-ordering protocol:
♥ Suppose Ti aborts, but Tj has read a data item written by Ti
♣ Then Tj must abort; if Tj had been allowed to commit
earlier, the schedule is not recoverable.
♣ Further, any transaction that has read a data item written
by Tj must also abort
♣ This can lead to cascading rollback…
Department of Information Science, WU Haleluya Kiflu 17
Advanced Database Systems Lecture Note
Cascading Rollback
Whenever some transaction T tries to issue a Read_Item(X) or a
Write_Item(X) operation, the basic timestamp ordering algorithm
compares the timestamp of T with the read timestamp and the write
timestamp of X to ensure that the timestamp order of execution of the
transactions is not violated. If the timestamp order is violated by the
operation, then transaction T will violate the equivalent serial schedule, so
T is aborted. Then T is resubmitted to the system as a new transaction with
new timestamp. If T is aborted and rolled back, any transaction Ti that
may have used a value written by T must also be rolled back. Similarly,
any transaction Tj that may have used a value written by Ti must also be
rolled back, and so on. This effect is known as cascading rollback.
Optimistic Technique
Locking and assigning and checking timestamp values may be
unnecessary for some transactions
Assumes that conflict is rare.
When transaction reaches the level of executing commit, a
check is performed to determine whether conflict has occurred.
If there is a conflict, transaction is rolled back and restarted.
Based on assumption that conflict is rare and more efficient to
let transactions proceed without delays to ensure serializability.
At commit, check is made to determine whether conflict has
occurred.
If there is a conflict, transaction must be rolled back and
restarted.
Potentially allows greater concurrency than traditional
protocols.
Three phases:
1. Read
2. Validation
3. Write
Department of Information Science, WU Haleluya Kiflu 18
Advanced Database Systems Lecture Note
1. Optimistic Techniques - Read Phase
¾ Extends from start until immediately before commit.
¾ Transaction reads values from database and stores them in local
variables. Updates are applied to a local copy of the data.
2. Optimistic Techniques - Validation Phase
¾ Follows the read phase.
¾ For read-only transaction, checks that data read are still current values.
If no interference, transaction is committed, else aborted and restarted.
¾ For update transaction, checks transaction leaves database in a
consistent state, with serializability maintained.
3. Optimistic Techniques - Write Phase
¾ Follows successful validation phase for update transactions.
¾ Updates made to local copy are applied to the database.
Granularity of data items
Granularity is the size of the data items chosen as the unit of protection
by a concurrency control protocol.
It could be:
¾ The entire database
¾ A file
¾ A page (a section of physical disk in which relations are stored)
¾ A record
¾ A field value of a record
The granularity has effect on the performance of the system. As locking will
prevent access to the data, the size of the data required to be locked will prevent
other transactions from having access. If the entire database is locked, then
consistency will be highly maintained but less performance of the system will be
witnessed. Is a single data item is locked; consistency maybe at risk but
concurrent processing and performance will be enhanced. Thus, as one go from
the entire database to a single value, performance and concurrent processing will
be enhanced but consistency will be at risk and needs good concurrency control
mechanism and strategy.
Department of Information Science, WU Haleluya Kiflu 19
Advanced Database Systems Lecture Note
Transaction Subsystem
Transaction manager
Scheduler
Recovery manager
Buffer manager
Transaction Manager in DBMS coordinates transactions on behalf of
application programs. Communicates with scheduler (lock manager)
which is responsible for implementing a strategy for concurrency
control.
The Scheduler in the DBMS ensures that the individual steps of
different transactions preserve consistency.
Recovery Manager: Ensures that database is restored to the original
state incase failure occurs.
Buffer Manager: responsible for transfer of data between disk
storage and main memory.
Department of Information Science, WU Haleluya Kiflu 20
Advanced Database Systems Lecture Note
Database Recovery
Database recovery is the process of restoring database to a correct state in
the event of a failure.
A database recovery is the process of eliminating the effects of a failure
from the database.
Recovery, in database systems terminology, is called restoring the last
consistent state of the data items.
¾ Need for Recovery Control
o Two types of storage: volatile (main memory) and
nonvolatile.
o Volatile storage does not survive system crashes.
o Stable storage represents information that has been
replicated in several nonvolatile storage media with
independent failure modes.
Types of failures
A failure is a state where data inconsistency is visible to transactions if
they are scheduled for execution. The kind of failure could be:
¾ System crashes, resulting in loss of main memory.
¾ Media failures, resulting in loss of parts of secondary storage.
¾ Application software errors.
¾ Natural physical disasters.
¾ Carelessness or unintentional destruction of data or facilities.
¾ Sabotage.
In databases usually a failure can generally be categorized as one of the
following major groups:
1. Transaction failure: a transaction cannot continue with its
execution, therefore, it is aborted and if desired it may be restarted
at some other time. Reasons: Deadlock, timeout, protection
violation, or system error.
2. System failure: the database system is unable to process any
transactions. Some of the common reasons of system failure are:
wrong data input, register overflow, addressing error, power failure,
memory failure, etc.
3. Media failure: failure of non-volatile storage media (mainly disk).
Some of the common reasons are: head crash, dust on the recording
surfaces, fire, etc.
Department of Information Science, WU Haleluya Kiflu 21
Advanced Database Systems Lecture Note
To make the database secured, one should formulate a “plan of attack” in
advance. The plan will be used in case of database insecurity that may
range from minor inconsistency to total loss of the data due to hazardous
events.
The basic steps in performing a recovery are
1. Isolating the database from other users. Occasionally, you may need
to drop and re-create the database to continue the recovery.
2. Restoring the database from the most recent useable dump.
3. Applying transaction log dumps, in the correct sequence, to the
database to make the data as current as possible.
It is a good idea to test your backup and recovery plans periodically by
loading the backups and transaction logs into a test database and verifying
that your procedure really works.
One can recover databases after three basic types of problems: user error,
software failure, and hardware failure.
Each type of failure requires a recovery mechanism. In a transaction
recovery, the effect of failed transaction is removed from the database, if
any. In a system failure, the effects of failed transactions have to be
removed from the database and the effects of completed transactions have
to be installed in the database.
The database recovery manger is responsible to guarantee the atomicity
and durability properties of the ACID property.
The execution of a transaction T is correct if, given a consistent database; T
leaves the database in the next consistent state. Initial consistent state of
the database: S1. T changes the consistent state from S1 to S2. During this
transition the database may go into an inconsistent state, which is visible
only to T. If the system or T fails then database will not be able to reach S2,
consequently the visible state would be an inconsistent state. From this
state we can go either forward (to state S2) or go backward (to state S1).
Department of Information Science, WU Haleluya Kiflu 22
Advanced Database Systems Lecture Note
Example:
The initial value of A=100, B=200 and C=300
The Required state after the execution of T1 is A=500, B=800 and C=700
Thus S1= (100,200,300)
S2= (500,800,700)
Transaction (T1)
Time Operation
1 A=A+200
2 B=B-100
3 C=C-200
Failure
4 A=A+200
5 B=B+700
6 C=C+600
Force writing: the explicit writing of the buffers to secondary storage.
Department of Information Science, WU Haleluya Kiflu 23
Advanced Database Systems Lecture Note
Transactions and Recovery
¾ Transactions represent basic unit of recovery.
¾ Recovery manager responsible for atomicity and durability.
¾ If failure occurs between commit and database buffers being
flushed to secondary storage then, to ensure durability, recovery
manager has to redo (rollforward) transaction's updates.
¾ If transaction had not committed at failure time, recovery
manager has to undo (rollback) any effects of that transaction for
atomicity.
¾ Partial undo - only one transaction has to be undone.
¾ Global undo - all transactions have to be undone.
Transaction Log: Execution history of concurrent transactions.
¾ DBMS starts at time t0, but fails at time tf. Assume data for
transactions T2 and T3 have been written to secondary storage.
¾ T1 and T6 have to be undone. In absence of any other information,
recovery manager has to redo T2, T3, T4, and T5.
¾ tc is the checkpoint time by the DBMS
Recovery Facilities
¾ DBMS should provide following facilities to assist with recovery:
o Backup mechanism: that makes periodic backup copies of
database.
o Logging facility: that keeps track of current state of
transactions and database changes.
o Checkpoint facility: that enables updates to database in
progress to be made permanent.
o Recovery manger: which allows DBMS to restore the
database to a consistent state following a failure.
Department of Information Science, WU Haleluya Kiflu 24
Advanced Database Systems Lecture Note
Restoring the database means transforming the state of the database to the
immediate good state before the failure. To do this, the change made on
the database should be preserved. Such kind of information is stored in a
system log or transaction log file.
Log File
¾ Contains information about all updates to database:
o Transaction records.
o Checkpoint records.
¾ Often used for other purposes (for example, auditing).
¾ Transaction records contain:
o Transaction identifier.
o Type of log record, (transaction start, insert, update, delete,
abort, commit).
o Identifier of data item affected by database action (insert,
delete, and update operations).
o Before-image of data item.
o After-image of data item.
¾ Log management information.
¾ Log file sometimes split into two separate random-access files.
¾ Potential bottleneck; critical in determining overall performance.
Department of Information Science, WU Haleluya Kiflu 25
Advanced Database Systems Lecture Note
Checkpointing
Checkpoint: is a point of synchronization between database and log file.
All buffers are force-written to secondary storage.
¾ Checkpoint record is created containing identifiers of all active
transactions.
¾ When failure occurs, redo all transactions that committed since
the checkpoint and undo all transactions active at time of crash.
¾ In previous example, with checkpoint at time tc, changes made
by T2 and T3 have been written to secondary storage.
¾ Thus:
o only redo T4 and T5,
o undo transactions T1 and T6.
Department of Information Science, WU Haleluya Kiflu 26
Advanced Database Systems Lecture Note
Recovery Techniques
Damage to the database could be either physical and relate which will
result in the loss of the data stored or just inconsistency of the database
state after the failure. For each we can have a recover mechanism:
1. If database has been damaged:
¾ Need to restore last backup copy of database and reapply
updates of committed transactions using log file.
¾ Extensive damage/catastrophic failure: physical media
failure; is restored by using the backup copy and by re
executing the committed transactions from the log up to
the time of failure.
2. If database is only inconsistent:
¾ No physical damage/only inconsistent: the restoring is
done by reversing the changes made on the database by
consulting the transaction log.
¾ Need to undo changes that caused inconsistency. May also
need to redo some transactions to ensure updates reach
secondary storage.
¾ Do not need backup, but can restore database using before-
and after-images in the log file.
Recovery Techniques for Inconsistent Database State
Recovery is required if only the database is updated. The kind of
recovery also depends on the kind of update made on the database.
Database update: A transaction’s updates to the database can be applied
in two ways:
¾ Three main recovery techniques:
1. Deferred Update
2. Immediate Update
3. Shadow Paging
Department of Information Science, WU Haleluya Kiflu 27
Advanced Database Systems Lecture Note
Deferred Update
¾ Updates are not written to the database until after a transaction has
reached its commit point.
¾ If transaction fails before commit, it will not have modified database
and so no undoing of changes required.
¾ May be necessary to redo updates of committed transactions as their
effect may not have reached database.
¾ A transaction first modifies all its data items and then writes all its
updates to the final copy of the database. No change is going to be
recorded on the database before commit. The changes will be made
only on the local transaction workplace. Update on the actual
database is made after commit and after the change is recorded on
the log. Since there is no need to perform undo operation it is also
called NO-UNDO/REDO Algorithm
Immediate Update/ Update-In-Place
¾ Updates are applied to database as they occur.
¾ Need to redo updates of committed transactions following a failure.
¾ May need to undo effects of transactions that had not committed at
time of failure.
¾ Essential that log records are written before write to database. Write-
ahead log protocol.
¾ If no "transaction commit" record in log, then that transaction was
active at failure and must be undone.
¾ Undo operations are performed in reverse order in which they were
written to log.
As soon as a transaction updates a data item, it updates the final copy
of the database on the database disk. During making the update, the
change will be recorded on the transaction log to permit rollback
operation in case of failure. UNDO and REDO are required to make the
transaction consistent. Thus it is called UNDO/REDO Algorithm. This
algorithm will undo all updates made in place before commit. The redo
is required because some operations which are completed but not
committed should go to the database. If we don’t have the second
scenario, then the other variation of this algorithm is called
UNDO/NO-REDO Algorithm.
Department of Information Science, WU Haleluya Kiflu 28
Advanced Database Systems Lecture Note
Shadow Paging
¾ Maintain two page tables during life of a transaction: current page
and shadow page table.
¾ When transaction starts, two pages are the same.
¾ Shadow page table is never changed thereafter and is used to restore
database in event of failure.
¾ During transaction, current page table records all updates to
database.
¾ When transaction completes, current page table becomes shadow
page table.
Department of Information Science, WU Haleluya Kiflu 29
Advanced Database Systems Lecture Note
Query Processing and Optimization
DB system features
¾ Crash Recovery
¾ Integrity Checking
¾ Security
¾ Concurrency Control
¾ Query Processing and Optimization
¾ File Organization and Optimization
Query Processing
The aim of query processing is to find information in one or more databases and
deliver it to the user quickly and efficiently. Traditional techniques work well for
databases with standard, single-site relational structures, but databases
containing more complex and diverse types of data demand new query
processing and optimization techniques.
Query Processing can be divided into four main phases:
1. Decomposition:
2. Optimization
3. Code generation, and
4. Execution
Query Decomposition
Query decomposition is the process of transforming a high level query into a
relational algebra query, and to check that the query is syntactically and
semantically correct. Query decomposition consists of parsing and validation
Input: Relational Algebra query on global relations
Typical stages in query decomposition are:
1. Analysis: lexical and syntactical analysis of the query (correctness).
Query tree will be built for the query containing leaf node
for base relations, one or many non-leaf nodes for relations
produced by relational algebra operations and root node for
the result of the query. Sequence of operation is from the
leaves to the root.
2. Normalization: convert the query into a normalized form. The
predicate WHERE will be converted to Conjunctive (∨) or
Disjunctive (∧) Normal form.
Department of Information Science, WU Haleluya Kiflu 30
Advanced Database Systems Lecture Note
3. Semantic Analysis: to reject normalized queries hat are not
correctly formulated or contradictory. Incorrect if
components do not contribute to generate result.
Contradictory if the predicate can not be satisfied by any
tuple. Algorithms: relation connection graph and normalized
attribute connection graph.
4. Simplification: to detect redundant qualifications, eliminate
common sub-expressions, and transform the query to a
semantically equivalent but more easily and effectively
computed form.
5. Query Restructuring More than one translation is possible Use
transformation rules
Most real-world data is not well structured. Today's databases typically contain
much non-structured data such as text, images, video, and audio, often
distributed across computer networks. In this complex environment, efficient
and accurate query processing becomes quite challenging.
There could be tons of tricks (not only in storage and query processing, but also
in concurrency control, recovery, etc.) Different tricks may work better in
different usage scenarios. Same tricks get used over and over again in different
applications
Query processing: Execute transactions in behalf of this query and print the
result. Steps in query processing:
Department of Information Science, WU Haleluya Kiflu 31
Advanced Database Systems Lecture Note
Example: Select Customer name
From Customer, Invoice
Where region = ‘Kansas City and Amount > 1000
Department of Information Science, WU Haleluya Kiflu 32
Advanced Database Systems Lecture Note
Query Optimization
What is wrong with the ordinary query?
Everyone wants the performance of their database to be optimal. In
particular, there is often a requirement for a specific query or object that is
query based, to run faster.
Problem of query optimization is to find the sequence of steps that produces
the answer to user request in the most efficient manner, given the database
structure.
The performance of a query is affected by the tables or queries that underlies
the query and by the complexity of the query.
When data/workload characteristics change
¾ The best navigation strategy changes
¾ The best way of organizing the data changes
Query optimizers are one of the main means by which modern database systems
achieve their performance advantages. Given a request for data manipulation or
retrieval, an optimizer will choose an optimal plan for evaluating the request
from among the manifold alternative strategies. i.e. there are many ways
(access paths) for accessing desired file/record. The optimizer tries to select the
most efficient (cheapest) access path for accessing the data. DBMS is responsible
to pick the best execution strategy based various considerations.
Query optimizers were already among the largest and most complex modules of
database systems.
Most efficient processing: Least amount of I/O and CPU resources.
Selection of the best method: In a non-procedural language the system does the
optimization at the time of execution. On the other hand in a procedural
language, programmers have some flexibility in selecting the best method. For
optimizing the execution of a query the programmer must know:
¾ File organization
¾ Record access mechanism and primary or secondary key.
¾ Data location on disk.
¾ Data access limitations.
To write correct code, application programmers need to know how data is
organized physically (e.g., which indexes exist)
To write efficient code, application programmers also need to worry about
data/workload characteristics
One has to cope with change!
Department of Information Science, WU Haleluya Kiflu 33
Advanced Database Systems Lecture Note
Example: Consider relations r(AB) and s(CD). We require r X s.
Method 1
a. Load next record of r in RAM.
b. Load all records of s, one at a time and concatenate with r.
c. All records of r concatenated?
NO: goto a.
YES: exit (the result in RAM or on disk).
Performance: Too many accesses.
Method 2: Improvement
a. Load as many blocks of r as possible leaving room for one block of s.
b. Run through the s file completely one block at a time.
Performance: Reduces the number of times s blocks are loaded by a factor of
equal to the number of r records than can fit in main memory.
Considerations during query Optimization:
¾ Narrow down intermediate result sets quickly. SELECT before JOIN
¾ Use access structures (indexes).
Department of Information Science, WU Haleluya Kiflu 34
Advanced Database Systems Lecture Note
Approaches to Query Optimization
Heuristics Approach
The heuristic approach uses the knowledge of the characteristics of the
relational algebra operations and the relationship between the operators
to optimize the query.
Thus the heuristic approach of optimization will make use of:
o Properties of individual operators
o Association between operators
o Query Tree: a graphical representation of the operators, relations,
attributes and predicates and processing sequence during query
processing.
Query tree is composed of three main parts:
i. The Leafs: the base relations used for processing the
query/ extracting the required information
ii. The Root: the final result/relation as an out put based on
the operation on the relations used for query processing
iii. Nodes: intermediate results or relations before reaching
the final result.
Sequence of execution of operation in a query tree will start
from the leaves and continues to the intermediate nodes and
ends at the root.
The properties of each operations and the association between operators is
analyzed using set of rules called TRANSFORMATION RULES. Use of the
transformation rules will transform the query to relatively good execution
strategy.
Department of Information Science, WU Haleluya Kiflu 35
Advanced Database Systems Lecture Note
Transformation Rules for Relational Algebra
1. Cascade of SELECTION: conjunctive SELECTION Operations can
cascade into individual Selection Operations and Vice Versa
(c1∧c2∧c3) (R)= c1(c2(c3(R)) where ci is a predicate
2. Commutativity of SELECTION operations
c1(c2(R))= c2(c1(R)) where ci is a predicate
3. Cascade of PROJECTION: in the sequence of PROJECTION
Operations, only the last in the sequence is required
π π π π (R)= π (R)
L1 L2 L3 L4 L1
4. Commutativity of SELECTION with PROJECTION and Vise Versa
a. If the predicate c1 involves only the attributes in the projection
list (L1), then the selection and projection operations commute
π ( (R))= (π (R))
L1 c1 c1 L1
5. Commutativity of THETA JOIN/Cartesian Product
R X S is equivalent to S X R
Also holds for Equi-Join and Natural-Join
(R c1 S)= (S R)
c1
6. Commutativity of SELECTION with THETA JOIN
a. If the predicate c1 involves only attributes of one of the relations
(R) being joined, then the Selection and Join operations
commute
c1 (R c S)= ( (R)) c1 c S)
b. If the predicate is in the form c1 ∧ c2 and c1 involves only
attributes of R and c2 involves only attributes of S, then the
Selection and Theta Join operations commute
c1∧c2 (R c S)= ( (R)) c1 c ( S))
c2
Department of Information Science, WU Haleluya Kiflu 36
Advanced Database Systems Lecture Note
7. Commutativity of PROJECTION and THETA JOIN
If the projection list is of the form L1, L2, where L1 involves only
attributes of R and L2 involves only attributes of S being joined and
the predicate c involves only attributes in the projection list, then
the SELECTION and JOIN operations commute
π L1,L2 (R c S)= (π L1,L2 (R)) c (π L1,L2 S))
8. Commutativity of the Set Operations: UNION and INTERSECTION
but not SET DIFFERENCE
R∩S=S∩R and R∪S=S∪R
9. Associativity of the THETA JOIN, CARTESIAN PRODUCT, UNION
and INTERSECTION.
(Rθ S)θ T=Rθ (Sθ T) whereθ is one of the operations
10. Commuting SELECTION with SET OPERATIONS
c (Rθ S)= (c(R) θ c(S)) whereθ is one of the operations
11. Commuting PROJECTION with UNION
π (S∪R )= π (S ) ∪ π (R )
L1 L1 L1
Department of Information Science, WU Haleluya Kiflu 37
Advanced Database Systems Lecture Note
Heuristic Approach will be implemented by using the above transformation
rules in the following sequence or steps.
Sequence for Applying Transformation Rules
1. Use
Rule-1Î Cascade SELECTION
2. Use
Rule-2: Commutativity of SELECTION
Rule-4: Commuting SELECTION with PROJECTION
Rule-6: Commuting SELECTION with JOIN and CARTESIAN
Rule-10: commuting SELECTION with SET OPERATIONS
3. Use
Rule-9: Associativity of Binary Operations (JOIN, CARTESIAN,
UNION and INTERSECTION). Rearrange nodes by making the most
restrictive operations to be performed first ( moving it as far down the
tree as possible)
4. Perform Cartesian Operations with the subsequent Selection Operation
5. Use
Rule-3: Cascade of PROJECTION
Rule-4: Commuting PROJECTION with SELECTION
Rule-7: Commuting PROJECTION with JOIN and CARTESIAN
Rule-11: commuting PROJECTION with UNION
Main Heuristic
The main heuristic is to first apply operations that reduce the size (the cardinality
and/or the degree) of the intermediate relation. That is:
a. Perform SELECTION as early as possible: that will reduce the
cardinality (number of tuples) of the relation.
b. Perform PROJECTION as early as possible: that will reduce the
degree (number of attributes) of the relation.
Both a and b will be accomplished by placing the
SELECT and PROJECT operations as far down the tree as
possible.
c. SELECT and JOIN operations with most restrictive conditions
resulting with smallest absolute size should be executed before
other similar operations. This is achieved by reordering the
nodes with JOIN
Department of Information Science, WU Haleluya Kiflu 38
Advanced Database Systems Lecture Note
Example: consider the following schemas and the query, where the EMPLOYEE
and the PROJECT relations are related by the WORKS_ON relation.
EMPLOYEE (EEmpID, FName, LName, Salary, Dept, Sex, DoB)
PROJECT (PProjID, PName, PLocation, PFund, PManagerID)
WORKS_ON (WEmpID, WProjID)
WEmpID (refers to employee identification) and PProjID (refers to project
identification) are foreign keys to WORKS_ON relation from EMPLOYEE and
PROJECT relations respectively.
Query: The manager of the company working on road construction would like
to view employees name born before January 1 1965 who are working
on the project named Ring Road.
Relational Algebra representation of the query will be:
π <FName, LName>
( <DoB<Jan 1 1965∧ WEmpID=EEmpID ∧PProjID=WProjID ∧ PName=’Ring
Road’> (EMPLOYEE X WORKS_ON X PROJECT))
The SQL equivalence for the above query will be:
SELECT FName, LName
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE DoB<Jan 1 1965 ∧ EEmpID=WEmpID ∧ WProjID=PProjID ∧ PName=”Ring Road”
Department of Information Science, WU Haleluya Kiflu 39
Advanced Database Systems Lecture Note
The initial query tree will be:
π <FName, LName>
(DoB<Jan 1 1965) ∧ (WEmpID=EEmpID) ∧ (PProjID=WProjID) ∧ (PName=’Ring Road’)
X
PROJECT
WORKS_ON
EMPLOYEE
By applying the first step (cascading the selection) we will come up with the
following structure.
(
(DoB<Jan 1 1965) (
(WEmpID=EEmpID) ( (PProjID=WProjID) (PName=’Ring Road’)
(EMPLOYEE X WORKS_ON X PROJECT) ) ) )
By applying the second step it can be seen that some conditions have attribute
that belong to a single relation ( DoB belongs to EMPLOYEE and PName belongs
to PROJECT) thus the selection operation can be commuted with Cartesian
Operation. Then, since the condition WEmpID=EEmpID base the employee and
WORKS_ON relation the selection with this condition can be cascaded.
Department of Information Science, WU Haleluya Kiflu 40
Advanced Database Systems Lecture Note
( (PProjID=WProjID) ( (PName=’Ring Road’) PROJECT ) X ( (WEmpID=EEmpID)
(WORKS_ON X ( (DoB<Jan 1 1965) EMPLOYEE))))
The query tree after this modification will be:
π <FName, LName>
(PProjID=WProjID)
(WEmpID=EEmpID) (PName=’Ring Road’)
PROJECT
X
(DoB<Jan 1 1965) WORKS_ON
EMPLOYEE
Department of Information Science, WU Haleluya Kiflu 41
Advanced Database Systems Lecture Note
Using the third step, perform most restrictive operations first.
From the query given we can see that selection on PROJECT is most restrictive
than selection on EMPLOYEE. Thus, it is better to perform selection on PROJECT
BEFOR on EMPLOYEE. rearrange the nodes to achieve this.
π <FName, LName>
( WEmpID=EEmpID)
( PProjID=WProjID) (DoB<Jan 1 1965)
EMPLOYEE
X
(PName=’Ring Road’) WORKS_ON
PROJECT
Department of Information Science, WU Haleluya Kiflu 42
Advanced Database Systems Lecture Note
Using the forth step, Perform Cartesian Operations with the subsequent Selection
Operation.
π <FName, LName>
( WEmpID=EEmpID)
(DoB<Jan 1 1965)
( PProjID=WProjID)
EMPLOYEE
(PName=’Ring Road’)
WORKS_ON
PROJECT
Department of Information Science, WU Haleluya Kiflu 43
Advanced Database Systems Lecture Note
Using the fifth step, Perform the projection as early as possible.
π <FName, LName>
( WEmpID=EEmpID)
π < WEmpID >
( PProjID=WProjID) π <FName,LName,EEmpID>
(DoB<Jan 1 1965)
WORKS_ON
π <PProjID> EMPLOYEE
(PName=’Ring Road’)
PROJECT
Department of Information Science, WU Haleluya Kiflu 44
Advanced Database Systems Lecture Note
Cost Estimation Approach to Query Optimization
The main idea is to minimize he cost of processing a query. The cost function is
comprised of:
¾ I/O cost + CPU processing cost + communication cost + Storage cost
These components might have different weights in different processing
environments
The DBMs will use information stored in the system catalogue for the purpose of
estimating cost. The main target of of query optimization is to minimize the size
of the intermediate relation. The size will have effect in the cost of:
¾ Disk Access
¾ Data Transpiration
¾ Storage space in the Primary Memory
¾ Writing on Disk
The statistics in the system catalogue used for cost estimation purpose are:
¾ Cardinality of a relation: the number of tuples contained in a relation
currently (r)
¾ Degree of a relation: number of attributes of a relation
¾ Number of tuples on a relation that can be stored in one block of memory
¾ Total number of blocks used by a relation
¾ Number of distinct values of an attribute (d)
¾ Selection Cardinality of an attribute (S): that is average number of records
that will satisfy an equality condition S=r/d
By sing the above information one could calculate the cost of executing a query
and selecting the best strategy, which is with the minimum cost of processing.
Cost Components for Query Optimization
The costs of query execution can be calculated for the following major process we
have during processing.
1. Access Cost of Secondary Storage
Data is going to be accessed from secondary storage, as an query will be
needing some part of the data stored in the database. The disk access cost can
again be analyzed in terms of:
¾ Searching
¾ Reading, and
¾ Writing, data blocks used to store some portion of a relation.
Department of Information Science, WU Haleluya Kiflu 45
Advanced Database Systems Lecture Note
The disk access cost will vary depending on the file organization used and
the access method implemented for the file organization. In addition to
the file organization, the data allocation scheme, whether the data is
stored contiguously or in scattered manner, will affect the disk access cost.
2. Storage Cost
While processing a query, as any query would be composed of many
database operations, there could be one or more intermediate results before
reaching the final output. These intermediate results should be stored in
primary memory for further processing. The bigger the intermediate relation,
the larger the memory requirement, which will have impact on the limited
available space. This will be considered as a cost of storage.
3. Computation Cost
Query is composed of many operations. The operations could be database
operations like reading and writing to a disk, or mathematical and other
operations like:
¾ Searching
¾ Sorting
¾ Merging
¾ Computation on field values
4. Communication Cost
In most database systems the database resides in on station and various
queries originate from different terminals. This will have impact on the
performance of the system adding cost for query processing. Thus, the cost of
transporting data between the database site and the terminal from where the
query originate should be analyzed.
Department of Information Science, WU Haleluya Kiflu 46
Advanced Database Systems Lecture Note
Pipelining
Pipelining is another method used for query optimization. It is sometime
referred to as on-the-fly processing of queries. As query optimization tries to
reduce the size of the intermediate result, pipelining use a better way of reducing
the size by performing different conditions on a single intermediate result
continuously. Thus the technique is said to reduce the number of intermediate
relations in query execution.
Pipelining performs multiple operations on a single relation in a pipeline.
Ex:
Lets say we have a relation on employee with the following schema
Employee(ID, FName, LName, DoB, Salary, Position, Dept)
If a query would like to extract supervisors with salary greater than 2000, the
relational algebra representation of the query will be:
(Salary>2000) ∧ (Position=Supervisor) (Employee)
After reading the relation from the memory, the system could perform the
operation by cascading the SELECT operation.
1. Approach One
(Salary>2000) ∧( (Position=Supervisor) (Employee))
Using this approach we will have the following relations
¾ Employee
¾ Relation created by the Operation:
R1 = Position=Supervisor) (Employee)
(
¾ The resulting Relation with the Operation
R1 = (Salary>2000) (R1)
2. Approach Two
One can select a single tuple from the relation Employee and perform both tests
in a pipeline and create the final relation at once. This is what is called
PIPELINING
Department of Information Science, WU Haleluya Kiflu 47
Advanced Database Systems Lecture Note
Chapter three
Database Integrity and Security
Topics:
• Database Integrity Rules
• Database Security Issues
o Discretionary access control
o Mandatory access control
o Statistical database security
In today's society, information is a critical resource not only in the fields of
industry, commerce, education, or medicine, but also in the fields of
military, diplomacy, or governments. Some information is extremely
important as to have to be protected. For example, data corruption or
fabrication in a hospital database could result in patients' receiving the
wrong medication. Disclosure or modification of military information
could endanger national security.
• Privacy – Ethical and legal rights that individuals have with regard
to control over the dissemination and user of their personal
information
• Database security – Protection of information contained in the
database against unauthorized access, modification or destruction
• Database integrity – Mechanism that is applied to ensure that the
data in the database is correct and consistent
A good database security management system has not only the following
characteristics: data independence, shared access, minimal redundancy,
data consistency, and data integrity but also the following characteristics:
privacy, integrity, and availability.
• Privacy signifies that an unauthorized user cannot disclose data
• Integrity ensures that an unauthorized user cannot modify data
• Availability ensures that data be made available to the authorized
user unfailingly
• Copyright ensures the native rights of individuals as a creator of
information.
• Validity ensures activities to be accountable by law.
Department of Information Science, WU Haleluya Kiflu 48
Advanced Database Systems Lecture Note
With a strong enforcement and management of these, it is said that the
database system can effectively prevent accidental security and integrity
threats from system error, improper authorization and concurrent usage
anomalies. In addition to have an efficient system, it should have
prevention on malicious or intentional security and integrity threats where
computer system operator can bypass security as well as programmers as
hackers.
There are certain security policy issues that we should recognize, where
we should consider administrative control policies, decide which security
features offered by the DBMS is used to implement the system, decide
whether the focus of security administration is left with DBA and whether
it is centralized or decentralized. Besides, one should decide on ownership
of shared data as well.
When we talk about the levels of security protection, it may start from
organization & administrative security, physical & personnel security,
communication security and Information systems security
Database security and integrity is about protecting the database from
being inconsistent and being disrupted. We can also call it database
misuse.
Database misuse could be Intentional or Accidental, where accidental
misuse is easier to cope with than intentional misuse.
Accidental inconsistency could occur due to:
¾ System crash during transaction processing
¾ Anomalies due to concurrent access
¾ Anomalies due to redundancy
¾ Logical errors
Like wise, even though there are various threats that could be categorized
in this group,
Intentional misuse could be:
¾ Unauthorized reading of data
¾ Unauthorized modification of data or
¾ Unauthorized destruction of data
Most systems implement good Database Integrity to protect the system
from accidental misuse while there are many computer based measures to
Department of Information Science, WU Haleluya Kiflu 49
Advanced Database Systems Lecture Note
protect the system from intentional misuse, which is termed as Database
Security measures.
Levels of Security Measures
Security measures can be implemented at several levels and for different
components of the system. These levels are:
1. Physical Level: concerned with securing the site containing the
computer system. The backup systems should also be physically
protected from access except for authorized users. In other words, the
site or sites containing the computer systems must be physically
secured against armed or sneaky entry by intruders.
2. Human Level: concerned with authorization of database users for
access the content at different levels and privileges.
3. Operating System: concerned with the weakness and strength of the
operating system security on data files. Weakness may serve as a means
of unauthorized access to the database. No matter how secure the
database system is, weakness in operating system security may serve as
a means of unauthorized access to the database. This also includes
protection of data in primary and secondary memory from
unauthorized access.
4. Database System: concerned with data access limit enforced by the
database system. Access limit like password, isolated transaction and
etc. Some database system users may be authorized to access only a
limited portion of the database. Other users may be allowed to issues
queries, but may be forbidden to modify the data. It is the responsibility
of the database system to ensure that these authorization restrictions
are not violated.
5. Application Level: Since almost all database systems allow remote
access through terminals or networks, software-level security with the
network software is as important as physical security, both on the
Internet and networks private to an enterprise.
Even though we can have different levels of security and authorization on
data objects and users, who access which data is a policy matter rather
than technical.
Department of Information Science, WU Haleluya Kiflu 50
Advanced Database Systems Lecture Note
These policies
¾ should be known by the system: should be encoded in the system
¾ should be remembered: should be saved somewhere (th catalogue)
• Database Integrity constraints contribute to maintaining a
secure database system by preventing data from becoming
invalid and hence giving misleading or incorrect results
• Domain Integrity means that each column in any table will
have set of allowed values and can not assume any value
other than the one specified in the domain.
• Entity Integrity means that in each table the primary key
(which may be composite) satisfies both of two conditions:
1. That the primary key is unique within the table and
2. That the primary key column(s) contains no null values.
• Referential Integrity means that in the database as a whole,
things are set up in such a way that if a column exists in two
or more tables in the database (typically as a primary key in
one table and as a foreign key in one or more other tables),
then any change to a value in that column in any one table
will be reflected in corresponding changes to that value where
it occurs in other tables. This means that the RDBMS must be
set up so as to take appropriate actions to spread a change—in
one table—from that table to the other tables where the
change must also occur.
The effect of the existence and maintenance of referential
integrity is, in short, that if a column exists in two or more
tables in the database, every occurrence of the column will
contain only values that are consistent across the database.
• Key constraints in a relational database, there should be some
collection of attributes with a special feature used to maintain
the integrity of the database. These attributes will be named as
Primary Key, Candidate Key, Foreign Key, and etc. These
Department of Information Science, WU Haleluya Kiflu 51
Advanced Database Systems Lecture Note
Key(s) should obey some rules set by the relational data
model.
• Enterprise Constraint means some business rules set by the
enterprise on how to use, manage and control the database
• Database Security - the mechanisms that protect the database
against intentional or accidental threats. Database security
encompasses hardware, software, people and data.
Database Management Systems supporting multi-user database
system must provide a database security and authorization
subsystem to enforce limits on individual and group access
rights and privileges.
Security Issues and general considerations
• Legal, ethical and social issues regarding the right to
access information
• Physical control issues regarding how to keep the database
physically secured.
• Policy issues regarding privacy of individual level at
enterprise and national level
• Operational consideration on the techniques used
(password, etc) to access and manipulate the database
• System level security including operating system and
hardware control
• Security levels and security policies in enterprise level
The designer and the administrator of a database should first
identify the possible threat that might be faced by the system in
order to take counter measures.
Department of Information Science, WU Haleluya Kiflu 52
Advanced Database Systems Lecture Note
Threat may be any situation or event, whether intentional or
accidental, that may adversely affect a system and consequently
the organization
• A threat may be caused by a situation or event involving a
person, action, or circumstance that is likely to bring harm to
an organization
• The harm to an organization may be tangible or intangible
Tangible – loss of hardware, software, or data
Intangible – loss of credibility or client confidence
Examples of threats:
9 Using another persons’ means of access
9 Unauthorized amendment/modification or copying of data
9 Program alteration
9 Inadequate policies and procedures that allow a mix of
confidential and normal out put
9 Wire-tapping
9 Illegal entry by hacker
9 Blackmail
9 Theft of data, programs, and equipment
9 Failure of security mechanisms, giving greater access than
normal
9 Staff shortages or strikes
9 Inadequate staff training
9 Viewing and disclosing unauthorized data
9 Electronic interference and radiation
9 Data corruption owing to power loss or surge
9 Fire (electrical fault, lightning strike, arson), flood, bomb
9 Physical damage to equipment
9 Breaking cables or disconnection of cables
9 Introduction of viruses
• An organization deploying a database system needs to
identify the types of threat it may be subjected to and initiate
Department of Information Science, WU Haleluya Kiflu 53
Advanced Database Systems Lecture Note
appropriate plans and countermeasures, bearing in mind the
costs of implementing each.
Countermeasures: Computer Based Controls
• The types of countermeasure to threats on computer systems
range from physical controls to administrative procedures
• Despite the range of computer-based controls that are available, it
is worth noting that, generally, the security of a DBMS is only as
good as that of the operating system, owing to their close
association
• The following are computer-based security controls for a
multi-user environment:
¾ Authorization
The granting of a right or privilege that enables a subject to
have legitimate access to a system or a system’s object
Authorization controls can be built into the software, and
govern not only what system or object a specified user can
access, but also what the user may do with it
Authorization controls are sometimes referred to as access
controls
The process of authorization involves authentication of
subjects (i.e. a user or program) requesting access to objects
(i.e. a database table, view, procedure, trigger, or any other
object that can be created within the system)
¾ Views
A view is the dynamic result of one or more relational
operations operation on the base relations to produce another
relation
A view is a virtual relation that does not actually exist in the
database, but is produced upon request by a particular user
Department of Information Science, WU Haleluya Kiflu 54
Advanced Database Systems Lecture Note
The view mechanism provides a powerful and flexible
security mechanism by hiding parts of the database from
certain users
Using a view is more restrictive than simply having certain
privileges granted to a user on the base relation(s)
¾ Backup and recovery
Backup is the process of periodically taking a copy of the
database and log file (and possibly programs) on to offline
storage media
A DBMS should provide backup facilities to assist with the
recovery of a database following failure
Database recovery is the process of restoring the database to a
correct state in the event of a failure
Journaling is the process of keeping and maintaining a log file
(or journal) of all changes made to the database to enable
recovery to be undertaken effectively in the event of a failure
The advantage of journaling is that, in the event of a failure,
the database can be recovered to its last known consistent
state using a backup copy of the database and the information
contained in the log file
If no journaling is enabled on a failed system, the only means
of recovery is to restore the database using the latest backup
version of the database
However, without a log file, any changes made after the last
backup to the database will be lost
¾ Integrity
Integrity constraints contribute to maintaining a secure
database system by preventing data from becoming invalid
and hence giving misleading or incorrect results
Domain Integrity: setting the allowed set of values
Entity integrity: demanding Primary key values not to assume
a NULL value
Referential integrity: enforcing Foreign Key values to have a
value that already exist in the corresponding Candidate Key
attribute(s) or be NULL.
Key constraints: the rules the Relational Data Model has on
different kinds of Key.
Department of Information Science, WU Haleluya Kiflu 55
Advanced Database Systems Lecture Note
Encryption
Authorization may not be sufficient to protect data in database
systems, especially when there is a situation where data should be
moved from one location to the other using network facilities.
Encryption is used to protect information stored at a particular site or
transmitted between sites from being accessed by unauthorized
users.
Encryption is the encoding of the data by a special algorithm that
renders the data unreadable by any program without the decryption
key.
It is not possible for encrypted data to be read unless the reader
knows how to decipher/decrypt the encrypted data.
¾ If a database system holds particularly sensitive data, it may be
deemed necessary to encode it as a precaution against possible
external threats or attempts to access it
¾ The DBMS can access data after decoding it, although there is a
degradation in performance because of the time taken to
decode it
¾ Encryption also protects data transmitted over communication
lines
¾ To transmit data securely over insecure networks requires the
use of a Cryptosystem, which includes:
1. An encryption key to encrypt the data (plaintext)
2. An encryption algorithm that, with the encryption key,
transforms the plaintext into ciphertext
3. A decryption key to decrypt the ciphertext
4. A decryption algorithm that, with the decryption key,
transforms the ciphertext back into plaintext
¾ Data encryption standard (DES) is an approach which does
both a substitution of characters and a rearrangement of their
order based on an encryption key.
Department of Information Science, WU Haleluya Kiflu 56
Advanced Database Systems Lecture Note
Types of Cryptosystems
Cryptosystems can be categorized into two
1. Symmetric encryption – uses the same key for both
encryption and decryption and relies on safe communication
lines for exchanging the key.
2. Asymmetric encryption – uses different keys for encryption
and decryption e.g. RSA
Generally, symmetric algorithms are much faster to execute on
a computer than those that are asymmetric. In the contrary,
asymmetric algorithms are more secure than symmetric
algorithms.
¾ RAID technology (Redundant Array of Independent Disks)
The hardware that the DBMS is running on must be fault-
tolerant, meaning that the DBMS should continue to operate
even if one of the hardware components fails. This suggests
having redundant components that can be seamlessly
integrated into the working system whenever there is one or
more component failures. The main hardware components
that should be fault-tolerant include disk drives, disk
controllers, CPU, power supplies, and cooling fans. Disk
drives are the most vulnerable components with the shortest
times between failures of any of the hardware components.
RAID works on having a large disk array comprising an
arrangement of several independent disks that are organized
to improve reliability and at same time increase performance
Performance is increased through data striping
Data striping – the data is segmented into equal size
partitions (the striping unit) which are transparently
distributed across multiple disks.
Department of Information Science, WU Haleluya Kiflu 57
Advanced Database Systems Lecture Note
Security at different Levels of Data
Almost all RDBMSs provide security at different levels and
formats of data. This includes:
1. Relation Level: permission to have access to a specific
relation.
2. View Level: permission to data included in the view
and not in the named relations
3. Hybrid (Relation/View): the case where only part of a
single relation is made available to users through View.
Any database access request will have the following three major
components
1. Requested Operation: what kind of operation is requested
by a specific query?
2. Requested Object: on which resource or data of the database
is the operation sought to be applied?
3. Requesting User: who is the user requesting the operation
on the specified object?
The database should be able to check for all the three
components before processing any request. The checking is
performed by the security subsystem of the DBMS.
Authentication
¾ All users of the database will have different access levels and
permission for different data objects, and authentication is the
process of checking whether the user is the one with the
privilege for the access level.
¾ Is the process of checking the users are who they say they are.
¾ Each user is given a unique identifier, which is used by the
operating system to determine who they are
Department of Information Science, WU Haleluya Kiflu 58
Advanced Database Systems Lecture Note
¾ Thus the system will check whether the user with a specific
username and password is trying to use the resource.
¾ Associated with each identifier is a password, chosen by the
user and known to the operation system, which must be
supplied to enable the operating system to authenticate who
the user claims to be
Authorization/Privilege
Authorization refers to the process that determines the mode in
which a particular (previously authenticated) client is allowed to
access a specific resource controlled by a server.
Most of the time, authorization is implemented by using Views.
¾ Views are unnamed relations containing part of one or more
base relations creating a customized/personalized view for
different users.
¾ Views are used to hide data that a user needs not to see.
Forms of user authorization
There are different forms of user authorization on the resource of the
database. These forms are privileges on what operations are allowed
on a specific data object.
User authorization on the data/extension
1. Read Authorization: the user with this privilege is allowed
only to read the content of the data object.
2. Insert Authorization: the user with this privilege is allowed
only to insert new records or items to the data object.
3. Update Authorization: users with this privilege are allowed to
modify content of attributes but are not authorized to delete the
records.
4. Delete Authorization: users with this privilege are only
allowed to delete a record and not anything else.
Department of Information Science, WU Haleluya Kiflu 59
Advanced Database Systems Lecture Note
¾ Different users, depending on the power of the user, can have one
or the combination of the above forms of authorization on
different data objects.
User authorization on the database schema
1. Index Authorization: deals with permission to create as well as
delete an index table for relation.
2. Resource Authorization: deals with permission to add/create
a new relation in the database.
3. Alteration Authorization: deals with permission to add as
well as delete attribute.
4. Drop Authorization: deals with permission to delete and
existing relation.
Role of DBA in Database Security
The database administrator is responsible to make the database
to be as secure as possible. For this the DBA should have the
most powerful privilege than every other user. The DBA
provides capability for database users while accessing the
content of the database.
The major responsibilities of DBA in relation to authorization of
users are:
1. Account Creation: involves creating different accounts for
different USERS as well as USER GROUPS.
2. Security Level Assignment: involves in assigning different
users at different categories of access levels.
Department of Information Science, WU Haleluya Kiflu 60
Advanced Database Systems Lecture Note
3. Privilege Grant: involves giving different levels of privileges
for different users and user groups.
4. Privilege Revocation: involves denying or canceling
previously granted privileges for users due to various
reasons.
5. Account Deletion: involves in deleting an existing account of
users or user groups. Is similar with denying all privileges of
users on the database.
Department of Information Science, WU Haleluya Kiflu 61
Advanced Database Systems Lecture Note
Approaches to Database Security
There are two broader approaches to security. The two types
of database security mechanisms are:
1. Discretionary security mechanisms
¾ Grant different privileges to different users and
user groups on various data objects
¾ The privilege is to access different data objects
¾ The mode of the privilege could be
Read,
Insert,
Delete,
Update files, records or fields.
¾ Is more flexible
¾ One user can have A but not B and another user
can have B but not A
2. Mandatory security mechanisms
¾ Enforce multilevel security
¾ classifying data and users into various
security classes (or levels) and
implementing the appropriate security
policy of the organization.
¾ Each data object will have certain classification
level
¾ Each user is given certain clearance level
¾ Only users who can pass the clearance level can
access the data object
¾ Is comparatively not-flexible/rigid
¾ If one user can have A but not B then B is
accessed by users with higher privilege and we
can not have B but not A
Department of Information Science, WU Haleluya Kiflu 62
Advanced Database Systems Lecture Note
The ability to classify user into a hierarchy of groups provide
a powerful tool for administering large systems with
thousands of users and objects.
A database system can support one or both of the security
mechanisms to protect the data.
In most systems it is better to filter those that are allowed
rather than identifying the not allowed. Since if some object
is authorized then it means it is not constrained.
Department of Information Science, WU Haleluya Kiflu 63
Advanced Database Systems Lecture Note
Statistical Database Security
¾ Statistical databases contain information about
individuals which may not be permitted to be seen by
others as individual records.
¾ Such databases may contain information about various
populations.
¾ Example: Medical Records, Personal Data like address,
salary, etc
¾ Such kind of databases should have special security
mechanisms so that confidential information about people
will not be disclosed for many users.
¾ Thus statistical databases should have additional security
techniques which will protect the retrieval of individual
records.
¾ Only queries with statistical aggregate functions like
Average, Sum, Min, Max, Standard Deviation, Mid,
Count, etc should be executed.
¾ Queries retrieving confidential attributes should be
prohibited.
¾ Not to let the user make inference on the retrieved data,
one can also implement constraint on the minimum
number of records or tuples in the resulting relation by
setting a threshold.
Department of Information Science, WU Haleluya Kiflu 64
Advanced Database Systems Lecture Note
Chapter Four
Distributed Database Systems
o Before the development of database technology, every
application used to have its own data in the application logic.
o Database development facilitates the integration of data
available in an organization from a number of applications and
enforces security on data access on a single local site. But it is
not always the case that organizational data reside in one
central site. This demand databases at different sites to be
integrated and synchronized with all the facilities of database
approach.
o This is will be made possible by computer networks and data
communication optimized by internet, mobile and wireless
computing and intelligent devices.
o This leads to Distributed Database Systems.
o The decentralization approach to data base mirrors the natural
organizational structure of companies which are logically
distributed in to divisions, departments, projects and so on and
physically distributed in to offices, plants, and factories each of
which maintains its own operational data.
o Distributed Database is not a centralized database.
Department of Information Science, WU Haleluya Kiflu 65
Advanced Database Systems Lecture Note
Data Distribution Strategies
Centralized DB Distributed DB
¾ Distributed DB stores logically related shared data and metadata
at several physically independent sites connected via network
¾ Distributed DBMS is the software system that permits the
management of a Distributed DB and makes the distribution
transparent to the user.
¾ Data allocation is the process of deciding where to allocate/store
particular data item.
¾ There are 3 data allocation strategies:
1. Centralized: the entire DB is located at a single
site. And computers access through net works.
Known as distributed processing.
2. Partitioned: the DB is split into several disjoint
parts (called partitions, segments or fragments)
and stored at several sites
3. Replicated: copies of one or more partitions are
stored at several sites
3.1 Selective- combines fragmentation (locality
of reference for those which are less updated)
replication and centralization as appropriate for
the data.
Department of Information Science, WU Haleluya Kiflu 66
Advanced Database Systems Lecture Note
3.2 Complete: - database copy is made
.
available in each site Snap shot is one method
used here.
¾ In a distributed database system, the database is logically
stored as single database but physically fragmented on several
computers. The computers in a distributed system
communicate with each other through various communication
media, such as high speed buses or telephone line.
¾ A distributed database system has the following components.
o Local DBMS
o Distributed DDBMS
o Global System Catalog(GSC)
o Data communication (DC)
¾ A distributed database system consists of a collection of sites,
each of which maintains a local database system (Local DBMS)
but each local DBMS also participates in at least one global
transaction where different databases are integrated together.
o Local Transaction: transactions that access data only in
that single site
o Global Transaction: transactions that access data in
several sites.
¾ Parallel DBMS: a DBMS running across multiple processors
and disks that is designed to execute operations in parallel,
when ever possible, in order to improve performance.
Three architectures for parallel DBMS:
Shared Memory- for fast data access for a limited
number of processors.
Shared Disk- for application inherently centralized
Shared nothing.- massively parallel
Department of Information Science, WU Haleluya Kiflu 67
Advanced Database Systems Lecture Note
¾ What makes DDBMS different is that
o The various sites are aware of each other
o Each site provides a facility for executing both local and
global transactions.
¾ The different sites can be connected physically in different
topologies.
o Fully /networked,
o Partially Connected,
o Tree Network,
o Star Network and
o Ring Network
¾ The differences between these sites is based on:
o Installation Cost: cost of linking sites physically.
o Communication Cost: cost to send and receive messages
and data
o Reliability: resistance to failure
o Availability: degree to which data can be accessed
despite the failure.
¾ The distribution of the database sites could be:
1. Large Geographical Area: Long-Haul Network
¾ relatively slow
¾ less reliable
¾ uses telephone line, microwave, satellite
2. Small Geographical Area: Local Area Network
¾ higher speed
¾ lower rate of error
¾ use twisted pair, base band coaxial, broadband
coaxial, fiber optics
¾ Even though integration of data implies centralized storage and
control, in distributed database systems the intention is
different. Data is stored in different database systems in a
decentralized manner but act as if they are centralized through
development of computer networks.
Department of Information Science, WU Haleluya Kiflu 68
Advanced Database Systems Lecture Note
¾ A distributed database system consists of loosely coupled sites
that share no physical component and database systems that
run on each site are independent of each other.
¾ Those which share physical components are known as Parallel
DBMS.
¾ Transactions may access data at one or more sites
¾ Organization may implement their database system on a
number of separate computer system rather than a single,
centralized mainframe. Computer Systems may be located at
each local branch office.
Functions of a DDBMS
¾ DDBMS have the following functionality.
Extended Communication Services to provide
access to remote sites.
Extended Data Dictionary- to store data distribution
details a need for global system catalog.
Distributed Query Processing - optimization of
query remote data access.
Extended security- access control to a distributed
data
Extended Concurrency Control –maintain
consistency of replicated data.
Extended Recovery Services- failures of individual
sites and the communication line.
Department of Information Science, WU Haleluya Kiflu 69
Advanced Database Systems Lecture Note
Issues in DDBMS
How is data stored in DDBMS?
There are several ways of storing a single relation in distributed
database systems.
¾ Replication:
o System maintains multiple copies of similar data
(identical data)
o Stored in different sites, for faster retrieval and fault
tolerance.
o Duplicate copies of the tables can be kept on each system
(replicated). With this option, updates to the tables can
become involved (of course the copies of the tables can be
read-only).
o Advantage: Availability, Increased parallelism (if only
reading)
o Disadvantage: increased overhead of update
¾ Fragmentation:
o Relation is partitioned into several fragments stored in
distinct sites
The partitioning could be vertical, horizontal or
both.
o Horizontal Fragmentation
Systems can share the responsibility of storing
information from a single table with individual
systems storing groups of rows
Performed by the Selection Operation
The whole content of the relation is reconstructed
using the UNION operation
Department of Information Science, WU Haleluya Kiflu 70
Advanced Database Systems Lecture Note
o Vertical Fragmentation
Systems can share the responsibility of storing
particular attributes of a table.
Needs attribute with tuple number (the primary key
value be repeated.)
Performed by the Projection Operation
The whole content of the relation is reconstructed
using the Natural JOIN operation using the
attribute with Tuple number (primary key values)
Department of Information Science, WU Haleluya Kiflu 71
Advanced Database Systems Lecture Note
o Both (hybrid fragmentation)
A system can share the responsibility of storing
particular attributes of a subset of records in a given
relation.
Performed by projection then selection or selection then
projection relational algebra operators.
Reconstruction is made by combined effect of Union and
natural join operators
Fragmentation is correct if it fulfils the following
1. Complete: - a data item must appear in at least one
fragment of a given relation R (R1, R2…Rn).
2. Reconstruction:- it must be possible to reconstruct a
relation from the fragments
3. Disjointness: - a data item should only be found in a
single fragment except for vertical fragmentation (the
primary key is repeated for reconstruction).
Department of Information Science, WU Haleluya Kiflu 72
Advanced Database Systems Lecture Note
¾ Data transparency:
The degree to which system user may remain unaware of the
details of how and where the data items are stored in a
distributed system.
1. Distribution transparency Even though there are
many systems they appear as one- seen as a single,
logical entity.
2. Replication transparency Copies of data floating
around everywhere also seem like just one copy to
the developers and users
3. Fragmentation transparency A table that is actually
stored in parts everywhere across sites may seem
like just a single table in a single
4. Location Transparency- the user doesn’t need to
know where a data item is physically located.
How does it work?
Distributed computing can be difficult to implement, particularly for
replicated data that can be updated from many systems. In order to
operate a distributed database system has to take care of
• Distributed Query Processing
• Distributed Transaction Management
• Replication Data Management If you are going to have copies
of data on many machines how often does the data get updated
if it is changed in another system? Who is in charge of
propagating the update to the data?
• Distributed Database Recovery If one machine goes down
how does that affect the others.
• Security: Just like any computer network, a distributed system
needs to have a common way to validate users entering from
any computer in the network of servers.
Department of Information Science, WU Haleluya Kiflu 73
Advanced Database Systems Lecture Note
• Common Data-Dictionary Your schema now has to be
distinguished and work in connection to schemas created on
many systems.
Homogeneous and Heterogeneous
Distributed Databases
In a homogeneous distributed database
¾ All sites have identical software (DBMS)
¾ Are aware of each other and agree to cooperate in
processing user requests.
¾ Each site surrenders part of its autonomy in terms of right
to change schemas or software
¾ Appears to the user as a single system
In a heterogeneous distributed database
¾ Different sites may use different schemas and software
(DBMS)
Difference in schema is a major problem for query
processing
Difference in software is a major problem for
transaction processing
¾ Sites may not be aware of each other and may provide
only limited facilities for cooperation in transaction
processing.
¾ May need gateways to interface one another.
Department of Information Science, WU Haleluya Kiflu 74
Advanced Database Systems Lecture Note
1. Why DDBMS/Advantages
2. Many existing systems
¾ Maybe you have no choice.
¾ Possibly there are many different existing system, with
possible different kinds of systems (Oracle, Informix, others)
that need to be used together.
3. Data sharing and distributed control:
¾ User at one site may be able access data that is available at
another site.
¾ Each site can retain some degree of control over local data
¾ We will have local as well as global database administrator
4. Reliability and availability of data
¾ If one site fails the rest can continue operation as long as
transaction does not demand data from the failed system
and the data is not replicated in other sites
5. Speedup of query processing
¾ If a query involves data from several sites, it may be possible
to split the query into sub-queries that can be executed at
several sites which is parallel processing
¾ Query can be sent to the least heavily loaded sites
6. Expansion (Scalability)
¾ In a distributed environment you can easily expand by
adding more machines to the network.
Disadvantages of DDBMS
1. Software Development Cost
¾ Is difficult to install, thus is costly
2. Greater Potential for Bugs
¾ Parallel processing may endanger correctness of algorithms
3. Increased Processing Overhead
¾ Exchange of message between sites – high communication
latency
¾ Due to communication jargons
Department of Information Science, WU Haleluya Kiflu 75
Advanced Database Systems Lecture Note
4. Communication problems
5. Increased Complexity and Data Inconsistency
Problems
¾ Since clients can read and modify closely related data stored
in different database instances concurrently.
6. Security Problems: network and replicated data security.
Query Processing and Transaction
Management in DDBMS
Query Processing
There are different strategies to process a specific query, which in
turn increase the performance of the system by minimizing
processing time and cost. In addition to the cost estimates we have
for a centralized database (disk access, relation size, etc), we have to
consider the following in distributed query processing:
o Cost of data transmission over the huge network
o Gain of parallel processing of a single query
For the case of Replicated data allocation, even though parallel
processing is used to increase performance, update will have a great
impact since all the sites containing the data item should be updated.
For the case of fragmentation, update woks more like the centralized
database but reconstruction of the whole relation will require
accessing data from all sites containing part of the relation.
Department of Information Science, WU Haleluya Kiflu 76
Advanced Database Systems Lecture Note
Let the distributed database has three sites (S1, S2, and S3). And two
relations, EMPLOYEE and DEPARTMENT are located at S1 and S2
respectively without any fragmentation. And a query is initiated
from S3 to retrieve employees [First Name (15 byte long), Last name
(15 byte long) and Department name (10 byte long) total of 40 bytes
with the department they are working in.
Let:
For EMPLOYEE we have the following information
1. 10,000 records
2. each record is 100 bytes long
For DEPARTMENT we have the following information
3. 100 records
4. each record is 35 bytes long
There are three ways of executing this query:
1. Transfer DEPARTMENT and EMPLOYEE to S3 and perform
the join there: needs transfer of 10,000*100+100*35=1,003,500
byte.
2. Transfer the DEPARTMENT to S1, perform the join there which
will have 40*10,000 = 400,000 bytes and transfer the result to S3.
we need 1,000,000+400,000=1,400,000 byte to be transferred
3. Transfer the EMPLOYEE to S2, perform the join there which
will have 40*10,000 = 400,000 bytes and transfer the result to S3.
We need 3,500+400,000=403,500 byte to be transferred.
Then one can select the strategy that will reduce the data transfer cost
for this specific query. Other steps of optimization may also be
included to make the processing more efficient by reducing the size of
the relations using projection.
Transaction Management
Transaction is a logical unit of work constituted by one or more
operations executed by a single user. A transaction begins with the
user's first executable query statement and ends when it is committed
or rolled back.
Department of Information Science, WU Haleluya Kiflu 77
Advanced Database Systems Lecture Note
A Distributed Transaction is a transaction that includes one or
more statements that, individually or as a group, update data on two
or more distinct nodes of a distributed database.
Representation of Query in Distributed Database
SQL Statement Object Database Domain
SELECT * FROM dept sales midroc.telecom.et;
[email protected];
There are two types of transaction in DDBMS to access data from
other sites:
1. Remote Transaction: contains only statements that access a
single remote node. Thus, Remote Query statement is a query
that selects information from one or more remote tables, all of
which reside at the same remote node or site.
o For example, the following query accesses data from the dept
table in the Addis schema (the site) of the remote sales database:
o A remote update statement is an update that modifies data in
one or more tables, all of which are collocated at the same
remote node. For example, the following query updates the
branch table in the Addis schema of the remote sales database:
UPDATE Addis.dept@ sales.midroc.telecom.et;
SET loc = 'Arada'
WHERE BranchNo = 5;
2. Distributed Transaction: contains statements that access more
than one node.
Department of Information Science, WU Haleluya Kiflu 78
Advanced Database Systems Lecture Note
o A distributed query statement retrieves information from two
or more nodes.
o If all statements of a transaction reference only a single remote
node, the transaction is remote, not distributed.
o A database must guarantee that all statements in a transaction,
distributed or non-distributed, either commit or roll back as a
unit. The effects of an ongoing transaction should be invisible
to all other transactions at all nodes; this transparency should
be true for transactions that include any type of operation,
including queries, updates, or remote procedure calls.
o For example, the following query accesses data from the local
database as well as the remote sales database:
SELECT ename, dname
FROM Awassa.emp AW, Addis.dept@ sales.midroc.telecom.et AD
WHERE AW.deptno = AD.deptno;
{Employee data is stored in Awassa and Sales data is stored in Addis,
there is an employee responsible for each sale}
Different kinds of SQL statements
• Remote query
select
client_nm
from
[email protected];
• Distributed query
select
project_name, student_nm
from
[email protected] i, student s
where
s.stu_id = i.stu_id
Department of Information Science, WU Haleluya Kiflu 79
Advanced Database Systems Lecture Note
• Remote Update
stu_id = '242'
where
stu_id = '200'
• Distributed Update
stu_id = '242'
where
stu_id = '200'
update student set
stu_id = '242'
where
stu_id = '200'
commit
Concurrency Control
o There are various techniques used for concurrency control in
centralized database systems. The techniques in distributed
database system are similar with the centralized approach with
additional implementation requirements or modifications.
o The main difference or the change that should be incorporated is
the way the lock manager is implemented and how it functions.
o There are different schemes for concurrency control in DDBS
Department of Information Science, WU Haleluya Kiflu 80
Advanced Database Systems Lecture Note
1. Non-Replicated Scheme
o No data is replicated in the system
o All sites will maintain a local lock manager (local lock and
unlock)
o If site Si needs a lock on data in site Sj it send message to
lock manager of site Sj and the locking will be handled by
site Sj
o All the locking and unlocking principles are handled by
the local lock manager in which the data object resides.
o Is simple to implement
o Need three message transfers
o To request a lock
o To notify grant of lock
o To request unlock
2. Single Coordinate Approach
o The system choose one single lock manager that resides in
one of the sites (Si)
o All locks and unlocks requests are made at site Si where
the lock manager resides
o Is simple to implement
o Needs two message transfers
o To request a lock
o To request unlock
o Simple deadlock handling
o Could be a bottleneck since all processes are handled at
one site
o Is vulnerable/at risk if the site with the lock manager fails
Department of Information Science, WU Haleluya Kiflu 81
Advanced Database Systems Lecture Note
Chapter Five
Data warehousing
Data warehouse is an integrated, subject-oriented, time-
variant, non-volatile database that provides support for
decision making.
9 Integrated Æ centralized, consolidated database that
integrates data derived from the entire organization.
¾ Consolidates data from multiple and diverse
sources with diverse formats.
¾ Helps managers to better understand the
company’s operations.
9 Subject-Oriented Æ Data warehouse contains data
organized by topics. Eg. Sales, Marketing, Finance,
Customer etc.
9 Time variant: In contrast to the operational data that
focus on current transactions, the warehouse data
represent the flow of data through time.
¾ Data warehouse contains data that reflect what
happened last week, last month, past five years, and
so on.
¾ Snapshot of data in the organization at different
point in time.
9 Non volatile Æ Once data enter the data warehouse,
they are never changed. Because the data in the
warehouse represent the company’s entire history not
operational data.
9 DW is organized around major subjects, such as
customer, product, sales.
Department of Information Science, WU Haleluya Kiflu 82
Advanced Database Systems Lecture Note
9 DW focuses on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.
9 DW provides a simple and concise view around
particular subject issues by excluding data that are not
useful in the decision support process.
9 Constructed by integrating multiple, heterogeneous data
sources
9 relational databases, flat files, on-line transaction records
9 Data cleaning and data integration techniques are
applied.
9 Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
Ultimately Information is created from data warehouses. Such
Information becomes the basis for rational decision making.
The data found in data warehouse is analyzed to discover
previously unknown data characteristics, relationships,
dependencies, or trends.
Differences between database and data warehouse
9 Because data is added all the time, warehouse is growing.
9 The data warehouse and operational environments are
separated. Data warehouse receives its data from
operational databases.
9 Data warehouse environment is characterized by read-
only transactions to very large data sets.
9 Operational environment is characterized by numerous
update transactions to a few data entities at a time.
9 Data warehouse contains historical data over a long time
horizon.
Department of Information Science, WU Haleluya Kiflu 83
Advanced Database Systems Lecture Note
Data in data warehouse is different form data in current
transaction. Current transaction data is called online data
and is processed using OnLine Transaction Processing
(OLTP)
Differences between data warehouse and data in OLTP
The main difference between these two systems is that, data
ware house contains historical data whereas OLTP contains
current data used in day-to-day transaction processing.
OLTP System Data Warehousing System
Current data Historical data
Detailed data Summarized
Dynamic data Largely static
Repetitive and already known operation Ad hoc and unstructured operations
Predicted pattern of usage Unpredictable pattern of usage
Transaction driven Analysis driven
Application oriented Subject oriented
Support day-to-day decisions Support strategic decision
Large number of operational users Low number of managerial users
Department of Information Science, WU Haleluya Kiflu 84
Advanced Database Systems Lecture Note
Data Warehouse vs. Operational DBMS
OLTP (On-Line Transaction Processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (On-Line Analytical Processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical,
consolidated
Database design: ER + application vs. star schema +
subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex
queries
Different aspects of comparing OLTP with OLAP
OLTP System OLAP System
User Clerk, IT professional Knowledge workers
Function Day-to-day operation Decision support
DB Design Application oriented Subject oriented
Data Current, up-to-date, Historical, summarized,
detailed integrated
Usage repetitive Ad-hoc
Access Read/write Lots of scans/read only
Unit of work Short and simple Complex query
transaction
# of records accessed tens millions
# of users thousands hundreds
DB size 100 MB-GB 100 GB-TB
Performance metric Transaction throughput Query throughput
Department of Information Science, WU Haleluya Kiflu 85
Advanced Database Systems Lecture Note
Benefits of Data Warehouse
If data is kept in data warehouse and used for high level
decision making and knowledge extraction, there are lots of
advantages that would be gained which include:
o Potential high returns on investment
o Competitive advantage on market
o Increased productivity
Data Mining
¾In simple definition, data mining is the extraction or ‘mining’ of
knowledge from large amount of data. “Mining of gold from
sand or soil is not sand or soil mining but GOLD mining”.
Similarly, data mining is actually knowledge mining or
knowledge discovery that is useful for decision making.
¾ Most organizations, these days, are in data rich but information
poor situation.
¾ Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions.
¾The availability of large amount of data coupled with the need to
have powerful data analysis tool motivate the development of
data mining tools.
¾ Data mining tools can answer business questions that
traditionally were too time-consuming to resolve.
¾ Evolution of Data Mining:
Data collectionÆdata stored and retrievedÆmore
information need
¾Similar terminologies:
• Knowledge Mining
• Knowledge extraction
• Knowledge discovery
• Pattern analysis
Department of Information Science, WU Haleluya Kiflu 86
Advanced Database Systems Lecture Note
• Data archaeology
¾ Data in the real world is dirty
• incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes
or names
¾ No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of
quality data
¾ The first task in data mining is to preprocess the data
warehouse so that quality data is used for knowledge
extraction.
Major Tasks in Data Preprocessing
• Data cleaning: Fill in missing values, smooth noisy
data, identify or remove outliers, and resolve
inconsistencies
• Data integration: Integration of multiple databases,
data cubes, or files
• Data transformation: Normalization and
aggregation
• Data reduction: Obtains reduced representation in
volume but produces the same or similar analytical
results
• Data discretization: Part of data reduction but with
particular importance, especially for numerical data
Types of knowledge to be mined
• Characterization: to study the behavior or
characteristics of data object for further analysis.
Department of Information Science, WU Haleluya Kiflu 87
Advanced Database Systems Lecture Note
• Discrimination: study patterns so that objects are
grouped for purpose of implementing different
strategy, will discriminate groups.
• Association: the process of analyzing existing
pattern of occurrence of events to see
interdependency between objects and events.
• Classification/ Clustering: The process of dividing
a dataset into mutually exclusive groups such that
the members of each group are as "close" as possible
to one another, and different groups are as "far" as
possible from one another, where distance is
measured with respect to specific variable(s) you
are trying to predict.
• Prediction: A structure and process for predicting
the values of specified variables in a dataset.
• Outlier analysis: preprocessing to deal with data
items found in extreme ends affecting knowledge
discovery
Given databases of sufficient size and quality, data mining
technology can generate new business opportunities by
providing these capabilities:
¾ Automated prediction of trends and behaviors.
¾ Automated discovery of previously unknown patterns.
The most commonly used techniques in data mining are:
• Artificial neural networks: Non-linear predictive models
that learn through training and resemble biological neural
networks in structure.
• Decision trees: Tree-shaped structures that represent sets of
decisions. These decisions generate rules for the classification of
a dataset. Specific decision tree methods include Classification
Department of Information Science, WU Haleluya Kiflu 88
Advanced Database Systems Lecture Note
and Regression Trees (CART) and Chi Square Automatic
Interaction Detection (CHAID) .
• Genetic algorithms: Optimization techniques that use
processes such as genetic combination, mutation, and natural
selection in a design based on the concepts of evolution.
• Nearest neighbor method: A technique that classifies each
record in a dataset based on a combination of the classes of the
k record(s). Sometimes called the k-nearest neighbor technique.
• Rule induction: The extraction of useful if…then… rules from
data based on statistical significance.
Or the techniques of data mining can be categorized into the
following modeling techniques.
• Predictive models. These types of models predict how likely
an event is. Usually, the higher a score, the more likely the
event is. For example, how likely a credit card transaction is
to be fraudulent, or how likely an airline passenger is to be a
terrorist, or how likely a company is to go bankrupt.
• Summary models. These models summarize data. For
example, a cluster model can be used to divide credit card
transactions or airline passengers into different groups
depending upon their characteristics.
• Network models. These types of models uncover certain
structures in data represented by nodes and links. For
example, a network model for a terrorist cell might use nodes
representing individuals and links representing meetings.
Department of Information Science, WU Haleluya Kiflu 89
Advanced Database Systems Lecture Note
• Association models. Sometimes certain events occur
frequently together. For example, purchases of certain items,
such as beer and pretzels, or a sequence of events associated
with component failure. Association models are used to find
and characterize these co-occurrences.
Department of Information Science, WU Haleluya Kiflu 90
Advanced Database Systems Lecture Note
Chapter Six
Introduction to Object-Oriented
Database Systems
Outline
• What are the basic concepts governing OO systems?
• Drawbacks of Relational DBMS
• How does object orientation affect data modeling and design?
• Are OO features related to traditional relational and E-R
models?
• What are the basic features of an OO database management
system (OODBMS)?
• What are the advantages and disadvantages of OODBMSs?
Object Orientation
• Object Orientation
• Set of design and development principles
• Based on autonomous computer structures known as
objects
• OO Contribution areas
• Programming Languages
• Graphical User Interfaces
• Databases
• Design
• Operating Systems
Evolution of OO Concepts
• Concepts stem from object-oriented programming
languages (OOPLs)
• Ada, ALGOL, LISP, SIMULA
• OOPLs goals
Department of Information Science, WU Haleluya Kiflu 91
Advanced Database Systems Lecture Note
• Easy-to-use development environment
• Powerful modeling tools for development
• Decrease in development time
• Make reusable code
• OO Attributes
• Data set not passive
• Data and procedures bound together
• Objects can act on itself
Advanced Database Applications
The application of database is being highly practiced in complex
applications demanding different way of modeling and designing.
Some of these applications are:
• Computer Aided Design (CAD)
• Computer Aided Manufacturing
• Computer Aided Software Engineering
• Network Management System
• Multimedia System
• Digital Publication
• Geographic Information System
• Interactive and Dynamic Web Sites
Drawbacks/Weaknesses of Relational DBMS
In addition to the emergence of many advanced database
application areas, there were some drawbacks on the relational
database management system.
1. Poor representation of ‘real world’ entities
relations does not correspond to real world objects
2. Semantic overloading (semantically Overloaded)
Representation of complex relationship is difficult
Eg. M:N relationship is represented by adding one
additional relation (making the relation an entity)
Department of Information Science, WU Haleluya Kiflu 92
Advanced Database Systems Lecture Note
One can not distinguish a relation from a
relationship.
Difficult to distinguish different types of
relationship.
3. Poor support for integrity and enterprise constraints
Many commercial BMS do not support all integrity
constraints
The relational model do not support enterprise
constraintsÎ it has to be done on the DBMS
4. Homogeneous data structure
Has vertical and horizontal homogeneity
Horizontal Homogeneity: Each tuple of a
relation have same number and type of
attributes
Vertical Homogeneity: Values of an attribute
should come from same domain.
Field value should be atomic
5. Limited operations
Relational model has fixed operations on the data
Does not allow additional/new operations
6. Difficulty in handling recursive queries
Direct and indirect recursive queries of a single
relation can not be represented.
Query to extract information from recursive
relationship between tuples of same entity is difficult
to represent.
7. Impedance mismatch
Mixing of different programming paradigms
Mismatch between the languages used
8. Poor navigational process
Department of Information Science, WU Haleluya Kiflu 93
Advanced Database Systems Lecture Note
Access is based on navigating individual records from
different relations
OO Concepts
• Object is a uniquely identifiable entity that contains both the
attributes that describes the state of the ‘real world’ object and
the action that are associated with it.
• OODBMS can manage complex, highly interrelated
information.
• Abstract representation of a real-world entity
• Unique identity
• Embedded properties
• Ability to interact with other objects and self
OID (Object Identity)
• Each object is unique in OO systems
• Unique to object
• Not a primary key (PK is unique only for a relation, PK is
selected from attribute making it dependent on the state)
• Is invariant (will not change)
• Independent of values of attributes ( two objects can have
same state but will have different OID)
• Is invisible to users
• Entity integrity is enforced
• Relationship: embedding the OID of one object into the other (
embed OID for a branch to employee object)
• Advantage of using OID:
• Are efficient
• Are fast
• Cannot be modified by users (system generated)
• Independent of content
Department of Information Science, WU Haleluya Kiflu 94
Advanced Database Systems Lecture Note
Attributes
• Called instance variables
• Domain
Object state
• Object values at any given time
• Values of attributes at any given point in time.
Methods
• Code/function that performs operation on object’s data
• Has name and body
• Methods define the behavior of object
• Can be used to change the state of the object by
modifying attribute values
Department of Information Science, WU Haleluya Kiflu 95
Advanced Database Systems Lecture Note
Messages
• Means by which objects communicate
• Request from one object to the other to activate one of its
methods
• Invokes method/calls method to be applied
• Sent to object from real world or other object
• Notation: Object.Method
• Eg: StaffObject.updatesalary(slary)
Classes
• Blueprint for defining similar objects
• Objects with similar attributes and respond to same
message are grouped together
Department of Information Science, WU Haleluya Kiflu 96
Advanced Database Systems Lecture Note
• Defined only once and used by many objects
• Collection of similar objects
• Shares attributes and structure
• Objects in a class are called instances of the class
• Eg: Class Definition : defining the class BRANCH
BRANCH
Attributes
brabchNo
street
city
postcode
Methods
Print()
getPostCode()
numberofStaff()
Objects of BRANCH class
brabchNo=B005 brabchNo=B007 brabchNo=B003
street=st1 street=st2 street=st2
city=Addis city=Dire city=Bahirdar
postcode=1425 postcode=452 postcode=85
Department of Information Science, WU Haleluya Kiflu 97
Advanced Database Systems Lecture Note
Object Characteristics
Class Hierarchy
• Superclass
• Subclass
Inheritance
• Ability of object to inherit the data structure and behavior
of classes above it
• Single inheritance – class has one immediate superclass
• Multiple – class has more than one immediate superclass
Method Overriding
• Method redefined at subclass
Ploymorphism
Department of Information Science, WU Haleluya Kiflu 98
Advanced Database Systems Lecture Note
• Allows different objects to respond to same message in
different ways
Object Classification
• Simple
Only single-valued attributes
No attributes refer to other objects
• Composite
At least one multi-valued attribute
No attributes refer to other object
• Compound
At least one attribute that references other object
• Hybrid
Repeating group of attributes
At least one refers to other object
Characteristics of OO Data Model
• Supports complex objects
• Must be extensible
• Supports encapsulation
• Exhibit inheritance
• Supports object identity
Department of Information Science, WU Haleluya Kiflu 99
Advanced Database Systems Lecture Note
OO vs. E-R Model Components
OO Data Model E-R Model
Type Entity definition
Object Entity Instance
Class Entity set
Instance variable Attribute
N/A Primary key
OID N/A
Class hierarchy E-R diagram
OODBMS
• Object-oriented database technology is a marriage of object-
oriented programming and database technologies.
• Database management system integrates benefits of typical
database systems with OODM characteristics
• Handles a mix of data types (since OODBMS permits new data
definition)
• Follows OO rules
• Follows DBMS rules
OO and Database Design
• Provides data identification and the procedures for data
manipulation
• Data and procedures self-contained entity
• Iterative and incremental
• DBA does more programming
• Lack of standards
Department of Information Science, WU Haleluya Kiflu 100
Advanced Database Systems Lecture Note
OODBMS Advantages
• More semantic information
• Support for complex objects
• Extensibility of data types (user defined data types)
• May improve performance with efficient caching
• Versioning
• Polymorphism: one operation shared by many objects and each
acting differently
• Reusability
• Inheritance speeds development and application: defining new
objects in terms of previously defined objectsÆ Incremental
Definition)
• Potential to integrate DBMSs into single environment
• Relationship between objects is represented explicitly
supporting both navigational and associative access to
information.
OODBMS Disadvantages
• Strong opposition from the established RDBMSs
• Lack of theoretical foundation
• No standard
• No single data model
• Throwback to old pointer systems
• Lack of standard ad hoc query language
• Lack of business data design and management tools
• Steep learning curve
• Low market presence
• Lack of compatibility between different OODBMSs
Department of Information Science, WU Haleluya Kiflu 101
Advanced Database Systems Lecture Note
Department of Information Science, WU Haleluya Kiflu 102