Relational Database Modeling Syllabus
Relational Database Modeling Syllabus
2
Syllabus (2/2)
1.3 Relational database design
1.3.1 Functional dependencies
1.3.2 First normal form
1.3.3 Second normal form
1.3.4 Third normal form
1.3.5 Other normal forms
1.4 SQL basics
1.4.1 Defining a relation schema
1.4.2 Database modifications
1.4.3 Simple queries
1.4.4 Subqueries
1.4.5 Aggregation operators
1.4.6 Grouping
1.4.7 Having clauses
1.4.8 Transactions
3
1.1 Relational model basics
● Dr. E. F. Codd proposed the relational model for database systems in 1970.
• It is the basis for the relational database management system (RDBMS).
• The relational model consists of the following:
– Collection of objects or relations
– Set of operators to act on the relations
– Data integrity for accuracy and consistency
● The relational model uses a collection of tables to represent both data and the
relationships among those data.
● Each table has multiple columns, and each column has a unique name.
○ Tables are also known as relations.
4
1.1 Relational model basics
● The relational model is an example of a record-based model.
○ Record-based models are so named because the database is structured in fixed-format
records of several types.
○ Each table contains records of a particular type.
○ Each record type defines a fixed number of fields, or attributes.
○ The columns of the table correspond to the attributes of the record type.
● The relational data model is the most widely used data model, and a vast
majority of current database systems are based on the relational model
5
1.1.1. Attributes
● A relation consists of a heading and a body.
● A heading is a set of attributes.
● An attribute is an ordered pair of attribute name and type name.
● An attribute value is a specific valid value for the type of the attribute.
○ This can be either a scalar value or a more complex type.
● In the relational model the term relation is used to refer to a table, while the
term tuple is used to refer to a row. Similarly, the term attribute refers to a
column of a table.
6
Structure of a relation
7
1.1.2 Domains
● For each attribute of a relation, there is a set of permitted values, called the
domain of that attribute.
● We require that, for all relation r, the domains of all attributes of r be atomic.A
domain is atomic if elements of the domain are considered to be indivisible
units.
● The important issue is not what the domain itself is, but rather how we use
domain elements in our database.
○ Suppose that a phone number attribute stores a single phone number. Even then, if we split
the value from the phone number attribute into a country code, an area code and a local
number, we would be treating it as a nonatomic value. If we treat each phone number as a
single indivisible unit, then the attribute phone number would have an atomic domain.
8
1.1.3 Schemas
● A relation schema, which is the logical design of the database, consists of a
list of attributes and their corresponding domains.
● The concept of a relation instance corresponds to the
programming-language notion of a value of a variable. The value of a given
variable may change with time; similarly the contents of a relation instance
may change with time as the relationis updated. In contrast, the schema of a
relation does not generally change.
9
schema notation
● Consider the department relation.
10
Standard Data Types
Data type Description
CHARACTER(n) Character string. Fixed-length n
VARCHAR(n) or CHARACTER VARYING(n) Character string. Variable length. Maximum length n
BINARY(n) Binary string. Fixed-length n
BOOLEAN Stores TRUE or FALSE values
VARBINARY(n) or BINARY VARYING(n) Binary string. Variable length. Maximum length n
INTEGER(p) Integer numerical (no decimal). Precision p
SMALLINT Integer numerical (no decimal). Precision 5
INTEGER Integer numerical (no decimal). Precision 10
BIGINT Integer numerical (no decimal). Precision 19
DECIMAL(p,s) Exact numerical, precision p, scale s.
NUMERIC(p,s) Exact numerical, precision p, scale s. (Same as DECIMAL)
FLOAT(p) Approximate numerical, mantissa precision p. A floating number in base 10 exponential notation.
REAL Approximate numerical, mantissa precision 7
FLOAT Approximate numerical, mantissa precision 16
DOUBLE PRECISION Approximate numerical, mantissa precision 16
DATE Stores year, month, and day values
TIME Stores hour, minute, and second values
TIMESTAMP Stores year, month, day, hour, minute, and second values
INTERVAL Composed of a number of integer fields, representing a period of time
ARRAY A set-length and ordered collection of elements
MULTISET A variable-length and unordered collection of elements
XML Stores XML data
11
1.1.4 Keys
● In relational model, no two tuples in a relation are allowed to have exactly the
same value for all attributes.
● Formally, let R denote the set of attributes in the schema of relation r.
● A superkey is a set of one or more attributes that, taken collectively, allow us
to identify uniquely a tuple in the relation
○ We say that a subset K of R is a superkey for r. If K is a superkey, then so is any superset of K
● Minimal superkeys are called candidate keys
○ It is possible that several distinct sets of attributes could serve as a candidate key.
● We shall use the term primary key to denote a candidate key that is chosen
by the database designer as the principal means of identifying tuples within a
relation.
12
Foreign Key and Referential Integrity
● A key (whether primary, candidate, or super) is a property of the entire
relation, rather than of the individual tuples. Any two individual tuples in the
relation are prohibited from having the same value on the key attributes at the
same time.
● A relation, say r1, may include among its attributes the primary key of another
relation, say r2. This attribute is called a foreign key from r1, referencing r2.
● A referential integrity constraint requires that the values appearing in
specified attributes of any tuple in the referencing relation also appear in
specified attributes of at least one tuple in the referenced relation
13
Keys in relations
14
1.1.5 Tuples
● In general, a row in a table represents a relationship among a set of values.
● In mathematical terminology, a tuple is simply a sequence (or list) of values.
● A relationship between n values is represented mathematically by an n-tuple
of values, i.e., a tuple with n values, which corresponds to a row in a table.
15
1.2 Relational algebra
● A query language is a language in which a user requests information from
the database.
● Query languages can be categorized as either procedural or non
procedural.
○ In a procedural language, the user instructs the system to perform a sequence of operations
on the database to compute the desired result.
○ In a non procedural language, the user describes the desired information without giving a
specific procedure for obtaining that information.
● The relational algebra is procedural, whereas the tuple relational calculus
and domain relational calculus are non procedural.
● The relational algebra consists of a set of operations that take one or two
relations as input and produce a new relation as their result.
16
Relational Operations
● Basic relational operations are: ○ Binary (set theory)
○ Unary ■ Union ∪
■ selection σ (sigma) ■ Intersection ∩
■ projection π (pi) ■ Difference -
○ Unary extended ○ Binary
■ Rename ⍴ (rho) ■ Cartesian product ⨯
■ Duplicate elimination 𝛿 (delta) ■ Join ⨝
■ Ordering 𝜏 (tau) ■ Natural join *
■ Aggregation 𝛾 (gamma) ■ Outer join
● Right ⟖
● Left ⟕
● Full ⟗
■ Division ÷
17
Relational unary operations (1/2)
dept_name building budget
σbudget >= 90000 (department)
Biology Watson 90000
building
building average(bud
𝛿 building (πbuilding (department)) Watson
building
𝛾 average (budget) (department)
get)
Taylor Packard 65000
Painter Painter 120000
Packard Taylor 92500
Watson Watson 80000 19
1.2.1 Set operations on relations dept_name
Biology
π dept_name (department) Comp. Sci.
department instructor ∪ Elec. Eng.
dept_name building budget ID name dept_name salary Finance
Biology Watson 90000 22222 Einstein Physics 95000 π dept_name (instructor) History
Comp. Sci. Taylor 100000 12121 Wu Finance 90000 Music
Elec. Eng. Taylor 85000 32343 El Said History 60000 Physics
Finance Painter 120000 45565 Katz Comp. Sci. 75000
History Packard 50000 98345 Kim Elec. Eng. 80000 π dept_name (department) dept_name
Comp. Sci.
Music Packard 80000 10101 Srinivasan Comp. Sci. 65000 ∩
Elec. Eng.
Physics Watson 70000 58583 Califieri History 62000
83821 Brandt Comp. Sci. 92000
π dept_name (instructor) Finance
History
33456 Gold Physics 87000
Physics
76543 Singh Finance 80000
22
1.2.3 Naming and renaming
● Unlike relations in the database, the results of relational-algebra expressions
do not have a name that we can use to refer to them.
○ Assume that a relational-algebra expression E has arity n.
○ Then, the expression x(A1,A2,...,An)(E) returns the result of expression E under the name x,
and with the attributes renamed to A1,A2,...,An
πinstructor.salary(σinstructor.salary<d.salary(instructor×⍴d(instructor)))
23
1.3 Relational database design
● In general, the goal of relational database design is to generate a set of
relation schemas that allows us to store information without unnecessary
redundancy, yet also allows us to retrieve information easily.
● A real-world database has a large number of schemas and an even larger
number of attributes.
○ The number of tuples can be in the millions or higher.
○ Discovering repetition would be costly.
24
1.3.1 Functional dependencies
● A method for designing a relational database is to use a process commonly
known as normalization. The approach is to design schemas that are in an
appropriate normal form.
● In a specification of functional requirements, users describe the kinds of
operations (or transactions) that will be performed on the data.
● Therefore, we need to allow the database designer to specify rules such as
“each specific value for deptname corresponds to at most one budget”
● This rules are specified as functional dependencies
deptname→budget
25
Normalization process
26
1.3.2 First normal form
● In the relational model, we formalize the idea that attributes do not have any
substructure. A domain is atomic if elements of the domain are considered to
be indivisible units.
● We say that a relation schema R is in first normal form (1NF) if the domains
of all attributes of R are atomic.
● The use of set-valued attributes can lead to designs with redundant storage of
data, which in turn can result in inconsistencies.
27
Example
28
1.3.3 Second normal form
● Some of the most commonly used types of real-world constraints can be
represented formally as keys (superkeys, candidate keys and primary keys),
or as functional dependencies.
● Using the functional-dependency notation, we say that K is a superkey of r(R)
if the functional dependency K→R holds on r(R).
● Functional dependencies allow us to express constraints that we cannot
express with superkeys. For example, consider the schema:
instdept(ID,name,salary,deptname,building,budget)
● in which the functional dependency deptname→budget holds because for
each department (identified by deptname) there is a unique budget amount.
29
Example
30
1.3.4 Third normal form
● A relation schema R is in third normal form with respect to a set F of
functional dependencies if, for all functional dependencies of the form x → y,
where x ⊆ R and y ⊆ R, at least one of the following holds:
○ x → y is a trivial functional dependency.
○ x is a superkey for R.
○ Each attribute A in y − x is contained in a candidate key for R.
31
Example
32
1.3.5 Other normal forms
● One of the more desirable normal forms that we can obtain is Boyce–Codd
normal form (BCNF). It eliminates all redundancy that can be discovered
based on functional dependencies.
● A relation schema R is in BCNF with respect to a set F of functional
dependencies if, for all functional dependencies of the form x → y, where x ⊆
R and y ⊆ R, at least one of the following holds:
○ x → y is a trivial functional dependency (that is,x ⊆ y).
○ x is a superkey for schema R
33
1.4 SQL basics
● IBM developed the original version of SQL, originally called Sequel, as part of
the System R project in the early 1970s.
● The Sequel language has evolved since then, and its name has changed to
SQL (Structured Query Language).
● In 1986, the American National Standards Institute (ANSI) and the
International Organization for Standardization (ISO) published an SQL
standard, called SQL-86.
● ANSI published an extended standard for SQL, SQL-89, in 1989. The next
version of the standard was SQL-92 standard, followed by SQL:1999,
SQL:2003, SQL:2006, and most recently SQL:2008.
34
SQL issues
● The SQL language has several parts:
○ Data-definition language(DDL). The SQL DDL provides commands for defining relation
schemas, deleting relations, and modifying relation schemas.
○ Data-manipulation language(DML). The SQL DML provides the ability to query information
from the database and to insert tuples into, delete tuples from, and modify tuples in the
database.
○ Integrity.The SQL DDL includes commands for specifying integrity constraints that the data
stored in the database must satisfy. Updates that violate integrity constraints are disallowed.
○ View definition.The SQL DDL includes commands for defining views.
○ Transaction control. SQL includes commands for specifying the beginning and ending of
transactions.
○ Authorization.The SQL DDL includes commands for specifying access rights to relations and
views.
35
1.4.1 Defining a relation schema
● The SQL DDL allows specification of not only a set of relations, but also
information about each relation, including:
○ The schema for each relation.
○ The types of values associated with each attribute.
○ The integrity constraints.
○ The set of indices to be maintained for each relation.
○ The security and authorization information for each relation.
○ The physical storage structure of each relation on disk.
36
Create table DDL
● SQL define a relation by using the create table command:
○ CREATE TABLE [schema.]table
(col_name datatype [DEFAULT expr][column_constraint],
...
[table_constraint][,...]);
○ column_constraint -> NOT NULL | [CONSTRAINT name] UNIQUE | PRIMARY
KEY | CHECK (condition) | REFERENCES table_ref[(col_ref)] [ ON
{DELETE | UPDATE} {CASCADE | SET NULL | NO ACTION | SET DEFAULT}]
○ table_constraint -> [CONSTRAINT name] UNIQUE (col_name[,
col_name...]) | PRIMARY KEY (col_name[, col_name...]) | CHECK
(condition) | FOREIGN KEY (col_name[, col_name...]) REFERENCES
table_ref[(col_ref[, col_ref...])] [ ON {DELETE | UPDATE} {CASCADE |
SET NULL | NO ACTION | SET DEFAULT}]
37
Basic Types
● The SQL standard supports a variety of built-in types, including:
○ char[acter](n): A fixed-length character string with user-specified length n.
○ varchar(n): A variable-length character string with user-specified maximum length n.
○ int[eger]: An integer (a finite subset of the integers that is machine dependent).
○ smallint: A small integer (a machine-dependent subset of the integer type).
○ numeric (p,d): A fixed-point number with user-specified precision. The number consists of p
digits (plus a sign), and d of the p digits are to the right of the decimal point.
○ real, double precision: Floating-point and double-precision floating-point numbers with
machine-dependent precision.
○ float(n): A floating-point number, with precision of at least n digits.
38
Constraints
● DEFAULT. Specify a default value for a column during an insert.
● NOT NULL. Ensures that null values are not permitted for the column.
● UNIQUE. Requires that every value in a column or set of columns (key)
be unique.
● PRIMARY KEY. Creates a primary key for the table. Only one primary key can be created for each
table.
● CHECK. Defines a condition that each row must satisfy
● FOREIGN KEY. Designates a column or combination of columns as a foreign key and establishes a
relationship between a primary key or a unique key in the same table or a different table.
○ ON DELETE | UPDATE CASCADE: Deletes or updates the dependent rows in the child table when a row in
the parent table is deleted
○ ON DELETE | UPDATE SET NULL: Converts dependent foreign key values to null
○ ON DELETE | UPDATE SET DEFAULT: Converts dependent foreign key values to default value on column
○ The default behavior is called the restrict rule, which disallows the update or deletion of referenced data.
39
Example
create table instructor(
create table department ( ID varchar(5),
deptname varchar(20), name varchar(20) not null,
deptname varchar(20),
building varchar(15),
salary numeric(8,2) default
budget numeric(12,2) 100.00,
constraint chk_budg primary key(ID),
check(budget > 0.0), unique(name),
primary key(deptname) foreign key(deptname) references
); department on delete no action
on update cascade
);
40
1.4.2 Database modifications
● If necessary change the table structure for any of the following reasons:
• Omitted a column.
• Column definition needs to be changed.
• Need to remove column
● Using the ALTER TABLE statement:
○ ALTER TABLE [schema.]table
[ADD col_name col_constraint]
[MODIFY col_name type col_constraint]
[ADD table_constraint]
[DROP PRIMARY KEY | UNIQUE | CONSTRAINT constraint_name [ CASCADE]]
41
Drop a table
● When you dropping a table
• All data and structure in the table are deleted.
• Any pending transactions are committed.
• All indexes are dropped.
• All constraints are dropped
42
Data Manipulation Language operations
● A DML statement is executed when:
○ Add new rows to a table
○ Modify existing rows in a table
○ Remove existing rows from a table
● INSERT Statement Syntax
INSERT INTO table [(column [, column...])]
VALUES (value [, value...]);
● Insert a new row containing values for each column.
○ List values in the default order of the columns in the table.
○ Optionally, list the columns in the INSERT clause.
○ Enclose character and date values in single quotation marks
43
Example
INSERT INTO departments(department_id,
department_name, manager_id, location_id)
VALUES (70, 'Public Relations', 100, 1700);
● Inserting Rows with Null Value
INSERT INTO departments (department_id, department_name )
VALUES (30, 'Purchasing');
44
Changing Data in a Table
● Modify existing rows with the UPDATE statement:
UPDATE table
SET column = value [, column = value, ...]
[WHERE condition];
● Example:
UPDATE employees
SET department_id = 70
WHERE employee_id = 113;
45
Removing a Row from a Table
● Remove existing rows from a table by using the DELETE statement:
DELETE [FROM] table
[WHERE condition];
● Example
DELETE FROM departments
WHERE department_name = 'Finance';
● TRUNCATE Statement
● Removes all rows from a table, leaving the table empty
and the table structure intact
TRUNCATE TABLE table_name;
46
1.4.3 Simple queries
● Basic SELECT Statement
SELECT *|{[DISTINCT] column|expression [alias],...}
FROM table;
● SELECT identifies the columns to be displayed.
● FROM identifies the table containing those columns
● Example:
SELECT department_id, location_id
FROM departments;
47
Writing SQL Statements
● SQL statements are not case sensitive.
● SQL statements can be on one or more lines.
● Keywords cannot be abbreviated or split across lines.
● Clauses are usually placed on separate lines
● Indents are used to enhance readability.
● Semicolons (;) are required if you execute multiple SQL statements
48
Arithmetic Expressions
● Create expressions with number and date data by using arithmetic operators.
* Multiply
/ Divide
- Subtract
+ Add
● Example:
SELECT last_name, salary, salary + 300
FROM employees;
49
Defining a Column Alias
● A column alias:
○ Renames a column heading
○ Is useful with calculations
○ Immediately follows the column name (There can also be the optional AS keyword between
the column name and alias.)
○ Requires double quotation marks if it contains spaces or special characters or if it is case
sensitive
● Example:
SELECT last_name "Name" , salary*12 "Annual Salary"
FROM employees;
SELECT last_name AS name, commission_pct comm
FROM employees;
50
Duplicate Rows
● Use keyword DISTINCT to avoid duplicate rows:
SELECT DISTINCT department_id
FROM employees;
51
Limiting the Rows That Are Selected
● Restrict the rows that are returned by using the WHERE clause:
• The WHERE clause follows the FROM clause.
SELECT *|{[DISTINCT] column|expression [alias],...}
FROM table
[WHERE condition(s)];
● Example:
SELECT employee_id, last_name, job_id, department_id
FROM employees
WHERE department_id = 90 ;
52
Comparison Conditions
< Less than
<= Less than or equal to
>= Greater than or equal to
> Greater than
= Equal to
<> Not equal to
BETWEEN Between two values
...AND… (inclusive)
IN(set) Match any of a list of values
LIKE Match a character pattern
IS NULL Is a null value
53
Logical Conditions
NOT Returns TRUE if the following condition is false
OR Returns TRUE if either component condition is true
AND Returns TRUE if both component conditions are true
Example:
SELECT employee_id, last_name, job_id, salary
FROM employees
WHERE salary >=10000
AND job_id LIKE '%MAN%' ;
54
Using the ORDER BY Clause
● Sort retrieved rows with the ORDER BY clause:
○ ASC: ascending order, default
○ DESC: descending order
● The ORDER BY clause comes last in the SELECT statement
● Example:
SELECT last_name, job_id, department_id, hire_date
FROM employees
ORDER BY hire_date ;
SELECT last_name, department_id, salary
FROM employees
ORDER BY department_id, salary DESC;
55
1.4.4 Subqueries
● The subquery (inner query) executes once before the main query (outer
query).
• The result of the subquery is used by the main query.
● Syntax:
SELECT select_list
FROM table
WHERE expr operator
(SELECT select_list
FROM table);
56
Example
SELECT last_name, salary
FROM employees
WHERE salary >
(SELECT salary
FROM employees
WHERE last_name = 'Abel');
● Enclose subqueries in parentheses.
● Place subqueries on the right side of the comparison condition.
● Use single-row operators with single-row subqueries, and use multiple-row
operators with multiple-row subqueries.
57
1.4.5 Aggregation operators
● Functions that give results over some column
○ AVG
○ COUNT
○ MAX
○ MIN
○ SUM
● Functions AVG and SUM work only over numeric data
● Example:
SELECT AVG(salary), MAX(salary),
MIN(salary), SUM(salary)
FROM employees
WHERE job_id LIKE '%REP%';
58
Functions
● MIN and MAX work for numeric, character, and date data types
SELECT MIN(hire_date), MAX(hire_date)
FROM employees;
● COUNT(DISTINCT expr) returns the number of distinct non-null values of the
expr
SELECT COUNT(DISTINCT department_id)
FROM employees;
59
1.4.6 Grouping
● Permits divide rows in a table into smaller groups by using the GROUP BY
clause
SELECT column, group_function(column)
FROM table
[WHERE condition]
[GROUP BY group_by_expression]
[ORDER BY column];
● All columns in the SELECT list that are not in group functions must be in the
GROUP BY clause
● The GROUP BY column does not have to be in the SELECT list
60
Example
SELECT department_id, AVG(salary)
FROM employees
GROUP BY department_id ;
● Using the GROUP BY Clause on Multiple Columns
SELECT department_id dept_id, job_id, SUM(salary)
FROM employees
GROUP BY department_id, job_id;
61
1.4.7 Having clause
● Restrict Group Results with the HAVING Clause
○ 1. Rows are grouped.
2. The group function is applied.
3. Groups matching the HAVING clause are displayed
● Syntax:
SELECT column, group_function
FROM table
[WHERE condition]
[GROUP BY group_by_expression]
[HAVING group_condition]
[ORDER BY column]
62
Example
SELECT department_id, MAX(salary)
FROM employees
GROUP BY department_id
HAVING MAX(salary)>10000 ;
64
Commit and Rollback operations
● With the use of COMMIT and ROLLBACK statements, is possible:
○ Ensure data consistency
○ Preview data changes before making changes permanent
○ Group logically related operations
● Commit the changes:
DELETE FROM employees WHERE employee_id = 99999;
1 row deleted.
INSERT INTO departments VALUES (290, 'Corporate Tax',
NULL, 1700);
1 row created
COMMIT;
Commit complete
65
Rollback operation
● Discard all pending changes by using the ROLLBACK statement:
○ Data changes are undone.
○ Previous state of the data is restored.
○ Locks on the affected rows are released
DELETE FROM copy_emp;
20 rows deleted.
ROLLBACK ;
Rollback complete
66
Properties of a transaction
● A database transaction, by definition, must be atomic, consistent, isolated
and durable. Database practitioners often refer to these properties of
database transactions using the acronym ACID.
● Transactions provide an "all-or-nothing" proposition, stating that each
work-unit performed in a database must either complete in its entirety or have
no effect whatsoever. Further, the system must isolate each transaction from
other transactions, results must conform to existing constraints in the
database, and transactions that complete successfully must get written to
durable storage.
67
Isolation in a DBMS
● Isolation determines how transaction integrity is visible to other users and
systems.
● When attempting to maintain the highest level of isolation, a DBMS usually
acquires locks on data which may result in a loss of concurrency
○ A lower isolation level increases the ability of many users to access the same data at the
same time, but increases the number of concurrency effects (such as dirty reads or lost
updates) users might encounter.
○ Conversely, a higher isolation level reduces the types of concurrency effects that users may
encounter, but requires more system resources and increases the chances that one
transaction will block another
68
Isolation levels
● The isolation levels defined by the ANSI/ISO SQL standard are:
○ Serializable. This is the highest isolation level. A serializable execution is defined to be an
execution of the operations of concurrently executing SQL-transactions that produces the
same effect as some serial execution of those same SQL-transactions. A serial execution is
one in which each SQL-transaction executes to completion before the next SQL-transaction
begins.
○ Repeatable reads. Write skew is possible at this isolation level, a phenomenon where two
writes are allowed to the same column(s) in a table by two different writers (who have
previously read the columns they are updating), resulting in the column having data that is a
mix of the two transactions.
○ Read committed. Guarantees that any data read is committed at the moment it is read. It
simply restricts the reader from seeing any intermediate, uncommitted, 'dirty' read.
○ Read uncommitted. This is the lowest isolation level. In this level, dirty reads are allowed, so
one transaction may see not-yet-committed changes made by other transactions.
69
Read phenomena (1/3)
● The ANSI/ISO standard SQL 92 refers to three different read phenomena:
○ Dirty reads (aka uncommitted dependency) occurs when a transaction is allowed to read data
from a row that has been modified by another running transaction and not yet committed.
Transaction 1 Transaction 2
/* Query 1 */
SELECT age FROM users WHERE id = 1;
/* will read 20 */
/* Query 2 */
UPDATE users SET age = 21 WHERE id =
1;
/* No commit here */
/* Query 1 */
SELECT age FROM users WHERE id = 1;
/* will read 21 */
ROLLBACK; /* lock-based DIRTY READ */ 70
Read phenomena (2/3)
● A non-repeatable read occurs, when during the course of a transaction, a row is retrieved twice
and the values within the row differ between reads.
Transaction 1 Transaction 2
/* Query 1 */
SELECT * FROM users WHERE id = 1;
/* Query 2 */
UPDATE users SET age = 21 WHERE id = 1;
COMMIT; /* in multiversion concurrency
control, or lock-based READ COMMITTED */
/* Query 1 */
SELECT * FROM users WHERE id = 1;
COMMIT; /* lock-based REPEATABLE READ */
● At the SERIALIZABLE and REPEATABLE READ isolation levels, the DBMS must return the old
value for the second SELECT. At READ COMMITTED and READ UNCOMMITTED, the DBMS may
return the updated value; this is a non-repeatable read. 71
Read phenomena (3/3)
● A phantom read occurs when, in the course of a transaction, new rows are added by another
transaction to the records being read.
Transaction 1 Transaction 2
/* Query 1 */
SELECT * FROM users
WHERE age BETWEEN 10 AND 30;
/* Query 2 */
INSERT INTO users(id,name,age) VALUES
(3,'Bob',27);
COMMIT;
/* Query 1 */
SELECT * FROM users
WHERE age BETWEEN 10 AND 30;
COMMIT;
72
● In REPEATABLE READ mode, the range would not be locked, allowing the record to be inserted
Isolation levels vs read phenomena
● The following table shows how a DBMS deals with different read phenomena:
73
Unit 2
Semistructured Data-model Basics
Syllabus
2.1 The semistructured data-model
2.1.1 Semistructured data
2.1.2 XML
2.1.3 Document Type Definitions (DTD)
2.1.4 XML schema
//@lang
Selects all attributes that are named lang
Examples
/bookstore/book[1]
Selects the first book element that is the child of the bookstore element.
/bookstore/book[last()]
Selects the last book element that is the child of the bookstore element
/bookstore/book[last()-1]
Selects the last but one book element that is the child of the bookstore element
/bookstore/book[position()<3]
Selects the first two book elements that are children of the bookstore element
//title[@lang]
Selects all the title elements that have an attribute named lang
Examples
//title[@lang='en']
Selects all the title elements that have a "lang" attribute with a value of "en"
/bookstore/book[price>35.00]
Selects all the book elements of the bookstore element that have a price element with a value greater
than 35.00
/bookstore/book[price>35.00]/title
Selects all the title elements of the book elements of the bookstore element that have a price element with
a value greater than 35.00
/bookstore/*
Selects all the child element nodes of the bookstore element
Examples
//*
Selects all elements in the document
//title[@*]
Selects all title elements which have at least one attribute of any kind
//book/title | //book/price
Selects all the title AND price elements of all book elements
//title | //price
Selects all the title AND price elements in the document
/bookstore/book/title | //price
Selects all the title elements of the book element of the bookstore element AND all the price elements in
the document
2.2.2 XQuery
● XQuery is a language for finding and extracting elements and attributes from
XML documents.
○ XQuery for XML is like SQL for databases
○ XQuery is built on XPath expressions
○ XQuery is supported by all major databases
○ XQuery is a W3C Recommendation
● XQuery can be used to:
○ Extract information to use in a Web Service
○ Generate summary reports
○ Transform XML data to XHTML
○ Search Web documents for relevant information
XQuery processing
● XQuery uses path expressions to navigate through elements in an XML
document.
● XQuery uses predicates to limit the extracted data from XML documents.
● Example. The following predicate is used to select all the book elements
under the bookstore element that have a price element with a value that is
less than 30:
doc("books.xml")/bookstore/book[price<30]
● books.xml is the file to be used, and the doc() function open it.
FLWOR Expressions
● FLWOR (pronounced "flower") is an acronym for "For, Let, Where, Order by,
Return".
For - selects a sequence of nodes
Let - binds a sequence to a variable
Where - filters the nodes
Order by - sorts the nodes
Return - what to return (gets evaluated once for every node)
● Same previous example with XQuery
for $x in doc("books.xml")/bookstore/book
where $x/price>30
return $x/title
XQuery Basic Syntax Rules
● Some basic syntax rules:
○ XQuery is case-sensitive
○ XQuery elements, attributes, and variables must be valid XML names
○ An XQuery string value can be in single or double quotes
○ An XQuery variable is defined with a $ followed by a name, e.g. $bookstore
○ XQuery comments are delimited by (: and :), e.g. (: XQuery Comment :)
● XQuery Conditional Expressions
○ "If-Then-Else" expressions are allowed in XQuery.
for $x in doc("books.xml")/bookstore/book
return if ($x/@category="CHILDREN")
then <child>{data($x/title)}</child>
else <adult>{data($x/title)}</adult>
XQuery Comparisons
● In XQuery there are two ways of comparing values.
1. General comparisons: =, !=, <, <=, >, >=
2. Value comparisons: eq, ne, lt, le, gt, ge
● The difference between both ways are:
○ The following expression returns true if any q attributes have a value greater than 10:
$bookstore//book/@q > 10
○ The following expression returns true if there is only one q attribute returned by the
expression, and its value is greater than 10. If more than one q is returned, an error occurs:
$bookstore//book/@q gt 10
XQuery Selecting and Filtering
● The for Clause
○ The for clause binds a variable to each item returned by the in expression.
○ The for clause results in iteration.
○ There can be multiple for clauses in the same FLWOR expression.
○ To loop a specific number of times in a for clause, you may use the to keyword.
for $x in (1 to 5)
return <test>{$x}</test>
○ The at keyword can be used to count the iteration
for $x at $i in doc("books.xml")/bookstore/book/title
return <book>{$i}. {data($x)}</book>
○ It is also allowed with more than one in expression in the for clause. Use comma to separate
each in expression
for $x in (10,20), $y in (100,200)
return <test>x={$x} and y={$y}</test>
FLWOR expressions
● The let Clause
○ The let clause allows variable assignments and it avoids repeating the same expression many
times. The let clause does not result in iteration.
let $x := (1 to 5)
return <test>{$x}</test>
● The where Clause
○ The where clause is used to specify one or more criteria for the result
where $x/price>30 and $x/price<100
● The order by Clause
○ The order by clause is used to specify the sort order of the result.
for $x in doc("books.xml")/bookstore/book
order by $x/@category, $x/title
return $x/title
Generating results
● The return Clause
○ The return clause specifies what is to be returned
<html>
<body>
<h1>Bookstore</h1>
<ul>
{
for $x in doc("books.xml")/bookstore/book
order by $x/title
return <li>{data($x/title)}. Category: {data($x/@category)}</li>
}
</ul>
</body>
</html>
2.2.3 Extensible Stylesheet Language
● XSL (eXtensible Stylesheet Language) is a styling language for XML.
● XSLT stands for XSL Transformations.
● XSLT is used to transform an XML document into another XML document, or
another type of document that is recognized by a browser, like HTML and
XHTML. Normally XSLT does this by transforming each XML element into an
(X)HTML element.
Declaration
● The correct way to declare an XSL style sheet according to the W3C XSLT
Recommendation is:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
● Link the XSL Style Sheet to the XML Document
Add the XSL style sheet reference to your XML document
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="myxslt.xsl"?>
XSLT tags (1/3)
● An XSL style sheet consists of one or more set of rules that are called
templates.
● A template contains rules to apply when a specified node is matched.
● The <xsl:template> element is used to build templates.
○ The match attribute is used to associate a template with an XML element.
○ The value of the match attribute is an XPath expression
○ <xsl:template match=" XPath">
● The <xsl:value-of> element can be used to extract the value of an XML
element and add it to the output stream of the transformation
○ <xsl:value-of select=" Xpath"/>
● The XSL <xsl:for-each> element can be used to select every XML
element of a specified node-set
○ <xsl:for-each select=" XPath">
XSLT tags (2/3)
● The <xsl:sort> element is used to sort the output.
○ The select attribute indicates what XML element to sort on.
○ <xsl:sort select=" element"/>
● The <xsl:if> element is used to put a conditional test against the content of
the XML file.
○ The value of the required test attribute contains the expression to be evaluated
○ <xsl:if test="expression">
...some output if the expression is true...
</xsl:if>
XSLT tags (3/3)
● The <xsl:choose> element is used in conjunction with <xsl:when> and
<xsl:otherwise> to express multiple conditional tests.
○ <xsl:choose>
<xsl:when test="expression">
... some output ...
</xsl:when>
<xsl:otherwise>
... some output ....
</xsl:otherwise>
</xsl:choose>
● The <xsl:apply-templates> element applies a template to the current
element or to the current element's child nodes.
○ <xsl:template match="XPath">
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
I2 I3
Example
SELECT ... FROM R WHERE a = 30 AND b =‘x’
b index
a index ...
● search key = (30, x)
x a b c
read a-dimension 10
20 y 30 z 1
(30, x)
● search for 30, find 30 ... 10 x 2
corresponding 40 x 20 y 1
b-dimension index ... y
10 y 2
● search for x, read z
30 z 1
corresponding disk ...
20 x 2
block and get record x
● select requested attributes y 30 y 1
z 20 z 2
z 30 x 1
Usefull indexes
● For which queries is this index good?
○ find records where a = 10 AND b = ‘x’ -> good
○ find records where a = 10 AND b ≥ ‘x’ -> good
○ find records where a = 10 -> bad
○ find records where b = ‘x’ -> bad
3.2 Query execution
● SQL processing is the parsing, optimization, row
source generation, and execution of a SQL
statement.
○ The parsing stage involves separating the pieces of a SQL
statement into a data structure that other routines can
process.
○ During the optimization stage, database must perform a hard
parse at least once for every unique DML statement and
performs the optimization during this parse.
○ The row source generator receives the optimal execution
plan from the optimizer and produces an iterative execution
plan that is usable by the rest of the database.
○ During execution, the SQL engine executes each row source
in the tree produced by the row source generator.
Overview of query execution
● Operations (steps) of query plan are represented using relational algebra
(with bag semantics)
● Describe efficient algorithms to implement the relational algebra operations
● Major approaches are scanning, hashing, sorting and indexing
● Algorithms differ depending on how much main memory is available
3.2.1 Scanning
● Reads entire contents of relation R
● Needed for doing join, union, etc.
● To find all tuples of R:
○ Table scan: if addresses of blocks
containing R are known and contiguous,
easy to retrieve the tuples
○ Index scan: if there is an index on any
attribute of R, use it to retrieve the
tuples
3.2.2 Hashing
● A bucket is a unit of storage containing one or more records (a
bucket is typically a disk block)
○ In a hash file organization, we obtain the bucket of a record directly from its search-key value
using a hash function
○ Hash function h is a function from the set of all search-key values K to the set of all bucket
addresses B
○ Hash function is used to locate records for access, insertion as well as deletion
○ Records with different search-key values may be mapped to the same bucket; thus entire
bucket has to be searched sequentially to locate a record
Example
● There are 10 buckets
○ The binary representation of the
ith character is assumed to be
the integer i
○ The hash function returns
the sum of the binary representations
of the characters modulo 10
• e.g., h(Music) = 1 h(History) = 2
h(Physics) = 3 h(Elec. Eng.) = 3
Hash Index
● Hashing can be used not only
for file organization, but also for
index-structure creation
○ A hash index organizes the search
keys, with their associated record
pointers, into a hash file structure
3.2.3 Sorting
● Two steps:
1) Created partially sorted data chunks
2) Merge the partially sorted chunks
● First step:
● Let M be the memory capacity
● Create sorted runs. Let i be 0 initially
Repeatedly do the following till the end of the relation:
(a) Read M blocks of relation into memory
(b) Sort the in-memory blocks
(c) Write sorted data to run Ri; increment i
Let the final value of i be N
Sorting (2)
● Second step: merge the runs
● Merge the runs (N-way merge). We assume (for now) that N < M.
● Use N blocks of memory to buffer input runs, and 1 block to buffer output. Read the first block of
each run into its buffer page
repeat
Select the first record (in sort order) among all buffer pages
Write the record to the output buffer. If the output buffer is full write it to disk.
Delete the record from its input buffer page.
If the buffer page becomes empty then read the next block (if any) of the run into the buffer.
until all input buffer pages are empty
• If N >= M, several merge passes are required
– In each pass, contiguous groups of M - 1 runs are merged
Use sorting
3.2.4 Indexing
● Basic idea
○ Search in index is O(log2N)
○ Following link is O(1)
○ Each index can remain sorted
○ Create an index for each attribute which you may
use in a query
● Trade-off
○ Faster queries
○ Some redundancy
■ But this is handled by the DBMS!
■ i.e., mainly a storage capacity problem, not so
much a consistency problem
Index basics
● Indexing mechanisms used to speed up access to desired data
○ e.g., searching by a specific attribute
○ but also: joins!
■ Search Key - attribute to set of attributes used to look up records in a file
○ An index file consists of records (called index entries) of the form:
search-key pointer
○ Two basic kinds of indices:
■ Ordered indices: search keys are stored in sorted order
■ Hash indices: search keys are distributed uniformly across “buckets”
using a “hash function”
Sparse Index
● Sparse Index: contains index records for only some values
○ Applicable when records are sequentially ordered on search-key
■ To locate a record with search-key value K we:
○ Find index record with largest search-key value < K
○ Search file sequentially starting at that record
Secondary Index
● Secondary index: index on any other attribute
○ Index record points to a bucket that contains pointers to all the actual records with that
particular search-key value
○ Secondary indices have to be dense
3.3 Query optimization
● Operations (steps) of query plan are represented using relational algebra
(with bag semantics)
● Describe efficient algorithms to implement the relational algebra operations
● Major approaches are scanning, hashing, sorting and indexing
● Algorithms differ depending on how much main memory is available
3.3.1 Algebraic laws for improving query plan
● An evaluation plan defines exactly what algorithm is used for each operation,
and how the execution of the operations is coordinated
Estimating costs
● Cost difference between evaluation plans for a query can be enormous
– e.g., seconds vs. days in some cases
• Steps in cost-based query optimization
– Generate logically equivalent expressions using equivalence rules
– Annotate resultant expressions to get alternative query plans
– Choose the cheapest plan based on estimated cost
● Estimation of plan cost based on:
– Statistical information about relations. Examples:
• number of tuples, number of distinct values for an attribute
– Statistics estimation for intermediate results
• to compute cost of complex expressions
– Cost formulae for algorithms, computed using statistics
Equivalence in relational algebra
● Two relational algebra expressions are said to be equivalent if the two
expressions generate the same set of tuples on every legal database instance
– order of tuples is irrelevant
– they may yield different results on databases that violate integrity
constraints
● Equivalent results must not be a result of chance, e.g.
– SELECT name FROM employee WHERE id=“12345” → “Smith”
– SELECT name FROM employee WHERE birthday=“30.10.1974” → “Smith”
● Those results could be different on a different database instance
Equivalence rules (1)
● (1) Conjunctive selection operations can be deconstructed into a sequence of
individual selections.
σθ1∧θ2( E)=σ θ1(σ θ2( E))
(a) If all the attributes in θ0 involve only the attributes of one of the
expressions (E1) being joined
σθ0(E1 ⨝ E2) = (σθ0(E1)) ⨝ E2
(b) If θ1 involves only the attributes of E1 and θ2 involves only the attributes of
E 2.
σθ1 ∧ θ2 (E1 ⨝ E2) = (σθ1(E1)) ⨝ (σθ2 (E2))
Example
Equivalence rules (8)
● (9) The set operations union and intersection are commutative
E1 ⋃ E 2 = E 2 ⋃ E 1
E1 ⋂ E 2 = E 2 ⋂ E 1 (but: set difference is not commutative)
(10) Set union and intersection are associative
(E1 ⋃ E2) ⋃ E3 = E1 ⋃ (E2 ⋃ E3)
(E1 ⋂ E2) ⋂ E3 = E1 ⋂ (E2 ⋂ E3)
–
Scan Primitive
● Reads entire contents of relation R
● Needed for doing join, union, etc.
● To find all tuples of R:
● Table scan: if addresses of blocks containing R are known and contiguous,
easy to retrieve the tuples
● Index scan: if there is an index on any attribute of R, use it to retrieve the
tuples
Costs of Scan Operators
● Table scan:
○ if R is clustered, then number of disk I/Os is approx. B(R).
○ if R is not clustered, number of disk I/Os could be as large as T(R).
● Index scan: approx. same as for table scan, since the number of disk I/Os to
examine entire index is usually much much smaller than B(R).
Sort-Scan Primitive
● Produces tuples of R in sorted order w.r.t. attribute a
● Needed for sorting operator as well as helping in other algorithms
● Approaches:
○ If there is an index on a or if R is stored in sorted order of a, then use index or table scan.
○ If R fits in main memory, retrieve all tuples with table or index scan and then sort
○ Otherwise can use a secondary storage sorting algorithm
Costs of Sort-Scan
● See earlier slide for costs of table and index scans in case of clustered and
unclustered files
● Cost of secondary sorting algorithm is:
○ approx. 3B disk I/Os if R is clustered
○ approx. T + 2B disk I/Os if R is not
One-Pass, Tuple-at-a-Time
● These are for SELECT and PROJECT
● Algorithm:
○ read the blocks of R sequentially into an input buffer
○ perform the operation
○ move the selected/projected tuples to an output buffer
○ Requires only M ≥ 1
○ I/O cost is that of a scan (either B or T, depending on if R is clustered or not)
○ Exception! Selecting tuples that satisfy some condition on an indexed attribute can be done
faster!
One Pass, Binary Operations
● Bag union:
● copy every tuple of R to the output, then copy every tuple of S to the output
● only needs M ≥ 1
● disk I/O cost is B(R) + B(S)
● For set union, set intersection, set difference, bag intersection, bag difference,
product, and natural join:
○ read smaller relation into main memory
○ use main memory search structure D to allow tuples to be inserted and found quickly
○ needs approx. min(B(R),B(S)) buffers
○ disk I/O cost is B(R ) + B(S)
3.3.3 Cost-based plan selection
● Cost is generally measured as total elapsed time for answering query
● Many factors contribute to time cost
○ disk accesses, CPU, or even network communication
○ Typically disk access is the predominant cost, and is also relatively easy to estimate
○ Measured by taking into account
– Number of seeks * average-seek-cost
– Number of blocks read * average-block-read-cost
– Number of blocks written * average-block-write-cost
○ Cost to write a block is greater than cost to read a block
– data is read back after being written to ensure that the write was successful
3.3.4 Order of joins
● For all relations r1, r2, and r3,
(r1 ⨝ r2) ⨝ r3 = r1 ⨝ (r2 ⨝ r3 ) (Rule 6)
(r1 ⨝ r2) ⨝ r3