0% found this document useful (0 votes)
144 views

Handouts PDF

Uploaded by

ParásitoZ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views

Handouts PDF

Uploaded by

ParásitoZ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 293

3621559 THJ

Tietokannanhallintajärjestelmät
Database Management Systems
Matti Nykänen
School of Computing, University of Eastern Finland
e-mail: [email protected]
Academic year 2011-12, IV quarter

Contents
1 Introduction 1

2 The Relational Model 2


2.1 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Integrity Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Different Viewpoints to Data . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Relational Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Structured Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.7.1 Data Definition Language . . . . . . . . . . . . . . . . . . . . . . . 44
2.7.2 Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.7.3 Data Manipulation Language . . . . . . . . . . . . . . . . . . . . . 49

3 Client-Server Database Architecture 50


3.1 Installing and Running SimpleDB . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Using a Relational Database from Java . . . . . . . . . . . . . . . . . . . . 53
3.3 JDBC Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 JDBC Transaction Handling . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Impedance Mismatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4 The Structure of the SimpleDB RDBMS Engine 64


4.1 File Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Log Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 Buffer Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Transaction Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.1 Database Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2 Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.5 Record Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.6 Metadata Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.7 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4.7.1 Query Scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

i
4.7.2 Update Scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.7.3 Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
4.7.4 Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
4.8 Parsing SQL Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
4.9 Query Execution Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.10 The Remote Database Server . . . . . . . . . . . . . . . . . . . . . . . . . 212

5 Indexing 225
5.1 Extendable Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.2 B+ -trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
5.3 Using an Index in a Relational Algebra Operation . . . . . . . . . . . . . . 257
5.4 Updating Indexed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

6 Query Optimization 265


6.1 Heuristic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
6.2 On Cost-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 288

References
Thomas M. Connolly and Carolyn E. Begg. Database Systems: A Practical Approach to
Design, Implementation, and Management. Addison Wesley, fifth edition, 2010.

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Intro-
duction to Algorithms. The MIT Press, third edition, 2009.

Ramez Elmasri and Shamkant B. Navathe. Database Systems: Models, Languages, Design,
and Application Programming. Pearson, sixth edition, 2011.

John R. Levine, Tony Mason, and Doug Brown. Lex & Yacc. O’Reilly, second edition,
1992.

Simon Peyton Jones. Beautiful concurrency. In Andy Oram and Greg Wilson, editors,
Beautiful Code, chapter 24, pages 385–406. O’Reilly, 2007.

Edward Sciore. Database Design and Implementation. Wiley, 2008.

Peter Sestoft. Java Precisely. The MIT Press, second edition, 2005.

Gerhard Weikum and Gottfried Vossen. Transactional Information Systems: Theory,


Algorithms, and the Practice of Concurrency Control and Recovery. Morgan Kaufmann,
2001.

ii
1 Introduction
• The main questions of this course are:
What features must a Relational Data Base Management System (RDBMS)
have?
How can these features be implemented in an RDBMS?
• This course discusses general design principles, not the vendor-specific design issues
in MySQL, Oracle, Microsoft Access,. . . RDBMS.

The course book and software


• This course is based mainly on the following book:
Sciore Edward: Database Design and Implementation. Wiley, 2008.
It illustrates these principles via a small and simplified RDBMS called SimpleDB
developed by its author.
– It can be downloaded from + http://www.cs.bc.edu/~sciore/simpledb/.
– Unfortunately Amazon offers another non-relational DBMS with the same
name. + http://aws.amazon.com/simpledb/

The wiki page for the course


• Direct your browser to the web address https://wiki.uef.fi and click the follow-
ing link chain:
+ tkt-wiki
+ Kurssien kotisivuja - Course homepages
+ THJ - Tietokannanhallintajärjestelmät - Database Management Systems
(3621559).
• Or add its direct link https://wiki.uef.fi/pages/viewpage.action?pageId=
20054083 into your bookmarks.
• This page contains these lecture handouts in weekly installments, as well as other
course material.
• Knowledge of RDBMS internals is interesting to for instance:
Software developers whose programs communicate with RDBMSs.
For instance, they need to understand the concept of transactions and its role
in this communication.
Data Base Administrators (DBAs) who are IT specialists responsible of keeping
the RDBMS of an organization up and running smoothly and efficiently. If an
organization is
large it has dedicated DBAs, since nowadays organizations rely heavily on
their databases
small its IT staff acts also as DBAs.
For instance, a DBA must know what RDBMS parameters like transaction
checkpoint frequency mean and how they affect the performance level.

1
Figure 1: The Class Diagram for the University Database. (Sciore, 2008)

2 The Relational Model


(Sciore, 2008, part 1)

• The earlier course “Data Management” (“Tiedonhallinta” (THA) in Finnish) has


discussed the following:

¬ What is the relational data model?


­ How can the class (or Entity-Relationship, or. . . ) diagram gained from infor-
mation system design be cast into relational table form?
Figure 1 shows the class diagram for an American university database, our
running example.
(We will omit its parking PERMIT class later.)
® How can this form be implemented using a Relational Database Management
System (RDBMS) such as MySQL, Oracle, MS Access,. . . ?

• Here we revisit the relational model briefly from another viewpoint:

¯ What does it expect that the RDBMS can do?

Then the rest of this course discusses how the RDBMS can do these things.

2.1 Tables
(Sciore, 2008, Chapters 2.1–2.2)

• The central feature of the relational data model is to organize data into tables.

• Moreover, the result of a query in the relational data model is always another table,
built from the stored tables.

• Each table has its own collection of columns called its attributes.

2
Figure 2: The Schema for the University Database. (Sciore, 2008)

• This collection is called the schema of the table.

• The collection of all the schemas of all the tables in a database is also called the
schema of this database.
(Some texts use the correct but old-fashioned plural “schemata”.)

• Figure 2 shows the schema for our example database, where each table scheme is in
the form

TABLENAME(AttrName1 ,AttrName2 ,AttrName3 ,...,AttrNamen )

which gives each of its Attributes its own Name.

• The schema is one example of metadata: Data about data.

Data is originally Latin for “the given things”.


(It is plural for “datum”, or “one given thing”.)
Meta (µετα) is originally Greek for “after”.
– When later scholars compiled Aristotle’s writings, they did not know where
to put his “first philosophy”, so they put it after physics, and started calling
it “metaphysics” instead.
– This gave “meta-” the new meaning of “what you must read first, before
you can understand the rest”.
– Hence in computer science “metadata” means the extra data which tells
how the actual data is structured. (For instance, XML.)

Requirement 1 (data and metadata). The RDBMS must be able to maintain both the
data itself and its metadata.

Rows
• A table contains zero or more rows.

• Each row represents the data corresponding to some specific individual.

• Each row r of table T representing some specific individual x then tells what is
stored in the database about x with respect to the attributes a1 , a2 , a3 , . . . , an of T ,
so that the value r.ai on column ai of row r tells what x “is like” with respect to ai .

3
• Intuitvely, such a row r says that “there is some individual x of kind T whose a1
is t .a1 and its a2 is t .a2 its a3 is t .a3 and. . . and its an is t .an ”.

• Figure 3 shows our example tables with some example rows. For instance, the first
row of STUDENT says that “there is some student whose student ID number
(“opiskelijanumero” in Finnish) is 1, whose name is Joe, whose graduation year
is 2004, and whose major subject is computer science”.

• The rows within a table are unordered. When we said that “the first row of STU-
DENT” is Joe’s, we meant that the STUDENT rows were shown in a particular
order, here by student ID.

Requirement 2 (no order). The RDBMS must be able to sort the rows before showing
them to the user. The user can determine the order in which (s)he wants to see them.
However, row ordering must not affect anything else than output.

Null Values
• However, some attribute values may come later. For instance:

¬ The new STUDENT is registered in the university database and assigned


his/her own student ID number in the beginning of his/her studies.
­ But the graduation year is only known at the end of his/her studies.

What is the value of the graduation year attribute during his/her studies?

• A natural solution is to mark the year as “not known (yet)”.

• The relational model provides special NULL values for such purposes.

• These NULLs behave differently than any actual values:

– Let r be a STUDENT table row of a currently studying student, so that


r .GradYear = NULL.
– Then the answer to evey one of these three questions

r .GradYear < 2015


r .GradYear = 2015
r .GradYear > 2015

must be “No!” because we don’t know the actual graduation year yet. In
addition, if s is another STUDENT row, then also the three questions

r .GradYear < s .GradYear


r .GradYear = s .GradYear
r .GradYear > s .GradYear

all get the same answer “No!” too, whether s .GradYear is a known year or
NULL.
– However, this holds even when row r is s! The relational model has the concept
of NULL values in general, but not “the NULL value(s) specifically for row r ”.

4
Figure 3: Some Contents of the University Database. (Sciore, 2008)

5
• Since NULL value behaviour is so different, some database theorists want to get rid
of them altogether. However, they are sometimes the best practical way to represent
that the information must exist, but is (yet) unknown.
• In contrast, attribute values which might or might not exist should be represented
in some other way.
– Suppose we added student mobile phone numbers into our university database.
If we added another attribute TelNo to the STUDENT table, and allowed
NULL values in it, then we would be implicitly claiming that “every student
does have a phone, but some students have kept their numbers secret”.
– A better design choice would be to add instead a new table with schema
MOBILE (SId , TelNo). (1)
∗ Then a student (represented by the ID) without a phone would have no
rows in this table. . .
∗ . . . whereas a student with many phones would have several.
∗ Moreover, since the university cannot use the information that a student
has phone but its number is secret, the TelNo attribute can be declared
to be non-NULL.
That is, this new table represents the known mobile numbers.

Requirement 3 (NULL value constraints). The RDBMS must permit the table definition
to declare whether a particular attribute can contain NULL values or not. It must enforce
such a constraint by rejecting the insertion of a new row which would have a NULL value
for an attribute which has been declared non-NULL.

• The RDBMS maintains these declarations in its metadata alongside the table defi-
nition.
• However, we shall largely bypass NULL values and their problems in this course.

2.2 Keys
(Sciore, 2008, Chapters 2.3–2.5)
• Intuitively, some attributes of a table identify or “name” uniquely the individual x
described by a row r whereas its other attributes describe the other qualities of
this x.
• The database table T with attributes a1 , a2 , a3 , . . . , am , b1 , b2 , b3 , . . . , bn satisfies the
functional dependency (FD)
a1 , a2 , a3 , . . . , am → bj (2)
if for all possible rows r and s that might be in T we have the following:

if r .a1 = s .a1 and r .a2 = s .a2 and r .a3 = s .a3 and. . . and r .am = s .am then also
r .bj = s .bj .

That is, the values for the attributes a1 , a2 , a3 , . . . , am on the left-hand side (LHS) of
the FD determine what the value for the attribute bj on its right-hand side (RHS)
must be.

6
• Note that this FD concerns the intended meaning of table T in the database schema,
not only the rows which T happens to contain just now.

• For instance, the MOBILE table in Eq. (1) satisfies the FD

TelNo → SId

because

if r .TelNo = s .TelNo then also r .SId = s .SId

since the mobile phone company will not give two different students r and s the same
mobile number (if we assume that two students do not share a common mobile).

• Trivially
a1 , a2 , a3 , . . . , am → ai
for every ai on its LHS.

• Transitively, if

a1 , a2 , a3 , . . . , am → b1
a1 , a2 , a3 , . . . , am → b2
a1 , a2 , a3 , . . . , am → b3
..
.
a1 , a2 , a3 , . . . , am → bn and
b1 , b2 , b3 , . . . , bn → c then also
a1 , a2 , a3 , . . . , am → c.

We can introduce vector notation in FDs to shorten such indexed sequences into

if ~a → ~b and ~b → c then also ~a → c. (3)

Candidate and Primary Keys

• Attributes a1 , a2 , a3 , . . . , am form a candidate key of a table T if both of the following


properties hold:

¬ T satisfies the FD (2) for every attribute bj of T .


­ If any of the ai is taken away from its LHS then property ¬ no longer holds.
(That is, every ai on its LHS is really needed.)

• If two rows r and s share the same values for all the LHS attributes a1 , a2 , a3 , . . . , am ,
then the database cannot tell them apart:

– They share the same values also for all the other attributes bj as well, by
property ¬.
– Their order in T does not matter, by requirement 2.

• Therefore we see that

– table T should really have just one copy of this row, not two; and

7
– each stored table should have candidate keys, to eliminate such duplicate rows.

• Once the database designer has determined the candidate keys for a new table T ,
(s)he chooses one of them as its primary key.

• None of the attributes ai of this chosen primary key is allowed to contain NULLs
by requirement 3 because they would make it impossible to check whether two rows
are two copies of the same row or not.

• What if T does not have any such “natural” candidate keys to choose?

– One solution is to say that T is “all key” and take all its attributes as the key.
– Another is to add an artificial “identifier” field to be the key.
∗ This is how the STUDENT, SECTION and ENROLL of our university
database got their Id fields.
∗ Its DEPT and COURSE have these Id fields as well, even though they
are not necessary: the department name and course title could have been
chosen as keys instead.
∗ However, UEF must have course ids, because we have both English and
Finnish titles for the same course.

Requirement 4 (key constraints). The RDBMS must permit table definition to also state
which attributes shall be its primary key. It must enforce such a constraint as follows:

¬ It must not permit any of these primary key attributes to have NULL values, via
requirement 3.

­ It must reject adding another row with identical values for all these key attributes
as an already stored row.

• The RDBMS maintains this primary key information in its metadata alongside the
table definition.

• The RDBMS can also generate unique values for artificial identifiers.
It can for instance maintain counters in its metadata.

• The chosen primary key attributes are often shown underlined in the schema of the
table.

• Only an actually stored database table has a primary key, but a table which the
RDBMS computes as an answer to show to the user does not.

• For instance, if we ask for just the students’ names in the university database, then
the answer will have duplicates, since several students can have the same name.
Hence this answer table cannot have a key – it is not even “all key”.

8
Figure 4: Foreign Keys for the University Database. (Sciore, 2008)

Foreign Keys

• An attribute a of a table T is a foreign key referencing another table U if

– its value r .a for a row t in table T names the row s in table U which corre-
sponds to this row r . . .
– . . . so that r .a = s .b where attribute b is the primary key chosen for table U .

If the primary key chosen for table U consists of multiple attributes b1 , b2 , b3 , . . . , bn


then the foreign key in table T consists of corresponding attributes a1 , a2 , a3 , . . . , an
so that r .a1 = s .b1 , r .a2 = s .b2 , r .a3 = s .b3 , . . . , r .an = s .bn .

• Intuitively, “this s is the U of this r ”.

• Foreign keys are the central tool to “glue together” the two individuals x and y
represented by the two rows r and s in the two relational tables T and U .

• Figure 4 shows the foreign keys in our university database example.

• For instance, attribute MajorId of table STUDENT is a foreign key of table DEPT.

– Hence the attribute r .MajorId of a row r in STUDENT contains the primary


key value s .DId of a certain row s in DEPT.
– That is, this department s is the department for the major subject of this
student r .
– Hence “Joe’s major is computer science” in Figure 3.

Requirement 5 (referential integrity). The RDBMS must permit defining the foreign key
attribute(s) from one table into another. Moreover, it must enforce that if an attribute a
of a table T is defined to be a foreign key of table U , and its value r .a in a row r in T
is not NULL, then table U must contain a row whose primary key value equals this r .a.

• In other words, if a row r of table T claims that there is some corresponding row s
in table U , then this row s must indeed exist in table U .

• It is part ¬ of requirement 4 applied to foreign keys.

• Compare to programming: a valid pointer is either NULL or it must point to some


valid object.

9
• The RDBMS must react somehow, if the user attempts to delete from table U the
row s referenced by some rows r via the foreign key in table T , since it would violate
requirement 5.

• The corresponding SQL definition ON DELETE. . .

IGNORE means that this attempt to delete row s from U will be rejected, because
something must be done to the rows r in T first.
The other reactions automate some common ways to “do something” to these
rows t first.
CASCADE means the following:
¬ this row s is deleted from its table U ;
­ every row r will be deleted from table T ; and
® the RDBMS reacts to each of these deletions in step ­ as defined.
This continues until requirement 5 is restored.
SET NULL first sets the foreign key attributes in table T into NULL. That is,
each modified row r will now say that “there is no corresponding row s in
table U ”.
Of course, all these attributes must permit NULL values, via requirement 3.
SET DEFAULT constant first sets the foreign key attributes in table T into the
given constant instead of NULL. It must be the key for some row s 0 still in
table U . That is, each modified row r will now refer to row s 0 instead of s.

• The RDBMS can maintain these foreign key definitions and their ON DELETE. . .
definitions (if any) alongside the definition of the referring table T in the metadata.

Normalization (Sciore, 2008, Chapter 3.6)

• A central tool in systematic database design is normalization theory.

• This approach has developed normal forms (NFs) to guide the design of database
tables.

• Each NF is designed to prevent some kinds of update anomalies: strange behaviour


when data is updated.

• We shall now review FD-based NFs, which already prevent the most common update
anomalies.

• Database theory literature has many more NFs based on generalizations of FDs,
which prevent other less often encountered anomalies.

• The insight in FD-based normalization is the following:

– Suppose we have a database table with the schema T (~a, ~b).


– Table T is normalized, if its FDs are exactly ~a → ~b.
– Otherwise the design of table T still has some reduncancy left, causing update
anomalies, so further normalization is needed.
– This further normalization consists of splitting table T into two new tables
connected together with foreign keys.

10
The Oath of the Relational Database Designer

“I swear to construct my tables so that every non-key attribute depends on


(provides a fact about) the key, the whole key, and nothing but the key — so
help me Codd!”

Depending on the key ensures the First Normal Form (1NF),

the whole key ensures the Second Normal Form (2NF), and

nothing but the key ensures the Third Normal Form (3NF).

In BCNF the “non-key” condition is extended into “functionally dependent”.

1NF

• Table T is in 1NF if its chosen primary key does indeed satisfy property ¬ of
candidate keys.

• Consider as an example maintaining the following contact information:

Person as the key


Address for that person
Phone numbers for that person – the same person can have zero or more phone
numbers.

• A natural schema would be

CONTACTS (Person,Address,SET OF PhoneNo)

since each person does indeed determine some set of corresponding phone numbers.

– However, our basic relational data model does not permit this:
Each attribute permits only a single indivisible value, and not a compound
value with inner structure.
– They would be permitted so-called non-first normal form (NFNF) data models,
which extend the basic model.

• A possibility within the basic model might be to fix some upper limit p on phone
numbers/person, and use the schema

CONTACTS (Person, Address,


PhoneNo 1 , PhoneNo 2 , PhoneNo 3 , . . . , PhoneNo p )

where each attribute PhoneNo i would be permitted to have NULLs to mean “this
person does not have an ith phone number”. However:

– Managing these separate attributes and their NULLs would be tedious.


– What is some well-connected person has more than p phone numbers?
– That is, this table design would add a new technological restriction not present
in the original situation we are trying to model into tables.

11
• Another possibility might be to give up property ¬ and use the schema

CONTACTS (Person,Address,PhoneNo)

with a duplicate row for each phone number for a given person. However:

– This table design would implicitly allow the same person to have many different
addresses.
– That is, it would not enforce even the FD
Person → Address
in the original situation.
– This is an example of an update anomaly:
The RDBMS would not be able to reject an update which would violate the
intended meaning.

• A solution in 1NF is to split the table into two with schemas

CONTACTS (Person,Address) and PHONES (Person,PhoneNo)

where the new table is all key.

• In practice, this solution needs also a fast way to find the phone numbers of a given
person. The primary key index of the PHONES table does not help, so we must
create a new clustering index too.

Requirement 6 (extra indexes). The RDBMS must permit defining new indexes, and it
must maintain these defined indexes automatically as the database contents are modified.
Unique indexes associate a single value to each key.

Clustered indexes associate a group of values to each key.

• The RDBMS maintains these new index definitions in the metadata.

2NF
• 2NF considers tables whose chosen primary key consists of two or more attributes,
and requires that all the other attributes must depend on all of them.

• The schema

T (~a, ~b, c, d)
~

and its FD
~b → c (4)
show how 2NF is violated: attribute c depends only on the part ~b of the whole key
~a, ~b.

• Its solution is to split the table into two tables

T (~a, ~b, d)
~ and U (~b, c)

12
where FD (4) has moved into the new table U . These two tables are connected by
stating that the attributes ~b of the old table T are a foreign key referencing the new
table U .

• As an example, consider the schema

WORKED(Employee, Project, Department, Hours) (5)

so that “this employee has worked this number of hours on that project which is
one of the projects of that department”.

• The violation of 2NF stems from the FD

Project → Department (6)

which represents the “which” part.

• A corresponding update anomaly is that the WORKED table permits the same
project to belong to many different departments, despite FD (6).

• Its solution is two tables

WORKED(Employee, Project, Hours) and


PROJECTS (Project, Department) (7)

or “this employee has worked this number of hours on that project” and “it is one
of the projects of that department” where

– FD (6) has moved to the new PROJECTS table, and


– the attribute WORKED .Project is a foreign key referencing it.

3NF

• 3NF precludes transitive dependencies (3):


Each attribute must depend directly on the key.

• The schema

T (~a, b, c)

and its FDs

~a → b
b→c (8)

show how 3NF is violated: attribute c does depend on the key ~a as it should, but
only via the non-key intermediate attribute b.

• Its solution is to split table T into two tables

T (~a, b) and U (b, c)

13
where FD (8) has moved into the new table U . These two tables are connected by
stating that the attribute b of the old table T is a foreign key referencing the new
table U .

• As an example, consider the schema

WORKS (Person,Department,Address)

so that “this person works in that department which is located at that address”.

• The violation of 3NF stems from the FD chain

Person → Department
.
Department → Address (9)

where FD (9) represents the “which” part.

• A corresponding update anomaly is that the WORKS table permits the same de-
partment to be located at many different locations, despite FD (9).

• Its solution is two tables

WORKS (Person,Department) and LOCATION (Department,Address)

or “this person works in that department” and “this department is located at that
address” where

– FD (9) has moved into the new table LOCATION , and


– attribute WORKS .Department is a foreign key referencing it.

2.3 Integrity Constraints


(Sciore, 2008, Chapter 2.5-2.6 and 5.2–5.3)
• From the RDBMS perspective, integrity constraints are conditions which the data-
base contents must satisfy to be in a consistent state.

• Requirement 5 is an example of such an integrity constraint:


It states that updating the database must not be permitted to break the intended
connections from one table into another.

• The database is only allowed to change

from its current consistent state


into another consistent state, as defined by the change operation (insertion, dele-
tion, update) and the integrity constraints of the database.

• From the database design perspective, integrity constraints describe what it means
for the database to reflect the reality (whatever that means. . . ) of its intended
application area.

• The specified integrity constraints are a part of the metadata.

14
Figure 5: Checking assertions in a table definition. (Sciore, 2008)

Assertions
• Assertions are conditions which the database state must satisfy.
• If the result of a change operation would violate any constraint, then the RDBMS
rejects the operation.
• In SQL, such an assertion can be expressed with
check c o n d i t i o n
which tests the given truth-valued condition.
• Such a check can appear in an SQL table definition, where it states a condition
which the attribute values for each row must satisfy, as in Figure 5.

• The condition to check often has the form


not e x i s t s query
which states that the result of this query must be empty.
• This lets us express constraints of the form
– “the database must never be allowed to contain any rows which would satisfy
this query” or
– “the database contents must never be allowed to be as described in this query”
as shown in Figures 6–8.
• These three examples show how we can define and name assertions outside tables:
create assertion ItsName
check c o n d i t i o n

• Figure 8 shows a named assertion involving two tables.

15
Figure 6: One example of a named SQL assertion. (Sciore, 2008)

Figure 7: Another example of a named SQL assertion. (Sciore, 2008)

Figure 8: An example of a named SQL assertion over two tables. (Sciore, 2008)

16
Triggers

• Sometimes we do not want to reject a change operation, as assertions do, but con-
tinue with other operations until the database is again in the kind of state we want
it to be.

• An example are the ON DELETE IGNORE, CASCADE, SET NULL and


SET constant options to handle deletions of rows referenced by foreign keys in
section 2.2.

• They tell how to continue until the referential integrity of the database are restored
again.

• Triggers are similar rules defined by the DBA.

• Triggers are also called

Event- because a trigger waits for a certain modification operation like insert,
delete or update to happen
Condition- because a trigger has a condition which the RDBMS tests when its
event happens, and this condition determines whether this trigger will fire or
not
Action- because if the trigger fires, then the RDBMS performs these other opera-
tions

rules, because they have these three parts.

• So for ON DELETE these parts would be:

Event is a delete operation to the table which is referenced by some foreign key
from another table.
Condition is to test whether this would delete rows which are referenced by rows
of this other table.
Action is the given option ON DELETE IGNORE, CASCADE, SET NULL
or SET constant.

• In Figure 9, the university wants to permit several persons in its staff to modify
course grades, but also wants to maintain a GRADE LOG table who changed what and
when for auditing purposes.

• In Figure 10 enforces the American university policy that when a new student is
inserted into the database, his/her forthcoming expected graduation year is no
more than 4 years from now.

• However, note that the expected graduation year for an existing student can still
be updated to violate this policy, because the trigger in Figure 10 applies only to
insertion events.

17
Figure 9: An example of an SQL trigger. (Sciore, 2008)

Figure 10: Another example of an SQL trigger. (Sciore, 2008)

18
Figure 11: The Schemata.

2.4 Different Viewpoints to Data


(Sciore, 2008, Chapters 1.7 and 4.5)

• Although normalization makes sense from the DBA’s viewpoint, because it avoids
update anomalies, the resulting new table structure might make less sense than the
original from the user’s viewpoint.

• For instance, the user may well prefer the original WORKED table (5) over its
normalized form (7) with the separate PROJECTS table, because (s)he may wish
to be reminded of the department when reading the hours listing.

• Hence we differentiate three separate levels of schemas for a database, as shown in


Figure 11.

Conceptual schema consists of the normalized tables derived from the class diagrams
describing the application area for which this database has been designed.

Physical schema implements the conceptual schema with concrete database table and
index files.

• The RDBMS is responsible for maintaining them.


• A DBMS supports physical data independence if its users do not need to
interact with it at this level.

External schemas implement the user’s various views to the stored data on top of the
conceptual schema.

• A DBMS supports logical data independence if its users can be given their
own external schemas, so that they do not need to know the conceptual schema.

• Data independence is desirable, because it shields the upper levels from changes in
the lower levels.

19
Requirement 7 (views). To support logical data independence, the RDBMS must support
defining views: virtual tables on top of actual tables.
• A view can be either
purely virtual so that it exists only as a query Q which accesses the actual tables
in the desired way, or
materialized so that the RDBMS maintains its current contents also in a separate
actual table V .
– This information is redundant, since the contents of this table V could be
created by the defining query Q from the database instead.
– In the “good” ol’ times, already normalized tables were later denormalized
by hand to provide such redundancy.
– Views are a better alternative, since the RDBMS can manage them auto-
matically.
– But there is not (yet) any standard vendor-independent way to define a
materialized view. . .
• The user of an external schema should see its views just like ordinary tables. How-
ever, there is a difference: It might not be clear how a view can be updated – that
is, how the RDBMS should handle insertions, deletions and updates to its rows,
because these rows might not ”really” exist.
• The intuition is that only those views are updatable, whose defining query Q is
so “simple” that the affected rows of its underlying actual tables can be deter-
mined (Connolly and Begg, 2010, Chapter 4.4.3) (Elmasri and Navathe, 2011, Chap-
ter 5.3.3):
– If Q uses grouping or aggregation operations (explained next), then it is not
updatable – since one row in the view is a combination of several rows of the
underlying table(s).
– If Q uses more than one table, then the view is in general not updatable –
since one row in the view is a combination of several rows, each from a different
underlying table.
– If Q contains nested queries, then the view is not updatable – since the update
might have to affect these nested queries too.
– If Q does not mention all the non-null attributes without default values of its
only table, then it is not updatable – since the update would not specify the
required values for these missing attributes.
Otherwise the view can be updatable.
• Another alternative which is becoming common in RDBMSs is to use stored proce-
dures instead of view updates.
– A stored procedure is a combination of programming language and query lan-
guage constructs – an “RDBMS subroutine”.
– It is stored in the view metadata.
– The user can invoke such a procedure, which the DBA has programmed to “do
the right thing” when the user wants to update his/her view.
They offer more flexibility than plain view updates, where the RDBMS has to guess
what would be “the right thing” to do when handling a view update.

20
Grouping Data

• The WORKED table example (5) suggests also another viewpoint to data:
The user may also wish to list the total number of hours spent per project.

• The RDBMS could fulfill this wish as follows:

¬ Sort the rows of the WORKED table according to its WORKED .Project at-
tribute, via the sorting requirement 2.
­ For each distinct WORKED .Project attribute value p, add together all the
t .Hours values for all the rows t with t .Project = p. All these rows t are now
adjacent to each other, by step ¬.
® Report to the user each value p and the corresponding sum computed in step ­.

Requirement 8 (grouping). The RDBMS must be able to group together related rows
and summarize each group into a single representative accumulated value.

2.5 Transactions
• The data grouping scenario in the end of section 2.4 shows that an RDBMS must
control concurrent access to its contents:

– One employee x has asked the RDBMS to give the listing of total hours per
each project. . .
– . . . while other employees y, z, u, . . . insert their own hours into the database at
the same time.
– Which of these new hours will be included in the listing?

• Note that even when no concurrency is permitted, the RDBMS must somehow be
able to enforce it as well.

• Hence RDBMS implementation includes aspects of concurrent programming.

• The RDBMS must also be able to recover properly after a crash. Consider the
following scenario:

– Suppose that a user deletes some row, which is referenced by a foreign key, and
this starts many other CASCADEd deletions in other parts of the database.
– Then the computer running the RDBMS crashes in the middle of these CAS-
CADEd deletions.
– When the computer and RDBMS are restarted after the crash, the RDBMS
must first somehow undo all those CASCADEd deletions which it managed
to perform before the crash.
(Or carry out the rest of them too, but this would be even harder.)
– Otherwise some of the CASCADEd deletions would be done while others
would be left undone – and so the referential integrity of the database might
be in danger!

• The general problem is not in CASCADE: Consider instead a company payroll, a


2% raise to all, and a crash in the middle of computing the new salaries. . .

21
• Hence RDBMS implementation includes aspects of fault-tolerant computing.

• These two scenarios show that the RDBMS must maintain its consistent state in
successive “snapshots”:

¬ The first grouping scenario showed that a query must be evaluated in some
static “snapshot” of the database, and updating it cannot be permitted at the
same time.
­ The second crash scenario showed that an update must take the database all
the way from one snapshot into the next, even though this may mean many
lengthy individual operations.

• One concept subsumes both of these concurrency and recovery requirements for an
RDBMS:

Requirement 9 (transactions). The RDBMS must permit defining transactions: se-


quences of operations which satisfy the 4 ACID properties.

Atomicity

• A transaction must be an atomic (Greek: ατοµοσ, “atomos”, “indivisible”) unit of


work: either

every operation in the transaction is executed successfully, or


none of them is.

• Accordingly, a transaction ends in either a

commit which means that it has managed to execute all its operations successfully,
or a
abort (also called rollback) which means to undo all the operations which it did
manage to execute successfully, so that afterwards everything looks like as if
the transaction had never started at all.

• Hence the abort operation is a very convenient abstraction for cleaning everything
up after an error occurred in the middle of a transaction – a very common pro-
gramming pattern in fault-tolerant computing.

• Atomicity solves the problem in our second crash scenario ­ as follows:

¶ The deletion issued by the user starts a new transaction t.


All the CASCADEd operations it causes are also executed in this same trans-
action t.
· When the RDBMS is restarted after a crash, it detects that this transaction t
did not manage to commit before the crash, so it aborts t.
In this way the RDBMS recovers from the crash back into consistent state in
which it was before step ¶ started.

• Hence atomicity is part of the recovery requirement.

22
Correctness
• A transaction must be correct, in the sense that the state of the database state after
it has committed must again be consistent, as defined by its integrity constraints
in section 2.3,. . .

• . . . although it can be inconsistent during the transaction:

– For instance, a deletion and all its cascaded operations in our second crash
scenario ­ are executed in the same transaction t in step ¶.
– The referential integrity requirement 5 is temporarily broken during t. . .
– . . . but is restored after committing or aborting t.

• In this way, the RDBMS uses transactions internally for its own operations like
these CASCADEd deletions.

• Hence correctness is part of the concurrency requirement:


The RDBMS must ensure that the database always appears to be in a consistent
state to the outside world, regardless of its internal state.

• The RDBMS must also permit external application programs which use the database
to specify their own transactions:

– The canonical example is: “Transfer Xe from bank account Y into Z if Y has
enough money.”
– In pseudocode:
SELECT Balance
xfer(X,Y,Z): 1 FROM Bank
WHERE Account = Y
2 if Balance ≥ X
UPDATE Bank
3 SET Balance = Balance − X
WHERE Account = Y;
UPDATE Bank
4 SET Balance = Balance + X
WHERE Account = Z
5 else what?
• The RDBMS can run each of the three SQL statements on lines 1, 3 and 4 in its
own internal transaction – but it would not be enough!

• Instead, lines 1–3 must be executed in the same transaction:

– Suppose some other execution of xfer(A,Y,B) executes concurrently between


lines 1 and 2.
– Then the value of Balance retrieved on line 1 is now out-of-date on line 2.
This other execution has transferred another Be out of account Y. . .
– . . . which might leave less than Xe, and line 3 makes the Balance of account Y
negative – even though this was exactly what we tried to avoid!

Thus we would have a concurrency problem otherwise.

23
• Lines 3–4 must be executed in the same transaction too:

– Suppose that line 4 fails for some reason.


– Then we must abort line 3 too, because. . .
– . . . if we do not, then Xe would disappear from account Y into nowhere.

Thus we would have a recovery problem otherwise.

• Hence lines 1–4 must all be executed in the same transaction.

• Consider finally line 5. How do we want to report the error that “account Y has
less than Xe”?

– A good choice would be to abort the transaction (even though it has changed
nothing in the Bank) because then an abort means “the money was not trans-
ferred for some reason”.
– Otherwise the transaction could commit in two ways:
either with “the money was transferred”
or with “there was not enough money to transfer”
and the caller of xfer would then have to find out which of these two possi-
bilities actually happened.
– When this xfer code is used as a small part of a large program which imple-
ments the “business logic” of the organization, this choice to abort becomes
more and more attractive to the programmer.

• Hence the RDBMS must permit external application programs to begin, commit
and abort their own transactions, which may consist of several database and non-
database operations.

• This is required, because the database might have also more complex integrity con-
straints like “money should not just disappear” which cannot be stated with just
the RDBMS assertions and triggers.

Isolation

• Transactions must be isolated from each other, in the sense that a transaction
must not notice any of the other concurrently running transactions – instead, each
transaction must “see” the database as if it were the only transaction using it.

• Hence isolation is the other part of the concurrency requirement (besides correct-
ness).

• Consider again our first grouping scenario ¬.

– Employee x will get a listing of all exactly those hours of the other employ-
ees y, z, u, . . . whose insertion transactions ty , tz , tu , . . . have already been com-
mitted before x starts the listing transaction tx .
– If such an insertion transaction is running at the same time as the listing trans-
action, they are isolated from each other. So tx does not see those transactions
of ty , tz , tu , . . . which are still running – because they might abort at the end,
and must therefore not be listed!

24
Figure 12: Transaction isolation levels. (Sciore, 2008)

• Isolation is the one ACID property which the user can relax, if (s)he. . .

– tolerates possible inaccuracies in the answer to the query, and


– wants the query to run faster.

In other words, the user can play “fast and loose” by altering the transaction isola-
tion level of his/her query, and accept the risks involved.

• These 4 levels are shown in Figure 12. Its middle column discusses a possible
implementation, and we shall return to that column later.

Serializable is the full risk-free isolation of ACID property.

• Every transaction should run at this level by default, and in most RDBMSs
they do.
• In our first grouping scenario ¬, the listing would contain effects of only those
of transactions ty , tz , tu , . . . which committed before transaction tx started.

Repeatable read is the next riskier level.

• The risk involved is phantoms:


New rows which may appear into the database while the current transaction
is running.

25
• In our first grouping scenario ¬, the listing might also contain some rows added
by those transactions ty , tz , tu , . . . which committed during transaction tx . . .
• . . . but user x would not know which, because that depends on the concurrent
execution order of these transactions tx , ty , tz , tu , . . ..
• This level is useful for transactions which modify an already existing row in the
database, because phantoms do not affect that.
• This is why some RDBMSs (noatably Oracle and Sybase) use it as the default
isolation level instead of fully Serializable.

Read committed level is riskier still.

• The new risk (in addition to phantoms) is nonrepeatable reads:


If a transaction reads the same value twice from the database, then it may
get different results – because another transaction has changed the value in-
between these two reads and committed.
• Note that the RDBMS may need to reread the same value repeatedly during
query evaluation.
• In our first grouping scenario ¬, the listing would include the Hours of some
rows modified by those transactions ty , tz , tu , . . . which committed during trans-
action tx . . .
• . . . but again user x would not know which.
• This level would be OK for a transaction whose operations are “unrelated”
to each other, in the sense that they could be executed in parallel as well as
sequentially.

Read uncommitted is the riskiest level.

• The new risk (in addition to phantoms and nonrepeatable reads) is dirty reads:
A transaction can read data as soon as another transaction writes it – even
when this other writing transaction later aborts, and its writings should not
have happened at all.
• This is also very fast, because this transaction does not have to stop and wait
for any other transactions.
• In our first grouping scenario ¬, the listing would contain whatever was in the
WORKED table when transaction tx happened to read it.
• However, this level would be OK for read-only transactions whose results do
not have to be exactly accurate.
• For instance, user x can run the listing transaction tx in this level, if (s)he
just wants to compute quickly some rough statistics about approximately how
many Hours have people WORKED on each project.

Durability

• Durability means that when a transaction commits, then the changes it has made
to the data are now stored permanently, so that even a computer crash does not
wipe them out.

• Hence durability is the other part of the recovery requirement (besides atomicity).

26
Java programming language RDBMS
Java source code in a .java file SQL statement from the user (which
source might also be an application program)
– declarative approach: what the result
must be
which gets compiled into
corresponding Java object code in a a corresponding Relational Algebra
intermediate

.class file by the Java compiler expression by the SQL parser of the
RDBMS and optimized by its query
optimizer – procedural approach: how
the result can be formed
which gets executed by
the Java virtual machine (JVM) internal algorithms chosen by the
runtime

RDBMS for each operation in the


expression

Table 1: Java vs. RDBMS

2.6 Relational Algebra


(Sciore, 2008, Chapter 4.2)

• The THA course has already presented the Relational Algebra from its own view-
point.

• Here we present in from the RDBMS viewpoint, as the intermediate language be-
tween

– the user-level Structured Query Language (SQL) and


– the internal algorithms with which the RDBMS can execute each operator of
Relational Algebra

as shown in Table 1.

• Hence we present here a variant of the Relational Algebra which may be closer to
the internals of the RDBMS than the one presented in THA.

• We also assume that the idea of expressions as trees is familiar from the course
“Basic Models of Computation” (”Laskennan perusmallit” or LAP in Finnish).

• In particular, when we say here that an argument of a Relational algebra operator


is a table, then it can be

– either an actual table saved in a database


– or another Relational Algebra expression whose value is the table.

In both cases, this argument table has an associated schema.

• Recall that the result of a Relational Algebra operation is another table, and that
this result has its own schema.

27
• In mathematical presentations of Relational Algebra, these tables are considered to
be sets of rows. Here we consider them to be bags or multi-sets of rows instead,
because the results computed by RDBMSs have in general duplicate rows, ulnless
they are explicitly suppressed.

28
Select (Sciore, 2008, Chapter 4.2.1)
• The select operator takes 2 arguments:

Table from which rows are selected.


Predicate which is any Boolean combination of Terms.
– We assume that Boolean operations and (‘∧’), or (‘∨’) and not (‘¬’) are
familiar from the course “Discrete Structures” (”Diskreetit rakenteet” or
DSR in Finnish).
– Here each Term is
Expression Comparison Expression
where
Comparison is ‘=’, ‘<’, ‘>’,. . .
Expression consists of attribute names from the schema of the table argu-
ment, constants, and operations like ‘+’, ‘-’,. . .
– Another kind of Term is
AttributeName IS [NOT] NULL.

• Its result consists of those rows of its table argument for which the predicate is
true.

• Hence its result has the same schema as the table argument.

• For instance the Relational Algebra expression

Q3 = select(select(STUDENT
,GradYear=2004) ¬
,MajorId=10 or MajorId=20) ­

of our university example

¬ first selects those rows of the STUDENT table where the GradYear attribute
equals 2004 – as the inner operation – and
­ then selects from them those rows where the MajorId attribute equals either
10 or 20 – as the outer operation.

In this way, it selects the students who graduated in 2004 from either computer
science or mathematics.

• Its corresponding expression tree in shown as Figure 13.

• Note how the result is conceptually computed “inside out”

starting at the leaf nodes representing the actually stored tables – here STUDENT
– and
moving up towards the root, and doing the Relational Algebra operator at each
internal node.

• This operation is often written as σpredicate (table) in the database literature – σ’


being the Greek ’s’.

29
Figure 13: The Relational Algebra expression tree for Q3. (Sciore, 2008)

Project (Sciore, 2008, Chapter 4.2.2)

• The project operation takes 2 arguments:

Table whose rows are projected.


Attributes into which they are projected. Here we represent them as sets of at-
tribute names from the schema of the table argument.

• Its result has the same rows as its table argument, but its schema is restricted to
consist only of these particular attributes – that is, we forget that the table argument
has any other attributes than these.

• This result can contain duplicate rows, by the bag semantics.

• For instance, the Relational Algebra expression

Q6 = project(select(STUDENT
,MajorId=10)
,{SName})

¬ first selects all computer science students, and


­ then keeps only their names.

• Its corresponding expression tree is given as Figure 14.

• Its result will in general have duplicate rows – one for each computer science student
with that particular name.

• This operation is often written as πattributes (table) in the database literature – ’π’
being the Greek ’p’.

30
Figure 14: The tree for Q6. (Sciore, 2008)

Sort (Sciore, 2008, Chapter 4.2.3)

• The sort operator takes 2 arguments:

Table whose rows are sorted.


List of attributes from its schema according to which they are sorted.
– We write this list in brackets and separated with commas [like,this].
– Such a list [a1 , a2 , a3 , . . . , ap ] defines a lexicographic sorting order:
∗ To compare two rows t and u, find the smallest index i such that
t .ai 6= u .ai , and let that decide which of them appears before the
other in the result.
∗ If there is no such i then either row can appear before the other.
– Furthermore, this order can be either ascending (or normal) or descending
(or reversed, “largest-first”).

• The result is sorted according to this order. That is, the result is now an ordered
bag. It has the same schema as the table argument.

• Because this order does not matter, sort is usually the last (topmost, root) operator
in the expression, and it is used only for displaying the result to the user.

• It fulfills one part of requirement 2.

• For instance the expression

Q8 = sort(STUDENT
,[GradYear,Sname])

sorts and displays the students

¬ first ordered by their graduation year, earliest first, and


­ then alphabetically by name within each year.

31
Rename (Sciore, 2008, Chapter 4.2.4)

• The rename operator takes 3 arguments:

Table to have one of its attributes renamed.


Attribute from its schema that is renamed.
New name for this attribute.

• Its result is the same table argument, except that the attribute argument is now
called by this new name in its schema.

• Relational Algebra contains also operators with two table arguments, as we shall
soon see.

• We must sometimes rename their attributes apart from each other first, to make
clear which of these two table arguments contains a particular attribute.

Extend (Sciore, 2008, Chapter 4.2.5)

• The extend operator takes 3 arguments:

Table to extend with a new attribute.


Expression to compute the value for this new attribute, as in selection.
New name for this new attribute – one which does not yet appear in the schema
for the table argument.

• For instance the expression

Q11 = extend(STUDENT
,GradYear-1863
,GradClass)

¬ goes through the STUDENT table, row by row, and


­ for each row t, computes its ct = t .GradYear − 1863 since the birth of the
university in 1863, and
® adds this ct as the value for the new GradClass attribute in the result schema
into row t.

GroupBy (Sciore, 2008, Chapter 4.2.6)

• The groupby operator takes 3 arguments:

Table whose rows are grouped together.


Attributes according which they are grouped, as a set of names from the schema
of the table argument.
Expressions whose values are computed as summaries for each group.
– This is a set of expressions.
– The functions on attributes in these Expressions are such that make sense
for a whole (nonempty) group of values, such as their Sum, Maximum,. . .

32
• This operation handles Requirement 8.

• The schema of its result consists of

– every grouping attribute mentioned in that argument, and


– for each expression mentioned in that argument, a new attribute named in
some implementation-dependent way.
For instance, if the expression is Max(AttrName) then this new attribute could
get the name MaxOfAttrName, and so on.

• The contents of its result can be formed as follows:

¬ The rows in the table argument are partitioned into groups, so that two rows t
and u are in the same group exactly when t .a = u .a for every attribute a
mentioned in the attribute argument.
­ Each such group g generates one tuple tg into the result. The value tg .a will
be this common attribute value of g for every attribute a mentioned in the
attribute argument.
® This tuple tg will also be “extended” with the values for each of the expressions
mentioned in that argument.
– Here these values are now computed by considering all the rows in g to-
gether.
– Hence they summarize the whole group g.
– In contrast, extend computed its new values individually row by row.

• For instance the expression

Q12 = groupby(STUDENT
,{MajorId} ¬
,{Min(GradYear),Max(GradYear)}) ­

¬ groups together every student with the same major


­ computes the minimum and maximum graduation year for each group.

Its output is in Figure 15.

• On the other hand, the expression

Q13 = groupby(STUDENT
,{MajorId,GradYear}
,{Count(SId)})

specifies two grouping attributes MajorId and GradYear, and so its result in Fig-
ure 16 tabulates how many graduates each major subject has had each year.

• If the attribute argument is empty, then the whole table argument forms a single
group, which gets summarized into a single row:

33
Figure 15: The output for Q12. (Sciore, 2008)

Figure 16: The output for Q13. (Sciore, 2008)

34
Q14 = groupby(STUDENT
,{}
,{Min(GradYear)})

computes the earliest graduation year of any student.

• If the expression argument is empty, then groupby groups the rows of the table
argument and removes duplicates:

Q15 = groupby(STUDENT
,{MajorId}
,{})

lists all the distinct majors of all students.

• The functions in the expression argument come in two flavours. For instance:

Q16 Counts how many students there are with known major subjects – aggregation
ignores NULL values, because it is not clear which group they should belong
to.
Q17 counts instead how many distinct major subjects the students have – each
major subject is now counted only once, whereas Q16 added 1 to the count for
each student.

Q16 = groupby(STUDENT
,{}
,{Count(MajorId)})
Q17 = groupby(STUDENT
,{}
,{CountDistinct(MajorId)})

Product (Sciore, 2008, Chapter 4.2.7)

• The fundamental tool for combining tables is the product operator.

• It takes 2 arguments:

one table and


another table such that their schemas have no attribute names in common – so
we rename them apart first, if necessary.

• The result of

product(T ,U )

consists of all these combinations:

35
Figure 17: The result of Q22 = product(STUDENT,DEPT). (Sciore, 2008)

1 let the result be initially empty;


2 for each row r in table T
3 for each row s in table U
4 form a new row by combining rows r and s into one;
5 add this new row into the result.

• Figure 17 shows an example.

• The schema for its results consists of the schemas for its two table arguments to-
gether – because they are assumed to be renamed apart from each other.

• This operation is often written as


T ×U

36
Figure 18: The expression tree for Q23. (Sciore, 2008)

in the database literature – because if the tables T and U are sets, then the result
is their Cartesian product.

• The product operator is on the one hand

fundamental because with it we can combine tables in every way we may want
to, but on the other hand
impractical because
– it is very slow to compute, because its result is so big, and
– (almost) always we want to combine tables with much more precision than
“all rows r from table T to all rows s from table U ”.

Join

• For instance, if attributes b1 , b2 , . . . , bn of the table with schema T (~a, b1 , b2 , . . . , bn )


are the foreign key referencing another table with schema U (~c, d), ~ then we almost
always want for each ~a only the corresponding d: ~

select(product(T
,U )
,b1 = c1 and b2 = c2 and...and bn = cn ).

• In our university example, we may want to combine students and their majors in
this way:

Q23 = select(product(STUDENT
,DEPT)
,MajorId=DId)

Then its results contains also the attribute DName which gives the name of the major
– the MajorId had the same information only as an artificial ID.

• Figure 18 shows the expression as a tree.

37
• These are examples of join operations. They have 3 arguments:

one table and


another table and
a predicate as in selection.

They are so common and useful that they warrant their own shorthand notation:

join(T ,U ,φ) ≡
select(product(T
,U )
,φ)

• Conversely, a product is a join whose comparison predicate φ is always true.

• When the comparison predicate φ is comparing attributes for equality, as in this

b1 = c1 and b2 = c2 and...and bn = cn

here, the join is called an equi join. We focus mainly on them here.

• When an equijoin is used to traverse the foreign key from table T into table U , as
in here, it is called a relationship join.

• As an example of joining multiple tables together, let us find out the grades Joe
received in 2004:

Q25 = select(STUDENT
,Sname=’joe’)
Q26 = join(Q25
,ENROLL
,SId=StudentId)
Q27 = select(SECTION
,YearOffered=2004)
Q28 = join(Q26
,Q27
,SectionId=SectId)
Q29 = project(Q28
,{Grade})

Q26 finds the courses to which Joe has ENROLLed. This needed his student ID via
Q25.
Q28 finds his ENROLLments during 2004. This needed the SECTIONs offered then via
Q27.

• Figure 19 is its expression tree.

• In general, if we must combine m tables, then we need m − 1 joins.

38
Figure 19: The expression tree for Q25–Q29. (Sciore, 2008)

Semijoin (Sciore, 2008, Chapter 4.2.8)

• A semi join has the same 3 arguments as join.

• However, its result is different:

– It consists of those rows r of the first table T for which there exists some
matching row s in the second table U .
– That is, so that r and s together satisfy the join predicate φ.
– But none of the attributes of this matching row s are included in the result.

• It is similar to the selection operation

except that now rows r of table T are chosen into the result based on the other
table U
whereas selection chose rows r based on the attribute values in each row r itself.

• This semijoin(T ,U ,φ) can be implemented with other operations:

project(join(T
,U
,φ)
,the attributes of T ).

• As as example, let us find the students taught by prof. Einstein:

Q38 = select(SECTION
,Prof=’einstein’)
Q39 = semijoin(ENROLL
,Q38

39
Figure 20: The expression tree for Q38–Q40. (Sciore, 2008)

,SectionId=SectId)
Q40 = semijoin(STUDENT
,Q39
,SId=StudentId)

Q39 chooses those ENROLLments whose section IDs are found in the SECTIONs taught
by him as Q38.
Q40 chooses those STUDENTs whose student IDs are found in Q39.

Figure 20 shows it as an expression tree.

Antijoin

• The anti join operator is the complement of the semijoin operator:


Its result consists of those rows r of table T for which there does not exist any
matching row s in table U .

• In contrast to semijoin, this antijoin operation cannot be implemented with our


previous operations:

– All of them can be shown to be monotone in their table argument(s):


If we add more rows into its argument(s), then its result is at least as large as
before.
– Moreover, it can be shown that combining monotone operations always yields
a monotone result as well.
– But antijoin(T ,U ,φ) is anti monotone in its second table argument U :
If we add more rows into U , then the result can get smaller than before!
– Hence we cannot produce this antimonotone result by any combination of our
other monotone operations.

40
• We need this antijoin for queries whose form is “there does not exist any x such
that. . . ”.

– For instance, a SECTION of a course was easy if no ENROLLed student got the
failing grade ‘F’.
– In other words: if there does not exist any ENROLLed student who got an F in
this SECTION.
– In our Relational Algebra this is
Q42 = select(ENROLL
,Grade=’F’)
Q43 = antijoin(SECTION
,Q42
,SectionId=SectId)

or “keep only those SECTIONs which do not appear in the table Q42 of ENROLLments
which got an ‘F’”.

• We need antijoin also for queries whose form is “something holds for every x”.

– This is because “φ is true for every x” is logically equivalent to “there exists


no x for which φ is false”. . .
– . . . or symbolically: ∀x.φ is logically equivalent to ¬∃x.¬φ.
– For instance a professor is stern if (s)he has given at least one grade ‘F’ in
every SECTION (s)he has ever taught.
– In our Relational Algebra this is
Q49 = rename(Q43
,Prof
,BadProf)
Q50 = antijoin(SECTION
,Q49
,Prof=BadProf)
Q51 = groupby(Q50
,{Prof}
,{})

or “keep only the professors of those SECTIONs whose professor has never taught
an easy SECTION (where the previous query Q43 retrieved the easy sections)”.
– Figure 21 shows its expression tree.

• Note: These double negations can be tricky to read and write! It helps to know
something about logic.

41
Figure 21: The tree for stern professors. (Sciore, 2008)

Union (Sciore, 2008, Chapter 4.2.9)


• The union operator takes 2 arguments:
one table and
another table which has the same schema – we can rename appropriately first if
needed.
Its result has also the same schema, and consists of all the rows which appear in at
least one of these two table arguments.
• Hence union(T ,U ) is similar to T ∪ U in mathematics.
• However, union is not needed very often, because we rarely want to know what
information table T or table U contains – in most situations, we want to know
what information we can get by joining them somehow instead.

• One use is to coalesce similar values together. For instance

Q52 = rename(project(STUDENT
,{Sname})
,SName
,Person)
Q53 = rename(project(SECTION
,{Prof})
,Prof
,Person)
Q54 = union(Q52
,Q53)

42
Figure 22: The result of Q55. (Sciore, 2008)

combines both STUDENTs (in Q52) and professors (in Q53) together as Persons, be-
cause here a person is either a student or a professor.

Outer Join
• The union operator is most commonly used as part of the outer join operator.
• This outerjoin operator has the same 3 arguments as the join operator.
• Its result consists of
– the result of the corresponding join operation, together with (here is the
union)
– all the rows from the two argument tables which did not match the join pred-
icate. . .
– . . . with their missing attribute values filled with NULLs (which of course must
be permitted by requirement 3).
That is, an outerjoin is a join which does include NULLs because their unknown
actual values might have matched the join predicate.

• For instance, we may want to see all the current ENROLLments together with all the
STUDENTs who have not ENROLLed into anything yet:

Q55 = outerjoin(STUDENT,ENROLL,SId=StudentId)

• From this we can count the number of ENROLLments for each STUDENT:

Q58 = groupby(Q55
,{SId}
,{Count(EId)})

43
– Now a STUDENT with no ENROLLments yet is alone is his/her own group. . .
– . . . and since the Count aggregation function ignores the NULL EId value in
his/her own group, its value will be 0 as it should.
– If we had used just ENROLL instead of Q55 in Q58, then we would have missed
these STUDENTs with 0 ENROLLments.

• In general, there are 3 kinds of outer joins:

Full outerjoins as described here, whose result consists of all rows from both table
arguments, with NULLs for those attributes for which no matching row existed
in the other table argument.
Left outer joins, whose result consists of all rows from the first table argument,
with NULLs for those attributes for which no matching row existed in the
second table argument.
– This Q55 is such a leftouterjoin, because. . .
– . . . it follows the foreign key from STUDENT into ENROLL. . .
– . . . and so each NULL is for a STUDENT without any ENROLLments, and they
are all at the “right end” of the result in Figure 22. . .
– . . . whereas there are no ENROLLments without STUDENTs, which would
cause NULLs at the “left end” of the result.
Right outer joins, symmetrically.

2.7 Structured Query Language


• The Structured Query Language (SQL) is the standard language for interacting with
an RDBMS.

• It (like all proper DBMS languages) contains 3 main sublanguages:

Data Definition Language (DDL) for defining the elements of the current data-
base schema.
Data Manipulation Language (DML) for populating the tables of the defined
schema with rows.
Query Language (QL) for retrieving the information stored in these database
table rows in various ways.

2.7.1 Data Definition Language


(Connolly and Begg, 2010, Chapter 7.3) (Sciore, 2008, Chapter 2.6)

• The CREATE command adds into the database schema new elements, like

tables Figure 5
integrity constraints like assertions in Figures 6–8 and triggers in Figures 9–10
views whose creation consists essentially of giving the defining query Q, and
indexes on a table and its attributes (in parentheses, separated by commas) like
in Figure 23.

44
Figure 23: Index creation commands. (Sciore, 2008)

• The SQL DDL user can ALTER these CREATEd tables and VIEWs (by ADDing
and DROPping COLUMNs and integrity constraint ASSERTIONSs) later, and
DROPping them altogether when they are no longer needed.

• The SQL DDL user can also CREATE and DROP whole SCHEMAs, because
the same RDBMS offers different schemas for different users.

2.7.2 Query Language


(Sciore, 2008, Chapter 4.3)

• Let us review the main (but not nearly all!) query features of SQL, and relate
them to our Relational Algebra presented in section 2.6, because here our aim is to
understand how an SQL query gets executed by the RDBMS.

• The SQL query statement has the form


SELECT [DISTINCT] a t t r i b u t e s
FROM tables
[WHERE p r e d i c a t e ]
[GROUP BY g r o u p i n g [HAVING p r e d i c a t e ] ]
[ORDER BY o r d e r i n g ]

where each [bracketed] part is optional.

The SELECT Part (Sciore, 2008, Chapter 4.3.3)

• SQL SELECT is the projection operator of Relational Algebra – not selection


despite its name.

• Its optional DISTINCT qualifier removes duplicate rows from the result – using
the appropriate groupby operator.

• Its attributes are a comma-separated list of FullNames having the form

RangeVar .AttrName

where

RangeVar is the range variable for some table T declared in the FROM part to
be explained next.
AttrName is the name of some attribute in this table T .
Or it can be ‘*’ instead. This shorthand expands into all the attributes of
table T .

45
Such a FullName stands for the attribute value r .AttrName for the current row r
of table T .
• Besides these names, the attributes can also contain
Expression AS NewAttrName
forms. These denote in turn extending the result with this new named attribute,
whose value for each row t is obtained by evaluating this Expression.
• A common use for this form is
OldAttrName AS NewAttrName
which essentially renames an old attribute.

The FROM Part (Connolly and Begg, 2010, Chapter 6.3.7) (Sciore, 2008, Chap-
ter 4.3.4)
• The tables in the FROM part are a comma-separated list of
TableName RangeVar
forms. Such a form declares that this RangeVar stands for the current row r of
TableName.
• If none of the other TableNames in this FROM part have any attribute names
in common with this one, then this RangeVar (and ‘.’) can be omitted from
FullNames, because then their AttrNames are enough to determine that they mean
this table.
• This TableName can also be another nested SELECT. . . FROM. . . WHERE. . .
query (in parentheses). Then its RangeVar ranges over the result rows of this nested
query.
• These nested queries permit one possible implementation for the view from sec-
tion 2.4:
If the TableName is a view, then put its defining query (Q) in its place.
• The corresponding Relational Algebra expression is the product of all TableNames
and nested queries in this FROM part.
• It is also possible to write different kinds of joins in this FROM part with the
syntax
first table [FULL or LEFT or RIGHT or NATURAL or CROSS or. . . ] JOIN
second table ON predicate
so Q55 could be written in SQL in for instance like
SELECT ∗
FROM STUDENT s
LEFT JOIN
ENROLL e
ON s . SId = e . S t u d e n t I d
whose result would then use a row of NULLs for those STUDENT rows s which do
not possess any matching ENROLLment rows e.

46
The WHERE Part (Sciore, 2008, Chapters 4.3.5 and 4.3.8)

• The optional WHERE part corresponds to the selection operation on this pred-
icate from the big product of the FROM part.

• If this part is missing, then WHERE true is assumed instead.

• A particularly common special case is when the predicate is a conjunction (that is,
all ands but no ors) of Terms with the form

one FullName = another FullName

because this is an equijoin of the FROM part.

• An example of such a query is “the grades Joe received during his graduation year”:

SELECT e.Grade
FROM STUDENT s,ENROLL e,SECTION k
WHERE s.SId=e.StudentId AND e.SectionId=k.SectId
AND k.YearOffered=s.GradYear AND s.SName=’Joe’

Its direct corresponding Relational Algebra expression is

project(select(product(product(STUDENT
,ENROLL)
,SECTION)
,s.SId=e.StudentId
AND e.SectionId=k.SectId
AND k.YearOffered=s.GradYear
AND s.SName=’Joe’)
,{e.Grade})

but the RDBMS query optimizer can improve it further into Figure 24.

• This optimization has consisted of

¬ considering each Term of the selection predicate separately – this is permit-


ted, because it is a conjunction – and
­ moving each Term down towards the actual tables for as far as it will go, and
® using each moved Term as a join predicate.

• The predicate can also contain another nested query with

FullName [NOT] IN (Query)

which is true if the current value of FullName is in the result of this nested Query.
That is,

– this kind of Term specifies a semijoin, while. . .

47
Figure 24: The Relational Algebra tree for Joe’s final year grades. (Sciore, 2008)

– its optional NOT specifies an antijoin instead


with the result of this Query on this FullName.
• Another kind of nested query is

[NOT] EXISTS (Query)

which is true if the result of this Query has [no] rows.


– It is also a semi- or antijoin, but without the FullName.
– We have already used it in our assertions in Figures 6–8.

The GROUP BY Part (Sciore, 2008, Chapters 4.3.6–4.3.7)


• The optional GROUP BY part turns the SELECT from a projection into a
groupby operation whose 3 arguments come from the following places:
Table comes from the FROM. . . WHERE. . . parts – which cannot therefore use
any values produced by groupby, because it takes place only after this joining.
Attributes come from the comma-separated grouping list of FullNames from the
table argument.
Expressions come from the SELECT part – which can therefore contain only
– grouping attributes and
– aggregate function calls on attributes of the table arguments AS new at-
tributes, with optional DISTINCTness directives.
• The optional HAVING part permits testing a WHERE-like condition after the
groupby operations, so it can use the produced values.

48
The ORDER BY Part (Sciore, 2008, Chapter 4.3.10)
• The optional ORDER BY part specifies a sorting operation as the very last step
of the whole query.

• Its ordering is a comma-separated list of AttrNames from the SELECT part –


without their possible RangeVar s, because the output of the whole query is sorted,
not the tables in its FROM part.

• An AttrName in this list can be optionally followed by DESCending to indicate


that it must be sorted in the opposite order.

Combining SELECTion Statements (Connolly and Begg, 2010, Chapter 6.3.9)


(Sciore, 2008, Chapter 4.3.9)
• SQL also permits the set-theoretical operations

UNION for T ∪ U which corresponds to the Relational Algebra union operator


INTERSECT for T ∩ U which can be expressed with a suitable semijoin in
Relational Algebra
EXCEPT for T \ U (set difference, or “the part of T which does not belong to U ”,
sometimes denoted as R − U instead) which can be expressed with a suitable
antijoin in Relational Algebra

between the two (parenthesized) SELECT. . . FROM. . . WHERE. . . queries which


produce the two result tables T and U .

2.7.3 Data Manipulation Language


(Connolly and Begg, 2010, Chapter 6.3.10) (Sciore, 2008, Chapter 4.4)
• The SQL statement to insert one new row into a table is

INSERT INTO TableName [(AttributeList)]


VALUES (ValueList)

whose

AttributeList lists the names a1 , a2 , a3 , . . . , an of some attributes of TableName. A


missing list means all its attributes.
ValueList lists the values v1 , v2 , v3 , . . . , vn given to these named attributes. The
other attributes of the new row receive NULL or default values, as prescribed
by the table definition.

• SQL can also insert many new rows by replacing the VALUES part with a database
Query.

• The SQL command to delete rows from a table is

DELETE FROM TableName


WHERE predicate

49
whose

predicate chooses the rows to delete, based on their attribute values, as in a Query.

• The SQL command to update rows in a table is

UPDATE TableName
SET AssignmentList
WHERE predicate

whose

predicate chooses the rows r to update as beforee, and


AssignmentList is a comma-separated list of
SET AttrName = Expression

forms. Such a form means that r .AttrName is updated into the value of its
Expression.

3 Client-Server Database Architecture


(Sciore, 2008, Chapter 7)

• An RDBMS is usually organized as as

client- which are other computers. A client


– connects to the server across the network
– runs some application program which interacts with the RDBMS on the
server by sending SQL commands and receiving their results
server which is a separate computer running the actual RDBMS as one operating
system (OS) process. It
– handles concurrent communication with its clients via transactions wher
eeach transaction runs as its own OS thread within the RDBMS process
– executes the SQL commands it receives from its clients in these threads
– is the component which manages the actual database at the physical level
of files
– keeps the database in a consistent state via transaction recovery.

architecture.

• This Client-Server architecture is also used on a single computer, so that the clients
are other processes running in the same computer as the RDBMS process.

• This architecture separates the

front end of a database application program (which handles the user interface and
the part of the “business logic” of the organization which cannot be represented
with database integrity constraints) in the client from its

50
back end in the server which provides the common database part for all such ap-
plications.

• There are also distributed (R)DBMSs:

– The database is divided among more than one servers, which serve the clients
together.
– They are very important, especially on the web.
– However, this course concentrates only on the “classical” one-server RDBMSs.

• Non-client-server architectures can be used instead when a single application “owns”


the whole database privately.

3.1 Installing and Running SimpleDB


(Sciore, 2008, Chapter 7)

• Here are the general steps for getting the SimpleDB RDBMS up and running on
your computer.

• How each step is carried out in a particular OS is left as an exercise to the reader. . .

¬ Download its latest version from + http://www.cs.bc.edu/~sciore/simpledb/


and unzip it. The version used here is 2.9.

­ Move the unzipped simpledb subdirectory into the serverdirectory where you want
the server-side software to be.

® Ensure that this serverdirectory is in your CLASSPATH environment variable, so that


your java can find it. This SimpleDB version assumes java version 1.6.

¯ Ensure that the current working directory ‘.’ is in CLASSPATH too (it may already
be).

• The SimpleDB server-side software should now be installed. The server process can
be started as follows:

° Start the

rmiregistry

program as another process.

• This program is part of Java SDK, which you should already have.
• It is the Remote Method Invocation (RMI) registry – the “phone directory”
for Java methods which can be called from other processes, even across the
network.
• The SimpleDB server registers its public methods there, so that its client pro-
cesses can invoke them to ask the server to perform database operations.

± Start the server process with the

51
java simpledb.server.Startup databasename

command.

• If your home directory contains a subdirectory named databasename, then the


server will continue using the already created database there. If the server
starts OK, then you will see the message
recovering existing database
database server ready

where the server first recovers databasename into a consistent state, because it
may have ended abnormally.
(For instance, its previous server process may have been killed.)
• Otherwise databasename will be created as a new empty database. If the server
starts OK, then you will see the message
creating new database
new transaction: 1
transaction 1 committed
database server ready

where this 1st transaction created the empty database.


• This databasename determines the only schema the server will use now – Sim-
pleDB does not support multiple schemas at the same time.
• Note: If you want to kill and restart the SimpleDB server process then kill
rmiregistry first and wait a while before restarting it – otherwise you might
get an RMI error instead.

• The SimpleDB server process should now be running.

² You can try it out for instance with the example client programs in the unzipped
studentClient/simpledb/ subdirecory:

CreateStudentDB.java creates the university database, our running example. It


shows how to CREATE tables and INSERT rows into them.
FindMajors.java lists all the STUDENTs majoring in the given department and their
graduation years. It can be run with the command:
java FindMajors department

StudentMajor.java lists all STUDENTs and their majors.


ChangeMajor.java UPDATEs Amy’s major subject into ’drama’.
SQLInterpreter.java is a simple interactive SQL shell for SELECTion queries
and row UPDATEs.

• SimpleDB implements only a very small subset of SQL:

52
– The SELECT part of a query has just an attribute name list – no ‘*’, AS
nor DISTINCT.
– Its FROM part is just a table name list – no RangeVar iables, JOINs nor
nested queries (but views are supported).
Hence attribute names must determine tables.
– Its WHERE part is just a conjunction of equality comparisons ‘=’ of attribute
names and constants – no other comparisons nor expressions.
– The only 2 supported attribute types are
INT for Java 32-bit integers, and
VARCHAR(N ) for ASCII strings of at most N characters
without NULLs.
– There is no UNION, GROUP nor ORDER BY.
– There are no keys or integrity constraints.
– An INSERT takes only VALUES – not queries.
– An UPDATE has only one assignment – not many.
– An INDEX can have only one attribute – not many. Moreover, index support
must be enabled separately
– Entities CREATEd in the current schema cannot be DROPped.

Its grammar is in Figure 25.

3.2 Using a Relational Database from Java


(Sciore, 2008, Chapter 8)
• We shall consider the SimpleDB server side structure later in this course.

• Let us consider here the structure for a simple client.

• There is a family client-server database communication protocols called Open Data


Base Connectivity (ODBC).

• There are now ODBC binding libraries for many programming languages. They
permit application programs written in that language to communicate with any
ODBC-compliant database server.

• The Java binding is called JDBC – which does not mean “Java DBC” according to
Sun’s legal position. . .

• The SimpleDB supports enough of the JDBC specification to allow writing simple
clients – but not nearly all the features of the whole specification.

• This basic JDBC API is shown as Figure 26.

• We shall use the SimpleDB studentClient/simpledb/FindMajors.java client as


our example.

• Such a batch-oriented client has 4 main phases:

53
Figure 25: A small SQL language dialect. (Sciore, 2008)

54
Figure 26: The basic JDBC Application Programming Interface. (Sciore, 2008)

55
¬ The client opens a connection to the server.

• The central Java code is


Driver d = new theRightDriver ();
String url = "jdbc:system://server /path";
Connection conn = d.connect(url,properties);

where
theRightDriver () is supplied by the RDBMS JDBC binding, and imported
into the client code.
For SimpleDB, it is simpledb.remote.SimpleDriver.
system is the RDBMS used.
For SimpleDB, it is simpledb.
server is the machine running the rmiregistry and the RDBMS processes to
which this client wants to connect.
If this server is in the same machine as this client, then this is localhost.
/path leads to the databasename to use within the server .
For SimpleDB it is not needed, becaise it stores its databasename subdi-
rectories directly in its users’ home direcories.
properties is an RDBMS-specific string giving extra options for the connection.
For instance, if the RDBMS has mandatory access control, then this string
can contain the required username and password.
SimpleDB does not support any properties so it is the null pointer.
• The vendor-independent parts of JDBC are imported from java.sql.*.
• The method calls of this created connection
¶ happen remotely via the rmiregistry process running on the server . . .
· which in turn forwards them to the RDBMS process.
• Unfortunately this old way to form the connection is not very portable, because
the client contains theRightDriver which is vendor-dependent.
• Java supports also new ways, where the server can send theRightDriver to its
clients based on the system in the url (Sciore, 2008, Chapter 8.2.1).
+ Now the client is vendor-independent, but. . .
− the server-side setup gets more complicated, and so we continue using the
old way here instead.
­ The client sends an SQL statement to the server.

• The central Java code for querying the database is


Statement stmt = conn.createStatement();
String qry = statement;
ResultSet rs = stmt.executeQuery(qry);

where
statement is an SQL SELECT. . . FROM. . . WHERE. . . statement as text.
rs gives the results of the query as a result set to be processed in the next
phase ®.

56
• Other SQL statements can be issued with
int howMany = stmt.executeUpdate(qry);

whose return value tells howMany records were affected instead of a result set.
• The RDBS server
¶ first compiles this statement into Relational Algebra and optimizes it into
a form. . .
· which it then executes.
• A statement can also be prepared beforehand:
– The compilation step ¶ happens only once.
– The same compiled statement can be executed in step · many times with
different parameter values each time.
This is useful, because we shall see during this course that step ¶ is not trivial.
• These parameter positions are marked with question marks ‘?’ within the
statement to prepare, while the value for the nth ‘?’ can be set with the
method
setType(int n,Type value)

for each SQL Type.


• Figure 27 shows an example.

® The client receives the result from the server.

• The result set of a query consists of the corresponding rows. One of them is
the current row – a reading position within the result set.
– Initially this current row is just before the first row of the result set – so
it is not valid yet.
– Method next moves this current row to the next row of the result set. It
returns false if it moved past the last row of the result set – so it is no
longer valid.
– If the current row is valid, then the value for its named attribute can be
extracted with the method
Type getType(String name)

for each SQL Type.


– Note: Figure 26 did not mention this name parameter.
• SimpleDB will use this “current row model” also for its internal intermediate
result sets of individual Relational Algebra operators.

57
• Besides these basic “read forward” result sets, JDBC also supports
scrollable result sets, whose current row can move also backwards, and
updatable result sets, which permit updating the attribute values of the
current row
(Sciore, 2008, Chapter 8.2.5) which are especially useful in clients with graph-
ical user interfaces (GUIs).
• Such a result set is an example of a lazy data structure:
– it does not exist as a whole, but. . .
– its elements are constructed one by one, as the client asks for the next
one.
• Once the
while(rs.next())

loop processing the result set rs finishes, the client should call
rs.close()

as soon as possible, because the RDBMS maintains each open result set, and
they reserve its limited resources.

¯ The client closes its connection to the server.

• Similarly client should call


conn.close()

as soon as it no longer needs this connection to the RDBMS, because open


connections are a limited resource too.

SimpleDB source file studentClient/simpledb/FindMajors.java

• The symbol ‘&’ denotes a long source code line which had to be divided into many
lines on the pages.
import j a v a . s q l . ∗ ;
import s i m p l e d b . r e m o t e . S i m p l e D r i v e r ;

public c l a s s F i n d M a j o r s {
public s t a t i c void main ( S t r i n g [ ] a r g s ) {
S t r i n g major = a r g s [ 0 ] ;
System . o u t . p r i n t l n ( ” Here a r e t h e ” + major + ” m a j o r s ” ) ;
System . o u t . p r i n t l n ( ”Name\ tGradYear ” ) ;

C o n n e c t i o n conn = n u l l ;
try {
// S t e p 1 : c o n n e c t t o d a t a b a s e s e r v e r
D r i v e r d = new S i m p l e D r i v e r ( ) ;
conn = d . c o n n e c t ( ” j d b c : s i m p l e d b : / / l o c a l h o s t ” , n u l l ) ;

// S t e p 2 : e x e c u t e t h e q u e r y
S t a t e m e n t stmt = conn . c r e a t e S t a t e m e n t ( ) ;
S t r i n g q r y = ” s e l e c t sname , g r a d y e a r ”
+ ” from s t u d e n t , d e p t ”
+ ” where d i d = m a j o r i d ”
+ ” and dname = ’ ” + major + ” ’ ” ;
R e s u l t S e t r s = stmt . e x e c u t e Q u e r y ( q r y ) ;

// S t e p 3 : l o o p t h r o u g h t h e r e s u l t s e t
while ( r s . n e x t ( ) ) {
S t r i n g sname = r s . g e t S t r i n g ( ” sname ” ) ;
int gradyear = r s . g e t I n t ( ” gradyear ” ) ;
System . o u t . p r i n t l n ( sname + ” \ t ” + g r a d y e a r ) ;
}

58
Figure 27: Preparing an SQL statement and using it. (Sciore, 2008)

rs . close () ;
}
catch ( E x c e p t i o n e ) {
e . printStackTrace () ;
}
finally {
// S t e p 4 : c l o s e t h e c o n n e c t i o n
try {
i f ( conn != n u l l )
conn . c l o s e ( ) ;
}
catch ( SQLException e ) {
e . printStackTrace () ;
}
}
}
}

3.3 JDBC Error Handling


• The FindMajors client code performed its phases ¬–® in a Java try block, be-
cause JDBC reports errors by throwing exceptions (Sestoft, 2005, Chapters 12.6.5–
12.6.6).

• These exceptions may arise for various phases and reasons:

¶ The client might not be able to connect to the server in phase ¬.


· There may be something wrong in the SQL statement which the client sends
to the server in phase ­.
¸ The server or network might crash during the result set processing loop of
phase ®.
¹ The RDBMS may have to abort the transaction of the client because the
RDBMS is running out of resources.

The client may choose to retry its operation later, especially if the reason for its
failure was ¹.

• The FindMajors client

– gives up trying, and


– closes the connection, and
– prints the Java stack trace as the diagnostic information.

59
with AutoCommit still true after setting it to false via the API in
Figure 28
The RDBMS executes each SQL The RDBMS continues the same
statement as its own transaction. transaction when the clients sends its
next SQL statement into the connection.
The RDBMS commits (or aborts) them The client must commit or abort this
internally and automatically – this is transaction by hand at the end.
what “autocommit” means.
Table 2: With and without autocommit mode.

Figure 28: JDBC transactions. (Sciore, 2008)

• This takes place in the finally part, so it is executed whether the try part executed
correctly or caused an exception to catch.

• This finally part closes the connection if phase ¬ managed to open it. It may
raise an exception too, and is therefore in its own try block.

3.4 JDBC Transaction Handling


• An RDBMS operates in autocommit mode by default, as described by Table 2.

• An RDBMS operates in its default transaction isolation level, unless the client sets
this level explicitly for its connection. For instance,
conn . s e t T r a n s a c t i o n I s o l a t i o n ( Connection . &
TRANSACTION SERIALIZABLE)

sets it to the full serializable level.

• When a clients turns off autocommit mode with


conn . setAutoCommit ( f a l s e )

it might (but hopefully never!) encounter the following pathological situation:

¶ Suppose that the client calls


conn . r o l l b a c k ( )
for some reason – for instance, if some SQL statement it sent to the server
caused an exception as in section 3.3.
· This attempt to abort the transaction fails with an(other) exception?

60
What should the client do then? Neither committing nor aborting its transaction
is possible!
• Then the database may have become corrupted because it may not be possible to
recover it to the last consistent state before this transaction started. Hence the
client should somehow alert the DBA about this danger if possible.

3.5 Impedance Mismatch


(Sciore, 2008, Chapter 9)
• A JDBC client like FindMajors is rather procedural programming:
– An attribute value STUDENT .MajorId yields the key of the row in the DEPT
which corresponds to the department of the student represented by this row of
the STUDENT table.
– For instance, Joe’s row in the STUDENT table has the ID of the computer
science department in the DEPT table, and so on.
• A more object-oriented design would have instead
– objects like Joe of class STUDENT . . .
– with attributes like Joe .majorOf which points to another object compsci of
class DEPT , and so on.
• It is possible to build an Object-Relational Mapping (ORM) which builds the latter
design on top of the former.
• The Java Persistence Architecture (JPA) is one tool for generating such an ORM.
It uses Java code annotated with the related relational table design, as in Figure 29.

• From a “programming philosophy” (whatever that is. . . ) viewpoint, this impedance


mismatch between these models stems from their origins:
Relational model is built on first-order predicate logic, whose structure is
flat because it has just indivisible values and their relations, but
flexible because these relations can be combined freely in formulas/queries.
Object-oriented model is built instead on representing information with
structured entities with their own identity and individual properties, but
with
specific access paths between them, encoded as these per-object properties,
such as “this student’s major”.
Each is more useful than the other in some situations.

• It is possible to develop a data model based on the object-oriented philosophy.


– This leads to Object-Oriented Data Base Management Systems (OODBMSs)
like O2 .
– However, their market share has remained much smaller than that of RDBMSs,
even though object-oriented programming languages have become very com-
mon.

61
Figure 29: JPA annotations combining the STUDENT table and class. (Sciore, 2008)
(Continues in Figure 30.)

62
Figure 30: Rest of Figure 29. (Sciore, 2008)

63
• Moreover, there are other programming philosophies than object-orientation, such
as functional and logic programming.

– They are based on the concept of “value” instead of “(object) identity” and so
the relational model is more natural for them.
– However, despite their long history they are still niche programming languages.

4 The Structure of the SimpleDB RDBMS Engine


(Sciore, 2008, part 3)

• Now we examine how an RDMBS server can be implemented using SimpleDB as


our example.

• Although SimpleDB is a restricted RDBMS written and made available for teaching
purposes, it does contain the most important components of a full RDBMS. These
components are shown in Figure 31.

• SimpleDB has chosen straightforward implementations for these components. We


shall mentioned some alternatives too.

• We can trace the execution of an SQL query in the SimpleDB RDBMS server process
down these components:

¶ The Remote manager handles the communication with the client. The server
process allocates a separate thread for each connection via the RMI meacha-
nism.
· When a clients sends an SQL statement to its open connection, this Remote
manager passes it to the Planner component.
– This component plans how the statement will be executed.
– This plan is a Relational Algebra expression which it sends to the Query
component.
– It invokes the Parser component, which turns the statement into a syntax
tree containing the tables, attributes, constants,. . . mentioned in it.
– This Parser component in turn invokes the Metadata manager, which
keeps track of information about the tables, attributes, indexes,. . . CRE-
ATEd in the database to check that the things mentioned in the syntax
tree do exist and have the right type.
¸ The Query component turns the plan it received from the Planner component
into a scan and executes it.
– It forms this scan by choosing an implementation for each operation in the
expression. For instance, if the expression contains a sort operation, then
this Query component chooses a particular sorting algorithm to use.
– The RDBMS can choose from several algorithms for the same operation,
because different algorithms suit different situations, improving perfor-
mance.
– This component uses the Metadata manager too, because its information
helps in making these choices.

64
Figure 31: The Components of an RDBMS Engine. (Sciore, 2008, page 310)

65
– This scan is executed using the same “current row” approach as the client
uses for processing the result in its phase ® in section 3.2.
¹ Each of these rows processed by the Query component is stored on disk as a
record handled by the Record manager.
– These records are stored in disk blocks held in files managed by the File
manager.
– The Buffer manager is in turn responsible for those disk blocks which have
been read into RAM for accessing the records in them.
º Each (scan for a) statement is executed as (if in autocommit mode) or within
(otherwise) a Transaction. They are managed by a manager responsible for
concurrency control and
recovery using a designated Log file managed by its own component.

• The relative order of these components may vary according to architecture:

– SimpleDB handles concurrency in the Buffer level, so its Transaction manager


is located just above it.
– Other databases handle it in the Record level instead, and so their Transaction
managers are above it instead.

• However, we will go upwards in Figure 31 so that each component

uses services provided by the components below it


provides services to the components above it.

4.1 File Management


(Sciore, 2008, Chapter 12)

• This lowest level of an RDBMS is the component which handles interaction with
the underlying disk drive(s).

• The RDBMS can do this with

raw disk(s) so that the database resides on dedicated drives (or partitions) with
nothing else.
+ This is as fast as possible, but. . .
− such disks needs dedicated special support from the DBA.
This is used only for very high performance requirements.
OS file(s) so that the database is in normal files in normal file systems.
+ They need only the same support as file systems in general, but. . .
− the OS layer overhead impairs performance.
This is currently the most common choice.

• This OS file choice can be divided further into

single file architecture, where the whole database is stored in a single (possibly
very) big file, like for instance the .mdb files of Microsoft Access.

66
multifile architecture, where each database is in a separate subdirectory containing
separate files for its tables, indexes,. . . like for instance Oracle and SimpleDB
do.

• The RDBMS treats its files internally like raw disks:

– It consults the OS only for opening and closing its files, and extending the with
more blocks, but. . .
– manages these blocks, their buffering, and their allocation by itself.

The reason is not only better performance but even more importantly ensuring
durability:
The RDBMS must know precisely which of its data is

already stored on disk, and which is


still only in RAM, and vanishes if the computer crashes.

Disks are persistent storage.

• In order to guarantee durability, the RDBMS needs some memory whose contents
do not disappear when the computer crashes.

• A disk drive provide such persistent storage.

• A disk drive consists of sectors which the OS divides further into blocks.

• Big databases require big disks.

− Big disks are more expensive than small disks.


+ It is possible to connect many small disks into one unit, which looks like a big
disk to the OS, because the controller of the unit takes care of spreading the
stored data among these disks.

• Disk striping builds such a big disk out of many smaller disks. For performance
reasons, it spreads the sectors of the big disk evenly across the sectors of the smaller
disks, as in Figure 32.

• The RDBMS relies on the disk drive to function properly.

• However, the DBA must also be prepared for disk failures.

• One countermeasure is to make regular backups (on tape). . .

• Another countermeasure is to use a Redundant Array of Inexpensive Disks (RAID)


as the drive.

– A RAID unit adds extra error-correcting information into a striped disk unit.
– If one of the smaller disks breaks, the RAID unit can inform the DBA about
which of them broke.
– The DBA can then change the broken disk and reconstruct its contents from
the other disks and this extra information.

67
Figure 32: Two-disk striping. (Sciore, 2008)

– The only problem is if another disk breaks during this reconstruction. . . but
this is unlikely.
– Moreover, adding more error-correcting information makes it possible to recon-
struct more than one disk at a time.

• There are now 7 levels of RAID, depending on what extra information the unit holds
and where.

• The simplest is RAID-0, which is plain striping without any extra error-correcting
information. Therefore it does not offer any protection against failures.

• The next level is RAID-1, where the extra error-correcting information is a mirror
of the data disk into another identical disk, as in Figure 33.

• The DBA can reconstruct the contents of the data disk simply by copying this
mirror disk into the replacement disk.

• Another kind of error-correcting extra data is parity:

– The RAID unit consists of N + 1 small disks.


– N of these disks hold the data.
– The extra (N + 1)st disk holds the parity blocks of the data blocks.
∗ That is, sector s of this extra (N + 1)st disk holds the exclusive-or of
sectors s of the N data disks.
∗ In other words, bit b of sector s on this extra (N + 1)st disk is 1 if and
only if an odd number of bits b of the sectors s on the N data disks are 1,
otherwise 0.
– If any one of these N data disks breaks, the DBA can reconstruct its contents
from the other (N − 1) still functioning data disks and this extra (N + 1)st
disk.

68
Figure 33: Mirroring. (Sciore, 2008)

– This is more compact than mirroring, because there is only one extra block
per N data blocks, whereas mirroring had one extra block per data block.
– In fact, mirroring could be viewed as parity with N = 1.

This parity idea is RAID-4. It is in Figure 34.

• However, the dedicated extra (N + 1)st parity disk becomes a bottleneck for the
whole RAID unit, because whenever a data disk sector changes, the corresponding
section of the parity disk must be updated too.

• RAID-5 solves this bottleneck by distributing these parity sectors evenly among the
data sectors.

– Every (N + 1)st sector of a small disk is a parity sector, its other sectors are
data.
– A parity sector s on a small disk d contains the parity of the corresponding
sectors s of the other small disks

1, 2, 3, . . . , d − 1, d + 1, d + 2, d + 3, . . . , N + 1

than d itself.
– Then the extra work of updating parity sectors is divided evenly among all the
other disks, and so no one disk is a bottleneck any longer.
– The DBA can still reconstruct the contents of any one broken disk from the
other still functioning N disks.

• The two most common levels are RAID-1 and RAID-5.

– RAID-2 used bit instead of sector striping and an error-correcting code instead
of parity, but it was hard to implement and performed poorly, and so is no
longer used.

69
Figure 34: Parity. (Sciore, 2008)

– RAID-3 is like RAID-4 but with the less efficient byte instead of sector striping.
– RAID-6 is like RAID-5 but with two kinds of parity information, so it tolerates
two disk failures at the same time.

• For instance, the current cs.uef.fi server has:


– one fast RAID-1 unit for the OS and temporary files, and
– two RAID-5 units with N = 4 for user files, and
– two hot-swap drives, which allow the IT support to reconstruct a broken disk
“on the fly” without having to shut down the server.
• The IT support (including the DBAs) recommends which RAID to buy based on
the required levels of
protection against downtime and loss of work caused by disk failures – in theory,
by determining a low enough expected value

disk failure probability · cost of disk failure

of the cost involved – and


performance requirements for the system – based on
statistics collected about its current use, and
estimates about its future use.

Disks are slow.


• Disk storage is much slower than RAM: About
100 000 times slower for mechanical disk drives, but “only” about
1 000 times slower for flash drives.
Requirement 10 (little I/O). The RDBMS must strive to avoid unnecessary disk I/O
whenever possible.

70
• This is one reason why the RDBMS executes queries concurrently:
If one query running in one thread must stop and wait for disk I/O, other queries
running in other threads which already have the data they need in RAM may
continue.

Disks are block devices.


• Disk storage is different from RAM also in that its addressing operates in much
larger units than single bytes.

• Each disk drive / file system / OS has its own block size constant, so that the block
k = 0, 1, 2, . . . of a file consists of the bytes at

from k · block size into (k + 1) · block size − 1

within that file, and reading/writing the value of any byte within that area copies
the whole block between disk and RAM.

• One way how the RDBMS can meet requirement 10 is to ensure that if a block must
be read from the disk, then the information in it is used as well as possible.

• This constant is usually between 512 bytes and 16 kilobytes, 4 kilobytes is a typical
value.

• On the one hand, the application programmer does not have to be aware of this
buffering because the OS handles it.
But (s)he may want to be, for performance reasons.

• On the other hand, the RDBMS wants to be aware of it, and bypasses this OS
buffering altogether with its own Buffer manager, for both performance and dura-
bility.

• That Buffer manager will use the services offered by this File manager for the actual
disk I/O operations.

SimpleDB source file simpledb/file/Block.java


• A Block object represents logical block number: a block number k within a particular
OS file.

• The OS converts it internally into a physical block number, which identifies a par-
ticular block on a particular sector of the disk drive.
package s i m p l e d b . f i l e ;

/∗ ∗
∗ A reference to a disk block .
∗ A B l o c k o b j e c t c o n s i s t s o f a f i l e n a m e and a b l o c k number .
∗ I t does not hold the co nte nt s of the b l o c k ;
∗ i n s t e a d , t h a t i s t h e j o b o f a { @ l i n k Page } o b j e c t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B l o c k {
private S t r i n g f i l e n a m e ;
p r i v a t e i n t blknum ;

/∗ ∗
∗ Constructs a block reference
∗ f o r t h e s p e c i f i e d f i l e n a m e and b l o c k number .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param b l k n u m t h e b l o c k number
∗/

71
public B l o c k ( S t r i n g f i l e n a m e , i n t blknum ) {
this . filename = filename ;
t h i s . blknum = blknum ;
}

/∗ ∗
∗ R e t u r n s t h e name o f t h e file where the block lives .
∗ @return t h e f i l e n a m e
∗/
public S t r i n g f i l e N a m e ( ) {
return f i l e n a m e ;
}

/∗ ∗
∗ Returns the l o c a t i o n of the block within the file .
∗ @ r e t u r n t h e b l o c k number
∗/
public i n t number ( ) {
return blknum ;
}

public boolean e q u a l s ( O b j e c t o b j ) {
Block blk = ( Block ) obj ;
return f i l e n a m e . e q u a l s ( b l k . f i l e n a m e ) && blknum == b l k . blknum ;
}

public S t r i n g t o S t r i n g ( ) {
return ” [ f i l e ” + f i l e n a m e + ” , b l o c k ” + blknum + ” ] ” ;
}

public i n t hashCode ( ) {
return t o S t r i n g ( ) . hashCode ( ) ;
}
}

SimpleDB source file simpledb/file/Page.java


• A Page object is a Block -sized chunk of memory.

• It is implemented with library class Java.nio.ByteBuffer .

• This library class provides also a reading/writing position within the chunk.

• Moreover, a Page object allocates the chunk Directly:

– This means that Java uses one of its OS I/O buffers as the chunk.
– This is a good idea in an RDBMS (but not in most other programming situa-
tions!) because it will manage its own Buffer s.
– In this way, the RDBMS can “recycle” the same memory which the OS would
have used for the same purpose.

• All these methods (like many others) are synchronized (Sestoft, 2005, Chap-
ter 16.2):

– That is, only one thread can execute the methods of a Page object at the same
time.
– Because the RDBMS process handles each connection with a client in its own
thread, this ensures that two clients cannot manipulate the same Page at the
same time – one must wait until the other is finished instead.
– This is important for the get. . . and set. . . methods, which
¬ first move the position where they want it to be, and
­ then read or write the data starting at that position.
package s i m p l e d b . f i l e ;

import s i m p l e d b . s e r v e r . SimpleDB ;
import j a v a . n i o . B y t e B u f f e r ;
import j a v a . n i o . c h a r s e t . C h a r s e t ;

/∗ ∗
∗ The c o n t e n t s o f a d i s k b l o c k i n memory .
∗ A p a g e i s t r e a t e d a s an a r r a y o f BLOCK SIZE b y t e s .

72
∗ There a r e methods t o g e t / s e t v a l u e s i n t o t h i s array ,
∗ and t o r e a d / w r i t e t h e c o n t e n t s o f t h i s a r r a y t o a d i s k b l o c k .

∗ For an e x a m p l e o f how t o u s e Page and
∗ { @link Block } o b j e c t s ,
∗ c o n s i d e r the f o l l o w i n g code fragment .
∗ The f i r s t p o r t i o n i n c r e m e n t s t h e i n t e g e r a t o f f s e t 792 o f b l o c k 6 o f file junk .
∗ The s e c o n d p o r t i o n s t o r e s t h e s t r i n g ” h e l l o ” a t o f f s e t 20 o f a p a g e ,
∗ and t h e n a p p e n d s i t t o a new b l o c k o f t h e f i l e .
∗ I t then reads t h a t b l o c k i n t o another page
∗ and e x t r a c t s t h e v a l u e ” h e l l o ” i n t o v a r i a b l e s .
∗ <p r e >
∗ Page p1 = new Page ( ) ;
∗ B l o c k b l k = new B l o c k ( ” j u n k ” , 6 ) ;
∗ p1 . r e a d ( b l k ) ;
∗ i n t n = p1 . g e t I n t ( 7 9 2 ) ;
∗ p1 . s e t I n t ( 7 9 2 , n+1) ;
∗ p1 . w r i t e ( b l k ) ;

∗ Page p2 = new Page ( ) ;
∗ p2 . s e t S t r i n g ( 2 0 , ” h e l l o ” ) ;
∗ b l k = p2 . a p p e n d ( ” j u n k ” ) ;
∗ Page p3 = new Page ( ) ;
∗ p3 . r e a d ( b l k ) ;
∗ S t r i n g s = p3 . g e t S t r i n g ( 2 0 ) ;
∗ </p r e >
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s Page {
/∗ ∗
∗ The number o f b y t e s i n a b l o c k .
∗ This v a l u e i s s e t u n r e a s o n a b l y low , so t h a t i t i s e a s i e r
∗ t o c r e a t e and t e s t d a t a b a s e s h a v i n g a l o t o f b l o c k s .
∗ A more r e a l i s t i c v a l u e w o u l d b e 4K .
∗/
public s t a t i c f i n a l i n t BLOCK SIZE = 4 0 0 ;

/∗ ∗
∗ The s i z e o f an i n t e g e r i n b y t e s .
∗ This v a l u e i s almost c e r t a i n l y 4 , but i t i s
∗ a good i d e a t o encode t h i s v a l u e as a c o n s t a n t .
∗/
public s t a t i c f i n a l i n t INT SIZE = I n t e g e r . SIZE / Byte . SIZE ;

/∗ ∗
∗ The maximum s i z e , i n b y t e s , o f a s t r i n g o f l e n g t h n .
∗ A s t r i n g i s r e p r e s e n t e d as th e encoding o f i t s c h a r a c t e r s ,
∗ p r e c e d e d b y an i n t e g e r d e n o t i n g t h e number o f b y t e s i n t h i s e n c o d i n g .
∗ I f t h e JVM u s e s t h e US−ASCII e n c o d i n g , t h e n e a c h c h a r
∗ i s s t o r e d i n one b y t e , s o a s t r i n g o f n c h a r a c t e r s
∗ h a s a s i z e o f 4+n b y t e s .
∗ @param n t h e s i z e o f t h e s t r i n g
∗ @ r e t u r n t h e maximum number o f b y t e s r e q u i r e d t o s t o r e a s t r i n g o f s i z e n
∗/
public s t a t i c f i n a l i n t STR SIZE ( i n t n ) {
f l o a t b y t e s P e r C h a r = C h a r s e t . d e f a u l t C h a r s e t ( ) . newEncoder ( ) . maxBytesPerChar ( ) ;
return INT SIZE + ( n ∗ ( i n t ) b y t e s P e r C h a r ) ;
}

p r i v a t e B y t e B u f f e r c o n t e n t s = B y t e B u f f e r . a l l o c a t e D i r e c t ( BLOCK SIZE ) ;
p r i v a t e F i l e M g r f i l e m g r = SimpleDB . f i l e M g r ( ) ;

/∗ ∗
∗ C r e a t e s a new p a g e . A l t h o u g h t h e c o n s t r u c t o r t a k e s no a r g u m e n t s ,
∗ i t d e p e n d s on a { @ l i n k F i l e M g r } o b j e c t t h a t i t g e t s f r o m t h e
∗ method { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#f i l e M g r ( ) } .
∗ That o b j e c t i s c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l e i t h e r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB# i n i t ( S t r i n g ) } o r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e M g r ( S t r i n g ) } o r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e A n d L o g M g r ( S t r i n g ) } o r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e L o g A n d B u f f e r M g r ( S t r i n g ) }
∗ is called first .
∗/
public Page ( ) {}

/∗ ∗
∗ Populates the page with the c o n t e n t s of the specified disk block .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗/
public synchronized void r e a d ( B l o c k b l k ) {
f i l e m g r . read ( blk , c o n t e n t s ) ;
}

/∗ ∗
∗ Writes the c o n t e n t s of the page to the s p e c i f i e d disk block .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗/
public synchronized void w r i t e ( B l o c k b l k ) {
f i l e m g r . w r i t e ( blk , c o n t e n t s ) ;
}

/∗ ∗
∗ Appends t h e c o n t e n t s o f t h e p a g e t o t h e s p e c i f i e d f i l e .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @ r e t u r n t h e r e f e r e n c e t o t h e n e w l y −c r e a t e d d i s k b l o c k
∗/
public synchronized B l o c k append ( S t r i n g f i l e n a m e ) {
return f i l e m g r . append ( f i l e n a m e , c o n t e n t s ) ;
}

/∗ ∗
∗ Returns the integer value at a specified offset of the page .

73
∗ I f an i n t e g e r was n o t s t o r e d a t t h a t l o c a t i o n ,
∗ t h e b e h a v i o r o f t h e method i s u n p r e d i c t a b l e .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @return t h e i n t e g e r v a l u e at t h a t o f f s e t
∗/
public synchronized i n t g e t I n t ( i n t o f f s e t ) {
contents . position ( o f f s e t ) ;
return c o n t e n t s . g e t I n t ( ) ;
}

/∗ ∗
∗ W r i t e s an i n t e g e r t o t h e s p e c i f i e d o f f s e t on t h e p a g e .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @param v a l t h e i n t e g e r t o b e w r i t t e n t o t h e p a g e
∗/
public synchronized void s e t I n t ( i n t o f f s e t , i n t v a l ) {
contents . position ( o f f s e t ) ;
contents . putInt ( val ) ;
}

/∗ ∗
∗ Returns the s t r i n g v a l u e at the s p e c i f i e d o f f s e t of the page .
∗ I f a s t r i n g was n o t s t o r e d a t t h a t l o c a t i o n ,
∗ t h e b e h a v i o r o f t h e method i s u n p r e d i c t a b l e .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @return t h e s t r i n g v a l u e at t h a t o f f s e t
∗/
public synchronized S t r i n g g e t S t r i n g ( i n t o f f s e t ) {
contents . position ( o f f s e t ) ;
int len = contents . g e t I n t ( ) ;
byte [ ] b y t e v a l = new byte [ l e n ] ;
contents . get ( byteval ) ;
return new S t r i n g ( b y t e v a l ) ;
}

/∗ ∗
∗ W r i t e s a s t r i n g t o t h e s p e c i f i e d o f f s e t on t h e p a g e .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @param v a l t h e s t r i n g t o b e w r i t t e n t o t h e p a g e
∗/
public synchronized void s e t S t r i n g ( i n t o f f s e t , S t r i n g v a l ) {
contents . position ( o f f s e t ) ;
byte [ ] b y t e v a l = v a l . g e t B y t e s ( ) ;
contents . putInt ( byteval . length ) ;
c o n t e n t s . put ( b y t e v a l ) ;
}
}

SimpleDB source file simpledb/file/FileMgr.java

• The SimpleDB process has just one global File Manager object. It handles all disk
I/O operations

read the contents of Block from disk into a ByteBuffer – for instance, into a Page
object
write a ByteBuffer into an already existing disk Block
append a new Block into the end of a file
get size of a file as the number of disk block in it

for the other components.

• It also opens all requested files and keeps them in openFiles to avoid reopening
them.

• These files are opened in binary random access

read and
write and
synchronous so that when write is executed without errors, then the operation
has really modified this block of this file on disk – this is where the RDBMS
takes over Buffer ing from the OS.

mode.

74
package s i m p l e d b . f i l e ;

import s t a t i c s i m p l e d b . f i l e . Page . BLOCK SIZE ;


import java . io . ∗ ;
import java . nio . ByteBuffer ;
import java . nio . channels . FileChannel ;
import java . u t i l . ∗ ;

/∗ ∗
∗ The SimpleDB f i l e manager .
∗ The d a t a b a s e s y s t e m s t o r e s i t s d a t a a s f i l e s w i t h i n a s p e c i f i e d d i r e c t o r y .
∗ The f i l e manager p r o v i d e s m e t h o d s f o r r e a d i n g t h e c o n t e n t s o f
∗ a f i l e b l o c k t o a Java b y t e b u f f e r ,
∗ writing the contents of a byte b u f f e r to a f i l e block ,
∗ and a p p e n d i n g t h e c o n t e n t s o f a b y t e b u f f e r t o t h e end o f a f i l e .
∗ T h e s e m e t h o d s a r e c a l l e d e x c l u s i v e l y b y t h e c l a s s { @ l i n k s i m p l e d b . f i l e . Page Page } ,
∗ and a r e t h u s p a c k a g e −p r i v a t e .
∗ The c l a s s a l s o c o n t a i n s t w o p u b l i c m e t h o d s :
∗ Method { @ l i n k #i s N e w ( ) i s N e w } i s c a l l e d d u r i n g s y s t e m i n i t i a l i z a t i o n b y { @ l i n k s i m p l e d b . s e r v e r . &
SimpleDB# i n i t } .
∗ Method { @ l i n k #s i z e ( S t r i n g ) s i z e } i s c a l l e d b y t h e l o g manager and t r a n s a c t i o n manager t o
∗ d e t e r m i n e t h e end o f t h e f i l e .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s F i l e M g r {
private F i l e d b D i r e c t o r y ;
p r i v a t e boolean isNew ;
p r i v a t e Map<S t r i n g , F i l e C h a n n e l > o p e n F i l e s = new HashMap<S t r i n g , F i l e C h a n n e l >() ;

/∗ ∗
∗ C r e a t e s a f i l e manager f o r t h e s p e c i f i e d d a t a b a s e .
∗ The d a t a b a s e w i l l b e s t o r e d i n a f o l d e r o f t h a t name
∗ i n t h e u s e r ’ s home d i r e c t o r y .
∗ I f the f o l d e r does not e x i s t , then a f o l d e r co nta ini ng
∗ an empty d a t a b a s e i s c r e a t e d a u t o m a t i c a l l y .
∗ F i l e s f o r a l l t e m p o r a r y t a b l e s ( i . e . t a b l e s b e g i n n i n g w i t h ” temp ” ) a r e deleted .
∗ @param dbname t h e name o f t h e d i r e c t o r y t h a t h o l d s t h e d a t a b a s e
∗/
public F i l e M g r ( S t r i n g dbname ) {
S t r i n g homedir = System . g e t P r o p e r t y ( ” u s e r . home” ) ;
d b D i r e c t o r y = new F i l e ( homedir , dbname ) ;
isNew = ! d b D i r e c t o r y . e x i s t s ( ) ;

// c r e a t e t h e d i r e c t o r y i f t h e d a t a b a s e i s new
i f ( isNew && ! d b D i r e c t o r y . mkdir ( ) )
throw new R u n t i m e E x c e p t i o n ( ” c a n n o t c r e a t e ” + dbname ) ;

// r e m o v e any l e f t o v e r t e m p o r a r y t a b l e s
for ( String filename : dbDirectory . l i s t ( ) )
i f ( f i l e n a m e . s t a r t s W i t h ( ” temp ” ) )
new F i l e ( d b D i r e c t o r y , f i l e n a m e ) . d e l e t e ( ) ;
}

/∗ ∗
∗ Reads t h e c o n t e n t s o f a d i s k b l o c k i n t o a b y t e b u f f e r .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @param b b the bytebuffer
∗/
synchronized void r e a d ( B l o c k b l k , B y t e B u f f e r bb ) {
try {
bb . c l e a r ( ) ;
FileChannel f c = g e t F i l e ( blk . fileName () ) ;
f c . r e a d ( bb , b l k . number ( ) ∗ BLOCK SIZE ) ;
}
catch ( I O E x c e p t i o n e ) {
throw new R u n t i m e E x c e p t i o n ( ” c a n n o t r e a d b l o c k ” + b l k ) ;
}
}

/∗ ∗
∗ Writes the c ont en ts of a b y t e b u f f e r i n t o a d i s k b l o c k .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @param b b the bytebuffer
∗/
synchronized void w r i t e ( B l o c k b l k , B y t e B u f f e r bb ) {
try {
bb . r e w i n d ( ) ;
FileChannel f c = g e t F i l e ( blk . fileName () ) ;
f c . w r i t e ( bb , b l k . number ( ) ∗ BLOCK SIZE ) ;
}
catch ( I O E x c e p t i o n e ) {
throw new R u n t i m e E x c e p t i o n ( ” c a n n o t w r i t e b l o c k ” + b l k ) ;
}
}

/∗ ∗
∗ Appends t h e c o n t e n t s o f a b y t e b u f f e r t o t h e end
∗ of the s p e c i f i e d f i l e .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param b b the bytebuffer
∗ @ r e t u r n a r e f e r e n c e t o t h e n e w l y −c r e a t e d b l o c k .
∗/
synchronized B l o c k append ( S t r i n g f i l e n a m e , B y t e B u f f e r bb ) {
i n t newblknum = s i z e ( f i l e n a m e ) ;
B l o c k b l k = new B l o c k ( f i l e n a m e , newblknum ) ;
w r i t e ( b l k , bb ) ;
return b l k ;
}

/∗ ∗
∗ R e t u r n s t h e number o f b l o c k s i n the s p e c i f i e d file .
∗ @param f i l e n a m e t h e name o f t h e file
∗ @ r e t u r n t h e number o f b l o c k s i n the f i l e

75
∗/
public synchronized i n t s i z e ( S t r i n g f i l e n a m e ) {
try {
FileChannel f c = g e t F i l e ( filename ) ;
return ( i n t ) ( f c . s i z e ( ) / BLOCK SIZE ) ;
}
catch ( I O E x c e p t i o n e ) {
throw new R u n t i m e E x c e p t i o n ( ” c a n n o t a c c e s s ” + f i l e n a m e ) ;
}
}

/∗ ∗
∗ Returns a boolean i n d i c a t i n g whether the file manager
∗ had t o c r e a t e a new d a t a b a s e directory .
∗ @return t r u e i f t h e d a t a b a s e i s new
∗/
public boolean isNew ( ) {
return isNew ;
}

/∗ ∗
∗ Returns the f i l e channel f o r the s p e c i f i e d filename .
∗ The f i l e c h a n n e l i s s t o r e d i n a map k e y e d on t h e f i l e n a m e .
∗ I f t h e f i l e i s n o t open , t h e n i t i s o p e n e d and t h e f i l e c h a n n e l
∗ i s a d d e d t o t h e map .
∗ @param f i l e n a m e t h e s p e c i f i e d f i l e n a m e
∗ @return t h e f i l e c h a n n e l a s s o c i a t e d w i t h t h e open f i l e .
∗ @throws IOException
∗/
p r i v a t e F i l e C h a n n e l g e t F i l e ( S t r i n g f i l e n a m e ) throws I O E x c e p t i o n {
FileChannel f c = openFiles . get ( filename ) ;
i f ( f c == n u l l ) {
F i l e dbTable = new F i l e ( d b D i r e c t o r y , f i l e n a m e ) ;
R a n d o m A c c e s s F i l e f = new R a n d o m A c c e s s F i l e ( dbTable , ” rws ” ) ;
f c = f . getChannel ( ) ;
o p e n F i l e s . put ( f i l e n a m e , f c ) ;
}
return f c ;
}
}

4.2 Log Management


(Sciore, 2008, Chapters 13.1–13.3)
• The RDBMS has two kinds of files:

Data files (and their supporting files like indexes, metadata,. . . ) – the RDBMS
has only partial control over their access patterns, because they depend on the
users’ queries too
Log file – which the RDBMS controls fully. It is. . .
– an extremely important special file, because it is the central concept to
implement database recovery after a crash!
– a “diary” (or “ship’s log” or “journal”) of all the operations which the
RDBMS has performed recently.

• You have (most likely. . . ) already encountered these log files implicitly in your daily
work:

– For instance, when Microsoft Word crashes, and is restarted, then it may ask
“Do you want to recover your file?”
– It can do this, because it has kept a log of all operations since the last “Save”
operation, and so it can redo them.

• Because this Log file is so important, and the RDBMS processes it differently from
its other files, it has its own manager.

• The Log file consists of log records.

– Each log record is identified with a Log Sequence Number (LSN).


– Each RDBMS operation generates its own kind of log record.

76
Figure 35: The SimpleDB log management algorithm. (Sciore, 2008)

– These log records are written at the end of the log in the order in which the
RDBMS executes their operations – that is, “forward in time”.
– However, recovery needs to read the Log file not only forward but also backwards
in time – also from the most recently written log record at the end towards the
older log records at the beginning.
– Hence the Log file is a linked list of log records, where each record contains
also a backwards pointer to the previous log record.

• The RDBMS allocates a specific Page which represents the last block of the Log file
(step 1 in Figure 35).

– All the previous blocks of the Log file have aready been written onto the disk.
– This last block may or may not have been written onto the disk yet.

• The log grows with the operations

append a new log record at the end of the Log file (step 2 in Figure 35) and give it
an LSN
flush a given LSN (step 3 in Figure 35) – that is, make sure that it is really written
onto the disk, and not just on the last log Page in RAM

which write the last log Page onto the disk if necessary.

• Since only the last log Page is still in RAM, flushing an LSN implies flushing all
the log records before it as well.

• The algorithm in Figure 35 is optimal in the sense that it writes the last log Page
onto the disk only when

append finds that it is already full, or


flush commands it to write an LSN in it.

77
• However, the algorithm in Figure 35 may write the same last log Page many times:

– once for each flush in it, and


– once more for the last append which finds it full.

• The algorithm in Figure 35 can be further improved to write each last log Page
(almost) just once with some concurrent programming:

– When a thread flushes an LSN in the last log Page, then it goes to sleep
waiting for some other thread to write the Page. It is namely enough to have
the LSN on disk when this flushing thread continues.
– When another thread tries to append a new log record into the Log file but
finds its last log Page full, it writes the Page onto the disk and wakes up all
the other threads which have gone to sleep waiting it to be written.
– However, there is a problem:
∗ What if all threads go to sleep waiting for some other thread to write the
last log Page onto the disk?
∗ A general solution to such problems is to have a separate thread which
the RDBMS executes only if it has nothing else to do. This thread then
performs such “housekeeping” tasks as saving the log Page if no other
thread has done it.

• Figure 36 shows how SimpleDB handles the last log Page.

– Each SimpleDB Log record is small enough to fit into a Page, so that a record
does not have to be split over a Page boundary.
– The 4 bytes after a log record give the end of the previous record on this Page
(and hence where the 4 bytes after it are), except that. . .
– the first 4 bytes on a Page give instead where the last 4 bytes on this Page
are, because appending a new log record needs to know this.
– Moving backwards across a Page boundary is in turn reading the previous disk
block of the Log file.

SimpleDB source file simpledb/log/LogMgr.java

• This is the SimpleDB Log Manager.

• The running RDBMS has always exactly one active Log file which grows with
appending new log records. This Manager handles its growth.

• SimpleDB implements the LSN of a log record by simply its disk block number in
the active Log file.

• SimpleDB hides all the details of reading the Log file backwards behind a log record
iterator:

LogMgr defines that BasicLogRecords are iterable, where. . .


LogIterator defines their actual iterator, which walks the Log file backwards,
whereas. . .

78
Figure 36: SimpleDB last log file page and records. (Sciore, 2008)

79
BasicLogRecord defines (only) the core functionality of a log record.
– This core functionality consists of methods for reading the next field of a
given type of the current log record.
– It does not know what kinds of log records the RDBMS has.
– Instead, the recovery part of the Transaction manager will define the var-
ious log records it will need. It will use this core functionality for imple-
menting them.

package s i m p l e d b . l o g ;

import s i m p l e d b . s e r v e r . SimpleDB ;
import simpledb . f i l e . ∗ ;
import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import java . u t i l . ∗ ;

/∗ ∗
∗ The l o w− l e v e l l o g manager .
∗ T h i s l o g manager i s r e s p o n s i b l e f o r w r i t i n g l o g r e c o r d s
∗ into a log f i l e .
∗ A l o g r e c o r d can b e any s e q u e n c e o f i n t e g e r and s t r i n g v a l u e s .
∗ The l o g manager d o e s n o t u n d e r s t a n d t h e meaning o f t h e s e
∗ v a l u e s , w h i c h a r e w r i t t e n and r e a d b y t h e
∗ { @ l i n k s i m p l e d b . t x . r e c o v e r y . R e c o v e r y M g r r e c o v e r y manager } .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s LogMgr implements I t e r a b l e <B a s i c L o g R e c o r d> {
/∗ ∗
∗ The l o c a t i o n w h e r e t h e p o i n t e r t o t h e l a s t i n t e g e r i n t h e p a g e i s .
∗ A v a l u e o f 0 means t h a t t h e p o i n t e r i s t h e f i r s t v a l u e i n t h e p a g e .
∗/
public s t a t i c f i n a l i n t LAST POS = 0 ;

private String l o g f i l e ;
private Page mypage = new Page ( ) ;
private Block c u r r e n t b l k ;
private int currentpos ;

/∗ ∗
∗ C r e a t e s t h e manager f o r t h e s p e c i f i e d l o g f i l e .
∗ I f the l o g f i l e does not yet e x i s t , i t i s c re at ed
∗ w i t h an empty f i r s t b l o c k .
∗ T h i s c o n s t r u c t o r d e p e n d s on a { @ l i n k F i l e M g r } o b j e c t
∗ t h a t i t g e t s f r o m t h e method
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#f i l e M g r ( ) } .
∗ That o b j e c t i s c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e M g r ( S t r i n g ) }
∗ is called first .
∗ @param l o g f i l e t h e name o f t h e l o g f i l e
∗/
public LogMgr ( S t r i n g l o g f i l e ) {
this . l o g f i l e = l o g f i l e ;
i n t l o g s i z e = SimpleDB . f i l e M g r ( ) . s i z e ( l o g f i l e ) ;
i f ( l o g s i z e == 0 )
appendNewBlock ( ) ;
else {
c u r r e n t b l k = new B l o c k ( l o g f i l e , l o g s i z e −1) ;
mypage . r e a d ( c u r r e n t b l k ) ;
c u r r e n t p o s = g e t L a s t R e c o r d P o s i t i o n ( ) + INT SIZE ;
}
}

/∗ ∗
∗ Ensures t h a t the l o g r e c o r d s corresponding to the
∗ s p e c i f i e d LSN h a s b e e n w r i t t e n t o d i s k .
∗ A l l e a r l i e r l o g r e c o r d s w i l l a l s o be w r i t t e n to d i s k .
∗ @param l s n t h e LSN o f a l o g r e c o r d
∗/
public void f l u s h ( i n t l s n ) {
i f ( l s n >= c u r r e n t L S N ( ) )
flush () ;
}

/∗ ∗
∗ R e t u r n s an i t e r a t o r f o r t h e l o g r e c o r d s ,
∗ w h i c h w i l l b e r e t u r n e d i n r e v e r s e o r d e r s t a r t i n g w i t h t h e most recent .
∗ @see j a v a . l a n g . I t e r a b l e # i t e r a t o r ( )
∗/
public synchronized I t e r a t o r <B a s i c L o g R e c o r d> i t e r a t o r ( ) {
flush () ;
return new L o g I t e r a t o r ( c u r r e n t b l k ) ;
}

/∗ ∗
∗ Appends a l o g r e c o r d t o t h e f i l e .
∗ The r e c o r d c o n t a i n s an a r b i t r a r y a r r a y o f s t r i n g s and i n t e g e r s .
∗ The method a l s o w r i t e s an i n t e g e r t o t h e end o f e a c h l o g r e c o r d whose v a l u e
∗ is the o f f s e t of the corresponding integer for the previous log record .
∗ These i n t e g e r s a l l o w l o g r e c o r d s t o be read i n r e v e r s e o r d e r .
∗ @param r e c t h e l i s t o f v a l u e s
∗ @ r e t u r n t h e LSN o f t h e f i n a l v a l u e
∗/
public synchronized i n t append ( O b j e c t [ ] r e c ) {
i n t r e c s i z e = INT SIZE ; // 4 b y t e s f o r t h e i n t e g e r t h a t p o i n t s to the previous log record

80
for ( Object obj : r e c )
r e c s i z e += s i z e ( o b j ) ;
i f ( c u r r e n t p o s + r e c s i z e >= BLOCK SIZE ) { // t h e l o g record doesn ’ t fit ,
flush () ; // s o move t o t h e n e x t b l o c k .
appendNewBlock ( ) ;
}
for ( Object obj : r e c )
appendVal ( o b j ) ;
finalizeRecord () ;
return c u r r e n t L S N ( ) ;
}

/∗ ∗
∗ Adds t h e s p e c i f i e d v a l u e t o t h e p a g e a t t h e p o s i t i o n d e n o t e d b y
∗ currentpos . Then i n c r e m e n t s c u r r e n t p o s b y t h e s i z e o f t h e v a l u e .
∗ @param v a l t h e i n t e g e r o r s t r i n g t o b e a d d e d t o t h e p a g e
∗/
p r i v a t e void appendVal ( O b j e c t v a l ) {
i f ( v al instanceof S t r i n g )
mypage . s e t S t r i n g ( c u r r e n t p o s , ( S t r i n g ) v a l ) ;
else
mypage . s e t I n t ( c u r r e n t p o s , ( I n t e g e r ) v a l ) ;
c u r r e n t p o s += s i z e ( v a l ) ;
}

/∗ ∗
∗ Calculates the size of the s p e c i f i e d integer or string .
∗ @param v a l t h e v a l u e
∗ @return t h e s i z e o f t h e value , in b y t e s
∗/
private int s i z e ( Object v a l ) {
i f ( v al instanceof S t r i n g ) {
String sval = ( String ) val ;
return STR SIZE ( s v a l . l e n g t h ( ) ) ;
}
else
return INT SIZE ;
}

/∗ ∗
∗ R e t u r n s t h e LSN o f t h e most r e c e n t l o g r e c o r d .
∗ As i m p l e m e n t e d , t h e LSN i s t h e b l o c k number w h e r e t h e record is stored .
∗ Thus e v e r y l o g r e c o r d i n a b l o c k h a s t h e same LSN .
∗ @ r e t u r n t h e LSN o f t h e most r e c e n t l o g r e c o r d
∗/
private int currentLSN ( ) {
return c u r r e n t b l k . number ( ) ;
}

/∗ ∗
∗ Writes the c u r r e n t page to the log file .
∗/
p r i v a t e void f l u s h ( ) {
mypage . w r i t e ( c u r r e n t b l k ) ;
}

/∗ ∗
∗ C l e a r t h e c u r r e n t p a g e , and a p p e n d i t to the log file .
∗/
p r i v a t e void appendNewBlock ( ) {
setLastRecordPosition (0) ;
c u r r e n t p o s = INT SIZE ;
c u r r e n t b l k = mypage . append ( l o g f i l e ) ;
}

/∗ ∗
∗ S e t s up a c i r c u l a r c h a i n o f p o i n t e r s t o t h e r e c o r d s i n t h e p a g e .
∗ T h e r e i s an i n t e g e r a d d e d t o t h e end o f e a c h l o g r e c o r d
∗ whose v a l u e i s t h e o f f s e t o f t h e p r e v i o u s l o g r e c o r d .
∗ The f i r s t f o u r b y t e s o f t h e p a g e c o n t a i n an i n t e g e r w h o s e v a l u e
∗ i s the o f f s e t of the i n t e g e r f o r the l a s t l o g record in the page .
∗/
p r i v a t e void f i n a l i z e R e c o r d ( ) {
mypage . s e t I n t ( c u r r e n t p o s , g e t L a s t R e c o r d P o s i t i o n ( ) ) ;
setLastRecordPosition ( currentpos ) ;
c u r r e n t p o s += INT SIZE ;
}

private int g e t L a s t R e c o r d P o s i t i o n ( ) {
return mypage . g e t I n t ( LAST POS ) ;
}

p r i v a t e void s e t L a s t R e c o r d P o s i t i o n ( i n t p o s ) {
mypage . s e t I n t ( LAST POS , p o s ) ;
}
}

SimpleDB source file simpledb/log/LogIterator.java


package s i m p l e d b . l o g ;

import s t a t i c s i m p l e d b . f i l e . Page . INT SIZE ;


import s i m p l e d b . f i l e . ∗ ;
import j a v a . u t i l . I t e r a t o r ;

/∗ ∗
∗ A c l a s s t h a t p r o v i d e s t h e a b i l i t y t o move t h r o u g h the
∗ records of the log f i l e in reverse order .

∗ @ a u t h o r Edward S c i o r e

81
∗/
c l a s s L o g I t e r a t o r implements I t e r a t o r <B a s i c L o g R e c o r d> {
private Block blk ;
p r i v a t e Page pg = new Page ( ) ;
private int c u r r e n t r e c ;

/∗ ∗
∗ C r e a t e s an i t e r a t o r f o r t h e r e c o r d s i n t h e log file ,
∗ positioned a f t e r the l a s t log record .
∗ This c o n s t r u c t o r i s c a l l e d e x c l u s i v e l y by
∗ { @ l i n k LogMgr# i t e r a t o r ( ) } .
∗/
L o g I t e r a t o r ( Block blk ) {
this . blk = blk ;
pg . r e a d ( b l k ) ;
c u r r e n t r e c = pg . g e t I n t ( LogMgr . LAST POS ) ;
}

/∗ ∗
∗ Determines i f the current l o g record
∗ i s the e a r l i e s t record in the log f i l e .
∗ @ r e t u r n t r u e i f t h e r e i s an e a r l i e r r e c o r d
∗/
public boolean hasNext ( ) {
return c u r r e n t r e c >0 | | b l k . number ( ) >0;
}

/∗ ∗
∗ Moves t o t h e n e x t l o g r e c o r d i n r e v e r s e o r d e r .
∗ I f the current log record i s the e a r l i e s t in i t s block ,
∗ t h e n t h e method moves t o t h e n e x t o l d e s t b l o c k ,
∗ and r e t u r n s t h e l o g r e c o r d f r o m t h e r e .
∗ @return t h e next e a r l i e s t l o g record
∗/
public B a s i c L o g R e c o r d n e x t ( ) {
i f ( c u r r e n t r e c == 0 )
moveToNextBlock ( ) ;
c u r r e n t r e c = pg . g e t I n t ( c u r r e n t r e c ) ;
return new B a s i c L o g R e c o r d ( pg , c u r r e n t r e c+INT SIZE ) ;
}

public void remove ( ) {


throw new U n s u p p o r t e d O p e r a t i o n E x c e p t i o n ( ) ;
}

/∗ ∗
∗ Moves t o t h e n e x t l o g b l o c k i n r e v e r s e o r d e r ,
∗ and p o s i t i o n s i t a f t e r t h e l a s t r e c o r d i n t h a t b l o c k .
∗/
p r i v a t e void moveToNextBlock ( ) {
b l k = new B l o c k ( b l k . f i l e N a m e ( ) , b l k . number ( ) −1) ;
pg . r e a d ( b l k ) ;
c u r r e n t r e c = pg . g e t I n t ( LogMgr . LAST POS ) ;
}
}

SimpleDB source file simpledb/log/BasicLogRecord.java


package s i m p l e d b . l o g ;

import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import s i m p l e d b . f i l e . Page ;

/∗ ∗
∗ A c l a s s t h a t p r o v i d e s the a b i l i t y to read the v a l u e s of
∗ a log record .
∗ The c l a s s h a s no i d e a w h a t v a l u e s a r e t h e r e .
∗ I n s t e a d , t h e m e t h o d s { @ l i n k #n e x t I n t ( ) n e x t I n t }
∗ and { @ l i n k #n e x t S t r i n g ( ) n e x t S t r i n g } r e a d t h e v a l u e s
∗ sequentially .
∗ Thus t h e c l i e n t i s r e s p o n s i b l e f o r k n o w i n g how many v a l u e s
∗ a r e i n t h e l o g r e c o r d , and w h a t t h e i r t y p e s a r e .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B a s i c L o g R e c o r d {
p r i v a t e Page pg ;
private int pos ;

/∗ ∗
∗ A log record located at the s p e c i f i e d position of the specified page .
∗ This c o n s t r u c t o r i s c a l l e d e x c l u s i v e l y by
∗ { @ l i n k L o g I t e r a t o r#n e x t ( ) } .
∗ @param p g t h e p a g e c o n t a i n i n g t h e l o g r e c o r d
∗ @param p o s t h e p o s i t i o n o f t h e l o g r e c o r d
∗/
public B a s i c L o g R e c o r d ( Page pg , i n t p o s ) {
t h i s . pg = pg ;
this . pos = pos ;
}

/∗ ∗
∗ Returns the next v a l u e of the current log record ,
∗ a s s u m i n g i t i s an i n t e g e r .
∗ @return t h e next v a l u e o f t h e current log record
∗/
public i n t n e x t I n t ( ) {
i n t r e s u l t = pg . g e t I n t ( p o s ) ;
p o s += INT SIZE ;
return r e s u l t ;
}

82
/∗ ∗
∗ Returns the next v a l u e of the current log record ,
∗ assuming i t i s a s t r i n g .
∗ @return t h e next v a l u e o f t h e c u r r e n t log record
∗/
public S t r i n g n e x t S t r i n g ( ) {
S t r i n g r e s u l t = pg . g e t S t r i n g ( p o s ) ;
p o s += STR SIZE ( r e s u l t . l e n g t h ( ) ) ;
return r e s u l t ;
}
}

4.3 Buffer Management


(Sciore, 2008, Chapters 13.4–13.8)

• The Buffer Manager is the component responsible for the Pages that hold user data
– that is, for the disk Block s that hold user data which have been read into RAM
to be processed.

• A Buffer is a combination of a

Page in RAM and a


Block on disk

such that this Page holds the current contents of this Block .

• This Manager allocates and manages a large fixed pool of these Buffer s. Initially
they have only their Pages but not yet any Block s.

• This pool reserves much of the RAM of the server computer running the RDBMS
process. This RAM is well spent, because it is the central tool for improving disk
I/O in the RDBMS.

• When a client t wants to access some disk Block d, it behaves as in Figure 37 to


bring it into a Buffer in the pool.

• This Buffer Manager allows the same Buffer to be pinned and accessed by many
clients at the same time.

– It just counts how many pins each Buffer has now – that is, how many clients
are accessing it now.
– If none, then this Buffer is said to be unpinned. This Manager recycles un-
pinned Buffer s.

• The Concurrency part of the Transaction Manager will be responsible for coordi-
nating their concurrent accesses.

• This Buffer Manager grants this request as follows:

¬ If the pool already contains a Buffer b for the requested disk Block d , then the
requesting client t can just add another pin into b. This ensures that a disk
Block has at most one Buffer .
­ If the RDBMS server process has been started only recently, then some Buffer s
may still have no disk Block s yet. This case is almost as easy:
¶ Take some such Buffer b and

83
Figure 37: Pinning and unpinning. (Sciore, 2008)

· read the contents of Block d from the disk into b.Page and let client t pin
this Buffer b.
® If some Buffer s in the pool are currently unpinned, then this Manager may
have to write before it can read:
¶ Select some such unpinned Buffer b.
· If the contents of b .Page are now different from b .Block – b is dirty – then
write these current contents from b .Page back into b .Block before this
Manager recycles b for d .
¸ Continue as in step · of the preceding case ­.
¯ If all the Buffer s in the pool are currently pinned, then this thread t must
sleep waiting for a Buffer to become unpinned before it can continue as in the
previous case ®.

• The RDBMS will write a disk Block only if

– step · of case ® does, or


– the Transaction manager does to ensure database recoverability.
This includes shutting down the RDBMS server process, because restarting it
begins with recovery.

• Selecting an unpinned Buffer from the pool in step ¶ of case ® is similar to what
the OS does with physical vs. virtual memory.

• Some selection strategies are

Naı̈ve:
– Since no thread is using an unpinned Buffer , it does not really matter
which one we select. . . does it?
– It does, because we prefer the fast case ¬ without disk I/O to slow case ®
with disk I/O.
– But then the RDBMS must guess which Buffer s in its pool are pinned to
disk Block s which client threads might need in the future.
– This is not a good selection strategy. However, SimpleDB uses it, because
it is very simple to implement.
FIFO or First In First Out:
– One such guess is that the disk Block s which were read into the pool long
ago are no longer needed.

84
– That is, the Buffer s in the pool form a queue.
– The first unpinned Buffer from the front of this queue is selected. . .
– and added to the back of this queue when it is pinned to another disk
Block .
– This is a reasonable idea.
– However, it does not take into account that some disk Block s (like meta-
data) may be needed very often, and if such a Buffer happens to be un-
pinned even for a brief moment, then it will get selected.
LRU or Least Recently Used:
– Another guess which solves this problem with FIFO is to remember in each
Buffer the time when it became unpinned, and. . .
– select the Buffer with the earliest time.
– Here the reasoning is that if a Buffer has not been used for a long time,
then its disk Block will not be used soon in the future either.
Clock:
– Another idea is to use the unpinned Buffer s of the pool as evenly as pos-
sible.
– Suppose that the pool is an array bufferpool[0 . . . PoolSize −1] of Buffer s.
– This strategy remember the latest index from where it found the previous
unpinned Buffer .
– When another unpinned Buffer is needed, this index moves forward in the
array with
latest = (latest + 1) mod PoolSize (10)
until one is found.
– The name comes from considering an analog clock whose
face is the bufferpool array
hours are 0, 1, 2, . . . , PoolSize − 1 instead of 1, 2, 3, . . . , 12
hand is the latest index.
– This strategy has some flavour of
FIFO since Equation (10) uses the bufferpool array as if it implemented
a queue
LRU since it skips over pinned Buffer s and reconsiders them only when
the latest index has gone a full circle around the whole bufferpool.

FIFO would select buffers 0 and 2 next in Figure 38, whereas

LRU would select 3 and 0 instead, and

Clock would select 2 and 3.

SimpleDB source file simpledb/buffer/Buffer.java

• Here is the definition of the Buffer objects.

• A client t can modify the Page of a Buffer object by calling its set method. This
method requires the following 2 additional parameters, which the Buffer remembers:

85
Figure 38: Buffer pool example. (Sciore, 2008)

The Transaction which t is now running. This Buffer Manager namely offers a
method to flush all the Buffer s modified by a given Transaction t into the
disk.
The LSN of the last modification to this Buffer by Transaction t. To get this
LSN, Transaction t must have Logged its intention to modify this Buffer before
actually modifying it.

The ıRecovery part of the Transaction Manager will use these remembered param-
eters.

Requirement 11 (write-ahead logging). Whenever the RDBMS writes a modified Buffer


back to its disk Block , it must

¬ first flush the log records for these modifications, and

­ then write its disk Block .

(Sciore, 2008, Chapter 14.3.5)

• Otherwise the following might happen:

¶ The RDBMS does step ­ first. This overwrites the original contents of the
disk Block .
· Then it tries to do step ¬ but fails. Now the Log file does not have the original
contents of the disk Block either – and so recovery becomes impossible!

Hence the order in requirement 11 is the correct choice, because it overwrites the
disk Block only after its original contents have been successfully flushed into the
Log.

• The LSN parameter of the set. . . methods ensure this order.

86
package s i m p l e d b . b u f f e r ;

import s i m p l e d b . s e r v e r . SimpleDB ;
import s i m p l e d b . f i l e . ∗ ;

/∗ ∗
∗ An i n d i v i d u a l b u f f e r .
∗ A b u f f e r w r a p s a p a g e and s t o r e s i n f o r m a t i o n a b o u t i t s s t a t u s ,
∗ such as t h e d i s k b l o c k a s s o c i a t e d w i t h t h e page ,
∗ t h e number o f t i m e s t h e b l o c k h a s b e e n p i n n e d ,
∗ whether the c o n t e n t s o f the page have been modified ,
∗ and i f so , t h e i d o f t h e m o d i f y i n g t r a n s a c t i o n and
∗ t h e LSN o f t h e c o r r e s p o n d i n g l o g r e c o r d .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B u f f e r {
p r i v a t e Page c o n t e n t s = new Page ( ) ;
private Block blk = null ;
private int p i n s = 0 ;
p r i v a t e i n t m o d i f i e d B y = −1; // n e g a t i v e means n o t m o d i f i e d
p r i v a t e i n t logSequenceNumber = −1; // n e g a t i v e means no c o r r e s p o n d i n g log record

/∗ ∗
∗ C r e a t e s a new b u f f e r , w r a p p i n g a new
∗ { @ l i n k s i m p l e d b . f i l e . Page p a g e } .
∗ This c o n s t r u c t o r i s c a l l e d e x c l u s i v e l y by t h e
∗ c l a s s { @link BasicBufferMgr }.
∗ I t d e p e n d s on the
∗ { @ l i n k s i m p l e d b . l o g . LogMgr LogMgr } o b j e c t
∗ t h a t i t g e t s from t h e c l a s s
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB } .
∗ That o b j e c t i s c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e A n d L o g M g r ( S t r i n g ) } o r
∗ is called first .
∗/
public B u f f e r ( ) {}

/∗ ∗
∗ Returns the i n t e g e r v a l u e at the specified offset of the
∗ b u f f e r ’ s page .
∗ I f an i n t e g e r was n o t s t o r e d at that location ,
∗ t h e b e h a v i o r o f t h e method i s unpredictable .
∗ @param o f f s e t t h e b y t e o f f s e t of the page
∗ @return t h e i n t e g e r v a l u e at that offset
∗/
public i n t g e t I n t ( i n t o f f s e t ) {
return c o n t e n t s . g e t I n t ( o f f s e t );
}

/∗ ∗
∗ Returns the s t r i n g v a l u e at the s p e c i f i e d o f f s e t of the
∗ b u f f e r ’ s page .
∗ I f a s t r i n g was n o t s t o r e d a t t h a t l o c a t i o n ,
∗ t h e b e h a v i o r o f t h e method i s u n p r e d i c t a b l e .
∗ @param o f f s e t t h e b y t e o f f s e t o f t h e p a g e
∗ @return t h e s t r i n g v a l u e at t h a t o f f s e t
∗/
public S t r i n g g e t S t r i n g ( i n t o f f s e t ) {
return c o n t e n t s . g e t S t r i n g ( o f f s e t ) ;
}

/∗ ∗
∗ W r i t e s an i n t e g e r t o t h e s p e c i f i e d o f f s e t o f t h e
∗ b u f f e r ’ s page .
∗ T h i s method a s s u m e s t h a t t h e t r a n s a c t i o n h a s a l r e a d y
∗ w r i t t e n an a p p r o p r i a t e l o g r e c o r d .
∗ The b u f f e r s a v e s t h e i d o f t h e t r a n s a c t i o n
∗ and t h e LSN o f t h e l o g r e c o r d .
∗ A negative lsn value indicates that a log record
∗ was n o t n e c e s s a r y .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @param v a l t h e new i n t e g e r v a l u e t o b e w r i t t e n
∗ @param txnum t h e i d o f t h e t r a n s a c t i o n p e r f o r m i n g t h e m o d i f i c a t i o n
∗ @param l s n t h e LSN o f t h e c o r r e s p o n d i n g l o g r e c o r d
∗/
public void s e t I n t ( i n t o f f s e t , i n t v a l , i n t txnum , i n t l s n ) {
m o d i f i e d B y = txnum ;
i f ( l s n >= 0 )
logSequenceNumber = l s n ;
contents . setInt ( offset , val ) ;
}

/∗ ∗
∗ Writes a s t r i n g to the s p e c i f i e d o f f s e t of the
∗ b u f f e r ’ s page .
∗ T h i s method a s s u m e s t h a t t h e t r a n s a c t i o n h a s a l r e a d y
∗ w r i t t e n an a p p r o p r i a t e l o g r e c o r d .
∗ A negative lsn value indicates that a log record
∗ was n o t n e c e s s a r y .
∗ The b u f f e r s a v e s t h e i d o f t h e t r a n s a c t i o n
∗ and t h e LSN o f t h e l o g r e c o r d .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @param v a l t h e new s t r i n g v a l u e t o b e w r i t t e n
∗ @param txnum t h e i d o f t h e t r a n s a c t i o n p e r f o r m i n g t h e modification
∗ @param l s n t h e LSN o f t h e c o r r e s p o n d i n g l o g r e c o r d
∗/
public void s e t S t r i n g ( i n t o f f s e t , S t r i n g v a l , i n t txnum , int lsn ) {
m o d i f i e d B y = txnum ;
i f ( l s n >= 0 )
logSequenceNumber = l s n ;
contents . setString ( offset , val ) ;
}

87
/∗ ∗
∗ Returns a r e f e r e n c e to the d i s k b l o c k
∗ t h a t the b u f f e r i s pinned to .
∗ @return a r e f e r e n c e to a d i s k b l o c k
∗/
public B l o c k b l o c k ( ) {
return b l k ;
}

/∗ ∗
∗ Writes the page to i t s d i s k b l o c k i f the
∗ page i s d i r t y .
∗ The method e n s u r e s t h a t t h e c o r r e s p o n d i n g l o g
∗ record has been w r i t t e n to d i s k p r i o r to w r i t i n g
∗ the page to d i s k .
∗/
void f l u s h ( ) {
i f ( m o d i f i e d B y >= 0 ) {
SimpleDB . logMgr ( ) . f l u s h ( logSequenceNumber ) ;
contents . write ( blk ) ;
m o d i f i e d B y = −1;
}
}

/∗ ∗
∗ Increases the b u f f e r ’ s pin count .
∗/
void p i n ( ) {
p i n s ++;
}

/∗ ∗
∗ Decreases the b u f f e r ’ s pin count .
∗/
void u n p i n ( ) {
p i n s −−;
}

/∗ ∗
∗ Returns t r u e i f the b u f f e r i s c u r r e n t l y pinned
∗ ( t h a t is , i f i t has a nonzero pin count ) .
∗ @return t r u e i f t h e b u f f e r i s pinned
∗/
boolean i s P i n n e d ( ) {
return p i n s > 0 ;
}

/∗ ∗
∗ Returns t r u e i f the b u f f e r i s d i r t y
∗ due t o a m o d i f i c a t i o n by t h e s p e c i f i e d t r a n s a c t i o n .
∗ @param txnum t h e i d o f t h e t r a n s a c t i o n
∗ @return t r u e i f t h e t r a n s a c t i o n mo dif ied t h e b u f f e r
∗/
boolean i s M o d i f i e d B y ( i n t txnum ) {
return txnum == m o d i f i e d B y ;
}

/∗ ∗
∗ Reads t h e c o n t e n t s o f t h e s p e c i f i e d b l o c k i n t o
∗ the b u f f e r ’ s page .
∗ I f t h e b u f f e r was d i r t y , t h e n t h e c o n t e n t s
∗ of the p r e v i o u s page are f i r s t w r i t t e n to d i s k .
∗ @param b a r e f e r e n c e t o t h e d a t a b l o c k
∗/
void a s s i g n T o B l o c k ( B l o c k b ) {
flush () ;
blk = b ;
contents . read ( blk ) ;
pins = 0;
}

/∗ ∗
∗ I n i t i a l i z e s the b u f f e r ’ s page according to the s p e c i f i e d formatter ,
∗ and a p p e n d s t h e p a g e t o t h e s p e c i f i e d f i l e .
∗ I f t h e b u f f e r was d i r t y , t h e n t h e c o n t e n t s
∗ of the p r e v i o u s page are f i r s t w r i t t e n to d i s k .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param f m t r a p a g e f o r m a t t e r , u s e d t o i n i t i a l i z e t h e p a g e
∗/
void assignToNew ( S t r i n g f i l e n a m e , P a g e F o r m a t t e r f m t r ) {
flush () ;
fmtr . format ( contents ) ;
b l k = c o n t e n t s . append ( f i l e n a m e ) ;
pins = 0;
}
}

SimpleDB source file simpledb/buffer/PageFormatter.java


• Usually the contents of a Buffer are read from an existing disk Block .
• But when a new disk Block is allocated and appended into a file, where do we get
the initial contents for its Page?
• SimpleDB uses the concept of a Page formatter for this.

88
• Such a formatter is a function which initializes the Page in RAM appropriately.

• Each kind of a disk block will define its own kind of formatter.

• For instance, the Record Manager will define a formatter which initializes the Page
to consist of empty unused Record s.

• Client threads will then access this formatted Page, and eventually the Buffer Man-
ager will write in into the disk, creating the new Block .
package s i m p l e d b . b u f f e r ;

import s i m p l e d b . f i l e . Page ;

/∗ ∗
∗ An i n t e r f a c e u s e d t o i n i t i a l i z e a new b l o c k on d i s k .
∗ T h e r e w i l l b e an i m p l e m e n t i n g c l a s s f o r e a c h ” t y p e ” o f
∗ disk block .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e P a g e F o r m a t t e r {
/∗ ∗
∗ I n i t i a l i z e s a page , whose c o n t e n t s w i l l be
∗ w r i t t e n t o a new d i s k b l o c k .
∗ T h i s method i s c a l l e d o n l y d u r i n g t h e method
∗ { @ l i n k B u f f e r#a s s i g n T o N e w } .
∗ @param p a b u f f e r p a g e
∗/
public void f o r m a t ( Page p ) ;
}

SimpleDB source file simpledb/buffer/BasicBufferMgr.java


• This basic Buffer Manager implements cases ¬–® of the Buffer granting algorithm.

• That is, it handles all the cases where the requesting client t can get a Buffer without
having to sleep first.
package s i m p l e d b . b u f f e r ;

import s i m p l e d b . f i l e . ∗ ;

/∗ ∗
∗ Manages t h e p i n n i n g and u n p i n n i n g of buffers to blocks .
∗ @ a u t h o r Edward S c i o r e

∗/
class BasicBufferMgr {
private B u f f e r [ ] b u f f e r p o o l ;
private int numAvailable ;

/∗ ∗
∗ C r e a t e s a b u f f e r manager h a v i n g t h e s p e c i f i e d number
∗ of buffer s l o t s .
∗ T h i s c o n s t r u c t o r d e p e n d s on b o t h t h e { @ l i n k F i l e M g r } and
∗ { @ l i n k s i m p l e d b . l o g . LogMgr LogMgr } o b j e c t s
∗ t h a t i t g e t s from t h e c l a s s
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB } .
∗ Those o b j e c t s a r e c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e A n d L o g M g r ( S t r i n g ) } o r
∗ is called first .
∗ @param n u m b u f f s t h e number o f b u f f e r s l o t s t o a l l o c a t e
∗/
B a s i c B u f f e r M g r ( i n t n um b u ff s ) {
b u f f e r p o o l = new B u f f e r [ n u mb u f fs ] ;
n u m A v a i l a b l e = n um b uf f s ;
f o r ( i n t i =0; i <n um b u ff s ; i ++)
b u f f e r p o o l [ i ] = new B u f f e r ( ) ;
}

/∗ ∗
∗ F l u s h e s t h e d i r t y b u f f e r s m o d i f i e d by t h e specified transaction .
∗ @param txnum t h e t r a n s a c t i o n ’ s i d number
∗/
synchronized void f l u s h A l l ( i n t txnum ) {
for ( Buffer buff : b u f f e r p o o l )
i f ( b u f f . i s M o d i f i e d B y ( txnum ) )
buff . flush () ;
}

/∗ ∗
∗ Pins a b u f f e r to t h e s p e c i f i e d b l o c k .
∗ I f there i s already a b u f f e r assigned to that block
∗ then t h a t b u f f e r i s used ;
∗ o t h e r w i s e , an u n p i n n e d b u f f e r f r o m t h e p o o l i s c h o s e n .
∗ R e t u r n s a n u l l v a l u e i f t h e r e a r e no a v a i l a b l e b u f f e r s .

89
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @return t h e pinned b u f f e r
∗/
synchronized B u f f e r p i n ( B l o c k b l k ) {
Buffer buff = f i n d E x i s t i n g B u f f e r ( blk ) ;
i f ( b u f f == n u l l ) {
buff = chooseUnpinnedBuffer ( ) ;
i f ( b u f f == n u l l )
return n u l l ;
buff . assignToBlock ( blk ) ;
}
i f ( ! buff . isPinned () )
n um Av a il ab l e −−;
b u f f . pin ( ) ;
return b u f f ;
}

/∗ ∗
∗ A l l o c a t e s a new b l o c k i n t h e s p e c i f i e d f i l e , and
∗ pins a b u f f e r to i t .
∗ Returns n u l l ( without a l l o c a t i n g the b l o c k ) i f
∗ t h e r e a r e no a v a i l a b l e b u f f e r s .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param f m t r a p a g e f o r m a t t e r o b j e c t , u s e d t o f o r m a t t h e new b l o c k
∗ @return t h e pinned b u f f e r
∗/
synchronized B u f f e r pinNew ( S t r i n g f i l e n a m e , P a g e F o r m a t t e r f m t r ) {
Buffer buff = chooseUnpinnedBuffer ( ) ;
i f ( b u f f == n u l l )
return n u l l ;
b u f f . assignToNew ( f i l e n a m e , f m t r ) ;
n um Av a il a bl e −−;
b u f f . pin ( ) ;
return b u f f ;
}

/∗ ∗
∗ Unpins t h e s p e c i f i e d b u f f e r .
∗ @param b u f f t h e b u f f e r t o b e u n p i n n e d
∗/
synchronized void u n p i n ( B u f f e r b u f f ) {
b u f f . unpin ( ) ;
i f ( ! buff . isPinned () )
n u m A v a i l a b l e ++;
}

/∗ ∗
∗ R e t u r n s t h e number o f available ( i . e . unpinned ) buffers .
∗ @ r e t u r n t h e number o f available buffers
∗/
int a v a i l a b l e ( ) {
return n u m A v a i l a b l e ;
}

private B u f f e r f i n d E x i s t i n g B u f f e r ( Block b lk ) {
for ( Buffer buff : b u f f e r p o o l ) {
Block b = b u f f . block ( ) ;
i f ( b != n u l l && b . e q u a l s ( b l k ) )
return b u f f ;
}
return n u l l ;
}

private B u f f e r chooseUnpinnedBuffer ( ) {
for ( Buffer buff : b u f f e r p o o l )
i f ( ! buff . isPinned () )
return b u f f ;
return n u l l ;
}
}

SimpleDB source file simpledb/buffer/BufferMgr.java

• This full Buffer Manager adds the remaining case ¯ of the Buffer granting algorithm
into the basic Buffer Manager.

• That is, it handles the remaining case where the requesting client t must first go to
sleep waiting for a Buffer to become unpinned.

• SimpleDB implements this sleeping with the Java lock of the unique bufferMgr
object (Sestoft, 2005, Chapter 16.4) as follows:

¶ When a client thread executes a syncronized bufferMgr.pin. . . method, but


finds all its Buffer s pinned,. . .
· it calls bufferMgr.wait which puts it to sleep waiting for another thread to
call bufferMgr.notify.

90
¸ When another thread takes the last pin from a Buffer , it calls bufferMgr.notifyAll ,
which wakes up every thread which is bufferMgr.waiting for this to happen,
and. . .
¹ all these threads compete for this one unpinned Buffer . One of them wins, and
the others must bufferMgr.wait again.

• This implementation is simple but unfair:

– A waiting thread can experience livelock where it cannot get on with its work,
because it always loses in the competitions of step ¹.
– A fair implementation would grant the buffer requests in FIFO order instead.

Livelocking is rarely a problem in practice.

• However, this Buffer Manager can also cause a deadlock – and that is a problem!

– Let the pool consist of just 2 Buffer s for simplicity.


– One client A needs 2 Buffer s. Another client B needs 2 Buffer s too.
– First client A pins one of the Buffer s in the pool. Then client B pins the other
Buffer in the pool.
– Now client A is waiting for another Buffer , and so is client B.
– Both clients A and B are now stuck, because each is holding a Buffer which
the other needs before it can unpin the Buffer it already has.

• SimpleDB breaks such Buffer ing deadlocks in a simple way:

– If a client thread has been waiting for a Buffer for 10 seconds, then it is
assumed to be in a deadlock.
– Then SimpleDB raises a BufferAbortException in this client thread in the
RDBMS server process, which. . .
– aborts the client thread’s current transaction, and this in turn unpins all its
Buffer s, and. . .
– gets passed to the client process too.
– This is an example of where the RDBMS reports an “error” to the client process
because it is running low on resources, as in failure reason ¹ of section 3.3.

• A better way would be to avoid the deadlock:

¶ Since client A has already got one of the Buffer s, client B must not get the
other.
· Instead, client A can get both Buffer s, and execute.
¸ At the end client A unpins both its Buffer s, and then client B can execute.

We shall do so at the Transaction level. A more sophisticated RDBMS than Sim-


pleDB does so at both levels.

91
package s i m p l e d b . b u f f e r ;

import s i m p l e d b . f i l e . ∗ ;

/∗ ∗
∗ The p u b l i c l y −a c c e s s i b l e b u f f e r manager .
∗ A b u f f e r manager w r a p s a b a s i c b u f f e r manager , and
∗ p r o v i d e s t h e same m e t h o d s . The d i f f e r e n c e i s t h a t
∗ t h e m e t h o d s { @ l i n k #p i n ( B l o c k ) p i n } and
∗ { @ l i n k #pinNew ( S t r i n g , P a g e F o r m a t t e r ) pinNew }
∗ w i l l never return n u l l .
∗ I f no b u f f e r s a r e c u r r e n t l y a v a i l a b l e , t h e n t h e
∗ c a l l i n g t h r e a d w i l l b e p l a c e d on a w a i t i n g l i s t .
∗ The w a i t i n g t h r e a d s a r e r e m o v e d f r o m t h e l i s t when
∗ a b u f f e r becomes a v a i l a b l e .
∗ I f a t h r e a d h a s b e e n w a i t i n g f o r a b u f f e r f o r an
∗ e x c e s s i v e amount o f t i m e ( c u r r e n t l y , 10 s e c o n d s )
∗ then a { @link B u f f e r A b o r t E x c e p t i o n } i s thrown .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B u f f e r M g r {
p r i v a t e s t a t i c f i n a l long MAX TIME = 1 0 0 0 0 ; // 10 s e c o n d s
private BasicBufferMgr bufferMgr ;

/∗ ∗
∗ C r e a t e s a new b u f f e r manager h a v i n g t h e s p e c i f i e d
∗ number o f b u f f e r s .
∗ T h i s c o n s t r u c t o r d e p e n d s on b o t h t h e { @ l i n k F i l e M g r } and
∗ { @ l i n k s i m p l e d b . l o g . LogMgr LogMgr } o b j e c t s
∗ t h a t i t g e t s from t h e c l a s s
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB } .
∗ Those o b j e c t s a r e c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e A n d L o g M g r ( S t r i n g ) } o r
∗ is called first .
∗ @param n u m b u f f e r s t h e number o f b u f f e r s l o t s t o a l l o c a t e
∗/
public B u f f e r M g r ( i n t n u m b u f f e r s ) {
b u f f e r M g r = new B a s i c B u f f e r M g r ( n u m b u f f e r s ) ;
}

/∗ ∗
∗ Pins a b u f f e r to t h e s p e c i f i e d block , p o t e n t i a l l y
∗ w a i t i n g u n t i l a b u f f e r becomes a v a i l a b l e .
∗ I f no b u f f e r b e c o m e s a v a i l a b l e w i t h i n a f i x e d
∗ time period , then a { @link B u f f e r A b o r t E x c e p t i o n } i s thrown .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @return t h e b u f f e r pinned to t h a t b l o c k
∗/
public synchronized B u f f e r p i n ( B l o c k b l k ) {
try {
long timestamp = System . c u r r e n t T i m e M i l l i s ( ) ;
Buffer b u f f = bufferMgr . pin ( blk ) ;
while ( b u f f == n u l l && ! w a i t i n g T o o L o n g ( timestamp ) ) {
w a i t (MAX TIME) ;
b u f f = bufferMgr . pin ( blk ) ;
}
i f ( b u f f == n u l l )
throw new B u f f e r A b o r t E x c e p t i o n ( ) ;
return b u f f ;
}
catch ( I n t e r r u p t e d E x c e p t i o n e ) {
throw new B u f f e r A b o r t E x c e p t i o n ( ) ;
}
}

/∗ ∗
∗ P i n s a b u f f e r t o a new b l o c k i n t h e s p e c i f i e d f i l e ,
∗ p o t e n t i a l l y w a i t i n g u n t i l a b u f f e r becomes a v a i l a b l e .
∗ I f no b u f f e r b e c o m e s a v a i l a b l e w i t h i n a f i x e d
∗ time period , then a { @link B u f f e r A b o r t E x c e p t i o n } i s thrown .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param f m t r t h e f o r m a t t e r u s e d t o i n i t i a l i z e t h e p a g e
∗ @return t h e b u f f e r pinned to t h a t b l o c k
∗/
public synchronized B u f f e r pinNew ( S t r i n g f i l e n a m e , P a g e F o r m a t t e r f m t r ) {
try {
long timestamp = System . c u r r e n t T i m e M i l l i s ( ) ;
B u f f e r b u f f = b u f f e r M g r . pinNew ( f i l e n a m e , f m t r ) ;
while ( b u f f == n u l l && ! w a i t i n g T o o L o n g ( timestamp ) ) {
w a i t (MAX TIME) ;
b u f f = b u f f e r M g r . pinNew ( f i l e n a m e , f m t r ) ;
}
i f ( b u f f == n u l l )
throw new B u f f e r A b o r t E x c e p t i o n ( ) ;
return b u f f ;
}
catch ( I n t e r r u p t e d E x c e p t i o n e ) {
throw new B u f f e r A b o r t E x c e p t i o n ( ) ;
}
}

/∗ ∗
∗ Unpins t h e s p e c i f i e d b u f f e r .
∗ I f t h e b u f f e r ’ s pin count becomes 0 ,
∗ t h e n t h e t h r e a d s on t h e w a i t l i s t a r e n o t i f i e d .
∗ @param b u f f t h e b u f f e r t o b e u n p i n n e d
∗/
public synchronized void u n p i n ( B u f f e r b u f f ) {
bufferMgr . unpin ( b u f f ) ;
i f ( ! buff . isPinned () )
notifyAll () ;
}

92
/∗ ∗
∗ F l u s h e s t h e d i r t y b u f f e r s m o d i f i e d by t h e specified transaction .
∗ @param txnum t h e t r a n s a c t i o n ’ s i d number
∗/
public void f l u s h A l l ( i n t txnum ) {
b u f f e r M g r . f l u s h A l l ( txnum ) ;
}

/∗ ∗
∗ R e t u r n s t h e number o f a v a i l a b l e ( i e unpinned ) buffers .
∗ @ r e t u r n t h e number o f a v a i l a b l e buffers
∗/
public i n t a v a i l a b l e ( ) {
return b u f f e r M g r . a v a i l a b l e ( ) ;
}

p r i v a t e boolean w a i t i n g T o o L o n g ( long s t a r t t i m e ) {
return System . c u r r e n t T i m e M i l l i s ( ) − s t a r t t i m e > MAX TIME ;
}
}

SimpleDB source file simpledb/buffer/BufferAbortException.java


package s i m p l e d b . b u f f e r ;

/∗ ∗
∗ A runtime e x c e p t i o n i n d i c a t i n g t h a t the t r a n s a c t i o n
∗ needs to a b o r t because a b u f f e r r e q u e s t could not be s a t i s f i e d .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
public c l a s s B u f f e r A b o r t E x c e p t i o n extends R u n t i m e E x c e p t i o n {}

4.4 Transaction Management


(Sciore, 2008, Chapters 8.2.2–8.2.3 and 14)

• We have defined what Transactions are and their 4 ACID properties in section 2.5.

• Let us now consider how SimpleDB implements them, and some alternatives.

• Transactions serve 2 purposes:

Recovery of the database after its server process is restarted after a shutdown.
Concurrency Management for Buffer s and other resources which several client threads
and Transactions want to use at the same time.

4.4.1 Database Recovery


(Sciore, 2008, Chapter 14.3) (Weikum and Vossen, 2001, Chapters 12–13)

• Let us start with their Recovery purpose.

• Recovery takes place when the RDBMS server process is restarted after a shutdown
(for whatever reason).

• Recovery restores the database into some consistent state, where. . .

– some Transactions are completely done, while. . .


– other Transactions have not even started.

• Recovery uses the information in the Log file which was written before the shutdown.

• Based on this Log information, the Recovery Manager. . .

undoes the modifications made by Transactions that got started but never committed
before the shutdown, and

93
no-undo with-undo
The database contains exactly the The database does contains all the
no-redo modifications by the committed modifications by the committed
Transactions. Transactions (and maybe more).

+ No separate recovery needed at all! + Recovery is faster than with redo.

− All the same problems as below and − When a Transaction commits, all its
to the right. . . Buffer s must be flushed, as in
Figure 40 – a lot of disk I/O in
one burst! In all, this could mean
up to 10 × normal I/O!

The database contains no modifications Algorithms like ARIES used in most


with-redo

by aborted Transactions. commercial RDBMSs.

+ A Transaction can be aborted − Recovery is slower.


quickly – compare Figures 42
and 41. + Normal operation has many more
commits than aborts or
− A Transaction must keep all its shutdowns, and it is faster in
Buffer s pinned until it ends – so these algorithms.
it needs a lot of RAM!
These algorithms perform both stages
− It restricts concurrency support of Figure 39.
only to the Page level – so it
restricts an important design
choice for another part of the
RDBMS.

Table 3: 4 kinds of recovery algorithms. (Weikum and Vossen, 2001, Chapter 12.5)

redoes the modifications made by Transactions that committed before the shut-
down, but whose Buffer s might not have been flushed yet.

This gives the 4-way classification of Page-oriented recovery algorithms shown in


Table 3.

• A Transaction with a Start but neither a Commit nor a Rollback log record must
have been running while the shutdown happened.

• Committing a Transaction with redo can simplify step 1 of Figure 40 into just
unpinning these Buffer s:

– The Buffer manager will write them to their disk Block s later, when it recycles
them.
– If the RDBMS shuts down before the it has written them, then their modifi-
cations can be redone from the Log during recovery.

• Figure 41 reconstructs the original disk Block contents into RAM Buffer s.

94
Figure 39: The general recovery algorithm. (Sciore, 2008)

Figure 40: Committing without redo. (Sciore, 2008)

Figure 41: Aborting with undo. (Sciore, 2008)

95
Figure 42: Aborting without undo. (Sciore, 2008)

– One way to ensure that these original Buffer s are written to disk is to flush
them before adding the abort Log record.
This is similar to committing in the “with-undo-no-redo” approach in Table 3.
– Another way is to add Log records also in its step 2a.
This way is compatible with more design choices that the first.

• These are the main kinds of log records in an RDBMS:

Start: When a new Transaction starts, the RDBMS. . .


¬ assigns it a new unique ID ι, and
­ creates for it a new Start log record with this ID ι
so that recovery knows when each Transaction started.
Commit: When a Transaction with ID ι commits, then the RDBMS creates a corre-
sponding log record with this ID ι, so that recovery knows that this Transaction
ended before the shutdown.
Rollback: For the same reason, when it aborts instead, the RDBMS creates a log
record for it.
Updateτ for every type τ of attribute it supports. It represents modifying a Record
field of this type τ . This record consists of the following:
– Which Block was modified?
In SimpleDB, its filename and block number within that file.
– At which offset within that Block did the modification start?
The Record Manager will determine these offsets, because it handles how
data is represented inside disk Block s.
– What was the old value of type τ at that offset before the modification?
This is needed for undoing this modification in step 1c of Figure 39.
– What is the new value of type τ which overwrote that old value?
This is needed for redoing this modification in step 2 of Figure 39.

• SimpleDB has chosen attribute values as its Logging and recovery granularity.

– Other, coarser choiced could have been to Log changes to whole Block s or even
files instead.
– Then the Log would contain fewer records, but each record would be larger.

96
SimpleDB source file simpledb/tx/recovery/LogRecord.java
• Here is the definition of the LogRecord interface.

• Each of the 5 files after it implements one particular kind of a log record mentioned
before.
package s i m p l e d b . t x . r e c o v e r y ;

import s i m p l e d b . l o g . LogMgr ;
import s i m p l e d b . s e r v e r . SimpleDB ;

/∗ ∗
∗ The i n t e r f a c e i m p l e m e n t e d b y e a c h t y p e o f l o g record .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e LogRecord {
/∗ ∗
∗ The s i x d i f f e r e n t t y p e s o f l o g r e c o r d
∗/
s t a t i c f i n a l i n t CHECKPOINT = 0 , START = 1 ,
COMMIT = 2 , ROLLBACK = 3 ,
SETINT = 4 , SETSTRING = 5 ;

static f i n a l LogMgr logMgr = SimpleDB . logMgr ( ) ;

/∗ ∗
∗ W r i t e s t h e r e c o r d t o t h e l o g and r e t u r n s i t s LSN .
∗ @ r e t u r n t h e LSN o f t h e r e c o r d i n t h e l o g
∗/
int writeToLog ( ) ;

/∗ ∗
∗ Returns the log record ’ s type .
∗ @return the log record ’ s type
∗/
i n t op ( ) ;

/∗ ∗
∗ Returns the t r a n s a c t i o n id s t o r e d with
∗ the log record .
∗ @return t h e l o g record ’ s t r a n s a c t i o n i d
∗/
i n t txNumber ( ) ;

/∗ ∗
∗ Undoes t h e o p e r a t i o n e n c o d e d b y t h i s l o g r e c o r d .
∗ The o n l y l o g r e c o r d t y p e s f o r w h i c h t h i s method
∗ d o e s a n y t h i n g i n t e r e s t i n g a r e SETINT and SETSTRING .
∗ @param txnum t h e i d o f t h e t r a n s a c t i o n t h a t i s p e r f o r m i n g t h e undo .
∗/
void undo ( i n t txnum ) ;
}

SimpleDB source file simpledb/tx/recovery/StartRecord.java


package s i m p l e d b . t x . r e c o v e r y ;

import s i m p l e d b . l o g . B a s i c L o g R e c o r d ;

c l a s s S t a r t R e c o r d implements LogRecord {
p r i v a t e i n t txnum ;

/∗ ∗
∗ C r e a t e s a new s t a r t l o g r e c o r d f o r t h e s p e c i f i e d transaction .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗/
public S t a r t R e c o r d ( i n t txnum ) {
t h i s . txnum = txnum ;
}

/∗ ∗
∗ C r e a t e s a l o g r e c o r d b y r e a d i n g one o t h e r value from t h e log .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public S t a r t R e c o r d ( B a s i c L o g R e c o r d r e c ) {
txnum = r e c . n e x t I n t ( ) ;
}

/∗ ∗
∗ Writes a s t a r t record to the l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e START o p e r a t o r ,
∗ f o l l o w e d by t h e t r a n s a c t i o n i d .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {START, txnum } ;
return logMgr . append ( r e c ) ;
}

public i n t op ( ) {
return START ;
}

public i n t txNumber ( ) {

97
return txnum ;
}

/∗ ∗
∗ Does n o t h i n g , b e c a u s e a s t a r t record
∗ c o n t a i n s no undo i n f o r m a t i o n .
∗/
public void undo ( i n t txnum ) {}

public S t r i n g t o S t r i n g ( ) {
return ”<START ” + txnum + ”>” ;
}
}

SimpleDB source file simpledb/tx/recovery/SetStringRecord.java

• SimpleDB calls its Updateτ log records Setτ instead.

• This file is its definition for τ = String.

• The next file is its definition for the other type τ = Int which SimpleDB supports.

package s i m p l e d b . t x . r e c o v e r y ;

import simpledb . s e r v e r . SimpleDB ;


import simpledb . buffer .∗;
import simpledb . f i l e . Block ;
import simpledb . l o g . BasicLogRecord ;

c l a s s S e t S t r i n g R e c o r d implements LogRecord {
p r i v a t e i n t txnum , o f f s e t ;
private S t r i n g v a l ;
private Block blk ;

/∗ ∗
∗ C r e a t e s a new s e t s t r i n g l o g r e c o r d .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗ @param b l k t h e b l o c k c o n t a i n i n g t h e v a l u e
∗ @param o f f s e t t h e o f f s e t o f t h e v a l u e i n t h e b l o c k
∗ @param v a l t h e new v a l u e
∗/
public S e t S t r i n g R e c o r d ( i n t txnum , B l o c k b l k , i n t o f f s e t , String val ) {
t h i s . txnum = txnum ;
this . blk = blk ;
this . o f f s e t = o f f s e t ;
this . val = val ;
}

/∗ ∗
∗ C r e a t e s a l o g r e c o r d by r e a d i n g f i v e o t h e r values from t h e log .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public S e t S t r i n g R e c o r d ( B a s i c L o g R e c o r d r e c ) {
txnum = r e c . n e x t I n t ( ) ;
String filename = rec . nextString () ;
i n t blknum = r e c . n e x t I n t ( ) ;
b l k = new B l o c k ( f i l e n a m e , blknum ) ;
o f f s e t = rec . nextInt () ;
val = rec . nextString () ;
}

/∗ ∗
∗ Writes a s e t S t r i n g record to the l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e SETSTRING o p e r a t o r ,
∗ f o l l o w e d b y t h e t r a n s a c t i o n i d , t h e f i l e n a m e , number ,
∗ and o f f s e t o f t h e m o d i f i e d b l o c k , and t h e p r e v i o u s
∗ s t r i n g value at that o f f s e t .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {SETSTRING , txnum , b l k . f i l e N a m e ( ) ,
b l k . number ( ) , o f f s e t , v a l } ;
return logMgr . append ( r e c ) ;
}

public i n t op ( ) {
return SETSTRING ;
}

public i n t txNumber ( ) {
return txnum ;
}

public S t r i n g t o S t r i n g ( ) {
return ”<SETSTRING ” + txnum + ” ” + b l k + ” ” + o f f s e t + ” ” + v a l + ”>” ;
}

/∗ ∗
∗ Replaces the s p e c i f i e d data value with the value saved in the log record .
∗ The method p i n s a b u f f e r t o t h e s p e c i f i e d b l o c k ,
∗ c a l l s s e t S t r i n g to r e s t o r e the saved value
∗ ( u s i n g a dummy LSN ) , and u n p i n s t h e b u f f e r .
∗ @see s i m p l e d b . t x . r e c o v e r y . L o g R e c o r d#undo ( i n t )
∗/

98
public void undo ( i n t txnum ) {
B u f f e r M g r b u f f M g r = SimpleDB . b u f f e r M g r ( ) ;
B u f f e r b u f f = buffMgr . pin ( blk ) ;
b u f f . s e t S t r i n g ( o f f s e t , v a l , txnum , −1) ;
buffMgr . unpin ( b u f f ) ;
}
}

SimpleDB source file simpledb/tx/recovery/SetIntRecord.java


package s i m p l e d b . t x . r e c o v e r y ;

import simpledb . s e r v e r . SimpleDB ;


import simpledb . buffer .∗;
import simpledb . f i l e . Block ;
import simpledb . l o g . BasicLogRecord ;

c l a s s S e t I n t R e c o r d implements LogRecord {
p r i v a t e i n t txnum , o f f s e t , v a l ;
private Block blk ;

/∗ ∗
∗ C r e a t e s a new s e t i n t l o g r e c o r d .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗ @param b l k t h e b l o c k c o n t a i n i n g t h e v a l u e
∗ @param o f f s e t t h e o f f s e t o f t h e v a l u e i n t h e b l o c k
∗ @param v a l t h e new v a l u e
∗/
public S e t I n t R e c o r d ( i n t txnum , B l o c k b l k , i n t o f f s e t , int val ) {
t h i s . txnum = txnum ;
this . blk = blk ;
this . o f f s e t = o f f s e t ;
this . val = val ;
}

/∗ ∗
∗ C r e a t e s a l o g r e c o r d by r e a d i n g f i v e o t h e r values from t h e log .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public S e t I n t R e c o r d ( B a s i c L o g R e c o r d r e c ) {
txnum = r e c . n e x t I n t ( ) ;
String filename = rec . nextString () ;
i n t blknum = r e c . n e x t I n t ( ) ;
b l k = new B l o c k ( f i l e n a m e , blknum ) ;
o f f s e t = rec . nextInt () ;
val = rec . nextInt () ;
}

/∗ ∗
∗ Writes a s e t I n t record to the l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e SETINT o p e r a t o r ,
∗ f o l l o w e d b y t h e t r a n s a c t i o n i d , t h e f i l e n a m e , number ,
∗ and o f f s e t o f t h e m o d i f i e d b l o c k , and t h e p r e v i o u s
∗ integer value at that o f f s e t .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {SETINT , txnum , b l k . f i l e N a m e ( ) ,
b l k . number ( ) , o f f s e t , v a l } ;
return logMgr . append ( r e c ) ;
}

public i n t op ( ) {
return SETINT ;
}

public i n t txNumber ( ) {
return txnum ;
}

public S t r i n g t o S t r i n g ( ) {
return ”<SETINT ” + txnum + ” ” + b l k + ” ” + o f f s e t + ” ” + v a l + ”>” ;
}

/∗ ∗
∗ Replaces the s p e c i f i e d data value with the value saved in the log record .
∗ The method p i n s a b u f f e r t o t h e s p e c i f i e d b l o c k ,
∗ c a l l s s e t I n t to r e s t o r e the saved value
∗ ( u s i n g a dummy LSN ) , and u n p i n s t h e b u f f e r .
∗ @see s i m p l e d b . t x . r e c o v e r y . L o g R e c o r d#undo ( i n t )
∗/
public void undo ( i n t txnum ) {
B u f f e r M g r b u f f M g r = SimpleDB . b u f f e r M g r ( ) ;
B u f f e r b u f f = buffMgr . pin ( blk ) ;
b u f f . s e t I n t ( o f f s e t , v a l , txnum , −1) ;
buffMgr . unpin ( b u f f ) ;
}
}

SimpleDB source file simpledb/tx/recovery/CommitRecord.java


package s i m p l e d b . t x . r e c o v e r y ;

import s i m p l e d b . l o g . B a s i c L o g R e c o r d ;

/∗ ∗

99
∗ The COMMIT l o g r e c o r d
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s CommitRecord implements LogRecord {
p r i v a t e i n t txnum ;

/∗ ∗
∗ C r e a t e s a new commit l o g r e c o r d f o r t h e s p e c i f i e d transaction .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗/
public CommitRecord ( i n t txnum ) {
t h i s . txnum = txnum ;
}

/∗ ∗
∗ C r e a t e s a l o g r e c o r d b y r e a d i n g one o t h e r value from t h e log .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public CommitRecord ( B a s i c L o g R e c o r d r e c ) {
txnum = r e c . n e x t I n t ( ) ;
}

/∗ ∗
∗ W r i t e s a commit r e c o r d t o t h e l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e COMMIT o p e r a t o r ,
∗ f o l l o w e d by t h e t r a n s a c t i o n i d .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {COMMIT, txnum } ;
return logMgr . append ( r e c ) ;
}

public i n t op ( ) {
return COMMIT;
}

public i n t txNumber ( ) {
return txnum ;
}

/∗ ∗
∗ Does n o t h i n g , b e c a u s e a commit r e c o r d
∗ c o n t a i n s no undo i n f o r m a t i o n .
∗/
public void undo ( i n t txnum ) {}

public S t r i n g t o S t r i n g ( ) {
return ”<COMMIT ” + txnum + ”>” ;
}
}

SimpleDB source file simpledb/tx/recovery/RollbackRecord.java


package s i m p l e d b . t x . r e c o v e r y ;

import s i m p l e d b . l o g . B a s i c L o g R e c o r d ;

/∗ ∗
∗ The ROLLBACK l o g r e c o r d .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s R o l l b a c k R e c o r d implements LogRecord {
p r i v a t e i n t txnum ;

/∗ ∗
∗ C r e a t e s a new r o l l b a c k l o g r e c o r d f o r t h e s p e c i f i e d transaction .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗/
public R o l l b a c k R e c o r d ( i n t txnum ) {
t h i s . txnum = txnum ;
}

/∗ ∗
∗ C r e a t e s a l o g r e c o r d b y r e a d i n g one o t h e r value from t h e log .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public R o l l b a c k R e c o r d ( B a s i c L o g R e c o r d r e c ) {
txnum = r e c . n e x t I n t ( ) ;
}

/∗ ∗
∗ Writes a r o l l b a c k record to the l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e ROLLBACK o p e r a t o r ,
∗ f o l l o w e d by t h e t r a n s a c t i o n i d .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {ROLLBACK, txnum } ;
return logMgr . append ( r e c ) ;
}

public i n t op ( ) {
return ROLLBACK;
}

public i n t txNumber ( ) {
return txnum ;
}

100
/∗ ∗
∗ Does n o t h i n g , b e c a u s e a r o l l b a c k record
∗ c o n t a i n s no undo i n f o r m a t i o n .
∗/
public void undo ( i n t txnum ) {}

public S t r i n g t o S t r i n g ( ) {
return ”<ROLLBACK ” + txnum + ”>” ;
}
}

101
Checkpoints (Sciore, 2008, Chapters 14.3.6–14.3.7) (Weikum and Vossen, 2001, Chap-
ter 13.3.3)
• The RDBMS can mark a checkpoint into its Log file at a moment where all its
Buffer s and Transactions are in some suitable known “quiet” state.

• These checkpoints speed up recovery by limiting the amount of Log information


required.

• Another but different meaning for the same word is a database state which its user
can save and later revert back into. We do not consider them here.

• The DBA can set how frequently the RDBMS takes these system checkpoints. Typ-
ical values are between 1 and 5 minutes.

• Heavyweight checkpointing flushes all modified Buffer s to get them into a quiet
state. It can be further divided into

quiescent (”uinuva” in Finnish) checkpointing, where also the Transactions enter


a quiet state.
– Its algorithm is given as Figure 43.
– Its main problem is that the RDBMS server process may become unre-
sponsive to its users for a long time – if one of the existing Transactions
runs for a long time before finishing.
nonquiescent checkpointing, where the Transactions continue afterwards.
– Its algorithm is given as Figure 44.
– Here too the RDBMS server process may become unresponsive, but for a
briefer time – only during the I/O burst caused by flushing all Buffer s in
step 3.

Log Truncation
• These system checkpoints allow the undo stage 1 or Figure 39 to stop reading the
Log file backwards sooner:
When the Recovery Manager encounters the first (that is, the most recent)

quiescent checkpoint, it can stop right there, because


– the database contents are known since all Buffer s were flushed then, and
– no modifications are pending, since no Transactions were running then.
nonquiescent checkpoint with Transactions T1 , . . . , Tk
– it can delete those which are on its committed or aborted lists, and
– stop as soon as it has seen the start records of the others, because
– it has then found all the relevant update records for step 2, similarly to
the quiescent case above.

• This is how the algorithm in Figure 39 behaves on the Log file in Figure 45:

¬ Because the 3 is not in the initially empty list of committed Transactions, it


will undo the last Log record SETINT, 3. . . .

102
Figure 43: Quiescent checkpointing. (Sciore, 2008)

Figure 44: Nonquiescent checkpointing. (Sciore, 2008)

Figure 45: An example log with a nonquiescent checkpoint. (Sciore, 2008)

103
­ Going backwards in its undo phase 1, it does the same for the preceding record
SETINT, 2. . . too.
® It passes over the START, 3 record.
¯ The COMMIT, 0 record adds 0 to its committed list.
° This causes the SETSTRING, 0 record to be ignored.
± When it encounters the nonquiescent checkpoint NQCKPT, 0, 2, it
– ignores 0 because it is now in the committed list, and
– knows that it can stop its phase 1 as soon as it encounters the START, 2
record.
² It undoes the SETSTRING, 2 record, because 2 is in neither in the committed
nor the (initally empty) aborted list.
³ The COMMIT, 1 record adds 1 to the committed list.
´ Now it encounters the START, 2 record it has been looking for, and moves from
its undo phase 1 into its redo phase 2.
µ This redo phase moves forward in the log from this START, 2 record, and
redoes all SET. . . ,0 and SET. . . ,1 records, because they form its committed
list.

• In this way, the last checkpoint in the Log file determines its still relevant tail – all
earlier records can be ignored.

• Deleting these no longer relevant records is called truncating the old Log file.

– The RDBMS can do it during Figure 39:


¬ Create a new initially empty Log file between its 2 stages.
­ During its redo stage 2, copy each Log record from the current file into
this new file.
® In the end, switch to using this new file, and delete or archive the old file.
– The RDBMS can do it also during normal operation:
¶ Stop accepting new Transactions.
· Execute the algorithm above but without restoring anything. That is:
undo stage 1 is executed just to find out where it stops
redo stage 2 is executed just to copy the Log records.
¸ Start accepting new Transactions.
– SimpleDB does not seem to truncate its Log.

• Lightweight checkpointing avoids response time problems by not flushing Buffer s.


However, this complicates recovery and Log truncation. We do not discuss it here.

SimpleDB source file simpledb/tx/recovery/CheckpointRecord.java

• Here is how SimpleDB implements quiescent checkpoint records.

104
package s i m p l e d b . t x . r e c o v e r y ;

import s i m p l e d b . l o g . B a s i c L o g R e c o r d ;

/∗ ∗
∗ The CHECKPOINT l o g r e c o r d .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s C h e c k p o i n t R e c o r d implements LogRecord {

/∗ ∗
∗ Creates a quiescent checkpoint record .
∗/
public C h e c k p o i n t R e c o r d ( ) {}

/∗ ∗
∗ C r e a t e s a l o g r e c o r d b y r e a d i n g no o t h e r v a l u e s
∗ from t h e b a s i c l o g r e c o r d .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public C h e c k p o i n t R e c o r d ( B a s i c L o g R e c o r d r e c ) {}

/∗ ∗
∗ Writes a c he ckp oi nt record to the l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e CHECKPOINT o p e r a t o r ,
∗ and n o t h i n g e l s e .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {CHECKPOINT} ;
return logMgr . append ( r e c ) ;
}

public i n t op ( ) {
return CHECKPOINT;
}

/∗ ∗
∗ C h e c k p o i n t r e c o r d s h a v e no a s s o c i a t e d t r a n s a c t i o n ,
∗ and s o t h e method r e t u r n s a ”dummy ” , n e g a t i v e t x i d .
∗/
public i n t txNumber ( ) {
return −1; // dummy v a l u e
}

/∗ ∗
∗ Does n o t h i n g , b e c a u s e a c h e c k p o i n t record
∗ c o n t a i n s no undo i n f o r m a t i o n .
∗/
public void undo ( i n t txnum ) {}

public S t r i n g t o S t r i n g ( ) {
return ”<CHECKPOINT>” ;
}
}

SimpleDB source file simpledb/tx/recovery/LogRecordIterator.java


• Here is the Log record iterator which takes into account the kinds of Log records
which SimpleDB has.

• The Log Manager defined the basic LogIterator with just moving backwards in
the Log file.
package s i m p l e d b . t x . r e c o v e r y ;

import s t a t i c s i m p l e d b . t x . r e c o v e r y . LogRecord . ∗ ;
import java . u t i l . I t e r a t o r ;
import simpledb . l o g . BasicLogRecord ;
import s i m p l e d b . s e r v e r . SimpleDB ;

/∗ ∗
∗ A c l a s s t h a t p r o v i d e s the a b i l i t y to read records
∗ from t h e l o g i n r e v e r s e o r d e r .
∗ Unlike the similar c l a s s
∗ { @link simpledb . log . LogIterator LogIterator } ,
∗ t h i s c l a s s u n d e r s t a n d s t h e meaning o f t h e l o g r e c o r d s .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s L o g R e c o r d I t e r a t o r implements I t e r a t o r <LogRecord> {
p r i v a t e I t e r a t o r <B a s i c L o g R e c o r d> i t e r = SimpleDB . logMgr ( ) . i t e r a t o r ( ) ;

public boolean hasNext ( ) {


return i t e r . hasNext ( ) ;
}

/∗ ∗
∗ C o n s t r u c t s a l o g r e c o r d from t h e v a l u e s i n t h e
∗ current basic log record .
∗ The method f i r s t r e a d s an i n t e g e r , w h i c h d e n o t e s
∗ the type of the log record . B a s e d on t h a t t y p e ,
∗ t h e method c a l l s t h e a p p r o p r i a t e L o g R e c o r d c o n s t r u c t o r
∗ to read the remaining v a l u e s .
∗ @ r e t u r n t h e n e x t l o g r e c o r d , o r n u l l i f no more r e c o r d s

105
∗/
public LogRecord n e x t ( ) {
BasicLogRecord r e c = i t e r . next ( ) ;
i n t op = r e c . n e x t I n t ( ) ;
switch ( op ) {
case CHECKPOINT:
return new C h e c k p o i n t R e c o r d ( r e c ) ;
case START :
return new S t a r t R e c o r d ( r e c ) ;
case COMMIT:
return new CommitRecord ( r e c ) ;
case ROLLBACK:
return new R o l l b a c k R e c o r d ( r e c ) ;
case SETINT :
return new S e t I n t R e c o r d ( r e c ) ;
case SETSTRING :
return new S e t S t r i n g R e c o r d ( r e c ) ;
default :
return n u l l ;
}
}

public void remove ( ) {


throw new U n s u p p o r t e d O p e r a t i o n E x c e p t i o n ( ) ;
}
}

SimpleDB source file simpledb/tx/recovery/RecoveryMgr.java


• Here is the Recovery Manager object for each Transaction.

• SimpleDB has chosen the “with-undo-no-redo” approach in Table 3.


package s i m p l e d b . t x . r e c o v e r y ;

import s t a t i c s i m p l e d b . t x . r e c o v e r y . LogRecord . ∗ ;
import simpledb . f i l e . Block ;
import simpledb . b u f f e r . Buffer ;
import s i m p l e d b . s e r v e r . SimpleDB ;
import java . u t i l . ∗ ;

/∗ ∗
∗ The r e c o v e r y manager . Each t r a n s a c t i o n has its own r e c o v e r y manager .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s RecoveryMgr {
p r i v a t e i n t txnum ;

/∗ ∗
∗ C r e a t e s a r e c o v e r y manager f o r t h e s p e c i f i e d t r a n s a c t i o n .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗/
public RecoveryMgr ( i n t txnum ) {
t h i s . txnum = txnum ;
new S t a r t R e c o r d ( txnum ) . w r i t e T o L o g ( ) ;
}

/∗ ∗
∗ W r i t e s a commit r e c o r d t o t h e l o g , and f l u s h e s i t to disk .
∗/
public void commit ( ) {
SimpleDB . b u f f e r M g r ( ) . f l u s h A l l ( txnum ) ;
i n t l s n = new CommitRecord ( txnum ) . w r i t e T o L o g ( ) ;
SimpleDB . logMgr ( ) . f l u s h ( l s n ) ;
}

/∗ ∗
∗ W r i t e s a r o l l b a c k r e c o r d t o t h e l o g , and f l u s h e s i t to disk .
∗/
public void r o l l b a c k ( ) {
doRollback ( ) ;
SimpleDB . b u f f e r M g r ( ) . f l u s h A l l ( txnum ) ;
i n t l s n = new R o l l b a c k R e c o r d ( txnum ) . w r i t e T o L o g ( ) ;
SimpleDB . logMgr ( ) . f l u s h ( l s n ) ;
}

/∗ ∗
∗ R e c o v e r s u n c o m p l e t e d t r a n s a c t i o n s from t h e l o g ,
∗ then w r i t e s a quiescent checkpoint record to the log and flushes it .
∗/
public void r e c o v e r ( ) {
doRecover ( ) ;
SimpleDB . b u f f e r M g r ( ) . f l u s h A l l ( txnum ) ;
i n t l s n = new C h e c k p o i n t R e c o r d ( ) . w r i t e T o L o g ( ) ;
SimpleDB . logMgr ( ) . f l u s h ( l s n ) ;

/∗ ∗
∗ W r i t e s a s e t i n t r e c o r d t o t h e l o g , and r e t u r n s i t s l s n .
∗ Updates to temporary f i l e s are not l o g g e d ; instead , a
∗ ”dummy” n e g a t i v e l s n i s r e t u r n e d .
∗ @param b u f f t h e b u f f e r c o n t a i n i n g t h e p a g e
∗ @param o f f s e t t h e o f f s e t o f t h e v a l u e i n t h e p a g e
∗ @param n e w v a l t h e v a l u e t o b e w r i t t e n
∗/

106
public i n t s e t I n t ( B u f f e r b u f f , i n t o f f s e t , i n t newva l ) {
int o l d v a l = b u f f . g e t I n t ( o f f s e t ) ;
Block blk = b u f f . block ( ) ;
i f ( isTempBlock ( b l k ) )
return −1;
else
return new S e t I n t R e c o r d ( txnum , b l k , o f f s e t , o l d v a l ) . w r i t e T o L o g ( ) ;
}

/∗ ∗
∗ W r i t e s a s e t s t r i n g r e c o r d t o t h e l o g , and r e t u r n s its lsn .
∗ Updates to temporary f i l e s are not l o g g e d ; instead , a
∗ ”dummy” n e g a t i v e l s n i s r e t u r n e d .
∗ @param b u f f t h e b u f f e r c o n t a i n i n g t h e p a g e
∗ @param o f f s e t t h e o f f s e t o f t h e v a l u e i n t h e p a g e
∗ @param n e w v a l t h e v a l u e t o b e w r i t t e n
∗/
public i n t s e t S t r i n g ( B u f f e r b u f f , i n t o f f s e t , S t r i n g newv al ) {
String oldval = buff . getString ( o f f s e t ) ;
Block blk = b u f f . block ( ) ;
i f ( isTempBlock ( b l k ) )
return −1;
else
return new S e t S t r i n g R e c o r d ( txnum , b l k , o f f s e t , o l d v a l ) . writeToLog ( ) ;
}

/∗ ∗
∗ R o l l s back the t r a n s a c t i o n .
∗ The method i t e r a t e s t h r o u g h t h e l o g r e c o r d s ,
∗ c a l l i n g undo ( ) f o r e a c h l o g r e c o r d i t f i n d s
∗ for the transaction ,
∗ u n t i l i t f i n d s t h e t r a n s a c t i o n ’ s START r e c o r d .
∗/
p r i v a t e void d o R o l l b a c k ( ) {
I t e r a t o r <LogRecord> i t e r = new L o g R e c o r d I t e r a t o r ( ) ;
while ( i t e r . hasNext ( ) ) {
LogRecord r e c = i t e r . n e x t ( ) ;
i f ( r e c . txNumber ( ) == txnum ) {
i f ( r e c . op ( ) == START)
return ;
r e c . undo ( txnum ) ;
}
}
}

/∗ ∗
∗ Does a c o m p l e t e d a t a b a s e r e c o v e r y .
∗ The method i t e r a t e s t h r o u g h t h e l o g r e c o r d s .
∗ Whenever i t f i n d s a l o g r e c o r d f o r an u n f i n i s h e d
∗ t r a n s a c t i o n , i t c a l l s undo ( ) on t h a t r e c o r d .
∗ The method s t o p s when i t e n c o u n t e r s a CHECKPOINT r e c o r d
∗ o r t h e end o f t h e l o g .
∗/
p r i v a t e void d o R e c o v e r ( ) {
C o l l e c t i o n <I n t e g e r > f i n i s h e d T x s = new A r r a y L i s t <I n t e g e r >() ;
I t e r a t o r <LogRecord> i t e r = new L o g R e c o r d I t e r a t o r ( ) ;
while ( i t e r . hasNext ( ) ) {
LogRecord r e c = i t e r . n e x t ( ) ;
i f ( r e c . op ( ) == CHECKPOINT)
return ;
i f ( r e c . op ( ) == COMMIT | | r e c . op ( ) == ROLLBACK)
f i n i s h e d T x s . add ( r e c . txNumber ( ) ) ;
e l s e i f ( ! f i n i s h e d T x s . c o n t a i n s ( r e c . txNumber ( ) ) )
r e c . undo ( txnum ) ;
}
}

/∗ ∗
∗ D e t e r m i n e s w h e t h e r a b l o c k comes f r o m a t e m p o r a r y file or not .
∗/
p r i v a t e boolean isTempBlock ( B l o c k b l k ) {
return b l k . f i l e N a m e ( ) . s t a r t s W i t h ( ” temp ” ) ;
}
}

4.4.2 Concurrency Control


(Sciore, 2008, Chapter 14.4) (Weikum and Vossen, 2001, Chapters 3.8–4.3.4)

• Now we turn to Concurrency Manager, the other part of Transaction Manager


besides the Recovery Manager just described.

• The task of this Manager is to coordinate all the concurrently running Transaction
threads.

• The central thing we want to coordinate is concurrent access to disk Block s, or


equivalently their RAM Pages, because we must ensure that they do not “mess up”
the database.

107
– The Page get and set methods were synchronized to ensure that each call
is finished before the next starts.
– Here we coordinate which Transactions are permitted to make these calls, and
for which Pages.

Histories and Schedules


• The history Ht of a Transaction t is a string built from the following kinds of
characters:

Rt (b) stands for the operation “Transaction t reads Buffer b”


Wt (b) stands for the operation “Transaction t writes Buffer b”

The idea is that Ht traces all the relevant I/O operations performed by Transaction t
in the order in which they happen.

• For instance
H1 = R1 (p)W1 (q)
says that transaction 1

first reads something from Buffer p, and


then writes something into Buffer q.

We are not interested in what it reads and writes, just in the order in which these
operations happen.

• In this way, history Ht simplifies what one Transaction t does, restricted only to
what is interesting for Concurrency Management.

• Let us now extend histories to many concurrent Transactions.

• An interleaving of histories H1 , H2 , H3 , . . . , Hn is their schedule S.

– This S is a big string which consists of the characters in these smaller strings
shuffled in some way, but keeping the characters of each Hi in their original
order.
– In other words, if we delete from S all characters for the other Transactions
j 6= i, then we get Hi .

• For instance, here are the 6 schedules of the 2 histories

H1 = R1 (p)W1 (q) and H2 = W2 (p)W2 (q) : (11)


S1 = R1 (p)W1 (q)W2 (p)W2 (q) S2 = W2 (p)W2 (q)R1 (p)W1 (q)
S3 = R1 (p)W2 (p)W1 (q)W2 (q) S4 = W2 (p)R1 (p)W2 (q)W1 (q)
S5 = R1 (p)W2 (p)W2 (q)W1 (q) S6 = W2 (p)R1 (p)W1 (q)W2 (q).

• In this way, a schedule S simplifies what many Transactions 1, 2, 3, . . . , n with his-


tories H1 , H2 , H3 , . . . , Hn do, restricted only to what is interesting for Concurrency
Management.

• This simplification lets us define precisely what we mean by “correct” Concurrency


Management.

108
• A schedule S is serial if every history Hi appears as a consecutive substring. For
instance the serial schedules for the H1 and H2 in Eq. (11) are

S1 = R1 (p)W1 (q) W2 (p)W2 (q) S2 = W2 (p)W2 (q) R1 (p)W1 (q)


| {z } | {z } | {z } | {z }
H1 H2 H2 H1

where

S1 = first H1 , then H2 S2 = vice versa.

• That is, in a serial schedule S the RDBMS executes each transaction i entirely from
its beginning to its end before begining another.

• In other words, a serial schedule S represents the case with no concurrency among
its Transactions.

• Hence serial schedules S are obviously correct. What non-serial schedules S 0 are
also correct? These S 0 are namely the concurrent executions of the RDBMS which
the Concurrency Manager can permit.

• By the Isolation property of Transactions, a non-serial schedule S 0 is also correct


exactly when S 0 is equivalent to some serial schedule S – that is, exactly when
although the RDBMS executes a non-serial schedule S 0 it could have executed S
instead.

Conflict Serializability

• The most common concept of this “equivalence” between schedules is conflict equiv-
alence.

• Let us denote that schedules Γ and ∆ are conflict equivalent by Γ ∼ ∆, and define
this relation ‘∼’ with suitable rules.

• One rule is
ΓRt (p)Ru (q)∆ ∼ ΓRu (q)Rt (p)∆ if t 6= u (12)
or “the order in which two adjacent reads by two different Transactions happen
does not matter” because each Transaction t or u reads the same contents from the
past Γ in both sides.

• However, rule (12) does not hold for just one Transaction t = u, because that would
change the history Ht of this Transaction t.

• Another rule is

ΓRt (p)Wu (q)∆ ∼ ΓWu (q)Rt (p)∆ if t 6= u and p 6= q (13)

or “the order of adjacent reads and writes does not matter, if they use different
Buffer s”.

• However, rule (13) certainly does not hold if they use the same Buffer p = q:

Left side says that Transaction t reads the previous contents of this Buffer p before
Transaction u overwrites them.

109
Right side says that Transaction u overwrites the contents of this Buffer p and
Transaction t reads these new contents.

• We say that this situation is a read-write conflict between these two Transactions t
and u since these two sides disagree on what contents of this Buffer p Transaction t
saw.

• A third rule is

ΓWt (p)Wu (q)∆ ∼ ΓWu (q)Wt (p)∆ if t 6= u and p 6= q (14)

or “the order of two adjacent writes does not matter, if they use different Buffer s”.

• Again, rule (14) does not hold if they use the same Buffer p = q:

Left side says that Transaction u writes the contents for the future ∆.
Right side says that Transaction t writes the contents for the future ∆.

• This situation is a write-write conflict between these two Transactions t and u


since the two sides disagree on what the contents of this Buffer p are in the future ∆.

• We can extend this ‘∼’ into an equivalence relation with the familiar (?) rules

Γ∼Γ (reflexivity)
Γ ∼ ∆ if and only if ∆ ∼ Γ (symmetry)

and

if Γ ∼ ∆ and ∆ ∼ Σ then also Γ ∼ Σ. (transitivity)

Such equivalence relations capture some notion of “similarity”.

• Here this notion is “these schedules Γ and ∆ perform their I/O operations in the
same order when that matters”.

• We have now reformulated the general but vague problem

“Is this schedule S correct?”


into the simplified but precise problem
“Does there exist a serial schedule S 0 such that S ∼ S 0 ?”

• For instance S3 ∼ S1 and S4 ∼ S2 if p 6= q in Eq. (11).

• This reformulation is a formal language acceptance problem as in the course “Basic


Models of Computation” (”Laskennan perusmallit” in Finnish).

• This reformulation permits

designing a Concurrency control algorithm in a general setting which


includes the relevant aspects of the tasks but
excludes irrelevant details
proving that this designed algorithm behaves correctly – by proving that if it
accepts a schedule S, then there is some serial schedule S 0 such that S ∼ S 0 .

110
• Proving correctness is especially important for algorithms whose testing is difficult
– and testing a Concurrency Manager is difficult!

• We develop an algorithm to this acceptance problem by reformulating it first as a


graph problem for which we have an algorithm.

• The conflict graph of a schedule S is as follows:

– Its nodes are the Transactions of S.


– There is a directed edge t →
− u from one node t into another node u exactly
when some earlier operation of Transaction t is in conflict with some later
operation of Transaction u in S:

S = . . . Xt (p) . . . Yu (p) . . . (15)

where at least one of these operations X and Y is a write.


– Such an edge means that “the equivalence relation ‘∼’ does not allow reordering
these two operations Xt (p) and Yu (p)”.

• In rules (12)–(14)

conflict graph of left side = conflict graph of right side

because although they turn the adjacent pair around, this does not turn any edges
around, since this pair did not produce any edges.

• When a schedule is serial, its conflict graph is acyclic. For instance, if

S 0 = H1 H2 H3 . . . Hn

then every edge i →


− j in its conflict graph has i < j.

• On the other hand, if G is any acyclic graph, then we can build a serial schedule S 0
which has it as its constraint graph.

– Each node of G creates one new Transaction into S 0 .


– Each edge e = t → − u of G creates one new Buffer e into S 0 .
It also creates the 2 conflicting operations Wt (e) and Wu (e) for its Buffer e.
They add the corresponding edge between nodes t and u into the confict graph
of S 0 , but the direction of this edge will be determined by the order of the
Transactions t and u in S 0 .
– These Wt for a node t appear together as one consecutive history Ht in S 0
because we are constructing a serial schedule.
– Because G is acyclic, we can order its nodes topologically so that its edges
point from left to right. If S 0 lists these histories in the same order, then the
corresponding edges in its conflict graph will also point from left to right.

• Hence we have translated the problem

“Does there exist some serial schedule S 0 such that S ∼ S 0 ?”


into
“Is the conflict graph of S acyclic?”

111
for which we have algorithms from the Data Structures II course (”Tietorakenteet II”
in Finnish).

• One way to summarize this 2-step translation is:


An t → − u in the conflict graph represents the decision that when-
ever two concurrent operations Xt (p) and Yu (p) of Transactions t
and u conflict, this conflict is always resolved by executing Xt (p)
first. (16)
A path t →− t0 →− t00 →− · · · represents the combined decision for
many Transactions t, t0 , t00 , . . . which follows from these individual
decisions by transitivity.

• Hence a correct Concurrency Manager can operate by building a serial schedule


incrementally as follows:

– This Manager maintains the conflict graph G of all Transactions.


(Fortunately this G needs only the currently running and recently terminated
Transactions, not all of them.)
– When a Transaction t wants to read or write a particular Buffer p, then this
Manager checks if adding the corresponding Rt (p) or Wt (p) as the next element
of the schedule would make a cycle into this G.
– If it does not, then this Manager permits the operation.
– But if it does, this Manager aborts Transactions to make G acyclic.
∗ It can do so by aborting this Transaction t.
∗ However, it can do so by choosing to abort some other Transactions than t
instead.
∗ Note how this flexibility arises from considering the problem in this more
abstract setting.
In practice this aborting can be implemented like the BufferAbortExceptions in
section 4.3.

Two-Phase Locking (Weikum and Vossen, 2001, Chapters 4.3.1–4.3.4) (Sciore, 2008,
Chapters 8.2.2–8.2.3 and 14)
• One common way to implement this summary (16) is by attaching Lock s on disk
Block s.

– Each Transaction has its own padlocks (”riippulukko” in Finnish).


– When a Transaction t wants to read or write a disk Block b, it attaches a
padlock into b.
– When other Transactions want to read or write the same disk Block b, they
see this padlock, and know that Transaction t is already using it.

• A Transaction t attaches a

shared lock (slock) if t only wants to read (but not write) Block b
exclusive lock (xlock) if t wants to (read and) write Block b.

These 2 basic kinds of locks are enough for correct Concurrency Management.

112
• An RDBMS can also have more kinds of locks to make its Concurrency Management
more flexible, but we concenterate only to these basic 2.

• A Block b can have

either many slocks


or just one xlock at the same time, but not both.

In other words,

many Transactions can read the same Block b at the same time, but if
one Transaction wants to write b, then it must be the only Transaction using b
at that time.

• Figure 46 shows an example.

• The constraint graph becomes the waits-for graph telling which Transactions are
now waiting for which other Transactions to unlock the Lock s for the Block they
need.
For instance, if a Transaction t holds an xlock on a Buffer b, another Transaction u
which needs b must wait, because its Ru (b) or Wu (b) operation conflicts with the Wt (b)
operation for which Transaction t attached its xlock on b.

• It turns out that the Concurrency Manager must coordinate also which way one
Transaction uses its own Lock s.

– This information is not in the waits-for graph.


– Consider for instance a Transaction t executing
. . . slock(x);R(x);unlock(x);slock(y);R(y). . .
and suppose that another Transaction u executes
xlock(x);xlock(y);W(x);W(y);unlock(x);unlock(y);commit
between its unlock and slock operations. This is not serializable even though
both lock Buffer s before using them.

Requirement 12 (two-phase locking (2PL)). After a Transaction has performed its first
unlock operation, it cannot perform any more locking operations.

• That is, the “life” of a Transaction consists of 2 phases:

¬ It sets the right Lock for each Block it needs, and processes their contents.
­ It starts unlocking them only when it is certain that it will not need any more
Block s to process.

• 2PL guarantees serializability.

• However, using full 2PL may cause cascading rollbacks.

– These are situations where aborting one Transaction causes aborting others
too:

113
Figure 46: Locking example. (Sciore, 2008)

114
Figure 47: Locking and unlocking rules. (Sciore, 2008)

¶ One Transaction t writes and unlocks a Block z.


· Another Transaction u slocks z and starts reading its contents.
¸ Transaction t aborts – so Transaction u must be aborted too, because it
has started reading a version z which must not exists.
– These cascading rollbacks are possible but tedious in an RDBMS.
– Hence RDBMSs often restrict 2PL further to avoid them.

• 2 such restrictions of full 2PL are

Strict 2PL (S2PL) where a Transaction keeps all its xlocks until it terminates.
– In the scenario above, t would not unlock z, and so avoids this write-read
conflict with u.
– S2PL avoids also write-write conflicts between running Transactions.
Strong 2PL (SS2PL) where a Transaction keeps all its Lock s until it terminates.
– SS2PL avoids all conflicts between running Transactions – including also
read-write.
– SS2PL is also commit order preserving (unlike S2PL):
The order in which Transactions commit is also their serial schedule.
– SimpleDB uses SS2PL.
– SS2PL is given as Figure 47.

Isolation Levels and Locking (Sciore, 2008, Chapters 8.2.2–8.2.3 and 14.4.7)

• The Lock Usage column of Figure 12 explains the connections between transaction
isolation levels and a Lock ing implementation.

• These levels relax the rules how Transactions can use slocks for reading from
Block s – when they compute results for queries.

• In constrast, the RDBMS must not relax the rules how Transactions can use xlocks
when they write into Block s – otherwise they might corrupt the database!

• Phantoms were new rows which appeared into the database during the current
Transaction t:

– First it does not see them there, later it does.

115
– They can appear, because t cannot slock a new Block n before another
Transaction u has appended it into the database – and by then u may have
added new phantom rows into n.

• The RDBMS implementation can avoid these phantoms by saying that the end-of-
file (eof) marker is another “Block ” which must be Lock ed too. Then

write(eof) means changing it – by appending another Block into this file


read(eof) means reading it – by getting the file size in Block s
slock(eof) means that this Transaction reads it – as above
xlock(eof) means that this Transaction is going to write it – as above.

• “Releasing slocks early” means unlocking old slocks before locking new slocks
– violating 2PL.

¶ Transaction t reads the contents of Block b and unlocks it immediately after-


wards.
· Another Transaction u can then write b following Figure 47.
¸ Transaction t slocks and reads b again later – and sees the changes made
by u.

If this Transaction u aborts instead of committing, then Transaction t does not


see its changes – hence the name

read committed if slocks are released early, and


repeatable read if they are held until the Transaction ends.

• But if a Transaction does not use slocks at all, then there are no guarantees on
what it sees – hence read uncommitted.

Deadlock Handling

• The Concurrency Manager must abort Transactions if they would be deadlocked


waiting for each others’ Lock s.

• When we encountered the same problem in the Buffer Manager, the SimpleDB
solution was brutally straightforward:

– If a Transaction had been waiting for any Buffer to become unpinned for
10 seconds, it was assumed to be deadlocked and was aborted.
– This solution was appropriate there, because the DBA can make these aborts
less frequent simply by adding more RAM to the Buffer pool of the RDBMS
server process.

• In contrast, here Transactions are waiting for Lock s on specific disk Block s.

– This waiting depends on the queries and the data – it cannot be alleviated
by tuning the system parameters, because the bottleneck is the actual Block s
themselves.
– Hence this Concurrency Manager should spend more effort in choosing which
Transactions it aborts than the Buffer Manager did.

116
– This effort can be based on the waits-for graph.
– However, maintaining this waits-for graph is somewhat costly in terms of both
RAM and time.
– Therefore simpler ways which do not need this graph are often preferred.
– For instance, we may base them on Transaction timestamps which indicate
when they started instead of the waits-for graph:
If we always prefer the older Transaction, then this graph would clearly be
acyclic.

• The 2 main ways based on Transaction starting times are

Wait-Die: If this Transaction u requests a Lock which conflicts with another Lock
already held by another Transaction t, then. . .

1 if u started before t
2 u waits for t to release its Lock
3 else abort u.
That is, u either waits or dies by suicide.
Wound-Wait: We can also do the opposite to Wait-Die with. . .

1 if u started before t
2 abort t so that u may get its Lock
3 else u waits for t to release its Lock .
That is, u either murders t or waits.

Both avoid aborting the Transaction which has been running longer, because that
would mean losing all the work which it has done so far.

• The suicide of the currently running Transaction u is somewhat simpler to imple-


ment than murdering another Transaction t, because u can unlock all its Block s,
and unpin all its Buffer s, and . . . by itself.

• On the other hand, if the waits-for graph is used instead, then the chosen Transactions
will have to be murdered.

• Lock waiting should also be fair, so that no Transaction will wait for a Block
indefinitely, because it is always given to other waiting Transactions instead.

SimpleDB source file simpledb/tx/concurrency/LockTable.java


• Here is the SimpleDB Lock ing implementation.

• Despite its shortcomings, SimpleDB uses the same 10-second waiting approach with
a single Java lock to break deadlocks in both Buffer s and Lock ing. It is shown in
Figure 48.
package s i m p l e d b . t x . c o n c u r r e n c y ;

import s i m p l e d b . f i l e . B l o c k ;
import j a v a . u t i l . ∗ ;

/∗ ∗
∗ The l o c k t a b l e , w h i c h p r o v i d e s m e t h o d s t o l o c k and u n l o c k b l o c k s .
∗ I f a t r a n s a c t i o n r e q u e s t s a l o c k t h a t c a u s e s a c o n f l i c t w i t h an
∗ e x i s t i n g l o c k , t h e n t h a t t r a n s a c t i o n i s p l a c e d on a w a i t l i s t .

117
Figure 48: The time-limit strategy. (Sciore, 2008)

∗ T h e r e i s o n l y one w a i t l i s t f o r a l l b l o c k s .
∗ When t h e l a s t l o c k on a b l o c k i s u n l o c k e d , t h e n a l l t r a n s a c t i o n s
∗ a r e r e m o v e d f r o m t h e w a i t l i s t and r e s c h e d u l e d .
∗ I f one o f t h o s e t r a n s a c t i o n s d i s c o v e r s t h a t t h e l o c k i t i s w a i t i n g for
∗ i s s t i l l l o c k e d , i t w i l l p l a c e i t s e l f b a c k on t h e w a i t l i s t .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s LockTable {
p r i v a t e s t a t i c f i n a l long MAX TIME = 1 0 0 0 0 ; // 10 s e c o n d s

p r i v a t e Map<Block , I n t e g e r > l o c k s = new HashMap<Block , I n t e g e r >() ;

/∗ ∗
∗ G r a n t s an SLock on t h e s p e c i f i e d b l o c k .
∗ I f an XLock e x i s t s when t h e method i s c a l l e d ,
∗ t h e n t h e c a l l i n g t h r e a d w i l l b e p l a c e d on a w a i t l i s t
∗ u n t i l the lock is released .
∗ I f t h e t h r e a d r e m a i n s on t h e w a i t l i s t f o r a c e r t a i n
∗ amount o f t i m e ( c u r r e n t l y 10 s e c o n d s ) ,
∗ t h e n an e x c e p t i o n i s t h r o w n .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
public synchronized void s L o c k ( B l o c k b l k ) {
try {
long timestamp = System . c u r r e n t T i m e M i l l i s ( ) ;
while ( h a s X l o c k ( b l k ) && ! w a i t i n g T o o L o n g ( timestamp ) )
w a i t (MAX TIME) ;
i f ( hasXlock ( blk ) )
throw new L o c k A b o r t E x c e p t i o n ( ) ;
int val = getLockVal ( blk ) ; // w i l l n o t b e n e g a t i v e
l o c k s . put ( b l k , v a l +1) ;
}
catch ( I n t e r r u p t e d E x c e p t i o n e ) {
throw new L o c k A b o r t E x c e p t i o n ( ) ;
}
}

/∗ ∗
∗ G r a n t s an XLock on t h e s p e c i f i e d b l o c k .
∗ I f a l o c k o f any t y p e e x i s t s when t h e method i s c a l l e d ,
∗ t h e n t h e c a l l i n g t h r e a d w i l l b e p l a c e d on a w a i t l i s t
∗ u n t i l the l o c k s are r e l e a s e d .
∗ I f t h e t h r e a d r e m a i n s on t h e w a i t l i s t f o r a c e r t a i n
∗ amount o f t i m e ( c u r r e n t l y 10 s e c o n d s ) ,
∗ t h e n an e x c e p t i o n i s t h r o w n .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
synchronized void xLock ( B l o c k b l k ) {
try {
long timestamp = System . c u r r e n t T i m e M i l l i s ( ) ;
while ( h a s O t h e r S L o c k s ( b l k ) && ! w a i t i n g T o o L o n g ( timestamp ) )
w a i t (MAX TIME) ;
i f ( hasOtherSLocks ( blk ) )
throw new L o c k A b o r t E x c e p t i o n ( ) ;
l o c k s . put ( b l k , −1) ;
}
catch ( I n t e r r u p t e d E x c e p t i o n e ) {
throw new L o c k A b o r t E x c e p t i o n ( ) ;
}
}

/∗ ∗
∗ R e l e a s e s a l o c k on t h e s p e c i f i e d b l o c k .
∗ I f t h i s l o c k i s t h e l a s t l o c k on t h a t b l o c k ,
∗ then the waiting t r a n s a c t i o n s are n o t i f i e d .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
synchronized void u n l o c k ( B l o c k b l k ) {
int val = getLockVal ( blk ) ;
i f ( val > 1)
l o c k s . put ( b l k , v a l −1) ;
else {
l o c k s . remove ( b l k ) ;
notifyAll () ;
}
}

118
p r i v a t e boolean h a s X l o c k ( B l o c k b l k ) {
return g e t L o c k V a l ( b l k ) < 0 ;
}

p r i v a t e boolean h a s O t h e r S L o c k s ( B l o c k b l k ) {
return g e t L o c k V a l ( b l k ) > 1 ;
}

p r i v a t e boolean w a i t i n g T o o L o n g ( long s t a r t t i m e ) {
return System . c u r r e n t T i m e M i l l i s ( ) − s t a r t t i m e > MAX TIME ;
}

private int getLockVal ( Block blk ) {


Integer i v a l = locks . get ( blk ) ;
return ( i v a l == n u l l ) ? 0 : i v a l . i n t V a l u e ( ) ;
}
}

SimpleDB source file simpledb/tx/concurrency/LockAbortException.java


package s i m p l e d b . t x . c o n c u r r e n c y ;

/∗ ∗
∗ A runtime e x c e p t i o n i n d i c a t i n g t h a t the t r a n s a c t i o n
∗ needs to a b o r t because a l o c k could not be o b t a i n e d .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
public c l a s s L o c k A b o r t E x c e p t i o n extends R u n t i m e E x c e p t i o n {
public L o c k A b o r t E x c e p t i o n ( ) {
}
}

SimpleDB source file simpledb/tx/Transaction.java

• The implementation for the Concurrency Manager maintains two kinds of informa-
tion:

Global information shared by all Transactions (in ConcurrencyMgr):


– the LockTable which tells what kind of a Lock a given Block has, and. . .
– in case of slocks, how many it has.
Local information for each Transaction:
– What kind of a Lock it has for a given Block , if any (in ConcurrencyMgr).
– The list of Buffer s which it has pinned (in BufferList):

package s i m p l e d b . t x ;

import simpledb . s e r v e r . SimpleDB ;


import simpledb . f i l e . Block ;
import simpledb . buffer .∗;
import simpledb . t x . r e c o v e r y . RecoveryMgr ;
import simpledb . t x . c o n c u r r e n c y . ConcurrencyMgr ;

/∗ ∗
∗ P r o v i d e s t r a n s a c t i o n management f o r c l i e n t s ,
∗ ensuring t h a t a l l t r a n s a c t i o n s are s e r i a l i z a b l e , recoverable ,
∗ and i n g e n e r a l s a t i s f y t h e ACID p r o p e r t i e s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s T r a n s a c t i o n {
p r i v a t e s t a t i c i n t nextTxNum = 0 ;
p r i v a t e s t a t i c f i n a l i n t END OF FILE = −1;
p r i v a t e RecoveryMgr recoveryMgr ;
p r i v a t e ConcurrencyMgr concurMgr ;
p r i v a t e i n t txnum ;
p r i v a t e B u f f e r L i s t m y B u f f e r s = new B u f f e r L i s t ( ) ;

/∗ ∗
∗ C r e a t e s a new t r a n s a c t i o n and i t s a s s o c i a t e d
∗ r e c o v e r y and c o n c u r r e n c y m a n a g e r s .
∗ T h i s c o n s t r u c t o r d e p e n d s on t h e f i l e , l o g , and b u f f e r
∗ managers t h a t i t g e t s from t h e c l a s s
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB } .
∗ Those o b j e c t s a r e c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l e i t h e r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB# i n i t ( S t r i n g ) } o r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e L o g A n d B u f f e r M g r ( S t r i n g ) } o r
∗ is called first .
∗/
public T r a n s a c t i o n ( ) {
txnum = nextTxNumber ( ) ;
r e c o v e r y M g r = new RecoveryMgr ( txnum ) ;
concurMgr = new ConcurrencyMgr ( ) ;

119
}

/∗ ∗
∗ Commits t h e c u r r e n t t r a n s a c t i o n .
∗ F l u s h e s a l l m o d i f i e d b u f f e r s ( and t h e i r l o g r e c o r d s ) ,
∗ w r i t e s and f l u s h e s a commit r e c o r d t o t h e l o g ,
∗ r e l e a s e s a l l l o c k s , and u n p i n s any p i n n e d b u f f e r s .
∗/
public void commit ( ) {
r e c o v e r y M g r . commit ( ) ;
concurMgr . r e l e a s e ( ) ;
myBuffers . unpinAll ( ) ;
System . o u t . p r i n t l n ( ” t r a n s a c t i o n ” + txnum + ” committed ” ) ;
}

/∗ ∗
∗ R o l l s back the current t r a n s a c t i o n .
∗ Undoes any m o d i f i e d v a l u e s ,
∗ flushes those buffers ,
∗ w r i t e s and f l u s h e s a r o l l b a c k r e c o r d t o t h e l o g ,
∗ r e l e a s e s a l l l o c k s , and u n p i n s any p i n n e d b u f f e r s .
∗/
public void r o l l b a c k ( ) {
recoveryMgr . r o l l b a c k ( ) ;
concurMgr . r e l e a s e ( ) ;
myBuffers . unpinAll ( ) ;
System . o u t . p r i n t l n ( ” t r a n s a c t i o n ” + txnum + ” r o l l e d back ” ) ;
}

/∗ ∗
∗ Flushes a l l modified b u f f e r s .
∗ Then g o e s t h r o u g h t h e l o g , r o l l i n g b a c k a l l
∗ uncommitted t r a n s a c t i o n s . Finally ,
∗ writes a quiescent checkpoint record to the log .
∗ T h i s method i s c a l l e d o n l y d u r i n g s y s t e m s t a r t u p ,
∗ before user transactions begin .
∗/
public void r e c o v e r ( ) {
SimpleDB . b u f f e r M g r ( ) . f l u s h A l l ( txnum ) ;
recoveryMgr . r e c o v e r ( ) ;
}

/∗ ∗
∗ Pins t h e s p e c i f i e d b l o c k .
∗ The t r a n s a c t i o n manages t h e b u f f e r for the client .
∗ @param b l k a r e f e r e n c e t o t h e d i s k block
∗/
public void p i n ( B l o c k b l k ) {
myBuffers . pin ( blk ) ;
}

/∗ ∗
∗ Unpins t h e s p e c i f i e d b l o c k .
∗ The t r a n s a c t i o n l o o k s up t h e b u f f e r p i n n e d to this block ,
∗ and u n p i n s i t .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
public void u n p i n ( B l o c k b l k ) {
myBuffers . unpin ( b lk ) ;
}

/∗ ∗
∗ Returns the i n t e g e r v a l u e s t o r e d at the
∗ s p e c i f i e d o f f s e t of the s p e c i f i e d block .
∗ The method f i r s t o b t a i n s an SLock on t h e b l o c k ,
∗ then i t c a l l s the b u f f e r to r e t r i e v e the value .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e b l o c k
∗ @return t h e i n t e g e r s t o r e d at t h a t o f f s e t
∗/
public i n t g e t I n t ( B l o c k b l k , i n t o f f s e t ) {
concurMgr . s L o c k ( b l k ) ;
B u f f e r b u f f = myBuffers . g e t B u f f e r ( blk ) ;
return b u f f . g e t I n t ( o f f s e t ) ;
}

/∗ ∗
∗ Returns the s t r i n g v a l u e s t o r e d at the
∗ s p e c i f i e d o f f s e t of the s p e c i f i e d block .
∗ The method f i r s t o b t a i n s an SLock on t h e b l o c k ,
∗ then i t c a l l s the b u f f e r to r e t r i e v e the value .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e b l o c k
∗ @return t h e s t r i n g s t o r e d at t h a t o f f s e t
∗/
public S t r i n g g e t S t r i n g ( B l o c k b l k , i n t o f f s e t ) {
concurMgr . s L o c k ( b l k ) ;
B u f f e r b u f f = myBuffers . g e t B u f f e r ( blk ) ;
return b u f f . g e t S t r i n g ( o f f s e t ) ;
}

/∗ ∗
∗ S t o r e s an i n t e g e r a t t h e s p e c i f i e d o f f s e t
∗ of the s p e c i f i e d block .
∗ The method f i r s t o b t a i n s an XLock on t h e b l o c k .
∗ I t then reads the current value at that o f f s e t ,
∗ p u t s i t i n t o an u p d a t e l o g r e c o r d , and
∗ writes that record to the log .
∗ Finally , i t c a l l s the b u f f e r to store the value ,
∗ p a s s i n g i n t h e LSN o f t h e l o g r e c o r d and t h e t r a n s a c t i o n ’ s id .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗ @param o f f s e t a b y t e o f f s e t w i t h i n t h a t b l o c k
∗ @param v a l t h e v a l u e t o b e s t o r e d

120
∗/
public void s e t I n t ( B l o c k b l k , i n t o f f s e t , i n t val ) {
concurMgr . xLock ( b l k ) ;
B u f f e r b u f f = myBuffers . g e t B u f f e r ( blk ) ;
int l s n = recoveryMgr . s e t I n t ( buff , o f f s e t , val ) ;
b u f f . s e t I n t ( o f f s e t , v a l , txnum , l s n ) ;
}

/∗ ∗
∗ Stores a s t r i n g at the s p e c i f i e d o f f s e t
∗ of the s p e c i f i e d block .
∗ The method f i r s t o b t a i n s an XLock on t h e b l o c k .
∗ I t then reads the current value at that o f f s e t ,
∗ p u t s i t i n t o an u p d a t e l o g r e c o r d , and
∗ writes that record to the log .
∗ Finally , i t c a l l s the b u f f e r to store the value ,
∗ p a s s i n g i n t h e LSN o f t h e l o g r e c o r d and t h e t r a n s a c t i o n ’ s id .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗ @param o f f s e t a b y t e o f f s e t w i t h i n t h a t b l o c k
∗ @param v a l t h e v a l u e t o b e s t o r e d
∗/
public void s e t S t r i n g ( B l o c k b l k , i n t o f f s e t , S t r i n g v a l ) {
concurMgr . xLock ( b l k ) ;
B u f f e r b u f f = myBuffers . g e t B u f f e r ( blk ) ;
int l s n = recoveryMgr . s e t S t r i n g ( buff , o f f s e t , val ) ;
b u f f . s e t S t r i n g ( o f f s e t , v a l , txnum , l s n ) ;
}

/∗ ∗
∗ R e t u r n s t h e number o f b l o c k s i n t h e s p e c i f i e d f i l e .
∗ T h i s method f i r s t o b t a i n s an SLock on t h e
∗ ” end o f t h e f i l e ” , b e f o r e a s k i n g t h e f i l e manager
∗ to return the f i l e s i z e .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @ r e t u r n t h e number o f b l o c k s i n t h e f i l e
∗/
public i n t s i z e ( S t r i n g f i l e n a m e ) {
B l o c k dummyblk = new B l o c k ( f i l e n a m e , END OF FILE ) ;
concurMgr . s L o c k ( dummyblk ) ;
return SimpleDB . f i l e M g r ( ) . s i z e ( f i l e n a m e ) ;
}

/∗ ∗
∗ Appends a new b l o c k t o t h e end o f t h e s p e c i f i e d f i l e
∗ and r e t u r n s a r e f e r e n c e t o i t .
∗ T h i s method f i r s t o b t a i n s an XLock on t h e
∗ ” end o f t h e f i l e ” , b e f o r e p e r f o r m i n g t h e a p p e n d .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param f m t r t h e f o r m a t t e r u s e d t o i n i t i a l i z e t h e new p a g e
∗ @ r e t u r n a r e f e r e n c e t o t h e n e w l y −c r e a t e d d i s k b l o c k
∗/
public B l o c k append ( S t r i n g f i l e n a m e , P a g e F o r m a t t e r f m t r ) {
B l o c k dummyblk = new B l o c k ( f i l e n a m e , END OF FILE ) ;
concurMgr . xLock ( dummyblk ) ;
B l o c k b l k = m y B u f f e r s . pinNew ( f i l e n a m e , f m t r ) ;
unpin ( b l k ) ;
return b l k ;
}

p r i v a t e s t a t i c synchronized i n t nextTxNumber ( ) {
nextTxNum++;
System . o u t . p r i n t l n ( ”new t r a n s a c t i o n : ” + nextTxNum ) ;
return nextTxNum ;
}
}

SimpleDB source file simpledb/tx/concurrency/ConcurrencyMgr.java


package s i m p l e d b . t x . c o n c u r r e n c y ;

import s i m p l e d b . f i l e . B l o c k ;
import j a v a . u t i l . ∗ ;

/∗ ∗
∗ The c o n c u r r e n c y manager f o r t h e t r a n s a c t i o n .
∗ Each t r a n s a c t i o n h a s i t s own c o n c u r r e n c y manager .
∗ The c o n c u r r e n c y manager k e e p s t r a c k o f w h i c h l o c k s the
∗ t r a n s a c t i o n c u r r e n t l y h a s , and i n t e r a c t s w i t h t h e
∗ g l o b a l l o c k t a b l e as needed .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s ConcurrencyMgr {

/∗ ∗
∗ The g l o b a l l o c k t a b l e . This v a r i a b l e i s s t a t i c because a l l t r a n s a c t i o n s
∗ s h a r e t h e same t a b l e .
∗/
p r i v a t e s t a t i c LockTable l o c k t b l = new LockTable ( ) ;
p r i v a t e Map<Block , S t r i n g > l o c k s = new HashMap<Block , S t r i n g >() ;

/∗ ∗
∗ O b t a i n s an SLock on t h e b l o c k , i f n e c e s s a r y .
∗ The method w i l l a s k t h e l o c k t a b l e f o r an SLock
∗ i f t h e t r a n s a c t i o n c u r r e n t l y h a s no l o c k s on t h a t block .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
public void s L o c k ( B l o c k b l k ) {
i f ( l o c k s . g e t ( b l k ) == n u l l ) {
l o c k t b l . sLock ( blk ) ;
l o c k s . put ( b l k , ”S” ) ;

121
}
}

/∗ ∗
∗ O b t a i n s an XLock on t h e b l o c k , i f n e c e s s a r y .
∗ I f t h e t r a n s a c t i o n d o e s n o t h a v e an XLock on t h a t block ,
∗ t h e n t h e method f i r s t g e t s an SLock on t h a t b l o c k
∗ ( i f n e c e s s a r y ) , and t h e n u p g r a d e s i t t o an XLock .
∗ @param b l k a r e f r e n c e t o t h e d i s k b l o c k
∗/
public void xLock ( B l o c k b l k ) {
i f ( ! hasXLock ( b l k ) ) {
sLock ( blk ) ;
l o c k t b l . xLock ( b l k ) ;
l o c k s . put ( b l k , ”X” ) ;
}
}

/∗ ∗
∗ R e l e a s e s a l l l o c k s by a s k i n g t h e lock table to
∗ u n l o c k e a c h one .
∗/
public void r e l e a s e ( ) {
for ( Block blk : l o c k s . keySet ( ) )
l o c k t b l . unlock ( blk ) ;
locks . clear () ;
}

p r i v a t e boolean hasXLock ( B l o c k b l k ) {
String locktype = locks . get ( blk ) ;
return l o c k t y p e != n u l l && l o c k t y p e . e q u a l s ( ”X” ) ;
}
}

SimpleDB source file simpledb/tx/BufferList.java


package s i m p l e d b . t x ;

import simpledb . s e r v e r . SimpleDB ;


import simpledb . f i l e . Block ;
import simpledb . buffer .∗;
import java . u t i l .∗;

/∗ ∗
∗ Manages t h e t r a n s a c t i o n ’ s c u r r e n t l y −p i n n e d b u f f e r s .
∗ @ a u t h o r Edward S c i o r e
∗/
class BufferList {
p r i v a t e Map<Block , B u f f e r > b u f f e r s = new HashMap<Block , B u f f e r >() ;
p r i v a t e L i s t <Block> p i n s = new A r r a y L i s t <Block >() ;
p r i v a t e B u f f e r M g r b u f f e r M g r = SimpleDB . b u f f e r M g r ( ) ;

/∗ ∗
∗ Returns the b u f f e r pinned to the s p e c i f i e d block .
∗ The method r e t u r n s n u l l i f t h e t r a n s a c t i o n has not
∗ pinned the b l o c k .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗ @return t h e b u f f e r pinned to t h a t b l o c k
∗/
B u f f e r g e t B u f f e r ( Block blk ) {
return b u f f e r s . g e t ( b l k ) ;
}

/∗ ∗
∗ P i n s t h e b l o c k and k e e p s t r a c k o f t h e b u f f e r internally .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
void p i n ( B l o c k b l k ) {
Buffer b u f f = bufferMgr . pin ( blk ) ;
b u f f e r s . put ( b l k , b u f f ) ;
p i n s . add ( b l k ) ;
}

/∗ ∗
∗ Appends a new b l o c k t o t h e s p e c i f i e d f i l e
∗ and p i n s i t .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param f m t r t h e f o r m a t t e r u s e d t o i n i t i a l i z e t h e new p a g e
∗ @ r e t u r n a r e f e r e n c e t o t h e n e w l y −c r e a t e d b l o c k
∗/
B l o c k pinNew ( S t r i n g f i l e n a m e , P a g e F o r m a t t e r f m t r ) {
B u f f e r b u f f = b u f f e r M g r . pinNew ( f i l e n a m e , f m t r ) ;
Block blk = b u f f . block ( ) ;
b u f f e r s . put ( b l k , b u f f ) ;
p i n s . add ( b l k ) ;
return b l k ;
}

/∗ ∗
∗ Unpins t h e s p e c i f i e d b l o c k .
∗ @param b l k a r e f e r e n c e t o t h e d i s k block
∗/
void u n p i n ( B l o c k b l k ) {
Buffer buff = b u f f e r s . get ( blk ) ;
bufferMgr . unpin ( b u f f ) ;
p i n s . remove ( b l k ) ;
i f ( ! pins . contains ( blk ) )
b u f f e r s . remove ( b l k ) ;
}

122
/∗ ∗
∗ U n p i n s any b u f f e r s s t i l l p i n n e d b y this transaction .
∗/
void u n p i n A l l ( ) {
for ( Block blk : p i n s ) {
Buffer buff = b u f f e r s . get ( blk ) ;
bufferMgr . unpin ( b u f f ) ;
}
buffers . clear () ;
pins . clear () ;
}
}

Multiversion Locking (Sciore, 2008, Chapter 14.4.6) (Weikum and Vossen, 2001,
Chapter 5)
• Many Transactions are read-only.

– For instance, every SQL SELECT. . . FROM. . . WHERE. . . query in the


(default) AutoCommit mode (outside INSERTions) only reads from the data-
base but does not write into it.
– A JDBC client can hint to the connected RDBMS server that it will be read-
only by calling
conn . setReadOnly ( true ) ;

• The RDBMS Concurrency Management can speed up these read-only Transactions


with an alternative protocol for them.

– This is not the same thing as relaxing the Transaction isolation level and
accepting the risk of wrong answers.
– Instead, this alternative protocol is still correct but faster than general Lock ing,
because it knows that the Transaction is read-only.

• One such specialized protocol for read-only Transactions is Multiversion Lock ing.

• Logically its idea is as follows:

¬ Each read-write Transaction follows the normal Lock ing protocol, to ensure
that each version of a Block has only one writer and timestamp.
­ When a read-write Transaction t writes a Block b and commits, this creates
a new version of b with the current timestamp.
® The RDBMS maintains many versions of the same Block b with different times-
tamps:
¶ one version timestamped with the commit time of t,
· another with the commit time of t0 ,
¸ a third with the commit time of t00 ,. . .
for its already committed writers t, t0 , t00 , . . .
¯ When a read-only Transaction u requests a Block b the RDBMS gives it the
version with the largest timestamp < the starting time of u – the version of b
which was newest when u started.
° This read-only Transaction u can then read its own version of Block b without
Lock ing:
That version was made by the most recently committed writer of b before u
started, so it no longer has any writer still running.

123
Figure 49: Multiversioning example. (Sciore, 2008)

124
Figure 49 gives an example.

• Physically the RDBMS does not have to maintain each version in ® explicitly as
separate disk Block s.
It can do so, but then it must also garbage collect the Block s for those versions
which are no longer needed.

• Instead, the RDBMS can reconstruct the correct version of the Block b requested
by the read-only Transaction u in ¯.

– Recall that recovery reconstructs all the Buffer s written by those Transactions
which committed before the shutdown.
– Here we reconstruct the Buffer f for this particular Block b written by those
Transactions which committed before u started.

• Hence the RDBMS can reconstruct this f as follows:

¶ Allocate a new RAM Buffer f which u pins and initialize it to have the current
contents of the requested Block b.
This f does not have to be pinned to any Block because it will not be saved
to disk.
· Execute this variant of Figure 39:
– Construct the list of Transactions which either have committed after the
beginning of u or are still executing, and
– undo into f what they have done to b.
¸ Now f is the version of Block b which u requested, so u can read f without
locking in °, and unpin f afterwards.

• This timestamping approach can be generalized from read-only to all Transactions.

– This needs no Lock s at all, because the RDBMS can compare timestamps
instead.
– Lock ing overhead disappears – but version management overhead appears.
– This approach is called multiversion timestamp ordering (MVTO). We do not
discuss it further here.

Software Transactional Memory

• Transactions are a central concept in concurrent and/or fault-tolerant computing.

• DBMSs are the most common but not the only example of such programming.

• Software Transactional Memory (STM) is one approach for bringing Transactions


into programming languages.

• Concurrent programs have

many OS threads executing within the same OS process


shared RAM memory which these OS threads access

125
at the same time.

• In this STM approach, when an OS thread wants to access this shared RAM, it

¬ begins a new Transaction


­ accesses the shared RAM during this Transaction
® after these modifications, either
commits which makes its modifications visible to other OS threads, or
aborts and its modifications are ignored.

• The STM design philosophy encourages programming the threads to use many brief
Transactions which focus on just modifying the shared RAM.

• The STM implementation coordinates access to shared RAM by coordinating these


Transactions.

• These Transactions satisfy the Atomicity and Isolation properties, but not Durabil-
ity – they coordinate using shared RAM, not disk database.

• Many programming languages have (more or less supported. . . ) STM libraries.

– However, there are things which the thread should not do inside a Transaction.
– Especially it should not perform I/O actions – how would you “undo” them if
the Transaction aborts later?
– An external library cannot enforce this – its documentation can only ask pro-
grammers to follow these rules. . .

• If STM is integrated into the programming language implementation then it can


enforce these rules.

– For instance, the Glasgow Haskell Compiler (GHC + www.haskell.org) has


built-in STM support.
– The type of I/O actions is not compatible with the type of STM code – so a
Haskell program which tries to call an I/O action inside STM code will not
compile.

• STM can even be included into the programming language specification.

– For instance, Clojure (+ www.clojure.org) specifies STM to be a part of the


language, so all implementations (there is only one. . . ) must include it.
– Clojure is a LISP dialect, and so it enforces these rules at run-time.

126
• STM offers an elegant solution to composing concurrent programs:
– If 2 individually correct Lock -based programs P and Q are composed sequen-
tially into P ; Q it may no longer be correct – because P leaves its Lock s in a
state which Q cannot handle.
– But if P and Q are STM code, then P ; Q is executed as one Transaction and
this problem does not arise.
• The GHC STM offers the following additional Transaction programming primitives
(Peyton Jones, 2007):
– A third way to end a Transaction:
∗ The retry function aborts the current Transaction and begins it again
later – because it might commit then.
∗ In most cases, the programmer wants his/her Transaction to commit
eventually, and this is easily expressed with retrying.
– A choice to control this retrying: P orElse Q
¬ first tries executing the STM code P , but if it would end in retrying
­ then executes the STM code Q instead.
The programmer can then define more elaborate Transaction control strategies
on top of orElse.

• The GHC STM implementation uses the following optimistic concurrency control
strategy:
¬ When a Transaction begins, it creates its own private initially empty Log.
­ When the Transaction wants to read a variable x, it
¬ first checks if it already has x in its own private Log
­ otherwise reads x from the shared RAM into its own private Log for later
use.
® When the Transaction wants to write a variable y, it writes it into its own
private Log.
¯ When the Transaction wants to commit, it compares its own private Log
against the shared RAM.
– If the shared RAM still has the same old Logged values, this Transaction
writes its new Logged values into the shared RAM and commits.
– Otherwise some other Transaction has committed and modified the orig-
inal values of the Logged variables in shared RAM while this Transaction
was running, so it must retry instead, because it has been using their
outdated values from its own private Log.
– This retrying just discards this private Log and goes back to ¬.
• This optimistic strategy is lightweight if retrying is. In
STM it is, because it uses RAM since it does not have to satisfy the Transaction
Durability property
DBMS it is not, because it must use the disk to satisfy it. However, it can still be
efficient if Log comparisons usually succeed. Oracle uses a variant of optimism
by default.

127
• Just for fun, here is the GHC STM solution the Santa Claus Problem from concur-
rent programming literature:

Santa repeatedly sleeps until woken up by either all of his 9 reindeer or 3


of his 10 elves. If by the reindeer, he harnesses them to his sled, delivers
toys, and unharnesses them. If by elves, he shows them into his office,
talks with them, and shows them out. If by both, he deals with the
reindeed first.

It is not part of this course! If you get interested, Peyton Jones (2007) derives this
solution step by step.

−− {−# OPTIONS −p a c k a g e stm #−}

module Main where

import C o n t r o l . C o n c u r r e n t .STM
import C o n t r o l . C o n c u r r e n t
import System .Random

main = do { e l f g p <− newGroup 3


; sequence [ e l f e l f g p n | n <− [ 1 . . 1 0 ] ]

; r e i n g p <− newGroup 9
; sequence [ r e i n d e e r r e i n g p n | n <− [ 1 . . 9 ] ]

; f o r e v e r ( santa e l f g p rein gp ) }
where
elf gp i d = f o r k I O ( f o r e v e r ( do { e l f 1 gp i d ; randomDelay } ) )
r e i n d e e r gp i d = f o r k I O ( f o r e v e r ( do { r e i n d e e r 1 gp i d ; randomDelay } ) )

santa : : Group −> Group −> IO ( )


santa e l f group rein group
= do { putStr ”−−−−−−−−−−\n”
; c h o o s e [ ( awaitGroup r e i n g r o u p , run ” d e l i v e r t o y s ” ) ,
( awaitGroup e l f g r o u p , run ” meet i n my s t u d y ” ) ] }
where
run : : String −> ( Gate , Gate ) −> IO ( )
run t a s k ( i n g a t e , o u t g a t e )
= do { putStr ( ”Ho ! Ho ! Ho ! l e t ’ s ” ++ t a s k ++ ” \n” )
; operateGate i n g a t e
; operateGate out gate }

h e l p e r 1 : : Group −> IO ( ) −> IO ( )


h e l p e r 1 group d o t a s k
= do { ( i n g a t e , o u t g a t e ) <− j o i n G r o u p group
; passGate i n g a t e
; do task
; passGate o u t g a t e }

e l f 1 , r e i n d e e r 1 : : Group −> Int −> IO ( )


elf1 group i d = h e l p e r 1 group ( m e e t I n S t u d y i d )
r e i n d e e r 1 group i d = h e l p e r 1 group ( d e l i v e r T o y s i d )

m e e t I n S t u d y i d = putStr ( ” E l f ” ++ show i d ++ ” m e e t i n g i n t h e s t u d y \n” )


d e l i v e r T o y s i d = putStr ( ” R e i n d e e r ” ++ show i d ++ ” d e l i v e r i n g t o y s \n” )

−−−−−−−−−−−−−−−
data Group = MkGroup Int ( TVar ( Int , Gate , Gate ) )

newGroup : : Int −> IO Group


newGroup n = a t o m i c a l l y ( do { g1 <− newGate n
; g2 <− newGate n
; t v <− newTVar ( n , g1 , g2 )
; return ( MkGroup n t v ) } )

j o i n G r o u p : : Group −> IO ( Gate , Gate )


j o i n G r o u p ( MkGroup n t v )
= a t o m i c a l l y ( do { ( n l e f t , g1 , g2 ) <− readTVar t v
; check ( n l e f t > 0)
; w r i t e T Va r t v ( n l e f t −1 , g1 , g2 )
; return ( g1 , g2 ) } )

awaitGroup : : Group −> STM ( Gate , Gate )


awaitGroup ( MkGroup n t v )
= do { ( n l e f t , g1 , g2 ) <− readTVar t v
; c h e c k ( n l e f t == 0 )
; new g1 <− newGate n
; new g2 <− newGate n
; w r i t e T V ar t v ( n , new g1 , new g2 )
; return ( g1 , g2 ) }

−−−−−−−−−−−−−−−
data Gate = MkGate Int ( TVar Int )

newGate : : Int −> STM Gate


newGate n = do { t v <− newTVar 0 ; return ( MkGate n t v ) }

p a s s G a t e : : Gate −> IO ( )
p a s s G a t e ( MkGate n t v )

128
The relational data model has. . . Its RDBMS implementation is. . .
a stored Table, which is a collection of. . . a File of disk Block s, which is a sequence
of. . .
Row s Record s
which have one or more. . .
Attributes. . . Field s. . .
each of which has some Value of a known Type.

Table 4: Rows and records.

= atomically ( do { n l e f t <− readTVar t v


; check ( n l e f t > 0)
; w r i t e T Va r t v ( n l e f t −1) } )

o p e r a t e G a t e : : Gate −> IO ( )
o p e r a t e G a t e ( MkGate n tv )
= do { a t o m i c a l l y ( writeTVar tv n )
; atomically ( do { n l e f t <− readTVar t v
; c h e c k ( n l e f t == 0 ) } ) }

−−−−−−−−−−−−−−−−

f o r e v e r : : IO ( ) −> IO ( )
−− R e p e a t e d l y p e r f o r m t h e a c t i o n
f o r e v e r a c t = do { a c t ; f o r e v e r a c t }

randomDelay : : IO ( )
−− D e l a y f o r a random t i m e b e t w e e n 1 and 1 0 0 0 , 0 0 0 m i c r o s e c o n d s
randomDelay = do { waitTime <− getStdRandom (randomR ( 1 , 1 0 0 0 0 0 0 ) )
; t h r e a d D e l a y waitTime }

choose : : [ ( STM a , a −> IO ( ) ) ] −> IO ( )


choose c h o i c e s = do { t o d o <− a t o m i c a l l y ( foldr1 orElse stm actions )
; to do }
where
s t m a c t i o n s : : [STM ( IO ( ) ) ]
s t m a c t i o n s = [ do { v a l <− guard ; return ( r h s val ) }
| ( guard , r h s ) <− c h o i c e s ]

4.5 Record Management


(Sciore, 2008, Chapter 15)
• All the preceding RDBMS components have been managing uninterpreted “raw
data” – just RAM Pages and Buffer s for disk Block s of bytes.

• Now we consider the RDBMS components which interpret this raw data as an
implementation of the relational data model.

• The Record Manager is the first such component. It builds a stored Table on top of
disk Block s as in Table 4.

• That is, RAM Pages (in Buffer s) have methods for getting and setting arbitrary
Values at arbitrary offsets within them.

• This Record Manager specifies in turn the

offsets where Record s start inside a Page


offsets of the Field s inside each Record
Type of each Field of a Record

so that the RDBMS components above it can

– access the nth Record on this Page, and


– its Field s by their Names, and

129
– get and set the Values for its Field s

without having to know and calculate their actual offsets – this Record Manager
takes care of that for them.

• This Record Manager also determines the Record IDentifier (RID) for each Record .

– This RID identifies each Record uniquely within its File.


– It is a pair hp, qi meaning “Record q in Block p of this File”.
– For example, an index for the chosen key of a Table maps each key value to
the corresponding RID which is the “pointer” to the corresponding Record .

Homogeneous vs. Nonhomogeneous Files

• The first tradeoff in the design of a Record Manager is the File structure. A File
is. . .

homogeneous if all its Record s belong to the same Table – and so they all have
the same Field s too.
+ Simpler Record Manager design – each Block of the File can be treated as
an array of structurally identical Record s.
− The database must be divided into many OS Files (as in Oracle, Sim-
pleDB,. . . ).
nonhomogeneous if its Record s can belong to different Tables and can therefore
have different Field s too.
+ The database can be in one OS File (like a MS Access .mdb file).
− The Record Manager must keep structurally different kinds of Record s to-
gether in the same File.

• The Log is an example of a nonhomogeneous File because it contains many different


kinds of Record s.

• Nonhomogeneous Files can be efficient if they are designed around a particular


way of joining Tables together. This is called clustering the data according to this
particular join predicate.

• This organization is

fast when the data is accessed in the same way as it was clustered but
slower when the data is accessed in other ways.

• Figure 50 shows the DEPT and STUDENT Tables clustered together in one file
so that the student Record s with the same major are clustered together after their
common department Record .

• Then it is fast to retrieve and list students according to their major, but slower if
they are accessed in some other way.

130
Figure 50: Nonhomogeneous blocks. (Sciore, 2008)

Figure 51: Records and block boundaries. (Sciore, 2008)

Spanned vs. Unspanned Records

• Another tradeoff is whether a Record can span over a Page boundary or not.

• If it can, then. . .

+ the Pages can be filled to maximum without having to waste the last part of a
Page which is too small for another Record .
+ Record length has no upper limit
− processing a Record which spans a Page boundary is more difficult, because it
must consider both Pages.

• The SimpleDB Log uses unspanned Record s, because it flushes the last Page when
the next Record would not fit into it any more, and starts another Page.

• Figure 51 shows 2 1000-byte Block s with 4 300-byte Record s. The wasted 100-byte
part of the unspanned choice (b) is shaded.

• Figure 52 shows 2 ways to represent spanned Record s with an integer in the begin-
ning of each Block telling how many bytes. . .

(a) belong to the last Record of (= length of R2b)

131
Figure 52: 2 ways to span a block boundary. (Sciore, 2008)

(b) of the first Record are in (= length of R2a)


the preceding Block .

Fixed- vs. Variable-Length Fields


• There are 2 kinds of Field s, depending on whether the length of its Values is
fixed like Java ints, which are always 4 bytes long.
variable like Java Strings which can hold any number of characters. SQL has for
instance the types
char(n)= a string of exactly n characters
varchar(n)= a string of at most n characters, where n should be “small”
clob(n)= a string of at most n characters, where n can be “large” – a char-
acter large object block
blob(n)= a sequence of at most n bytes – a binary large object block.
• For instance, in our university example database
small strings are enough for names
large strings could be used for free-form text like course descriptions.
• Variable-length Field s are more difficult to implement than fixed-length Field s,
because
¶ modifying the Value of a variable-length Field can change (especially grow)
the length of its Record , and so
· the RIDs may change inside a Page.
• Problem ¶ can be solved by adding overflow Block s when needed, as in Figure 53.

• Problem · can be solved with an ID table as in part (c) of Figure 54. Then RID
hp, qi means “the record whose starting point within Block p is in its ID-TABLE[q]”.

• Character and binary large object blocks can be stored separately from their records,
as in part (b) of Figure 55.

132
Figure 53: Growing a variable-length field into an overflow block. (Sciore, 2008)

Figure 54: Using an ID table in a block. (Sciore, 2008)

133
Figure 55: Different ways to store variable-length strings. (Sciore, 2008)

The SimpleDB Record Manager

• SimpleDB uses

homogeneous Files containing


unspanned Record s with
fixed-length fields, where each varchar(n) reserves enough space for all its n
characters, as in part (c) of Figure 55.

• It stores each Java int as 4 bytes – including the length n of a varchar field.

• In addition, each Record begins with a Flag byte, which is

1 if this Record is already used for storing a row of the Table stored in this File,
and
0 if it is still unused.

(In fact, SimpleDB uses a 4-byte int as the Flag, but let us assume just 1 byte for
these examples.)

• This structure means that a File contains

Record Pages each containing the same number of


Slots each containing
one Record and its Flag byte, as in Figure 56.

• Then RID hp, qi means the Record stored in Slot q of Record Page p.

• Figure 57 shows the corresponding table information which describes the Field struc-
ture of the Record inside a Slot.

• The Schema of a Table consists of

134
Figure 56: A block of student records. (Sciore, 2008)

Figure 57: The 26 bytes of a student record. (Sciore, 2008)

its Attributes and for each of them


its Type and
the Length of its Values.

That is, it describes the logical structure of the Row s.

• This table information gives

the Length of the Field reserved for each Attribute, and


the Offset of each Field from the start of the Record .

That is, it describes the physical structure of the Record s.

• That is, the Length of SName is

10 characters for the Attribute in the Schema by the Table definition in Figure 5,
but
14 bytes for the Field in the table information in Figure 57 – because the Field
begins with the 4-byte int giving the actual length of the current Value.

• For instance, accessing the MajorId field of the Record with RID hp, qi means

¬ retrieving Block p of the Student File into the RAM Page of a Buffer

135
­ moving into the beginning Slot q within this Buffer – that is, into position
Slot length
z }| {
q · (Flag byte + Record length) = q · (1 + 26)
= q · 27

® moving into the position

Flag byte + Field Offset = 1 + 22


= 23

within the Slot


¯ getting or setting the 4-byte integer Value starting at that position.

• The Record Manager handles this translation of an RID and an Attribute name into
a position within a Block .

• The API for the Schema and TableInfo objects used in this translation is in Fig-
ure 58.

SimpleDB source file simpledb/record/Schema.java


package s i m p l e d b . r e c o r d ;

import s t a t i c j a v a . s q l . Types . ∗ ;
import j a v a . u t i l . ∗ ;

/∗ ∗
∗ The r e c o r d schema o f a t a b l e .
∗ A schema c o n t a i n s t h e name and t y p e o f
∗ each f i e l d o f t h e t a b l e , as w e l l as t h e l e n g t h
∗ of each varchar f i e l d .
∗ @ a u t h o r Edward S c i o r e

∗/
public c l a s s Schema {
p r i v a t e Map<S t r i n g , F i e l d I n f o > i n f o = new HashMap<S t r i n g , F i e l d I n f o >() ;

/∗ ∗
∗ C r e a t e s an empty schema .
∗ F i e l d i n f o r m a t i o n can b e a d d e d t o a schema
∗ v i a t h e f i v e addXXX m e t h o d s .
∗/
public Schema ( ) {}

/∗ ∗
∗ Adds a f i e l d t o t h e schema h a v i n g a s p e c i f i e d
∗ name , t y p e , and l e n g t h .
∗ I f the f i e l d type i s ” i n t e g e r ” , then the l e n g t h
∗ value is irrelevant .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param t y p e t h e t y p e o f t h e f i e l d , a c c o r d i n g t o the constants in simpledb . s q l . types
∗ @param l e n g t h t h e c o n c e p t u a l l e n g t h o f a s t r i n g field .
∗/
public void a d d F i e l d ( S t r i n g fldname , i n t type , i n t length ) {
i n f o . put ( fldname , new F i e l d I n f o ( type , l e n g t h ) ) ;
}

/∗ ∗
∗ Adds an i n t e g e r f i e l d t o t h e schema .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗/
public void a d d I n t F i e l d ( S t r i n g f l d n a m e ) {
a d d F i e l d ( fldname , INTEGER, 0 ) ;
}

/∗ ∗
∗ Adds a s t r i n g f i e l d t o t h e schema .
∗ The l e n g t h i s t h e c o n c e p t u a l l e n g t h o f t h e f i e l d .
∗ For e x a m p l e , i f t h e f i e l d i s d e f i n e d a s v a r c h a r ( 8 ) ,
∗ then i t s l e n g t h i s 8.
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param l e n g t h t h e number o f c h a r s i n t h e v a r c h a r d e f i n i t i o n
∗/
public void a d d S t r i n g F i e l d ( S t r i n g fldname , i n t l e n g t h ) {
a d d F i e l d ( fldname , VARCHAR, l e n g t h ) ;
}

/∗ ∗
∗ Adds a f i e l d t o t h e schema h a v i n g t h e same
∗ t y p e and l e n g t h as t he c o r r e s p o n d i n g f i e l d

136
Figure 58: The two kinds of objects in the Record Manager. (Sciore, 2008)

137
∗ i n a n o t h e r schema .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param s c h t h e o t h e r schema
∗/
public void add ( S t r i n g fldname , Schema s c h ) {
int type = sch . type ( fldname ) ;
int l e n g t h = sch . l e n g t h ( fldname ) ;
a d d F i e l d ( fldname , type , l e n g t h ) ;
}

/∗ ∗
∗ Adds a l l o f t h e f i e l d s i n t h e s p e c i f i e d schema
∗ t o t h e c u r r e n t schema .
∗ @param s c h t h e o t h e r schema
∗/
public void a d d A l l ( Schema s c h ) {
i n f o . putAll ( sch . i n f o ) ;
}

/∗ ∗
∗ R e t u r n s a c o l l e c t i o n c o n t a i n i n g t h e name o f
∗ e a c h f i e l d i n t h e schema .
∗ @ r e t u r n t h e c o l l e c t i o n o f t h e schema ’ s f i e l d names
∗/
public C o l l e c t i o n <S t r i n g > f i e l d s ( ) {
return i n f o . k e y S e t ( ) ;
}

/∗ ∗
∗ Returns t r ue i f the s p e c i f i e d f i e l d
∗ i s i n t h e schema
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @ r e t u r n t r u e i f t h e f i e l d i s i n t h e schema
∗/
public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return f i e l d s ( ) . c o n t a i n s ( f l d n a m e ) ;
}

/∗ ∗
∗ Returns the type of the s p e c i f i e d f i e l d , using the
∗ c o n s t a n t s i n { @ l i n k j a v a . s q l . Types } .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e i n t e g e r t y p e o f t h e f i e l d
∗/
public i n t t y p e ( S t r i n g f l d n a m e ) {
return i n f o . g e t ( f l d n a m e ) . t y p e ;
}

/∗ ∗
∗ Returns the co n c ep t ua l l e n g t h of the s p e c i f i e d field .
∗ I f the f i e l d i s not a s t r i n g f i e l d , then
∗ the return value is undefined .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e c o n c e p t u a l l e n g t h o f t h e f i e l d
∗/
public i n t l e n g t h ( S t r i n g f l d n a m e ) {
return i n f o . g e t ( f l d n a m e ) . l e n g t h ;
}

class FieldInfo {
i n t type , l e n g t h ;
public F i e l d I n f o ( i n t type , int length ) {
this . type = type ;
this . length = length ;
}
}
}

SimpleDB source file simpledb/record/TableInfo.java


package s i m p l e d b . r e c o r d ;

import s t a t i c j a v a . s q l . Types . INTEGER ;


import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import j a v a . u t i l . ∗ ;

/∗ ∗
∗ The m e t a d a t a a b o u t a t a b l e and i t s r e c o r d s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s T a b l e I n f o {
p r i v a t e Schema schema ;
p r i v a t e Map<S t r i n g , I n t e g e r > o f f s e t s ;
private int r e c o r d l e n ;
p r i v a t e S t r i n g tblname ;

/∗ ∗
∗ C r e a t e s a T a b l e I n f o o b j e c t , g i v e n a t a b l e name
∗ and schema . The c o n s t r u c t o r c a l c u l a t e s t h e
∗ p h y s i c a l o f f s e t of each f i e l d .
∗ T h i s c o n s t r u c t o r i s u s e d when a t a b l e i s c r e a t e d .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param schema t h e schema o f t h e t a b l e ’ s r e c o r d s
∗/
public T a b l e I n f o ( S t r i n g tblname , Schema schema ) {
t h i s . schema = schema ;
t h i s . tblname = tblname ;
offsets = new HashMap<S t r i n g , I n t e g e r >() ;
int pos = 0 ;
f o r ( S t r i n g f l d n a m e : schema . f i e l d s ( ) ) {

138
o f f s e t s . put ( fldname , p o s ) ;
p o s += l e n g t h I n B y t e s ( f l d n a m e ) ;
}
r e c o r d l e n = pos ;
}

/∗ ∗
∗ C r e a t e s a T a b l e I n f o o b j e c t from t h e
∗ s p e c i f i e d metadata .
∗ T h i s c o n s t r u c t o r i s u s e d when t h e m e t a d a t a
∗ i s r e t r i e v e d from t h e c a t a l o g .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param schema t h e schema o f t h e t a b l e ’ s r e c o r d s
∗ @param o f f s e t s t h e a l r e a d y −c a l c u l a t e d o f f s e t s o f t h e f i e l d s w i t h i n a r e c o r d
∗ @param r e c o r d l e n t h e a l r e a d y −c a l c u l a t e d l e n g t h o f e a c h r e c o r d
∗/
public T a b l e I n f o ( S t r i n g tblname , Schema schema , Map<S t r i n g , I n t e g e r > o f f s e t s , i n t recordlen ) {
t h i s . tblname = tblname ;
t h i s . schema = schema ;
this . o f f s e t s = offsets ;
this . recordlen = recordlen ;
}

/∗ ∗
∗ Returns the filename a s s i g n e d to t h i s t a b l e .
∗ C u r r e n t l y , t h e f i l e n a m e i s t h e t a b l e name
∗ f o l l o w e d by ” . t b l ” .
∗ @ r e t u r n t h e name o f t h e f i l e a s s i g n e d t o t h e table
∗/
public S t r i n g f i l e N a m e ( ) {
return tblname + ” . t b l ” ;
}

/∗ ∗
∗ R e t u r n s t h e schema o f t h e t a b l e ’ s r e c o r d s
∗ @ r e t u r n t h e t a b l e ’ s r e c o r d schema
∗/
public Schema schema ( ) {
return schema ;
}

/∗ ∗
∗ Returns the o f f s e t of a s p e c i f i e d f i e l d w i t h i n a record
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e o f f s e t o f t h a t f i e l d w i t h i n a record
∗/
public i n t o f f s e t ( S t r i n g f l d n a m e ) {
return o f f s e t s . g e t ( f l d n a m e ) ;
}

/∗ ∗
∗ Returns the l e n g t h of a record , in b y t e s .
∗ @return t h e l e n g t h in b y t e s o f a record
∗/
public i n t r e c o r d L e n g t h ( ) {
return r e c o r d l e n ;
}

private int l e n g t h I n B y t e s ( S t r i n g fldname ) {


i n t f l d t y p e = schema . t y p e ( f l d n a m e ) ;
i f ( f l d t y p e == INTEGER)
return INT SIZE ;
else
return STR SIZE ( schema . l e n g t h ( f l d n a m e ) ) ;
}
}

SimpleDB source file simpledb/record/RID.java


package s i m p l e d b . r e c o r d ;

/∗ ∗
∗ An i d e n t i f i e r f o r a r e c o r d w i t h i n a f i l e .
∗ A RID c o n s i s t s o f t h e b l o c k number i n t h e file ,
∗ and t h e ID o f t h e r e c o r d i n t h a t b l o c k .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s RID {
p r i v a t e i n t blknum ;
private int i d ;

/∗ ∗
∗ C r e a t e s a RID f o r t h e r e c o r d h a v i n g t h e
∗ s p e c i f i e d ID i n t h e s p e c i f i e d b l o c k .
∗ @param b l k n u m t h e b l o c k number w h e r e t h e record lives
∗ @param i d t h e r e c o r d ’ s ID
∗/
public RID ( i n t blknum , i n t i d ) {
t h i s . blknum = blknum ;
this . id = id ;
}

/∗ ∗
∗ R e t u r n s t h e b l o c k number associated with t h i s RID .
∗ @ r e t u r n t h e b l o c k number
∗/
public i n t blockNumber ( ) {
return blknum ;
}

139
/∗ ∗
∗ R e t u r n s t h e ID a s s o c i a t e d with t h i s RID .
∗ @ r e t u r n t h e ID
∗/
public i n t i d ( ) {
return i d ;
}

public boolean e q u a l s ( O b j e c t o b j ) {
RID r = ( RID ) o b j ;
return blknum == r . blknum && i d==r . i d ;
}

public S t r i n g t o S t r i n g ( ) {
return ” [ ” + blknum + ” , ” + i d + ” ] ” ;
}
}

SimpleDB source file simpledb/record/RecordPage.java


• Here is the implementation of one RecordPage.

• It maintains the current Slot within this Block .

• Its get and set methods take an Attribute name as an argument, and translate it
into the correct position within this current Slot.

• It also provides the next method, which moves this current Slot into the next Slot
in use within this Block , if any.
package s i m p l e d b . r e c o r d ;

import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import s i m p l e d b . f i l e . B l o c k ;
import s i m p l e d b . t x . T r a n s a c t i o n ;

/∗ ∗
∗ Manages t h e p l a c e m e n t and a c c e s s o f r e c o r d s i n a b l o c k .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s RecordPage {
public s t a t i c f i n a l i n t EMPTY = 0 , INUSE = 1 ;

private Block blk ;


private TableInfo t i ;
private Transaction tx ;
private int s l o t s i z e ;
private i n t c u r r e n t s l o t = −1;

/∗ ∗ C r e a t e s t h e r e c o r d manager f o r t h e s p e c i f i e d b l o c k .
∗ The c u r r e n t r e c o r d i s s e t t o b e p r i o r t o t h e f i r s t one .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗ @param t i t h e t a b l e ’ s m e t a d a t a
∗ @param t x t h e t r a n s a c t i o n p e r f o r m i n g t h e o p e r a t i o n s
∗/
public RecordPage ( B l o c k b l k , T a b l e I n f o t i , T r a n s a c t i o n t x ) {
this . blk = blk ;
this . t i = t i ;
this . tx = tx ;
s l o t s i z e = t i . r e c o r d L e n g t h ( ) + INT SIZE ;
tx . pin ( blk ) ;
}

/∗ ∗
∗ C l o s e s t h e manager , by unpinning the block .
∗/
public void c l o s e ( ) {
i f ( b l k != n u l l ) {
tx . unpin ( blk ) ;
blk = null ;
}
}

/∗ ∗
∗ Moves t o t h e n e x t r e c o r d i n t h e b l o c k .
∗ @ r e t u r n f a l s e i f t h e r e i s no n e x t r e c o r d .
∗/
public boolean n e x t ( ) {
return s e a r c h F o r ( INUSE ) ;
}

/∗ ∗
∗ Returns the i n t e g e r v a l u e s t o r e d f o r the
∗ s p e c i f i e d f i e l d of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d .
∗ @return t h e i n t e g e r s t o r e d in t h a t f i e l d
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
int p o s i t i o n = f i e l d p o s ( fldname ) ;
return t x . g e t I n t ( b l k , p o s i t i o n ) ;
}

140
/∗ ∗
∗ Returns the s t r i n g v a l u e s t o r e d f o r the
∗ s p e c i f i e d f i e l d of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d .
∗ @return t h e s t r i n g s t o r e d in t h a t f i e l d
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
int p o s i t i o n = f i e l d p o s ( fldname ) ;
return t x . g e t S t r i n g ( b l k , p o s i t i o n ) ;
}

/∗ ∗
∗ S t o r e s an i n t e g e r a t t h e s p e c i f i e d f i e l d
∗ of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e i n t e g e r v a l u e s t o r e d i n t h a t field
∗/
public void s e t I n t ( S t r i n g fldname , i n t v a l ) {
int p o s i t i o n = f i e l d p o s ( fldname ) ;
tx . s e t I n t ( blk , p o s i t i o n , v a l ) ;
}

/∗ ∗
∗ Stores a s t r i n g at the s p e c i f i e d f i e l d
∗ of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e s t r i n g v a l u e s t o r e d i n t h a t f i e l d
∗/
public void s e t S t r i n g ( S t r i n g fldname , S t r i n g v a l ) {
int p o s i t i o n = f i e l d p o s ( fldname ) ;
tx . s e t S t r i n g ( blk , p o s i t i o n , v a l ) ;
}

/∗ ∗
∗ Deletes the current record .
∗ D e l e t i o n i s performed by j u s t marking t h e r e c o r d
∗ as ” d e l e t e d ” ; t he c u r r e n t record does not change .
∗ To g e t t o t h e n e x t r e c o r d , c a l l n e x t ( ) .
∗/
public void d e l e t e ( ) {
int p o s i t i o n = currentpos ( ) ;
t x . s e t I n t ( b l k , p o s i t i o n , EMPTY) ;
}

/∗ ∗
∗ I n s e r t s a new , b l a n k r e c o r d s o m e w h e r e i n t h e p a g e .
∗ R e t u r n f a l s e i f t h e r e w e r e no a v a i l a b l e s l o t s .
∗ @ r e t u r n f a l s e i f t h e i n s e r t i o n was n o t p o s s i b l e
∗/
public boolean i n s e r t ( ) {
c u r r e n t s l o t = −1;
boolean f o u n d = s e a r c h F o r (EMPTY) ;
i f ( found ) {
int p o s i t i o n = currentpos ( ) ;
t x . s e t I n t ( b l k , p o s i t i o n , INUSE ) ;
}
return f o u n d ;
}

/∗ ∗
∗ S e t s t he c u r r e n t record to be th e record having the
∗ s p e c i f i e d ID .
∗ @param i d t h e ID o f t h e r e c o r d w i t h i n t h e p a g e .
∗/
public void moveToId ( i n t i d ) {
currentslot = id ;
}

/∗ ∗
∗ R e t u r n s t h e ID o f t h e c u r r e n t record .
∗ @ r e t u r n t h e ID o f t h e c u r r e n t record
∗/
public i n t c u r r e n t I d ( ) {
return c u r r e n t s l o t ;
}

private int c u r r e n t p o s ( ) {
return c u r r e n t s l o t ∗ s l o t s i z e ;
}

private int f i e l d p o s ( S t r i n g fldname ) {


i n t o f f s e t = INT SIZE + t i . o f f s e t ( f l d n a m e ) ;
return c u r r e n t p o s ( ) + o f f s e t ;
}

p r i v a t e boolean i s V a l i d S l o t ( ) {
return c u r r e n t p o s ( ) + s l o t s i z e <= BLOCK SIZE ;
}

p r i v a t e boolean s e a r c h F o r ( i n t f l a g ) {
c u r r e n t s l o t ++;
while ( i s V a l i d S l o t ( ) ) {
int p o s i t i o n = currentpos ( ) ;
i f ( t x . g e t I n t ( b l k , p o s i t i o n ) == f l a g )
return true ;
c u r r e n t s l o t ++;
}
return f a l s e ;
}
}

141
Figure 59: The record file operations. (Sciore, 2008)

SimpleDB source file simpledb/record/RecordFormatter.java


package s i m p l e d b . r e c o r d ;

import s t a t i c j a v a . s q l . Types . INTEGER ;


import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import s t a t i c s i m p l e d b . r e c o r d . RecordPage .EMPTY;
import s i m p l e d b . f i l e . Page ;
import simpledb . b u f f e r . PageFormatter ;

/∗ ∗
∗ An o b j e c t t h a t can f o r m a t a p a g e t o l o o k l i k e a b l o c k of
∗ empty r e c o r d s .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s R e c o r d F o r m a t t e r implements P a g e F o r m a t t e r {
private T a b l e I n f o t i ;

/∗ ∗
∗ C r e a t e s a f o r m a t t e r f o r a new p a g e o f a table .
∗ @param t i t h e t a b l e ’ s m e t a d a t a
∗/
public R e c o r d F o r m a t t e r ( T a b l e I n f o t i ) {
this . t i = t i ;
}

/∗ ∗
∗ F o r m a t s t h e p a g e b y a l l o c a t i n g a s many r e c o r d s l o t s
∗ as p o s s i b l e , g i v e n t he record l e n g t h .
∗ Each r e c o r d s l o t i s a s s i g n e d a f l a g o f EMPTY.
∗ Each i n t e g e r f i e l d i s g i v e n a v a l u e o f 0 , and
∗ each s t r i n g f i e l d i s g i v e n a v a l u e of ””.
∗ @see s i m p l e d b . b u f f e r . P a g e F o r m a t t e r#f o r m a t ( s i m p l e d b . f i l e . Page )
∗/
public void f o r m a t ( Page page ) {
i n t r e c s i z e = t i . r e c o r d L e n g t h ( ) + INT SIZE ;
f o r ( i n t p o s =0; p o s+r e c s i z e <=BLOCK SIZE ; p o s += r e c s i z e ) {
page . s e t I n t ( pos , EMPTY) ;
m a k e D e f a u l t R e c o r d ( page , p o s ) ;
}
}

p r i v a t e void m a k e D e f a u l t R e c o r d ( Page page , i n t p o s ) {


f o r ( S t r i n g f l d n a m e : t i . schema ( ) . f i e l d s ( ) ) {
int o f f s e t = t i . o f f s e t ( fldname ) ;
i f ( t i . schema ( ) . t y p e ( f l d n a m e ) == INTEGER)
page . s e t I n t ( p o s + INT SIZE + o f f s e t , 0 ) ;
else
page . s e t S t r i n g ( p o s + INT SIZE + o f f s e t , ” ” ) ;
}
}
}

142
SimpleDB source file simpledb/record/RecordFile.java
• Here is the implementation of a whole File of RecordPages – that is, of a stored
Table.

• Its API is in Figure 59.

• It implements the result set concept for a stored Table:

– It maintains the notion of the current Record – which it builds on top of the
current Slot within the current RecordPage.
– This current Record can be positioned “just before the first” actual Record in
the File.
– It can be moved to the next Record (if any) – which it does by moving
¬ to the next Slot in use within the current RecordPage, and
­ to the next RecordPage there are no more Slots in use within the current
RecordPage.
– It permits getting and setting the Attribute Values for this current Record .

• It also provides random access to these Record using RIDs as their addresses:

– The current Record can be positioned at a given RID.


– The RID of the current Record can be obtained.

Database indexes (which we will discuss later) use these operations.

• It can also delete the current Record – by setting the Flag of its Slot to 0.

• It can also insert a new Record somewhere in the File. Its contents can then be
set. SimpleDB

¬ starts in the current Block , and


­ scans the File forward until it finds the first unused Slot and takes it into use,
or
® if it reaches the end of the File, then it adds another RecordPage of unused
Slots there.

This linear scan for an unused Slot is not very efficient. Instead an RDBMS
RecordFile can maintain for instance a list of still unused Slots linked by RIDs.
package s i m p l e d b . r e c o r d ;

import s i m p l e d b . f i l e . B l o c k ;
import s i m p l e d b . t x . T r a n s a c t i o n ;

/∗ ∗
∗ Manages a f i l e o f r e c o r d s .
∗ There a r e methods f o r i t e r a t i n g through the records
∗ and a c c e s s i n g t h e i r c o n t e n t s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s R e c o r d F i l e {
private T a b l e I n f o t i ;
private T r a n s a c t i o n tx ;
private S t r i n g f i l e n a m e ;
p r i v a t e RecordPage r p ;
private int currentblknum ;

/∗ ∗
∗ C o n s t r u c t s an o b j e c t t o manage a f i l e o f r e c o r d s .
∗ I f the f i l e does not e x i s t , i t i s c re at ed .
∗ @param t i t h e t a b l e m e t a d a t a
∗ @param t x t h e t r a n s a c t i o n
∗/

143
public R e c o r d F i l e ( T a b l e I n f o t i , Transaction tx ) {
this . t i = t i ;
this . tx = tx ;
filename = t i . fileName () ;
i f ( t x . s i z e ( f i l e n a m e ) == 0 )
app endBlo ck ( ) ;
moveTo ( 0 ) ;
}

/∗ ∗
∗ Closes the record f i l e .
∗/
public void c l o s e ( ) {
rp . c l o s e ( ) ;
}

/∗ ∗
∗ P o s i t i o n s th e c u r r e n t record so t h a t a call t o method n e x t
∗ w i l l w i n d up a t t h e f i r s t r e c o r d .
∗/
public void b e f o r e F i r s t ( ) {
moveTo ( 0 ) ;
}

/∗ ∗
∗ Moves t o t h e n e x t r e c o r d . R e t u r n s false if there
∗ i s no n e x t r e c o r d .
∗ @ r e t u r n f a l s e i f t h e r e i s no n e x t record .
∗/
public boolean n e x t ( ) {
while ( true ) {
i f ( rp . next ( ) )
return true ;
i f ( atLastBlock () )
return f a l s e ;
moveTo ( c u r r e n t b l k n u m + 1 ) ;
}
}

/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e i n t e g e r v a l u e at t h a t f i e l d
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
return r p . g e t I n t ( f l d n a m e ) ;
}

/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e s t r i n g v a l u e at t h a t f i e l d
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
return r p . g e t S t r i n g ( f l d n a m e ) ;
}

/∗ ∗
∗ Sets the value of the s p e c i f i e d f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e new v a l u e f o r t h e f i e l d
∗/
public void s e t I n t ( S t r i n g fldname , i n t v a l ) {
r p . s e t I n t ( fldname , v a l ) ;
}

/∗ ∗
∗ Sets the value of the s p e c i f i e d f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e new v a l u e f o r t h e f i e l d
∗/
public void s e t S t r i n g ( S t r i n g fldname , S t r i n g val ) {
r p . s e t S t r i n g ( fldname , v a l ) ;
}

/∗ ∗
∗ Deletes the current record .
∗ The c l i e n t must c a l l n e x t ( ) t o move t o
∗ the next record .
∗ C a l l s t o m e t h o d s on a d e l e t e d r e c o r d
∗ have u n s p e c i f i e d b e h a v i o r .
∗/
public void d e l e t e ( ) {
rp . d e l e t e ( ) ;
}

/∗ ∗
∗ I n s e r t s a new , b l a n k r e c o r d s o m e w h e r e i n t h e f i l e
∗ beginning at the current record .
∗ I f t h e new r e c o r d d o e s n o t f i t i n t o an e x i s t i n g b l o c k ,
∗ t h e n a new b l o c k i s a p p e n d e d t o t h e f i l e .
∗/
public void i n s e r t ( ) {
while ( ! r p . i n s e r t ( ) ) {
i f ( atLastBlock () )
appe ndBlo ck ( ) ;
moveTo ( c u r r e n t b l k n u m + 1 ) ;
}
}

144
/∗ ∗
∗ P o s i t i o n s th e c u r r e n t record as indicated by the
∗ s p e c i f i e d RID .
∗ @param r i d a r e c o r d i d e n t i f i e r
∗/
public void moveToRid ( RID r i d ) {
moveTo ( r i d . blockNumber ( ) ) ;
r p . moveToId ( r i d . i d ( ) ) ;
}

/∗ ∗
∗ R e t u r n s t h e RID o f t h e c u r r e n t r e c o r d .
∗ @return a record i d e n t i f i e r
∗/
public RID c u r r e n t R i d ( ) {
int i d = rp . c u r r e n t I d ( ) ;
return new RID ( c u r r e n t b l k n u m , i d ) ;
}

p r i v a t e void moveTo ( i n t b ) {
i f ( r p != n u l l )
rp . c l o s e ( ) ;
currentblknum = b ;
B l o c k b l k = new B l o c k ( f i l e n a m e , c u r r e n t b l k n u m ) ;
r p = new RecordPage ( b l k , t i , t x ) ;
}

p r i v a t e boolean a t L a s t B l o c k ( ) {
return c u r r e n t b l k n u m == t x . s i z e ( f i l e n a m e ) − 1 ;
}

p r i v a t e void app endBlo ck ( ) {


R e c o r d F o r m a t t e r f m t r = new R e c o r d F o r m a t t e r ( t i ) ;
t x . append ( f i l e n a m e , f m t r ) ;
}
}

4.6 Metadata Management


(Sciore, 2008, Chapter 16)

• The Schema and table information of the Record Manager in section 4.5 is one
example of metadata:
data telling how to interpret the other data stored in the database.

• The Metadata manager handles its storage and retrieval.

• The SQL standard specifies > 50 different views an RDBMS must offer to its meta-
data.
In this way it avoids specifying how an RDBMS actually stores its metadata.

• SimpleDB stores its much simpler metadata in 4 Tables:

tblcat(TblName:varchar(16),RecLength:int)
fldcat(TblName:varchar(16),FldName:varchar(16)
,Type:int,Length:int,Offset:int)
viewcat(ViewName:varchar(16),ViewDef:varchar(100))
idxcat(tablename:varchar(16),fieldname:varchar(16)
,indexname:varchar(16))

• They can be queried with SELECT. . . FROM. . . WHERE. . . just like other
Tables.

• These metadata tables are often called the catalog of the RDBMS.

• The Table catalog tblcat has the name of each CREATEd Table as its key and
the length of its Record s as its other attribute.

• The Field catalog fldcat tells which Field s such a Table has, as well as the

145
Type of its Values, where
4 denotes an int, and
10 denotes varchar
Length of these Values – for varchars
Offset inside the Record

for each Field .

• Figure 60 shows them for our university database example.

SimpleDB source file simpledb/metadata/TableMgr.java

• Here is the implementation of these tblcat and fldcat metadata tables.

• Together they form the SimpleDB metadata for each CREATEd Table.

• Hence this implementation provides also getting the table information for a given
Table.

• Its constructor CREATEs internally their own metadata into themselves, if it is


constructing a new database from scratch.
package s i m p l e d b . metadata ;

import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . r e c o r d . ∗ ;
import j a v a . u t i l . ∗ ;

/∗ ∗
∗ The t a b l e manager .
∗ There a r e methods t o c r e a t e a t a b l e , s a v e t h e metadata
∗ i n t h e c a t a l o g , and o b t a i n t h e m e t a d a t a o f a
∗ p r e v i o u s l y −c r e a t e d t a b l e .
∗ @ a u t h o r Edward S c i o r e

∗/
public c l a s s TableMgr {
/∗ ∗
∗ The maximum number o f c h a r a c t e r s i n any
∗ tablename or f i e l d n a m e .
∗ Currently , t h i s value i s 16.
∗/
public s t a t i c f i n a l i n t MAX NAME = 1 6 ;

private T a b l e I n f o tcatInfo , fcatInfo ;

/∗ ∗
∗ C r e a t e s a new c a t a l o g manager f o r t h e d a t a b a s e s y s t e m .
∗ I f t h e d a t a b a s e i s new , t h e n t h e t w o c a t a l o g t a b l e s
∗ are created .
∗ @param i s N e w h a s t h e v a l u e t r u e i f t h e d a t a b a s e i s new
∗ @param t x t h e s t a r t u p t r a n s a c t i o n
∗/
public TableMgr ( boolean isNew , T r a n s a c t i o n t x ) {
Schema t c a t S c h e m a = new Schema ( ) ;
t c a t S c h e m a . a d d S t r i n g F i e l d ( ” tblname ” , MAX NAME) ;
tcatSchema . a d d I n t F i e l d ( ” r e c l e n g t h ” ) ;
t c a t I n f o = new T a b l e I n f o ( ” t b l c a t ” , t c a t S c h e m a ) ;

Schema f c a t S c h e m a = new Schema ( ) ;


f c a t S c h e m a . a d d S t r i n g F i e l d ( ” tblname ” , MAX NAME) ;
f c a t S c h e m a . a d d S t r i n g F i e l d ( ” f l d n a m e ” , MAX NAME) ;
fcatSchema . a d d I n t F i e l d ( ” type ” ) ;
fcatSchema . a d d I n t F i e l d ( ” l e n g t h ” ) ;
fcatSchema . a d d I n t F i e l d ( ” o f f s e t ” ) ;
f c a t I n f o = new T a b l e I n f o ( ” f l d c a t ” , f c a t S c h e m a ) ;

if ( isNew ) {
c r e a t e T a b l e ( ” t b l c a t ” , tcatSchema , tx ) ;
c r e a t e T a b l e ( ” f l d c a t ” , fcatSchema , tx ) ;
}
}

/∗ ∗
∗ C r e a t e s a new t a b l e h a v i n g t h e s p e c i f i e d name and schema .
∗ @param t b l n a m e t h e name o f t h e new t a b l e
∗ @param s c h t h e t a b l e ’ s schema
∗ @param t x t h e t r a n s a c t i o n c r e a t i n g t h e t a b l e
∗/
public void c r e a t e T a b l e ( S t r i n g tblname , Schema sch , T r a n s a c t i o n t x ) {

146
Figure 60: Metadata for the University Database. (Sciore, 2008)

147
T a b l e I n f o t i = new T a b l e I n f o ( tblname , s c h ) ;
// i n s e r t one r e c o r d i n t o t b l c a t
R e c o r d F i l e t c a t f i l e = new R e c o r d F i l e ( t c a t I n f o , t x ) ;
t c a t f i l e . insert () ;
t c a t f i l e . s e t S t r i n g ( ” tblname ” , tblname ) ;
t c a t f i l e . s e t I n t ( ” reclength ” , t i . recordLength () ) ;
t c a t f i l e . close () ;

// i n s e r t a r e c o r d i n t o f l d c a t f o r e a c h f i e l d
R e c o r d F i l e f c a t f i l e = new R e c o r d F i l e ( f c a t I n f o , t x ) ;
for ( S t r i n g fldname : sch . f i e l d s ( ) ) {
f c a t f i l e . insert () ;
f c a t f i l e . s e t S t r i n g ( ” tblname ” , tblname ) ;
f c a t f i l e . s e t S t r i n g ( ” fldname ” , fldname ) ;
f c a t f i l e . setInt ( ” type ” , sch . type ( fldname ) ) ;
f c a t f i l e . setInt ( ” l e n g t h ” , sch . l e n g t h ( fldname ) ) ;
f c a t f i l e . setInt ( ” o f f s e t ” , t i . o f f s e t ( fldname ) ) ;
}
f c a t f i l e . close () ;
}

/∗ ∗
∗ R e t r i e v e s the metadata f o r the s p e c i f i e d t a b l e
∗ out of the c a t a l o g .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param t x t h e t r a n s a c t i o n
∗ @return t h e t a b l e ’ s s t o r e d metadata
∗/
public T a b l e I n f o g e t T a b l e I n f o ( S t r i n g tblname , T r a n s a c t i o n t x ) {
R e c o r d F i l e t c a t f i l e = new R e c o r d F i l e ( t c a t I n f o , t x ) ;
i n t r e c l e n = −1;
while ( t c a t f i l e . n e x t ( ) )
i f ( t c a t f i l e . g e t S t r i n g ( ” tblname ” ) . e q u a l s ( tblname ) ) {
reclen = t c a t f i l e . getInt (” reclength ”) ;
break ;
}
t c a t f i l e . close () ;

R e c o r d F i l e f c a t f i l e = new R e c o r d F i l e ( f c a t I n f o , t x ) ;
Schema s c h = new Schema ( ) ;
Map<S t r i n g , I n t e g e r > o f f s e t s = new HashMap<S t r i n g , I n t e g e r >() ;
while ( f c a t f i l e . n e x t ( ) )
i f ( f c a t f i l e . g e t S t r i n g ( ” tblname ” ) . e q u a l s ( tblname ) ) {
S t r i n g fldname = f c a t f i l e . g e t S t r i n g ( ” fldname ” ) ;
int f l d t y p e = f c a t f i l e . g e t I n t ( ” type ” ) ;
int f l d l e n = f c a t f i l e . getInt (” length ”) ;
int o f f s e t = f c a t f i l e . getInt (” offset ”) ;
o f f s e t s . put ( fldname , o f f s e t ) ;
s c h . a d d F i e l d ( fldname , f l d t y p e , f l d l e n ) ;
}
f c a t f i l e . close () ;
return new T a b l e I n f o ( tblname , sch , o f f s e t s , r e c l e n ) ;
}
}

SimpleDB source file simpledb/metadata/ViewMgr.java

• The view catalog viewcat tells the definition of each named view.

– In SimpleDB, this definition is the SQL SELECT. . . FROM. . . WHERE. . .


query as text.
– A more reasonable type for this text could be something like clob(9999) if
SimpleDB supported it.

• Here is its implementation.

• Its constructor CREATEs internally this viewcat table and its Field s into the
Table metadata, if it is constructing a new database from scratch.

• This implementation also retrieves the definition of a given named View .

package s i m p l e d b . metadata ;

import s i m p l e d b . r e c o r d . ∗ ;
import s i m p l e d b . t x . T r a n s a c t i o n ;

c l a s s ViewMgr {
private s t a t i c f i n a l i n t MAX VIEWDEF = 1 0 0 ;
TableMgr t b l M g r ;

public ViewMgr ( boolean isNew , TableMgr tblMgr , T r a n s a c t i o n t x ) {


t h i s . tblMgr = tblMgr ;
i f ( isNew ) {
Schema s c h = new Schema ( ) ;
s c h . a d d S t r i n g F i e l d ( ” viewname ” , TableMgr .MAX NAME) ;
s c h . a d d S t r i n g F i e l d ( ” v i e w d e f ” , MAX VIEWDEF) ;

148
t b l M g r . c r e a t e T a b l e ( ” v i e w c a t ” , sch , tx ) ;
}
}

public void c r e a t e V i e w ( S t r i n g vname , S t r i n g v d e f , T r a n s a c t i o n t x ) {


T a b l e I n f o t i = tblMgr . g e t T a b l e I n f o ( ” v i e w c a t ” , tx ) ;
R e c o r d F i l e r f = new R e c o r d F i l e ( t i , t x ) ;
rf . insert () ;
r f . s e t S t r i n g ( ” viewname ” , vname ) ;
r f . s e t S t r i n g ( ” viewdef ” , vdef ) ;
rf . close () ;
}

public S t r i n g g e t V i e w D e f ( S t r i n g vname , T r a n s a c t i o n t x ) {
S t r i n g r e s u l t = null ;
T a b l e I n f o t i = tblMgr . g e t T a b l e I n f o ( ” v i e w c a t ” , tx ) ;
R e c o r d F i l e r f = new R e c o r d F i l e ( t i , t x ) ;
while ( r f . n e x t ( ) )
i f ( r f . g e t S t r i n g ( ” viewname ” ) . e q u a l s ( vname ) ) {
r e s u l t = r f . getString ( ” viewdef ” ) ;
break ;
}
rf . close () ;
return r e s u l t ;
}
}

SimpleDB source file simpledb/metadata/IndexMgr.java


• The index catalog idxcat will be used when we discuss indexing later.

– It tells the names of the indexes which have been CREATEd for a given
named Table.
– Each SimpleDB index can be built on just one Field of a Table, and that
restriction simplifies this index metadata.
– In general, an RDBMS index can be built on many fields of the same Table.

• Its constructor CREATEs internally this idxcat table and its Field s into the Table
metadata, if it is constructing a new database from scratch.
package s i m p l e d b . metadata ;

import s t a t i c s i m p l e d b . metadata . TableMgr .MAX NAME;


import simpledb . tx . Transaction ;
import simpledb . record . ∗ ;
import java . u t i l . ∗ ;

/∗ ∗
∗ The i n d e x manager .
∗ The i n d e x manager h a s similar functionalty to the table manager .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s IndexMgr {
private T a b l e I n f o t i ;

/∗ ∗
∗ C r e a t e s t h e i n d e x manager .
∗ This c o n s t r u c t o r i s c a l l e d during system s t a r t u p .
∗ I f t h e d a t a b a s e i s new , t h e n t h e <i >i d x c a t </i > t a b l e i s c r e a t e d .
∗ @param i s n e w i n d i c a t e s w h e t h e r t h i s i s a new d a t a b a s e
∗ @param t x t h e s y s t e m s t a r t u p t r a n s a c t i o n
∗/
public IndexMgr ( boolean i s n e w , TableMgr tblmgr , T r a n s a c t i o n t x ) {
i f ( isnew ) {
Schema s c h = new Schema ( ) ;
s c h . a d d S t r i n g F i e l d ( ” indexname ” , MAX NAME) ;
s c h . a d d S t r i n g F i e l d ( ” t a b l e n a m e ” , MAX NAME) ;
s c h . a d d S t r i n g F i e l d ( ” f i e l d n a m e ” , MAX NAME) ;
t b l m g r . c r e a t e T a b l e ( ” i d x c a t ” , sch , t x ) ;
}
t i = tblmgr . g e t T a b l e I n f o ( ” i d x c a t ” , tx ) ;
}

/∗ ∗
∗ C r e a t e s an i n d e x o f t h e s p e c i f i e d t y p e f o r t h e s p e c i f i e d f i e l d .
∗ A u n i q u e ID i s a s s i g n e d t o t h i s i n d e x , and i t s i n f o r m a t i o n
∗ i s stored in the i d x c a t t a b l e .
∗ @param i d x n a m e t h e name o f t h e i n d e x
∗ @param t b l n a m e t h e name o f t h e i n d e x e d t a b l e
∗ @param f l d n a m e t h e name o f t h e i n d e x e d f i e l d
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public void c r e a t e I n d e x ( S t r i n g idxname , S t r i n g tblname , S t r i n g fldname , Transaction tx ) {
R e c o r d F i l e r f = new R e c o r d F i l e ( t i , t x ) ;
rf . insert () ;
r f . s e t S t r i n g ( ” indexname ” , idxname ) ;
r f . s e t S t r i n g ( ” t a b l e n a m e ” , tblname ) ;
r f . s e t S t r i n g ( ” f i e l d n a m e ” , fldname ) ;

149
Figure 61: The information on each index. (Sciore, 2008)

rf . close () ;
}

/∗ ∗
∗ R e t u r n s a map c o n t a i n i n g t h e i n d e x i n f o f o r a l l i n d e x e s
∗ on t h e s p e c i f i e d t a b l e .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n a map o f I n d e x I n f o o b j e c t s , k e y e d b y t h e i r f i e l d names
∗/
public Map<S t r i n g , I n d e x I n f o > g e t I n d e x I n f o ( S t r i n g tblname , T r a n s a c t i o n t x ) {
Map<S t r i n g , I n d e x I n f o > r e s u l t = new HashMap<S t r i n g , I n d e x I n f o >() ;
R e c o r d F i l e r f = new R e c o r d F i l e ( t i , t x ) ;
while ( r f . n e x t ( ) )
i f ( r f . g e t S t r i n g ( ” t a b l e n a m e ” ) . e q u a l s ( tblname ) ) {
S t r i n g idxname = r f . g e t S t r i n g ( ” indexname ” ) ;
S t r i n g fldname = r f . g e t S t r i n g ( ” f i e l d n a m e ” ) ;
I n d e x I n f o i i = new I n d e x I n f o ( idxname , tblname , fldname , t x ) ;
r e s u l t . put ( fldname , i i ) ;
}
rf . close () ;
return r e s u l t ;
}
}

• getting the index information for a named Table returns

– the names of the indexes CREATEd for this table, and


– for each named index, the corresponding information.

• Figure 61 shows this information about a specific index.

– An index must be opened before it can be used to search for the RIDs having
the given Value in the indexed Field .
– The blocksAccessed estimates how many Block s would be accessed during
one such search, so that the RDBMS can decide which is faster in a given
situation:
∗ Reading the Records sequentially from the File vs.
∗ searching the File using this index – which may read the same Block many
times.

SimpleDB source file simpledb/metadata/IndexInfo.java


package s i m p l e d b . metadata ;

import s t a t i c j a v a . s q l . Types . INTEGER ;


import s t a t i c s i m p l e d b . f i l e . Page . BLOCK SIZE ;
import s i m p l e d b . s e r v e r . SimpleDB ;
import simpledb . tx . Transaction ;
import simpledb . record . ∗ ;
import simpledb . index . Index ;
import s i m p l e d b . i n d e x . hash . HashIndex ;
import s i m p l e d b . i n d e x . b t r e e . BTreeIndex ; // i n c a s e we c h a n g e to btree indexing

150
/∗ ∗
∗ The i n f o r m a t i o n a b o u t an i n d e x .
∗ This i n f o r m a t i o n i s used by t h e query planner in order to
∗ estimate the costs of using the index ,
∗ and t o o b t a i n t h e schema o f t h e index records .
∗ I t s methods are e s s e n t i a l l y t h e same a s t h o s e o f P l a n .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n d e x I n f o {
p r i v a t e S t r i n g idxname , f l d n a m e ;
private T r a n s a c t i o n tx ;
private T a b l e I n f o t i ;
private S t a t I n f o s i ;

/∗ ∗
∗ C r e a t e s an I n d e x I n f o o b j e c t f o r t h e s p e c i f i e d i n d e x .
∗ @param i d x n a m e t h e name o f t h e i n d e x
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param f l d n a m e t h e name o f t h e i n d e x e d f i e l d
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public I n d e x I n f o ( S t r i n g idxname , S t r i n g tblname , S t r i n g fldname ,
Transaction tx ) {
t h i s . idxname = idxname ;
this . fldname = fldname ;
this . tx = tx ;
t i = SimpleDB . mdMgr ( ) . g e t T a b l e I n f o ( tblname , t x ) ;
s i = SimpleDB . mdMgr ( ) . g e t S t a t I n f o ( tblname , t i , t x ) ;
}

/∗ ∗
∗ Opens t h e i n d e x d e s c r i b e d b y t h i s o b j e c t .
∗ @return t h e Index o b j e c t a s s o c i a t e d with t h i s information
∗/
public I n d e x open ( ) {
Schema s c h = schema ( ) ;
// C r e a t e new H a s h I n d e x f o r h a s h i n d e x i n g
return new HashIndex ( idxname , sch , t x ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s r e q u i r e d t o
∗ f i n d a l l index records having a p a r t i c u l a r search key .
∗ The method u s e s t h e t a b l e ’ s m e t a d a t a t o e s t i m a t e t h e
∗ s i z e o f t h e i n d e x f i l e and t h e number o f i n d e x r e c o r d s
∗ per b l o c k .
∗ I t then passes t h i s information to the t r a v e r s a l C o s t
∗ method o f t h e a p p r o p r i a t e i n d e x t y p e ,
∗ which p r o v i d e s the e s t i m a t e .
∗ @ r e t u r n t h e number o f b l o c k a c c e s s e s r e q u i r e d t o t r a v e r s e the index
∗/
public i n t b l o c k s A c c e s s e d ( ) {
T a b l e I n f o i d x t i = new T a b l e I n f o ( ” ” , schema ( ) ) ;
i n t rpb = BLOCK SIZE / i d x t i . r e c o r d L e n g t h ( ) ;
i n t numblocks = s i . r e c o r d s O u t p u t ( ) / rpb ;
// C a l l H a s h I n d e x . s e a r c h C o s t f o r h a s h i n d e x i n g
return HashIndex . s e a r c h C o s t ( numblocks , rpb ) ;
}

/∗ ∗
∗ R e t u r n s t h e e s t i m a t e d number o f r e c o r d s h a v i n g a
∗ search key . T h i s v a l u e i s t h e same a s d o i n g a s e l e c t
∗ q u e r y ; t h a t i s , i t i s t h e number o f r e c o r d s i n t h e t a b l e
∗ d i v i d e d b y t h e number o f d i s t i n c t v a l u e s o f t h e i n d e x e d f i e l d .
∗ @ r e t u r n t h e e s t i m a t e d number o f r e c o r d s h a v i n g a s e a r c h k e y
∗/
public i n t r e c o r d s O u t p u t ( ) {
return s i . r e c o r d s O u t p u t ( ) / s i . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}

/∗ ∗
∗ Returns the d i s t i n c t v a l u e s f o r a s p e c i f i e d f i e l d
∗ in t h e u n d e r l y i n g t a b l e , or 1 f o r t h e i n d e xe d f i e l d .
∗ @param fname t h e s p e c i f i e d f i e l d
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g fname ) {
i f ( f l d n a m e . e q u a l s ( fname ) )
return 1 ;
else
return Math . min ( s i . d i s t i n c t V a l u e s ( f l d n a m e ) , r e c o r d s O u t p u t ( ) ) ;
}

/∗ ∗
∗ R e t u r n s t h e schema o f t h e i n d e x r e c o r d s .
∗ The schema c o n s i s t s o f t h e dataRID ( w h i c h i s
∗ r e p r e s e n t e d a s t w o i n t e g e r s , t h e b l o c k number and t h e
∗ r e c o r d ID ) and t h e d a t a v a l ( w h i c h i s t h e i n d e x e d f i e l d ) .
∗ Schema i n f o r m a t i o n a b o u t t h e i n d e x e d f i e l d i s o b t a i n e d
∗ v i a the t a b l e ’ s metadata .
∗ @ r e t u r n t h e schema o f t h e i n d e x r e c o r d s
∗/
p r i v a t e Schema schema ( ) {
Schema s c h = new Schema ( ) ;
sch . addIntField ( ” block ” ) ;
sch . addIntField ( ” id ” ) ;
i f ( t i . schema ( ) . t y p e ( f l d n a m e ) == INTEGER)
sch . addIntField ( ” dataval ” ) ;
else {
i n t f l d l e n = t i . schema ( ) . l e n g t h ( f l d n a m e ) ;
sch . addStringField ( ” dataval ” , f l d l e n ) ;
}

151
Figure 62: Example Statistics for the University Database. (Sciore, 2008)

return s c h ;
}
}

Table Statistics
• The blocksAccessed method in Figure 61 is an example of statistics which the
RDBMS uses to decide an efficient way to execute the given SQL query.
• Consider the following simple statistics:
B(T ): the number of Block s in the File storing this Table T – estimating the I/O
needed to list its contents
R(T ): the number of Record s in this Table T – estimating the size of this listing
V(T ,F ): the number of distinct Values in this Field F of this Table T – estimating
the size of select(T ,F = . . .), or how selective F is.
• A commercial RDBMS may use much more elaborate statistics than these.
• Figure 62 shows them for a university with about 900 students and 500 sections per
year, for the last 50 years.

• The RDBMS can maintain these statistics in the


catalog as Tables like
tblstats(TblName,NumBlocks,NumRecords)
fldstats(TblName,FldName,NumValues)

152
and
write them when the database contents change – with xlocks, which reduces
concurrency
read them when planning how to execute a given SQL query – without slocks,
because the results do not have to be exact.
RAM because these Tables are small, but then they must be
recalculated whenever the RDBMS process is started, and
maintained while it is running.

SimpleDB source file simpledb/metadata/StatMgr.java


• SimpleDB chooses to

– maintain its statistics in RAM, and


– recompute them every 100th time they are requested.
package s i m p l e d b . metadata ;

import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . r e c o r d . ∗ ;
import j a v a . u t i l . ∗ ;

/∗ ∗
∗ The s t a t i s t i c s manager , w h i c h i s r e s p o n s i b l e f o r
∗ ke epin g s t a t i s t i c a l information about each t a b l e .
∗ The manager d o e s n o t s t o r e t h i s i n f o r m a t i o n i n t h e d a t a b a s e .
∗ I n s t e a d , i t c a l c u l a t e s t h i s i n f o r m a t i o n on s y s t e m s t a r t u p ,
∗ and p e r i o d i c a l l y r e f r e s h e s i t .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s StatMgr {
p r i v a t e TableMgr t b l M g r ;
p r i v a t e Map<S t r i n g , S t a t I n f o > t a b l e s t a t s ;
private int numcalls ;

/∗ ∗
∗ C r e a t e s t h e s t a t i s t i c s manager .
∗ The i n i t i a l s t a t i s t i c s a r e c a l c u l a t e d b y
∗ traversing the entire database .
∗ @param t x t h e s t a r t u p t r a n s a c t i o n
∗/
public StatMgr ( TableMgr tblMgr , T r a n s a c t i o n t x ) {
t h i s . tblMgr = tblMgr ;
r e f r e s h S t a t i s t i c s ( tx ) ;
}

/∗ ∗
∗ Returns the s t a t i s t i c a l information about the s p e c i f i e d t a b l e .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param t i t h e t a b l e ’ s m e t a d a t a
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @return t h e s t a t i s t i c a l i n f o r m a t i o n about t h e t a b l e
∗/
public synchronized S t a t I n f o g e t S t a t I n f o ( S t r i n g tblname , T a b l e I n f o ti , Transaction tx ) {
n u m c a l l s ++;
i f ( numcalls > 100)
r e f r e s h S t a t i s t i c s ( tx ) ;
S t a t I n f o s i = t a b l e s t a t s . g e t ( tblname ) ;
i f ( s i == n u l l ) {
s i = c a l c T a b l e S t a t s ( t i , tx ) ;
t a b l e s t a t s . put ( tblname , s i ) ;
}
return s i ;
}

p r i v a t e synchronized void r e f r e s h S t a t i s t i c s ( T r a n s a c t i o n t x ) {
t a b l e s t a t s = new HashMap<S t r i n g , S t a t I n f o >() ;
numcalls = 0 ;
T a b l e I n f o tcatmd = t b l M g r . g e t T a b l e I n f o ( ” t b l c a t ” , t x ) ;
R e c o r d F i l e t c a t f i l e = new R e c o r d F i l e ( tcatmd , t x ) ;
while ( t c a t f i l e . n e x t ( ) ) {
S t r i n g tblname = t c a t f i l e . g e t S t r i n g ( ” tblname ” ) ;
T a b l e I n f o md = t b l M g r . g e t T a b l e I n f o ( tblname , t x ) ;
S t a t I n f o s i = c a l c T a b l e S t a t s (md, t x ) ;
t a b l e s t a t s . put ( tblname , s i ) ;
}
t c a t f i l e . close () ;
}

p r i v a t e synchronized S t a t I n f o c a l c T a b l e S t a t s ( T a b l e I n f o ti , Transaction tx ) {
i n t numRecs = 0 ;
R e c o r d F i l e r f = new R e c o r d F i l e ( t i , t x ) ;
i n t numblocks = 0 ;
while ( r f . n e x t ( ) ) {
numRecs++;

153
numblocks = r f . c u r r e n t R i d ( ) . blockNumber ( ) + 1 ;
}
rf . close () ;
return new S t a t I n f o ( numblocks , numRecs ) ;
}
}

SimpleDB source file simpledb/metadata/StatInfo.java

• Here are the 3 statistics for a given Table.

• SimpleDB does not actually compute the true V(T ,F ) values – it just makes a wild
guess. . .
package s i m p l e d b . metadata ;

/∗ ∗
∗ Holds t h r e e p i e c e s of s t a t i s t i c a l information about a table :
∗ t h e number o f b l o c k s , t h e number o f r e c o r d s ,
∗ and t h e number o f d i s t i n c t v a l u e s f o r e a c h f i e l d .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S t a t I n f o {
p r i v a t e i n t numBlocks ;
p r i v a t e i n t numRecs ;

/∗ ∗
∗ Creates a StatInfo o b j e c t .
∗ N o t e t h a t t h e number o f d i s t i n c t v a l u e s i s n o t
∗ passed into the constructor .
∗ The o b j e c t f a k e s t h i s v a l u e .
∗ @param n u m b l o c k s t h e number o f b l o c k s i n t h e t a b l e
∗ @param n u m r e c s t h e number o f r e c o r d s i n t h e t a b l e
∗/
public S t a t I n f o ( i n t numblocks , i n t numrecs ) {
t h i s . numBlocks = numblocks ;
t h i s . numRecs = numrecs ;
}

/∗ ∗
∗ R e t u r n s t h e e s t i m a t e d number o f blocks in the table .
∗ @ r e t u r n t h e e s t i m a t e d number o f blocks in the table
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return numBlocks ;
}

/∗ ∗
∗ R e t u r n s t h e e s t i m a t e d number o f records in the table .
∗ @ r e t u r n t h e e s t i m a t e d number o f records in the table
∗/
public i n t r e c o r d s O u t p u t ( ) {
return numRecs ;
}

/∗ ∗
∗ R e t u r n s t h e e s t i m a t e d number o f d i s t i n c t v a l u e s
∗ for the s p e c i f i e d f i e l d .
∗ In a c t u a l i t y , t h i s e s t i m a t e i s a complete g u e s s .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @ r e t u r n a g u e s s a s t o t h e number o f d i s t i n c t f i e l d values
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
return 1 + ( numRecs / 3 ) ;
}
}

SimpleDB source file simpledb/metadata/MetadataMgr.java

• This Metadata Manager just collects together these 4 smaller Managers:

– Table,
– View ,
– Index and
– Statistics.
package s i m p l e d b . metadata ;

import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . r e c o r d . ∗ ;
import j a v a . u t i l . Map ;

154
public c l a s s MetadataMgr {
p r i v a t e s t a t i c TableMgr tblmgr ;
p r i v a t e s t a t i c ViewMgr viewmgr ;
p r i v a t e s t a t i c StatMgr statmgr ;
p r i v a t e s t a t i c IndexMgr idxmgr ;

public MetadataMgr ( boolean i s n e w , T r a n s a c t i o n t x ) {


t b l m g r = new TableMgr ( i s n e w , t x ) ;
viewmgr = new ViewMgr ( i s n e w , tblmgr , t x ) ;
s t a t m g r = new StatMgr ( tblmgr , t x ) ;
idxmgr = new IndexMgr ( i s n e w , tblmgr , t x ) ;
}

public void c r e a t e T a b l e ( S t r i n g tblname , Schema sch , Transaction tx ) {


t b l m g r . c r e a t e T a b l e ( tblname , sch , t x ) ;
}

public T a b l e I n f o g e t T a b l e I n f o ( S t r i n g tblname , Transaction tx ) {


return t b l m g r . g e t T a b l e I n f o ( tblname , t x ) ;
}

public void c r e a t e V i e w ( S t r i n g viewname , S t r i n g viewdef , Transaction tx ) {


viewmgr . c r e a t e V i e w ( viewname , v i e w d e f , t x ) ;
}

public S t r i n g g e t V i e w D e f ( S t r i n g viewname , T r a n s a c t i o n t x ) {
return viewmgr . g e t V i e w D e f ( viewname , t x ) ;
}

public void c r e a t e I n d e x ( S t r i n g idxname , S t r i n g tblname , S t r i n g fldname , Transaction tx ) {


idxmgr . c r e a t e I n d e x ( idxname , tblname , fldname , t x ) ;
}

public Map<S t r i n g , I n d e x I n f o > g e t I n d e x I n f o ( S t r i n g tblname , Transaction tx ) {


return idxmgr . g e t I n d e x I n f o ( tblname , t x ) ;
}

public S t a t I n f o g e t S t a t I n f o ( S t r i n g tblname , T a b l e I n f o ti , Transaction tx ) {


return s t a t m g r . g e t S t a t I n f o ( tblname , t i , t x ) ;
}
}

155
Figure 63: Scan nodes. (Sciore, 2008)

4.7 Query Processing


(Sciore, 2008, Chapter 17)
• The Record Manager in section 4.5 described an implementation of a relational
Table as a File of Record s and their Field s.

• Now we build on them the next level of query processing by combining these Tables
with the Relational Algebra operations in section 2.6 which compute answers by
traversing these Files.

4.7.1 Query Scans


• A Scan is an executable Relational Algebra expression: each

leaf node is a File of Record s implementing one relational Table – which can be
processed as a result set
internal node is an implementation of a relational Algebra operation which takes
result set(s) as input and produces another result set as output – which can be
an input into another internal node.

That is, a Scan is a tree of Table and operation Scans by Figure 63.

• Figure 64 shows the interface for these Scan nodes. It is similar to Record Files,
except that it. . .

– is read-only, so it has no methods to insert or delete Record s or set their


Field s – only update Scans have them
– does not need the whole Schema, but just the Field s in its own output
– has a generic method for getting the Value of a Field regardless of its type.

• Figures 65 and 66 show examples of

(a) Relational Algebra query trees and their corresponding


(b) SimpleDB query Scans – with their Predicates postponed until later, to omit
their details which would be irrelevant to the general idea of Scanning.

• Both examples

¬ first construct the Scan (b) according to the Relational Algebra expression (a)

156
Figure 64: Scans. (Sciore, 2008)

­ then call its next method while it still has another current row to print.

• However, this “current row” does not exist physically:

– When next tells that it has another current row (by returning true), the
printing loop can get its Attribute Values. . .
– . . . but this actually gets the corresponding Field Values from the current
Record (s) of their File(s).

That is, these next and get

requests go down in the Scan until its leaf Record Files, and their results
return values come back up in the Scan.

• Note also that the same Table can have many current Record s at the same time.

– Then it has many Record File objects which


share the same underlying disk Block s, but
have their own current Block s and Slots.
– This is necessary for supporting for instance self -joins, where a table is joined
with itself:
SELECT x , y
FROM STUDENT x ,STUDENT y
WHERE x . MajorId=y . MajorId AND x . SId<>y . SId
gets all pairs of STUDENTs x and y with the same major.

SimpleDB source file simpledb/query/Scan.java

• Here is the definition for Figure 64.

• Let us next consider how each kind of query Scan implements its beforeFirst
and next methods.

157
Figure 65: One-table scan. (Sciore, 2008)

158
Figure 66: Two-table scan. (Sciore, 2008)

159
package s i m p l e d b . q u e r y ;

/∗ ∗
∗ The i n t e r f a c e w i l l b e i m p l e m e n t e d b y e a c h query scan .
∗ T h e r e i s a Scan c l a s s f o r e a c h r e l a t i o n a l
∗ algebra operator .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e Scan {

/∗ ∗
∗ Positions the scan before its first record .
∗/
public void beforeFirst () ;

/∗ ∗
∗ Moves t h e s c a n t o t h e n e x t r e c o r d .
∗ @ r e t u r n f a l s e i f t h e r e i s no n e x t r e c o r d
∗/
public boolean next ( ) ;

/∗ ∗
∗ Closes the s c a n and its subscans , if any .
∗/
public void close () ;

/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d f i e l d in the current record .
∗ The v a l u e i s e x p r e s s e d a s a C o n s t a n t .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e v a l u e o f t h a t f i e l d , e x p r e s s e d as a Constant .
∗/
public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) ;

/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d i n t e g e r f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e f i e l d ’ s i n t e g e r v a l u e in t h e c u r r e n t record
∗/
public i n t g e t I n t ( S t r i n g fldname ) ;

/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d s t r i n g f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e f i e l d ’ s s t r i n g v a l u e in t h e c u r r e n t record
∗/
public S t r i n g g e t S t r i n g ( S t r i n g fldname ) ;

/∗ ∗
∗ Returns t r u e i f the scan has the s p e c i f i e d field .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t r u e i f t h e scan has t h a t f i e l d
∗/
public boolean h a s F i e l d ( S t r i n g fldname ) ;
}

SimpleDB source file simpledb/query/TableScan.java

• A Table Scan just redirects its beforeFirst and next methods into the same meth-
ods for its inderlying Record File rf .

package s i m p l e d b . q u e r y ;

import s t a t i c j a v a . s q l . Types . INTEGER ;


import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . r e c o r d . ∗ ;

/∗ ∗
∗ The Scan c l a s s c o r r e s p o n d i n g t o a t a b l e .
∗ A t a b l e scan i s j u s t a wrapper f o r a RecordFile object ;
∗ most m e t h o d s j u s t d e l e g a t e t o t h e c o r r e s p o n d i n g
∗ R e c o r d F i l e methods .
∗ @ a u t h o r Edward S c i o r e

∗/
public c l a s s T a b l e S c a n implements UpdateScan {
private R e c o r d F i l e r f ;
p r i v a t e Schema s c h ;

/∗ ∗
∗ C r e a t e s a new t a b l e s c a n ,
∗ and o p e n s i t s c o r r e s p o n d i n g r e c o r d f i l e .
∗ @param t i t h e t a b l e ’ s m e t a d a t a
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public T a b l e S c a n ( T a b l e I n f o t i , T r a n s a c t i o n t x ) {
rf = new R e c o r d F i l e ( t i , t x ) ;
s c h = t i . schema ( ) ;
}

// Scan m e t h o d s

public void b e f o r e F i r s t ( ) {
rf . beforeFirst () ;

160
}

public boolean n e x t ( ) {
return r f . n e x t ( ) ;
}

public void c l o s e ( ) {
rf . close () ;
}

/∗ ∗
∗ Returns t h e v a l u e o f t he s p e c i f i e d f i e l d , as a Constant .
∗ The schema i s e x a m i n e d t o d e t e r m i n e t h e f i e l d ’ s t y p e .
∗ I f INTEGER , t h e n t h e r e c o r d f i l e ’ s g e t I n t method i s c a l l e d ;
∗ o t h e r w i s e , t h e g e t S t r i n g method i s c a l l e d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) {
i f ( s c h . t y p e ( f l d n a m e ) == INTEGER)
return new I n t C o n s t a n t ( r f . g e t I n t ( f l d n a m e ) ) ;
else
return new S t r i n g C o n s t a n t ( r f . g e t S t r i n g ( f l d n a m e ) ) ;
}

public i n t g e t I n t ( S t r i n g f l d n a m e ) {
return r f . g e t I n t ( f l d n a m e ) ;
}

public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
return r f . g e t S t r i n g ( f l d n a m e ) ;
}

public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return s c h . h a s F i e l d ( f l d n a m e ) ;
}

// U p d a t e S c a n m e t h o d s

/∗ ∗
∗ S e t s t he v a l u e o f th e s p e c i f i e d f i e l d , as a Constant .
∗ The schema i s e x a m i n e d t o d e t e r m i n e t h e f i e l d ’ s t y p e .
∗ I f INTEGER , t h e n t h e r e c o r d f i l e ’ s s e t I n t method i s c a l l e d ;
∗ o t h e r w i s e , t h e s e t S t r i n g method i s c a l l e d .
∗ @see s i m p l e d b . q u e r y . U p d a t e S c a n#s e t V a l ( j a v a . l a n g . S t r i n g , s i m p l e d b . q u e r y . C o n s t a n t )
∗/
public void s e t V a l ( S t r i n g fldname , C o n s t a n t v a l ) {
i f ( s c h . t y p e ( f l d n a m e ) == INTEGER)
r f . s e t I n t ( fldname , ( I n t e g e r ) v a l . a s J a v a V a l ( ) ) ;
else
r f . s e t S t r i n g ( fldname , ( S t r i n g ) v a l . a s J a v a V a l ( ) ) ;
}

public void s e t I n t ( S t r i n g fldname , int val ) {


r f . s e t I n t ( fldname , v a l ) ;
}

public void s e t S t r i n g ( S t r i n g fldname , String val ) {


r f . s e t S t r i n g ( fldname , v a l ) ;
}

public void d e l e t e ( ) {
rf . delete () ;
}

public void i n s e r t ( ) {
rf . insert () ;
}

public RID g e t R i d ( ) {
return r f . c u r r e n t R i d ( ) ;
}

public void moveToRid ( RID r i d ) {


r f . moveToRid ( r i d ) ;
}
}

SimpleDB source file simpledb/query/SelectScan.java


• A Selection Scan corresponds to the Relational Algebra operation select(s,pred ).
• Again, its beforeFirst method just redirects the call to the corresponding method
of its subScan s.
• But here its next method does more:
It moves to the next row of its s for which its pred icate is true.
• This is in turn determined (in the source file Term.java, not here) so that whenever
its pred icate mentions some Attribute A, the method call s.getVal(A) retrieves
the corresponding Field Value from the current row of its s.

161
package s i m p l e d b . q u e r y ;

import s i m p l e d b . r e c o r d . ∗ ;

/∗ ∗
∗ The s c a n c l a s s c o r r e s p o n d i n g t o t h e <i >s e l e c t </i > r e l a t i o n a l
∗ algebra operator .
∗ A l l methods e x c e p t n e x t d e l e g a t e t h e i r work t o t h e
∗ u n d e r l y i n g scan .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S e l e c t S c a n implements UpdateScan {
p r i v a t e Scan s ;
private P r e d i c a t e pred ;

/∗ ∗
∗ Creates a s e l e c t scan having the s p e c i f i e d underlying
∗ s c a n and p r e d i c a t e .
∗ @param s t h e s c a n o f t h e u n d e r l y i n g q u e r y
∗ @param p r e d t h e s e l e c t i o n p r e d i c a t e
∗/
public S e l e c t S c a n ( Scan s , P r e d i c a t e p r e d ) {
this . s = s ;
this . pred = pred ;
}

// Scan m e t h o d s

public void b e f o r e F i r s t ( ) {
s . beforeFirst () ;
}

/∗ ∗
∗ Move t o t h e n e x t r e c o r d s a t i s f y i n g t h e p r e d i c a t e .
∗ The method r e p e a t e d l y c a l l s n e x t on t h e u n d e r l y i n g s c a n
∗ u n t i l a s u i t a b l e r e c o r d i s found , or t h e u n d e r l y i n g scan
∗ c o n t a i n s no more r e c o r d s .
∗ @see s i m p l e d b . q u e r y . Scan#n e x t ( )
∗/
public boolean n e x t ( ) {
while ( s . n e x t ( ) )
i f ( pred . i s S a t i s f i e d ( s ) )
return true ;
return f a l s e ;
}

public void c l o s e ( ) {
s . close () ;
}

public C o n s t a n t g e t V a l ( S t r i n g fldname ) {
return s . g e t V a l ( f l d n a m e ) ;
}

public i n t g e t I n t ( S t r i n g f l d n a m e ) {
return s . g e t I n t ( f l d n a m e ) ;
}

public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
return s . g e t S t r i n g ( f l d n a m e ) ;
}

public boolean h a s F i e l d ( S t r i n g fldname ) {


return s . h a s F i e l d ( f l d n a m e ) ;
}

// U p d a t e S c a n m e t h o d s

public void s e t V a l ( S t r i n g fldname , Constant v a l ) {


UpdateScan u s = ( UpdateScan ) s ;
u s . s e t V a l ( fldname , v a l ) ;
}

public void s e t I n t ( S t r i n g fldname , int val ) {


UpdateScan u s = ( UpdateScan ) s ;
u s . s e t I n t ( fldname , v a l ) ;
}

public void s e t S t r i n g ( S t r i n g fldname , String val ) {


UpdateScan u s = ( UpdateScan ) s ;
u s . s e t S t r i n g ( fldname , v a l ) ;
}

public void d e l e t e ( ) {
UpdateScan u s = ( UpdateScan ) s ;
us . d e l e t e ( ) ;
}

public void i n s e r t ( ) {
UpdateScan u s = ( UpdateScan ) s ;
us . i n s e r t ( ) ;
}

public RID g e t R i d ( ) {
UpdateScan u s = ( UpdateScan ) s ;
return u s . g e t R i d ( ) ;
}

public void moveToRid ( RID r i d ) {


UpdateScan u s = ( UpdateScan ) s ;
u s . moveToRid ( r i d ) ;
}
}

162
SimpleDB source file simpledb/query/ProjectScan.java

• A Projection Scan corresponds to operation project(s,Attributes) of Relational


Algebra.

• It does not actually compute anything about the rows of its subScan s, but just

– redirects its beforeFirst and next methods into s


– forgets all the other Attributes of its s which were not mentioned in this
projection in its hasField method.

package s i m p l e d b . q u e r y ;

import j a v a . u t i l . ∗ ;

/∗ ∗
∗ The s c a n c l a s s c o r r e s p o n d i n g t o t h e <i >p r o j e c t </i > r e l a t i o n a l
∗ algebra operator .
∗ A l l methods e x c e p t h a s F i e l d d e l e g a t e t h e i r work t o t h e
∗ u n d e r l y i n g scan .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s P r o j e c t S c a n implements Scan {
p r i v a t e Scan s ;
p r i v a t e C o l l e c t i o n <S t r i n g > f i e l d l i s t ;

/∗ ∗
∗ Creates a p r o j e c t scan having the s p e c i f i e d
∗ u n d e r l y i n g s c a n and f i e l d l i s t .
∗ @param s t h e u n d e r l y i n g s c a n
∗ @param f i e l d l i s t t h e l i s t o f f i e l d names
∗/
public P r o j e c t S c a n ( Scan s , C o l l e c t i o n <S t r i n g > f i e l d l i s t ) {
this . s = s ;
this . f i e l d l i s t = f i e l d l i s t ;
}

public void b e f o r e F i r s t ( ) {
s . beforeFirst () ;
}

public boolean n e x t ( ) {
return s . n e x t ( ) ;
}

public void c l o s e ( ) {
s . close () ;
}

public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) {
i f ( h a s F i e l d ( fldname ) )
return s . g e t V a l ( f l d n a m e ) ;
else
throw new R u n t i m e E x c e p t i o n ( ” f i e l d ” + f l d n a m e + ” n o t f o u n d . ” ) ;
}

public i n t g e t I n t ( S t r i n g f l d n a m e ) {
i f ( h a s F i e l d ( fldname ) )
return s . g e t I n t ( f l d n a m e ) ;
else
throw new R u n t i m e E x c e p t i o n ( ” f i e l d ” + fldname + ” not found . ” ) ;
}

public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
i f ( h a s F i e l d ( fldname ) )
return s . g e t S t r i n g ( f l d n a m e ) ;
else
throw new R u n t i m e E x c e p t i o n ( ” f i e l d ” + f l d n a m e + ” n o t f o u n d . ” ) ;
}

/∗ ∗
∗ Returns t r ue i f the s p e c i f i e d f i e l d
∗ i s in the p r o j e c t i o n l i s t .
∗ @see s i m p l e d b . q u e r y . Scan#h a s F i e l d ( j a v a . l a n g . S t r i n g )
∗/
public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return f i e l d l i s t . c o n t a i n s ( f l d n a m e ) ;
}
}

SimpleDB source file simpledb/query/ProductScan.java

• A Product Scan corresponds to the Relational Algebra operation product(s1 ,s2 ).

• It the RDBMS constructed its result eagerly then it would use the following 2 nested
for loops:

163
1 for each row r1 of the subScan s1
2 for each row r2 of the subScan s2
3 output the row r with the same Field s and Values
as r1 and r2 .

• However, since the RDBMS constructs its result lazily in a pipelined fashion one r
at a time, it unrolls these loops into next steps.

• Let us introduce the following 2 variables for this unrolling:

v1 = “Does s1 have a valid current row?”


v2 = the same for s2 .

• Algorithm and program design principle 1:

¬ When you introduce a new variable, describe what it stands for.


­ Then write your code to fulfill your description.

The competent programmer is fully aware of the strictly limited size of his
own skull; therefore he approaches the programming task in full humility,
and among other things he avoids clever tricks like the plague. —
E.W. Dijkstra

• The unrolled beforeFirst method becomes the initialization

1 s1 .beforeFirst( );
2 v1 = s1 .next( );
3 s2 .beforeFirst( ).

The first call to next will initialize v2 .

• This next method is

1 v2 = s2 .next( );
2 if not v2
3 v1 = s1 .next( );
4 s2 .beforeFirst( );
5 v2 = s2 .next( );
6 return v1 and v2 .

• The lecturer (MN) suspects a bug in the SimpleDB next method!

– What if the very first call of s1 .next( ) already returns false in the beforeFirst
method?
– That is, what if s1 is empty?
– Adding this v1 variable handles that.

• The get and set methods redirect their calls to the correct subScan s1 or s2 de-
pending on which of them contains this f ieldname.

164
package s i m p l e d b . q u e r y ;

/∗ ∗
∗ The s c a n c l a s s c o r r e s p o n d i n g t o t h e <i >p r o d u c t </i > r e l a t i o n a l
∗ algebra operator .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s P r o d u c t S c a n implements Scan {
p r i v a t e Scan s1 , s 2 ;

/∗ ∗
∗ C r e a t e s a p r o d u c t scan h a v i n g t h e two underlying scans .
∗ @param s 1 t h e LHS s c a n
∗ @param s 2 t h e RHS s c a n
∗/
public P r o d u c t S c a n ( Scan s1 , Scan s 2 ) {
this . s1 = s1 ;
this . s2 = s2 ;
s1 . next ( ) ;
}

/∗ ∗
∗ P o s i t i o n s the scan b e f o r e i t s f i r s t record .
∗ I n o t h e r words , t h e LHS s c a n i s p o s i t i o n e d a t
∗ i t s f i r s t r e c o r d , and t h e RHS s c a n
∗ is positioned before i t s f i r s t record .
∗ @see s i m p l e d b . q u e r y . Scan#b e f o r e F i r s t ( )
∗/
public void b e f o r e F i r s t ( ) {
s1 . b e f o r e F i r s t ( ) ;
s1 . next ( ) ;
s2 . b e f o r e F i r s t ( ) ;
}

/∗ ∗
∗ Moves t h e s c a n t o t h e n e x t r e c o r d .
∗ The method moves t o t h e n e x t RHS r e c o r d , i f p o s s i b l e .
∗ O t h e r w i s e , i t moves t o t h e n e x t LHS r e c o r d and t h e
∗ f i r s t RHS r e c o r d .
∗ I f t h e r e a r e no more LHS r e c o r d s , t h e method r e t u r n s f a l s e .
∗ @see s i m p l e d b . q u e r y . Scan#n e x t ( )
∗/
public boolean n e x t ( ) {
i f ( s2 . next ( ) )
return true ;
else {
s2 . b e f o r e F i r s t ( ) ;
return s 2 . n e x t ( ) && s 1 . n e x t ( ) ;
}
}

/∗ ∗
∗ Closes both underlying scans .
∗ @see s i m p l e d b . q u e r y . Scan#c l o s e ( )
∗/
public void c l o s e ( ) {
s1 . c l o s e ( ) ;
s2 . c l o s e ( ) ;
}

/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d f i e l d .
∗ The v a l u e i s o b t a i n e d f r o m w h i c h e v e r s c a n
∗ contains the f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) {
i f ( s1 . h a s F i e l d ( fldname ) )
return s 1 . g e t V a l ( f l d n a m e ) ;
else
return s 2 . g e t V a l ( f l d n a m e ) ;
}

/∗ ∗
∗ Returns the i n t e g e r v a l u e of the s p e c i f i e d f i e l d .
∗ The v a l u e i s o b t a i n e d f r o m w h i c h e v e r s c a n
∗ contains the f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t I n t ( j a v a . l a n g . S t r i n g )
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
i f ( s1 . h a s F i e l d ( fldname ) )
return s 1 . g e t I n t ( f l d n a m e ) ;
else
return s 2 . g e t I n t ( f l d n a m e ) ;
}

/∗ ∗
∗ Returns the s t r i n g v a l u e of the s p e c i f i e d f i e l d .
∗ The v a l u e i s o b t a i n e d f r o m w h i c h e v e r s c a n
∗ contains the f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t S t r i n g ( j a v a . l a n g . S t r i n g )
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
i f ( s1 . h a s F i e l d ( fldname ) )
return s 1 . g e t S t r i n g ( f l d n a m e ) ;
else
return s 2 . g e t S t r i n g ( f l d n a m e ) ;
}

/∗ ∗
∗ Returns t r ue i f the s p e c i f i e d f i e l d i s in
∗ e i t h e r of the underlying scans .
∗ @see s i m p l e d b . q u e r y . Scan#h a s F i e l d ( j a v a . l a n g . S t r i n g )

165
∗/
public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return s 1 . h a s F i e l d ( f l d n a m e ) | | s 2 . h a s F i e l d ( f l d n a m e ) ;
}
}

Extending Scans

• SimpleDB does not have a Scan extend(s,Expr ,AttrName) operation of Relational


Algebra.

• Its main methods could be as follows:

beforeFirst method could simply call the s .beforeFirst method of its subScan s.
next method could simply call the s .next method of its subScan s.
hasField method could

1 return fldname = AttrName or


2 s .hasField (fldname)
to introduce this new named Field .
getVal methods could do the actual work with

1 if fldname = AttrName
2 return the current value of this Expr ession
3 else return s .getVal (fldname)

• The current value of this Expr ession on line 2 means its value on the current row of
the subScan s:
Whenever this Expr ession mentions some fieldname, its value is retrieved with
s .get(fieldname), like in Selection Scans.

Sorting Scans

• SimpleDB does not contain a Scan for the sort(s,AttrList) operation of Relational
Algebra.

• This pipelined query execution would not be very good for sorting its output:

– The first next call must produce the smallest row in the output of its subScan s
(wrt. the lexicographic order on AttrList). . .
– . . . but how can it know which is its smallest row without examining all its
rows?
– There are special cases where it can be known, but in general it cannot.

• One solution would be to rescan the whole output of s whenever the next row is
requested, to find out the next larger row than the most recently found row.

• Another better solution is to trade space for time and materialize the whole output
of s once and for all:

166
s .materialize( ): 1 temp = CREATE a new initially empty database Table
with the same Schema as s;
2 s .beforeFirst( );
3 while s .next( )
4 insert a copy of the current row of s into temp;
5 s .close( );
6 return temp.

Filling this new temporary table on line 4. . .

– does not need Lock ing, because other Transactions do not know that it exists.
– needs only its currently last Block in one RAM Buffer , because its earlier
Block s can be stored in the corresponding temporary disk File.

• Then this sorting Scan could have the methods

beforeFirst:

1 if its temporary Table t does not exist yet


2 t = s .materialize( );
3 sort the Record s in Table t
lexicographically according to AttrList;
4 t .beforeFirst( ).
The sorting algorithm on line 3 is external :
– It sorts the big disk File containing t. . .
– by using only a small number of RAM Buffer s (as many as the RDBMS
can afford without slowing down other Transactions too much). . .
– while optimizing disk I/O.
next: return t .next( ).
getVal (fldname): return t .getVal (fldname).
close: DROP Table t and delete its disk File.

Antijoin Scans

• Another complicated Scan missing from SimpleDB is antijoin(s1 ,s2 ,pred ) –


those rows r1 of subScan s1 which do not have any row r2 of subScan s2 such
that r1 and r2 would make this pred icate true.

• Its naı̈ve eager algorithm would then be

1 for each row r1 of s1


2 match = false;
3 for each row r2 of s2
4 match = match or pred ;
5 if not match
6 output row r1 .

• The beforeFirst method in its unrolled pipelined form becomes

167
1 s1 .beforeFirst( ).

• The next method becomes

1 repeat
2 v1 = s1 .next( );
3 match = false;
4 s2 .beforeFirst( );
5 repeat
6 v2 = v1 and s2 .next( );
7 match = v2 and pred
8 until (not v2 ) or match
9 until (not v1 ) or (not match);
10 return v1 .

• Algorithm and program design principle 2: When you design a loop, de-
scribe its

invariant: the defining property of the loop, which holds whenever the loop test is
checked
bound: how the execution of the loop body progresses towards its termination

so than after the loop its invariant and current status of its loop test together give
what we wanted to achieve with it.

• The inner repeat loop on lines 3–8 has the

invariant is
– v2 = “Are the current rows of s1 (as told by v1 by line 6) and s2 valid?”
– match = “Do these valid rows satisfy this pred icate?” (where this matching
pred icate is evaluated on line 7 by getting the appropriate Values from the
current rows of s1 or s2 ).
– none of the previous rows of s2 have matched.
bound is that the current row of s2 advances towards its end, where it is no longer
valid.

• This invariant must be verified to hold

both before the first execution of the loop test on line 8


and during its execution, by showing the implication “if it holds in the beginning
of the loop body, then it holds also in its end”.

• Hence we can verify this loop to achieve


match if and only if s2 contains a matching row for the current
(17)
row of s1 .

• The outer repeat loop on lines 1–9 has in turn the

invariant that all of the rows of s1


after the current row at the beginning of this next call, but

168
before the current row now
are valid (as told by v1 ) and have a matching row in s2 – for this we use the
result (17) of the inner loop as a lemma.
bound that the current row of s1 advances towards its end

so after the loop we have achieved that

either the current row of s1 is no longer valid (as told by v1 )


or it is the next row without a matching row in s2 .

• Because s2 is Scanned repeatedly, it makes sense to

¬ first materialize it with

if its temporary Table t does not exist yet


t = s2 .materialize( )
during beforeFirst, and
­ then Scan this t instead of s2 in the inner repeat loop on lines 3–8.

4.7.2 Update Scans


• An SQL UPDATE command like part (a) of Figure 67 defines an Update Scan.
(Note that it should use the enrollment Table instead of the student Table.)

• The previously discussed Query Scans provided methods for getting the named
Attribute Values of the current row.

• These Update Scans add methods for

– setting these Values


– manipulating the RID of the current row.

• A Query is updatable only if this concept of “the RID of my current row” makes
sense – if the RDBMS knows the exact Record s to modify.

• This is the technical meaning of an “updatable view” discussed in section 2.4 – a


single Query can be considered as a temporary view definition.

• In SimpleDB, a

Table Scan is always updatable (because it always has a current RID) and
Selection Scan select(s,p) is updatable, if its subScan s is too (because then its
next RID is the next RID of s satisfying p, if any)

but no other kinds of Scans are updatable.

• For instance, a Projection Scan project(s,...) could be made updatable too, if s


is.

• In contrast, it would be difficult to make Product Scans updatable:

169
Figure 67: SQL update command and scan. (Sciore, 2008)

170
– Each row r in the output of product(s1 ,s2 ) combines two rows: r1 from its
subScan s1 and r2 from s2 .
– Even if these r1 and r2 had RIDs, what would be the RID of r?
– If we wanted to update some Attribute Value r1 .a to have a value which de-
pends on another Attribute Value r2 .b (which is why we would like to UP-
DATE this product at all) what would this mean?
Which Value(s) of b would we use for this r1 .a?

SimpleDB source file simpledb/query/UpdateScan.java


package s i m p l e d b . q u e r y ;

import s i m p l e d b . r e c o r d . RID ;

/∗ ∗
∗ The i n t e r f a c e i m p l e m e n t e d b y a l l u p d a t e a b l e s c a n s .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e UpdateScan extends Scan {
/∗ ∗
∗ Modifies the f i e l d value of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e new v a l u e , e x p r e s s e d a s a C o n s t a n t
∗/
public void s e t V a l ( S t r i n g fldname , C o n s t a n t v a l ) ;

/∗ ∗
∗ Modifies the f i e l d value of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e new i n t e g e r v a l u e
∗/
public void s e t I n t ( S t r i n g fldname , i n t v a l ) ;

/∗ ∗
∗ Modifies the f i e l d value of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e new s t r i n g v a l u e
∗/
public void s e t S t r i n g ( S t r i n g fldname , S t r i n g v a l ) ;

/∗ ∗
∗ I n s e r t s a new r e c o r d s o m e w h e r e i n the scan .
∗/
public void i n s e r t ( ) ;

/∗ ∗
∗ Deletes the current r e c o r d from t h e scan .
∗/
public void d e l e t e ( ) ;

/∗ ∗
∗ R e t u r n s t h e RID o f the current record .
∗ @ r e t u r n t h e RID o f the current record
∗/
public RID getRid ( ) ;

/∗ ∗
∗ P o s i t i o n s th e scan so t h a t t h e c u r r e n t record has
∗ t h e s p e c i f i e d RID .
∗ @param r i d t h e RID o f t h e d e s i r e d r e c o r d
∗/
public void moveToRid ( RID r i d ) ;
}

4.7.3 Plans
• Each Scan tells one way how a particular Query can be executed.
• A Plan is otherwise similar to a Scan, but it tells instead roughly how much it would
cost to execute this Scan.
• Keeping Plans and Scans separate in this way makes it easier for the RDBMS to
offer many alternative implementations for the same Relational Algebra operation.
• The Planner component of the RDBMS builds many different Plans for the user’s
Query Q and compares their costs.
• Once this component finds a cheap Plan P for Q, the RDBMS opens this P into
the corresponding Scan S and executes S.

171
Figure 68: Some cost formulas. (Sciore, 2008)

• The “currency” of these cost estimations is essentially the amount of disk I/O in
the Scan – because that is the central measure of RDBMS performance.

• SimpleDB uses the following cost estimates for a Scan s:

B(s): the number of Block accesses required to construct the output of s


R(s): the number of Record s in this output
V(s,F ): the number of distinct Values in this Field F of this output.

• Their computation proceeds inductively/recursively on the structure of the expres-


sion tree of s:

– If s Scans a stored database Table T , then its B(T ), R(T ) and V(T ,F ) val-
ues are the statistical metadata which SimpleDB has collected about T in
section 4.6.
– If s is another kind of Scan like product(s1 ,s2 ) then we can compute its
B(s), R(s) and V(s,F ) values from the B(s1 ), R(s1 ), V(s1 ,F ), B(s2 ), R(s2 )
and V(s2 ,F ) obtained by recursion from its two subScans s1 and s2 with the
cost equations for the product Relational Algebra operation.

• Figure 68 gives these cost equations for the 3 main Relational Algebra operations.

SimpleDB source file simpledb/query/Plan.java

• Here is the Plan interface where the cost component

B is the method blocksAccessed


R is the method recordsOutput

172
V is the method distinctValues.
package s i m p l e d b . q u e r y ;

import s i m p l e d b . r e c o r d . Schema ;

/∗ ∗
∗ The i n t e r f a c e i m p l e m e n t e d b y e a c h q u e r y p l a n .
∗ There i s a Plan c l a s s f o r each r e l a t i o n a l a l g e b r a operator .
∗ @ a u t h o r Edward S c i o r e

∗/
public i n t e r f a c e Plan {

/∗ ∗
∗ Opens a s c a n c o r r e s p o n d i n g t o t h i s plan .
∗ The s c a n w i l l b e p o s i t i o n e d b e f o r e its first record .
∗ @return a scan
∗/
public Scan open ( ) ;

/∗ ∗
∗ R e t u r n s an e s t i m a t e o f t h e number o f b l o c k a c c e s s e s
∗ t h a t w i l l o c c u r when t h e s c a n i s r e a d t o c o m p l e t i o n .
∗ @ r e t u r n t h e e s t i m a t e d number o f b l o c k a c c e s s e s
∗/
public i n t blocksAccessed () ;

/∗ ∗
∗ R e t u r n s an e s t i m a t e o f t h e number o f r e c o r d s
∗ in the query ’ s output t a b l e .
∗ @ r e t u r n t h e e s t i m a t e d number o f o u t p u t r e c o r d s
∗/
public i n t recordsOutput ( ) ;

/∗ ∗
∗ R e t u r n s an e s t i m a t e o f t h e number o f d i s t i n c t v a l u e s
∗ for the s p e c i f i e d f i e l d in the query ’ s output t a b l e .
∗ @param f l d n a m e t h e name o f a f i e l d
∗ @ r e t u r n t h e e s t i m a t e d number o f d i s t i n c t f i e l d v a l u e s in the output
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g fldname ) ;

/∗ ∗
∗ R e t u r n s t h e schema o f t h e q u e r y .
∗ @ r e t u r n t h e q u e r y ’ s schema
∗/
public Schema schema ( ) ;
}

SimpleDB source file simpledb/query/TablePlan.java


• A Table Plan implements the Plan interface by querying its own metadata.

• This leads to the equations in Figure 68.


package s i m p l e d b . q u e r y ;

import simpledb . s e r v e r . SimpleDB ;


import simpledb . tx . Transaction ;
import simpledb . metadata . ∗ ;
import simpledb . record . ∗ ;

/∗ ∗ The P l a n c l a s s c o r r e s p o n d i n g t o a t a b l e .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s T a b l e P l a n implements Plan {
private T r a n s a c t i o n tx ;
private T a b l e I n f o t i ;
private S t a t I n f o s i ;

/∗ ∗
∗ C r e a t e s a l e a f node i n t h e q u e r y t r e e c o r r e s p o n d i n g
∗ to the s p e c i f i e d t a b l e .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public T a b l e P l a n ( S t r i n g tblname , T r a n s a c t i o n t x ) {
this . tx = tx ;
t i = SimpleDB . mdMgr ( ) . g e t T a b l e I n f o ( tblname , t x ) ;
s i = SimpleDB . mdMgr ( ) . g e t S t a t I n f o ( tblname , t i , t x ) ;
}

/∗ ∗
∗ Creates a t a b l e scan f o r t h i s query .
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
return new T a b l e S c a n ( t i , t x ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s f o r t h e t a b l e ,
∗ w h i c h i s o b t a i n a b l e f r o m t h e s t a t i s t i c s manager .
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )

173
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return s i . b l o c k s A c c e s s e d ( ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f r e c o r d s i n t h e t a b l e ,
∗ w h i c h i s o b t a i n a b l e f r o m t h e s t a t i s t i c s manager .
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return s i . r e c o r d s O u t p u t ( ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f d i s t i n c t f i e l d v a l u e s i n t h e t a b l e ,
∗ w h i c h i s o b t a i n a b l e f r o m t h e s t a t i s t i c s manager .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
return s i . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}

/∗ ∗
∗ D e t e r m i n e s t h e schema o f t h e t a b l e ,
∗ which i s o b t a i n a b l e from t h e c a t a l o g manager .
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return t i . schema ( ) ;
}
}

SimpleDB source file simpledb/query/SelectPlan.java

• A Selection Plan implements the Plan interface as follows, where the correspond-
ing Scan is s0 = select(s1 ,pred ).

• Consider first the equation for B(s) in Figure 68.

– Computing the answer for s0 requires executing the subScan s1 and selecting
those rows which satisfy this pred icate.
– Hence it involves accessing the same Block s as s1 , and therefore

B(s) = B(s0 )
= B(s1 )

as in Figure 68.

• Consider then the equation for R(s) when the pred icate compares the Value of an
Attribute A with a constant c.

– Assume for simplicity that each of the V(s1 ,A) distinct Values for A occurs
roughly as often.
– We are calculating estimates because exact values would be about as hard as
doing the actual query itself.
– This gives the equation in Figure 68.

• Consider then the equation for R(s) when the pred icate compares the Value of 2
Attributes A and B.

– Assume also for B the same simplicity as for A.


– It makes sense to assume that the Values of A and B are related:
∗ We assume that if V(s1 , A) < V(s1 , B) then every Value of A does appear
in B too, and vice versa.
∗ This holds for instance when A is a foreign key referencing B.

174
– This and the simplicity assumption above lead to the equation in Figure 68.

• Consider then the equation for V(s,F ) for the constant c.

– If F = A then this selection reduces its Values into just 1, namely this c.
– If F 6= A then we can use the inductive count V(s1 ,F ) directly. . .
– . . . but if the output of s0 has fewer rows than this, then it has only as many
distinct Values left.
– This leads to the equation in Figure 68.

• Consider finally the equation for V(s,F ) for 2 Attributes A and B.

– If F 6= A and F 6= B then we can reason as in the preceding case.


– If F = A or F = B then the result has at most as many distinct values as the
smaller of the two.
– This leads to the equation in Figure 68.

package s i m p l e d b . q u e r y ;

import s i m p l e d b . r e c o r d . Schema ;

/∗ ∗ The P l a n c l a s s c o r r e s p o n d i n g t o t h e <i >s e l e c t </i >


∗ r e l a t i o n a l algebra operator .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S e l e c t P l a n implements Plan {
p r i v a t e Plan p ;
private P r e d i c a t e pred ;

/∗ ∗
∗ C r e a t e s a new s e l e c t n o d e i n t h e q u e r y t r e e ,
∗ h a v i n g t h e s p e c i f i e d s u b q u e r y and p r e d i c a t e .
∗ @param p t h e s u b q u e r y
∗ @param p r e d t h e p r e d i c a t e
∗/
public S e l e c t P l a n ( Plan p , P r e d i c a t e p r e d ) {
this . p = p ;
this . pred = pred ;
}

/∗ ∗
∗ Creates a s e l e c t scan f o r t h i s query .
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
Scan s = p . open ( ) ;
return new S e l e c t S c a n ( s , p r e d ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s i n t h e selection ,
∗ w h i c h i s t h e same a s i n t h e u n d e r l y i n g q u e r y .
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return p . b l o c k s A c c e s s e d ( ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f o u t p u t r e c o r d s i n t h e s e l e c t i o n ,
∗ which i s determined by t h e
∗ reduction factor of the predicate .
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return p . r e c o r d s O u t p u t ( ) / p r e d . r e d u c t i o n F a c t o r ( p ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f d i s t i n c t f i e l d v a l u e s
∗ in the p r o j e c t i o n .
∗ I f t h e p r e d i c a t e c o n t a i n s a term e q u a t i n g t h e s p e c i f i e d
∗ f i e l d to a constant , then t h i s v a l u e w i l l be 1 .
∗ O t h e r w i s e , i t w i l l b e t h e number o f t h e d i s t i n c t v a l u e s
∗ in the underlying query
∗ ( b u t n o t more t h a n t h e s i z e o f t h e o u t p u t t a b l e ) .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
i f ( p r e d . e q u a t e s W i t h C o n s t a n t ( f l d n a m e ) != n u l l )
return 1 ;
else {
S t r i n g fldname2 = pred . equatesWithField ( fldname ) ;
i f ( f l d n a m e 2 != n u l l )
return Math . min ( p . d i s t i n c t V a l u e s ( f l d n a m e ) ,

175
p . d i s t i n c t V a l u e s ( fldname2 ) ) ;
else
return Math . min ( p . d i s t i n c t V a l u e s ( f l d n a m e ) ,
recordsOutput ( ) ) ;
}
}

/∗ ∗
∗ R e t u r n s t h e schema o f t h e s e l e c t i o n ,
∗ w h i c h i s t h e same a s i n t h e u n d e r l y i n g query .
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return p . schema ( ) ;
}
}

SimpleDB source file simpledb/query/ProjectPlan.java

• A Projection Plan implements the Plan interface by redirecting the cost methods
into the subPlan.

• The equations in Figure 68 do the same.

• This is because the projection Relational Algebra operation just modifies the
Schema but the actual rows stay the same.

package s i m p l e d b . q u e r y ;

import s i m p l e d b . r e c o r d . Schema ;
import j a v a . u t i l . C o l l e c t i o n ;

/∗ ∗ The P l a n c l a s s c o r r e s p o n d i n g t o t h e <i >p r o j e c t </i >


∗ r e l a t i o n a l algebra operator .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s P r o j e c t P l a n implements Plan {
p r i v a t e Plan p ;
p r i v a t e Schema schema = new Schema ( ) ;

/∗ ∗
∗ C r e a t e s a new p r o j e c t n o d e i n t h e q u e r y t r e e ,
∗ h a v i n g t h e s p e c i f i e d s u b q u e r y and f i e l d l i s t .
∗ @param p t h e s u b q u e r y
∗ @param f i e l d l i s t t h e l i s t o f f i e l d s
∗/
public P r o j e c t P l a n ( Plan p , C o l l e c t i o n <S t r i n g > f i e l d l i s t ) {
this . p = p ;
for ( S t r i n g fldname : f i e l d l i s t )
schema . add ( fldname , p . schema ( ) ) ;
}

/∗ ∗
∗ Creates a p r o j e c t scan f o r t h i s query .
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
Scan s = p . open ( ) ;
return new P r o j e c t S c a n ( s , schema . f i e l d s ( ) ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s i n t h e projection ,
∗ w h i c h i s t h e same a s i n t h e u n d e r l y i n g q u e r y .
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return p . b l o c k s A c c e s s e d ( ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f o u t p u t r e c o r d s i n t h e projection ,
∗ w h i c h i s t h e same a s i n t h e u n d e r l y i n g q u e r y .
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return p . r e c o r d s O u t p u t ( ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f d i s t i n c t f i e l d v a l u e s
∗ in the projection ,
∗ w h i c h i s t h e same a s i n t h e u n d e r l y i n g q u e r y .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
return p . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}

/∗ ∗
∗ R e t u r n s t h e schema o f t h e p r o j e c t i o n ,
∗ which i s t a k e n from t h e f i e l d l i s t .

176
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return schema ;
}
}

SimpleDB source file simpledb/query/ProductPlan.java


• A Product Plan s implements the Plan interface as in the equations in Figure 68,
where the corresponding Scan is s0 = product(s1 ,s2 ).

• The value V(s, F ) is V(si , F ) for the subPlan si whose Schema has this F .

• The output consists of R(s1 ) · R(s2 ) Record s.

• The equation
B(s) = B(s1 ) + R(s1 ) · B(s2 ) (18)
| {z } | {z }
outer and inner loops

for this resource-consuming Relational Algebra product operation motivates these


cost functions:

– We are interested in the B-estimates, but. . .


– this Eq. (18) requires R-estimates,. . .
– which in turn benefit from better V-estimates.

• Eq. (18) is not symmetric even though product itself is:

B(product(s1 , s2 )) 6= B(product(s2 , s1 )).

• Rewriting

R(s1 ) = RPB(s1 ) · B(s1 )

where

RPB(s1 ) = number of rows output per Block read in s1

namely reveals that B(product(s1 , s2 )) is smaller when

RPB(s1 ) < RPB(s2 ). (19)

• If s1 and s2 are Tables, then Eq. (19) says that their product is cheaper if the Table
with larger Record s comes first.
package s i m p l e d b . q u e r y ;

import s i m p l e d b . r e c o r d . Schema ;

/∗ ∗ The P l a n c l a s s c o r r e s p o n d i n g t o t h e <i >p r o d u c t </i >


∗ r e l a t i o n a l algebra operator .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s P r o d u c t P l a n implements Plan {
p r i v a t e Plan p1 , p2 ;
p r i v a t e Schema schema = new Schema ( ) ;

/∗ ∗
∗ C r e a t e s a new p r o d u c t n o d e i n t h e q u e r y tree ,
∗ h a v i n g t h e two s p e c i f i e d s u b q u e r i e s .
∗ @param p1 t h e l e f t −hand s u b q u e r y
∗ @param p2 t h e r i g h t −hand s u b q u e r y
∗/

177
public P r o d u c t P l a n ( Plan p1 , Plan p2 ) {
t h i s . p1 = p1 ;
t h i s . p2 = p2 ;
schema . a d d A l l ( p1 . schema ( ) ) ;
schema . a d d A l l ( p2 . schema ( ) ) ;
}

/∗ ∗
∗ Creates a product scan f o r t h i s query .
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
Scan s 1 = p1 . open ( ) ;
Scan s 2 = p2 . open ( ) ;
return new P r o d u c t S c a n ( s1 , s 2 ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s i n t h e p r o d u c t .
∗ The f o r m u l a i s :
∗ <p r e > B( p r o d u c t ( p1 , p2 ) ) = B( p1 ) + R( p1 ) ∗B( p2 ) </p r e >
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return p1 . b l o c k s A c c e s s e d ( ) + ( p1 . r e c o r d s O u t p u t ( ) ∗ p2 . b l o c k s A c c e s s e d ( ) ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f o u t p u t r e c o r d s i n t h e p r o d u c t .
∗ The f o r m u l a i s :
∗ <p r e > R( p r o d u c t ( p1 , p2 ) ) = R( p1 ) ∗R( p2 ) </p r e >
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return p1 . r e c o r d s O u t p u t ( ) ∗ p2 . r e c o r d s O u t p u t ( ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e d i s t i n c t number o f f i e l d v a l u e s i n t h e p r o d u c t .
∗ Since t h e product does not i n c r e a s e or d e c r e a s e f i e l d v a l u e s ,
∗ t h e e s t i m a t e i s t h e same a s i n t h e a p p r o p r i a t e u n d e r l y i n g q u e r y .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
i f ( p1 . schema ( ) . h a s F i e l d ( f l d n a m e ) )
return p1 . d i s t i n c t V a l u e s ( f l d n a m e ) ;
else
return p2 . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}

/∗ ∗
∗ R e t u r n s t h e schema of the product ,
∗ which i s t h e union o f t h e schemas o f the underlying queries .
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return schema ;
}
}

• Figures 69 and 70 give an example on calculating the cost of retrieving the math
majors’ names.

(a) gives the Query tree which determines the Scan and Plan to consider.
(b) gives the SimpleDB client method calls which it would execute.
(c) gives its cost in our University example.

• Eq. (19) says that it would have been better to swap s1 and s3 in s4.

4.7.4 Predicates
• We have skipped until now the SimpleDB implementation of selection pred icates.

• SimpleDB supports only conjunctions (that is, ANDs) of Terms, where each Term
is one of

– AttrName = AttrName or
– AttrName = constant.

• Full SQL offers much more detailed pred icates in its WHERE parts.

178
Figure 69: Cost estimation example. (Sciore, 2008)

179
Figure 70: Figure 69 continued. (Sciore, 2008)

180
• This pred icate handling involves a lot of code which the

Parser component of the RDBMS invokes when it parses the WHERE part of an
SQL statement into the corresponding predicate
Query and Planner components invoke when they process this predicate constructed
by the Parser .

SimpleDB source file simpledb/query/Constant.java


package s i m p l e d b . q u e r y ;

/∗ ∗
∗ The i n t e r f a c e t h a t d e n o t e s v a l u e s s t o r e d i n t h e d a t a b a s e .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e C o n s t a n t extends Comparable<Constant> {

/∗ ∗
∗ Returns the Java object corresponding to this constant .
∗ @return t h e Java value of the constant
∗/
public O b j e c t asJavaVal ( ) ;
}

SimpleDB source file simpledb/query/StringConstant.java


package s i m p l e d b . q u e r y ;

/∗ ∗
∗ The c l a s s t h a t w r a p s J a v a s t r i n g s a s d a t a b a s e c o n s t a n t s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S t r i n g C o n s t a n t implements C o n s t a n t {
private S t r i n g v a l ;

/∗ ∗
∗ Create a c o n s t a n t by wrapping t h e specified string .
∗ @param s t h e s t r i n g v a l u e
∗/
public S t r i n g C o n s t a n t ( S t r i n g s ) {
val = s ;
}

/∗ ∗
∗ Unwraps t h e s t r i n g and r e t u r n s i t .
∗ @see s i m p l e d b . q u e r y . C o n s t a n t#a s J a v a V a l ( )
∗/
public S t r i n g a s J a v a V a l ( ) {
return v a l ;
}

public boolean e q u a l s ( O b j e c t o b j ) {
StringConstant sc = ( StringConstant ) obj ;
return s c != n u l l && v a l . e q u a l s ( s c . v a l ) ;
}

public i n t compareTo ( C o n s t a n t c ) {
StringConstant sc = ( StringConstant ) c ;
return v a l . compareTo ( s c . v a l ) ;
}

public i n t hashCode ( ) {
return v a l . hashCode ( ) ;
}

public S t r i n g t o S t r i n g ( ) {
return v a l ;
}
}

SimpleDB source file simpledb/query/IntConstant.java


package s i m p l e d b . q u e r y ;

/∗ ∗
∗ The c l a s s t h a t w r a p s J a v a i n t s a s d a t a b a s e c o n s t a n t s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n t C o n s t a n t implements C o n s t a n t {
private I n t e g e r v a l ;

/∗ ∗
∗ Create a c o n s t a n t by wrapping the specified int .
∗ @param n t h e i n t v a l u e
∗/
public I n t C o n s t a n t ( i n t n ) {
v a l = new I n t e g e r ( n ) ;
}

181
/∗ ∗
∗ Unwraps t h e I n t e g e r and r e t u r n s i t .
∗ @see s i m p l e d b . q u e r y . C o n s t a n t#a s J a v a V a l ( )
∗/
public O b j e c t a s J a v a V a l ( ) {
return v a l ;
}

public boolean e q u a l s ( O b j e c t o b j ) {
IntConstant i c = ( IntConstant ) obj ;
return i c != n u l l && v a l . e q u a l s ( i c . v a l ) ;
}

public i n t compareTo ( C o n s t a n t c ) {
IntConstant i c = ( IntConstant ) c ;
return v a l . compareTo ( i c . v a l ) ;
}

public i n t hashCode ( ) {
return v a l . hashCode ( ) ;
}

public S t r i n g t o S t r i n g ( ) {
return v a l . t o S t r i n g ( ) ;
}
}

SimpleDB source file simpledb/query/ConstantExpression.java


package s i m p l e d b . q u e r y ;

import s i m p l e d b . r e c o r d . Schema ;

/∗ ∗
∗ An e x p r e s s i o n c o n s i s t i n g e n t i r e l y o f a s i n g l e c o n s t a n t .
∗ @ a u t h o r Edward S c i o r e

∗/
public c l a s s C o n s t a n t E x p r e s s i o n implements E x p r e s s i o n {
private Constant v a l ;

/∗ ∗
∗ C r e a t e s a new e x p r e s s i o n b y w r a p p i n g a c o n s t a n t .
∗ @param c t h e c o n s t a n t
∗/
public C o n s t a n t E x p r e s s i o n ( C o n s t a n t c ) {
val = c ;
}

/∗ ∗
∗ Returns t r ue .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#i s C o n s t a n t ( )
∗/
public boolean i s C o n s t a n t ( ) {
return true ;
}

/∗ ∗
∗ Returns f a l s e .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#i s F i e l d N a m e ( )
∗/
public boolean i s F i e l d N a m e ( ) {
return f a l s e ;
}

/∗ ∗
∗ Unwraps t h e c o n s t a n t and r e t u r n s i t .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a s C o n s t a n t ( )
∗/
public C o n s t a n t a s C o n s t a n t ( ) {
return v a l ;
}

/∗ ∗
∗ T h i s method s h o u l d n e v e r b e c a l l e d .
∗ Throws a C l a s s C a s t E x c e p t i o n .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a s F i e l d N a m e ( )
∗/
public S t r i n g asFiel dName ( ) {
throw new C l a s s C a s t E x c e p t i o n ( ) ;
}

/∗ ∗
∗ Returns the constant , r e g a r d l e s s of the scan .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#e v a l u a t e ( s i m p l e d b . q u e r y . Scan )
∗/
public C o n s t a n t e v a l u a t e ( Scan s ) {
return v a l ;
}

/∗ ∗
∗ R e t u r n s t r u e , b e c a u s e a c o n s t a n t a p p l i e s t o any schema .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a p p l i e s T o ( s i m p l e d b . r e c o r d . Schema )
∗/
public boolean a p p l i e s T o ( Schema s c h ) {
return true ;
}

public S t r i n g toString () {

182
return v a l . t o S t r i n g ( ) ;
}
}

SimpleDB source file simpledb/query/Expression.java


package s i m p l e d b . q u e r y ;

import s i m p l e d b . r e c o r d . Schema ;

/∗ ∗
∗ The i n t e r f a c e c o r r e s p o n d i n g t o SQL e x p r e s s i o n s .
∗ @ a u t h o r Edward S c i o r e

∗/
public i n t e r f a c e E x p r e s s i o n {

/∗ ∗
∗ Returns t r ue if the expression is a constant .
∗ @return t r u e if the expression is a constant
∗/
public boolean isConstant () ;

/∗ ∗
∗ Returns t r ue if the expression is a f i e l d reference .
∗ @return t r u e if the expression denotes a f i e l d
∗/
public boolean isFieldName ( ) ;

/∗ ∗
∗ Returns the constant corresponding to a constant expression .
∗ Throws an e x c e p t i o n i f t h e e x p r e s s i o n d o e s n o t
∗ denote a constant .
∗ @return t h e e x p r e s s i o n as a c o n s t a n t
∗/
public C o n s t a n t a s C o n s t a n t ( ) ;

/∗ ∗
∗ R e t u r n s t h e f i e l d name c o r r e s p o n d i n g t o a c o n s t a n t expression .
∗ Throws an e x c e p t i o n i f t h e e x p r e s s i o n d o e s n o t
∗ denote a f i e l d .
∗ @ r e t u r n t h e e x p r e s s i o n a s a f i e l d name
∗/
public S t r i n g asF ieldN ame ( ) ;

/∗ ∗
∗ Evaluates the expression with respect to the
∗ c u r r e n t record of the s p e c i f i e d scan .
∗ @param s t h e s c a n
∗ @return t h e v a l u e o f t h e e x p r e s s i o n , as a Constant
∗/
public C o n s t a n t e v a l u a t e ( Scan s ) ;

/∗ ∗
∗ Determines i f a l l o f t h e f i e l d s mentioned in this expression
∗ a r e c o n t a i n e d i n t h e s p e c i f i e d schema .
∗ @param s c h t h e schema
∗ @return t r u e i f a l l f i e l d s in t h e e x p r e s s i o n are in t h e schema
∗/
public boolean a p p l i e s T o ( Schema s c h ) ;
}

SimpleDB source file simpledb/query/FieldNameExpression.java


package s i m p l e d b . q u e r y ;

import s i m p l e d b . r e c o r d . Schema ;

/∗ ∗
∗ An e x p r e s s i o n c o n s i s t i n g e n t i r e l y o f a s i n g l e f i e l d .
∗ @ a u t h o r Edward S c i o r e

∗/
public c l a s s F i e l d N a m e E x p r e s s i o n implements E x p r e s s i o n {
private S t r i n g fldname ;

/∗ ∗
∗ C r e a t e s a new e x p r e s s i o n b y w r a p p i n g a f i e l d .
∗ @param f l d n a m e t h e name o f t h e w r a p p e d f i e l d
∗/
public F i e l d N a m e E x p r e s s i o n ( S t r i n g f l d n a m e ) {
this . fldname = fldname ;
}

/∗ ∗
∗ Returns f a l s e .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#i s C o n s t a n t ( )
∗/
public boolean i s C o n s t a n t ( ) {
return f a l s e ;
}

/∗ ∗
∗ Returns t r ue .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#i s F i e l d N a m e ( )
∗/

183
public boolean i s F i e l d N a m e ( ) {
return true ;
}

/∗ ∗
∗ T h i s method s h o u l d n e v e r b e c a l l e d .
∗ Throws a C l a s s C a s t E x c e p t i o n .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a s C o n s t a n t ( )
∗/
public C o n s t a n t a s C o n s t a n t ( ) {
throw new C l a s s C a s t E x c e p t i o n ( ) ;
}

/∗ ∗
∗ Unwraps t h e f i e l d name and r e t u r n s i t .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a s F i e l d N a m e ( )
∗/
public S t r i n g asFiel dName ( ) {
return f l d n a m e ;
}

/∗ ∗
∗ E v a l u a t e s t h e f i e l d by g e t t i n g i t s v a l u e in t h e scan .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#e v a l u a t e ( s i m p l e d b . q u e r y . Scan )
∗/
public C o n s t a n t e v a l u a t e ( Scan s ) {
return s . g e t V a l ( f l d n a m e ) ;
}

/∗ ∗
∗ R e t u r n s t r u e i f t h e f i e l d i s i n t h e s p e c i f i e d schema .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a p p l i e s T o ( s i m p l e d b . r e c o r d . Schema )
∗/
public boolean a p p l i e s T o ( Schema s c h ) {
return s c h . h a s F i e l d ( f l d n a m e ) ;
}

public S t r i n g t o S t r i n g ( ) {
return f l d n a m e ;
}
}

SimpleDB source file simpledb/query/Term.java


package s i m p l e d b . q u e r y ;

import s i m p l e d b . r e c o r d . Schema ;

/∗ ∗
∗ A term i s a comparison b e t w e e n two expressions .
∗ @ a u t h o r Edward S c i o r e

∗/
public c l a s s Term {
private E x p r e s s i o n l h s , r h s ;

/∗ ∗
∗ C r e a t e s a new t e r m t h a t c o m p a r e s t w o e x p r e s s i o n s
∗ for equality .
∗ @param l h s t h e LHS e x p r e s s i o n
∗ @param r h s t h e RHS e x p r e s s i o n
∗/
public Term ( E x p r e s s i o n l h s , E x p r e s s i o n r h s ) {
this . lhs = lhs ;
this . rhs = rhs ;
}

/∗ ∗
∗ C a l c u l a t e s t h e e x t e n t t o w h i c h s e l e c t i n g on t h e t e r m reduces
∗ t h e number o f r e c o r d s o u t p u t b y a q u e r y .
∗ For e x a m p l e i f t h e r e d u c t i o n f a c t o r i s 2 , t h e n t h e
∗ term c u t s t h e s i z e o f t h e o u t p u t i n h a l f .
∗ @param p t h e q u e r y ’ s p l a n
∗ @return t h e i n t e g e r r e d u c t i o n f a c t o r .
∗/
public i n t r e d u c t i o n F a c t o r ( Plan p ) {
S t r i n g lhsName , rhsName ;
i f ( l h s . i s F i e l d N a m e ( ) && r h s . i s F i e l d N a m e ( ) ) {
lhsName = l h s . asF ieldNa me ( ) ;
rhsName = r h s . asFi eldNam e ( ) ;
return Math . max ( p . d i s t i n c t V a l u e s ( lhsName ) ,
p . d i s t i n c t V a l u e s ( rhsName ) ) ;
}
i f ( l h s . isFieldName ( ) ) {
lhsName = l h s . asF ieldNa me ( ) ;
return p . d i s t i n c t V a l u e s ( lhsName ) ;
}
i f ( rhs . isFieldName ( ) ) {
rhsName = r h s . asFi eldNam e ( ) ;
return p . d i s t i n c t V a l u e s ( rhsName ) ;
}
// o t h e r w i s e , t h e t e r m e q u a t e s c o n s t a n t s
i f ( l h s . asConstant ( ) . equals ( rhs . asConstant ( ) ) )
return 1 ;
else
return I n t e g e r .MAX VALUE;
}

/∗ ∗
∗ Determines if this term is of t h e f o r m ”F=c ”

184
∗ w h e r e F i s t h e s p e c i f i e d f i e l d and c i s some c o n s t a n t .
∗ I f so , t h e method r e t u r n s t h a t c o n s t a n t .
∗ I f n o t , t h e method r e t u r n s n u l l .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return e i t h e r t h e c o n s t a n t or n u l l
∗/
public C o n s t a n t e q u a t e s W i t h C o n s t a n t ( S t r i n g f l d n a m e ) {
i f ( l h s . i s F i e l d N a m e ( ) &&
l h s . asFi eldNam e ( ) . e q u a l s ( f l d n a m e ) &&
rhs . isConstant () )
return r h s . a s C o n s t a n t ( ) ;
e l s e i f ( r h s . i s F i e l d N a m e ( ) &&
r h s . asFiel dName ( ) . e q u a l s ( f l d n a m e ) &&
lhs . isConstant () )
return l h s . a s C o n s t a n t ( ) ;
else
return n u l l ;
}

/∗ ∗
∗ D e t e r m i n e s i f t h i s t e r m i s o f t h e f o r m ”F1=F2”
∗ w h e r e F1 i s t h e s p e c i f i e d f i e l d and F2 i s a n o t h e r f i e l d .
∗ I f so , t h e method r e t u r n s t h e name o f t h a t f i e l d .
∗ I f n o t , t h e method r e t u r n s n u l l .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @ r e t u r n e i t h e r t h e name o f t h e o t h e r f i e l d , o r n u l l
∗/
public S t r i n g e q u a t e s W i t h F i e l d ( S t r i n g f l d n a m e ) {
i f ( l h s . i s F i e l d N a m e ( ) &&
l h s . asFie ldName ( ) . e q u a l s ( f l d n a m e ) &&
rhs . isFieldName ( ) )
return r h s . asFiel dName ( ) ;
e l s e i f ( r h s . i s F i e l d N a m e ( ) &&
r h s . asFiel dName ( ) . e q u a l s ( f l d n a m e ) &&
l h s . isFieldName ( ) )
return l h s . asFi eldNam e ( ) ;
else
return n u l l ;
}

/∗ ∗
∗ Returns t r u e i f b o t h o f t h e term ’ s e x p r e s s i o n s
∗ a p p l y t o t h e s p e c i f i e d schema .
∗ @param s c h t h e schema
∗ @ r e t u r n t r u e i f b o t h e x p r e s s i o n s a p p l y t o t h e schema
∗/
public boolean a p p l i e s T o ( Schema s c h ) {
return l h s . a p p l i e s T o ( s c h ) && r h s . a p p l i e s T o ( s c h ) ;
}

/∗ ∗
∗ Returns t r u e i f b o t h o f t h e term ’ s e x p r e s s i o n s
∗ e v a l u a t e t o t h e same c o n s t a n t ,
∗ with r e s p e c t to the s p e c i f i e d scan .
∗ @param s t h e s c a n
∗ @ r e t u r n t r u e i f b o t h e x p r e s s i o n s h a v e t h e same v a l u e in the scan
∗/
public boolean i s S a t i s f i e d ( Scan s ) {
Constant l h s v a l = l h s . e v a l u a t e ( s ) ;
Constant r h s v a l = rhs . e v a l u a t e ( s ) ;
return r h s v a l . e q u a l s ( l h s v a l ) ;
}

public S t r i n g t o S t r i n g ( ) {
return l h s . t o S t r i n g ( ) + ”=” + r h s . t o S t r i n g ( ) ;
}
}

SimpleDB source file simpledb/query/Predicate.java


package s i m p l e d b . q u e r y ;

import s i m p l e d b . r e c o r d . Schema ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ A p r e d i c a t e i s a Boolean combination o f terms .
∗ @ a u t h o r Edward S c i o r e

∗/
public c l a s s P r e d i c a t e {
p r i v a t e L i s t <Term> t e r m s = new A r r a y L i s t <Term>() ;

/∗ ∗
∗ C r e a t e s an empty p r e d i c a t e , corresponding to ” true ”.
∗/
public P r e d i c a t e ( ) {}

/∗ ∗
∗ Creates a predicate containing a single term .
∗ @param t t h e t e r m
∗/
public P r e d i c a t e ( Term t ) {
t e r m s . add ( t ) ;
}

/∗ ∗
∗ M o d i f i e s t h e p r e d i c a t e to be t he c o n j u n c t i o n of
∗ i t s e l f and t h e s p e c i f i e d p r e d i c a t e .
∗ @param p r e d t h e o t h e r p r e d i c a t e
∗/

185
public void c o n j o i n W i t h ( P r e d i c a t e p r e d ) {
terms . addAll ( pred . terms ) ;
}

/∗ ∗
∗ Returns t r ue i f the p r e d i c a t e e v a l u a t e s to t r u e
∗ with r e s p e c t to the s p e c i f i e d scan .
∗ @param s t h e s c a n
∗ @return t r u e i f t h e p r e d i c a t e i s t r u e in t h e scan
∗/
public boolean i s S a t i s f i e d ( Scan s ) {
f o r ( Term t : t e r m s )
if (! t . isSatisfied (s))
return f a l s e ;
return true ;
}

/∗ ∗
∗ C a l c u l a t e s t h e e x t e n t t o w h i c h s e l e c t i n g on t h e p r e d i c a t e
∗ r e d u c e s t h e number o f r e c o r d s o u t p u t b y a q u e r y .
∗ For e x a m p l e i f t h e r e d u c t i o n f a c t o r i s 2 , t h e n t h e
∗ predica te cuts the s i z e of the output in h a l f .
∗ @param p t h e q u e r y ’ s p l a n
∗ @return t h e i n t e g e r r e d u c t i o n f a c t o r .
∗/
public i n t r e d u c t i o n F a c t o r ( Plan p ) {
int f a c t o r = 1 ;
f o r ( Term t : t e r m s )
f a c t o r ∗= t . r e d u c t i o n F a c t o r ( p ) ;
return f a c t o r ;
}

/∗ ∗
∗ Returns the s u b p r e d i c a t e t h a t a p p l i e s to the s p e c i f i e d schema .
∗ @param s c h t h e schema
∗ @ r e t u r n t h e s u b p r e d i c a t e a p p l y i n g t o t h e schema
∗/
public P r e d i c a t e s e l e c t P r e d ( Schema s c h ) {
P r e d i c a t e r e s u l t = new P r e d i c a t e ( ) ;
f o r ( Term t : t e r m s )
i f ( t . appliesTo ( sch ) )
r e s u l t . t e r m s . add ( t ) ;
i f ( r e s u l t . t e r m s . s i z e ( ) == 0 )
return n u l l ;
else
return r e s u l t ;
}

/∗ ∗
∗ Returns the s u b p r e d i c a t e c o n s i s t i n g of terms t h a t apply
∗ t o t h e union o f t h e two s p e c i f i e d schemas ,
∗ b u t n o t t o e i t h e r schema s e p a r a t e l y .
∗ @param s c h 1 t h e f i r s t schema
∗ @param s c h 2 t h e s e c o n d schema
∗ @return t h e s u b p r e d i c a t e whose terms a p p l y t o t h e union of the two schemas but not either &
schema s e p a r a t e l y .
∗/
public P r e d i c a t e j o i n P r e d ( Schema s c h 1 , Schema s c h 2 ) {
P r e d i c a t e r e s u l t = new P r e d i c a t e ( ) ;
Schema newsch = new Schema ( ) ;
newsch . a d d A l l ( s c h 1 ) ;
newsch . a d d A l l ( s c h 2 ) ;
f o r ( Term t : t e r m s )
i f ( ! t . a p p l i e s T o ( s c h 1 ) &&
! t . a p p l i e s T o ( s c h 2 ) &&
t . a p p l i e s T o ( newsch ) )
r e s u l t . t e r m s . add ( t ) ;
i f ( r e s u l t . t e r m s . s i z e ( ) == 0 )
return n u l l ;
else
return r e s u l t ;
}

/∗ ∗
∗ D e t e r m i n e s i f t h e r e i s a t e r m o f t h e f o r m ”F=c ”
∗ w h e r e F i s t h e s p e c i f i e d f i e l d and c i s some c o n s t a n t .
∗ I f so , t h e method r e t u r n s t h a t c o n s t a n t .
∗ I f n o t , t h e method r e t u r n s n u l l .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return e i t h e r t h e c o n s t a n t or n u l l
∗/
public C o n s t a n t e q u a t e s W i t h C o n s t a n t ( S t r i n g f l d n a m e ) {
f o r ( Term t : t e r m s ) {
Constant c = t . equatesWithConstant ( fldname ) ;
i f ( c != n u l l )
return c ;
}
return n u l l ;
}

/∗ ∗
∗ D e t e r m i n e s i f t h e r e i s a t e r m o f t h e f o r m ”F1=F2”
∗ w h e r e F1 i s t h e s p e c i f i e d f i e l d and F2 i s a n o t h e r field .
∗ I f so , t h e method r e t u r n s t h e name o f t h a t f i e l d .
∗ I f n o t , t h e method r e t u r n s n u l l .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @ r e t u r n t h e name o f t h e o t h e r f i e l d , o r n u l l
∗/
public S t r i n g e q u a t e s W i t h F i e l d ( S t r i n g f l d n a m e ) {
f o r ( Term t : t e r m s ) {
S t r i n g s = t . equatesWithField ( fldname ) ;
i f ( s != n u l l )
return s ;

186
Figure 71: An example of a parse tree. (Sciore, 2008)

}
return n u l l ;
}

public S t r i n g t o S t r i n g ( ) {
I t e r a t o r <Term> i t e r = t e r m s . i t e r a t o r ( ) ;
i f ( ! i t e r . hasNext ( ) )
return ” ” ;
S t r i n g r e s u l t = i t e r . next ( ) . t o S t r i n g ( ) ;
while ( i t e r . hasNext ( ) )
r e s u l t += ” and ” + i t e r . n e x t ( ) . t o S t r i n g ( ) ;
return r e s u l t ;
}
}

4.8 Parsing SQL Statements


(Sciore, 2008, Chapter 18)

• Conceptually, the Parser component of the RDBMS translates an SQL statement


from a string into a parse tree, as in Figure 71 for a pred icate in a WHERE part.

• The “Basic Models of Computation” (”Laskennan perusmallit” or LAP in Finnish)


course gave the principles of such translation.

• In particular, SQL has been designed so that hand-written recursive descent LL(1)
parsing is enough.

• However, full SQL is so large that using a dedicated parser generator tool like yacc
instead would be a good idea. (Levine et al., 1992, Appendix J)

• The SimpleDB subset of SQL was in Figure 25. Its recursive descent parser is listed
here.

187
SimpleDB source file simpledb/parse/Lexer.java
package s i m p l e d b . p a r s e ;

import j a v a . u t i l . ∗ ;
import j a v a . i o . ∗ ;

/∗ ∗
∗ The l e x i c a l a n a l y z e r .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s L e x e r {
p r i v a t e C o l l e c t i o n <S t r i n g > k ey w o rd s ;
private StreamTokenizer tok ;

/∗ ∗
∗ C r e a t e s a new l e x i c a l a n a l y z e r f o r SQL s t a t e m e n t s .
∗ @param s t h e SQL s t a t e m e n t
∗/
public L e x e r ( S t r i n g s ) {
initKeywords () ;
t o k = new S t r e a m T o k e n i z e r (new S t r i n g R e a d e r ( s ) ) ;
tok . ordinaryChar ( ’ . ’ ) ;
t o k . lowerCaseMode ( true ) ; // i d s and k e y w o r d s a r e c o n v e r t e d
nextToken ( ) ;
}

// M e t h o d s t o check the status of the current token

/∗ ∗
∗ Returns t r ue i f the current token i s
∗ the s p e c i f i e d delimiter character .
∗ @param d a c h a r a c t e r d e n o t i n g t h e d e l i m i t e r
∗ @return t r u e i f t h e d e l i m i t e r i s t h e c u r r e n t token
∗/
public boolean matchDelim ( char d ) {
return d == ( char ) t o k . t t y p e ;
}

/∗ ∗
∗ R e t u r n s t r u e i f t h e c u r r e n t t o k e n i s an i n t e g e r .
∗ @ r e t u r n t r u e i f t h e c u r r e n t t o k e n i s an i n t e g e r
∗/
public boolean m a t c h I n t C o n s t a n t ( ) {
return t o k . t t y p e == S t r e a m T o k e n i z e r .TT NUMBER;
}

/∗ ∗
∗ Returns t r ue i f the current token i s a string .
∗ @return t r u e i f t h e c u r r e n t token i s a string
∗/
public boolean m a t c h S t r i n g C o n s t a n t ( ) {
return ’ \ ’ ’ == ( char ) t o k . t t y p e ;
}

/∗ ∗
∗ Returns t r u e i f t h e c u r r e n t token i s t h e s p e c i f i e d keyword .
∗ @param w t h e k e y w o r d s t r i n g
∗ @return t r u e i f t h a t keyword i s t h e c u r r e n t token
∗/
public boolean matchKeyword ( S t r i n g w) {
return t o k . t t y p e == S t r e a m T o k e n i z e r .TT WORD && t o k . s v a l . e q u a l s (w) ;
}

/∗ ∗
∗ Returns t r ue i f the current token i s a l e g a l i d e n t i f i e r .
∗ @ r e t u r n t r u e i f t h e c u r r e n t t o k e n i s an i d e n t i f i e r
∗/
public boolean matchId ( ) {
return t o k . t t y p e==S t r e a m T o k e n i z e r .TT WORD && ! k e yw o rd s . c o n t a i n s ( t o k . s v a l ) ;
}

// M e t h o d s t o ” e a t ” t h e current token

/∗ ∗
∗ Throws an e x c e p t i o n i f t h e c u r r e n t t o k e n i s not the
∗ specified delimiter .
∗ O t h e r w i s e , moves t o t h e n e x t t o k e n .
∗ @param d a c h a r a c t e r d e n o t i n g t h e d e l i m i t e r
∗/
public void e a t D e l i m ( char d ) {
i f ( ! matchDelim ( d ) )
throw new B a d S y n t a x E x c e p t i o n ( ) ;
nextToken ( ) ;
}

/∗ ∗
∗ Throws an e x c e p t i o n i f t h e c u r r e n t t o k e n i s n o t
∗ an i n t e g e r .
∗ O t h e r w i s e , r e t u r n s t h a t i n t e g e r and moves t o t h e next token .
∗ @return t h e i n t e g e r v a l u e o f t h e c u r r e n t token
∗/
public i n t e a t I n t C o n s t a n t ( ) {
i f ( ! matchIntConstant ( ) )
throw new B a d S y n t a x E x c e p t i o n ( ) ;
int i = ( int ) tok . nval ;
nextToken ( ) ;
return i ;
}

/∗ ∗
∗ Throws an e x c e p t i o n if the current token is not
∗ a string .

188
∗ O t h e r w i s e , r e t u r n s t h a t s t r i n g and moves t o t h e n e x t t o k e n .
∗ @return t h e s t r i n g v a l u e o f t h e c u r r e n t token
∗/
public S t r i n g e a t S t r i n g C o n s t a n t ( ) {
i f ( ! matchStringConstant ( ) )
throw new B a d S y n t a x E x c e p t i o n ( ) ;
S t r i n g s = t o k . s v a l ; // c o n s t a n t s a r e n o t c o n v e r t e d t o l o w e r case
nextToken ( ) ;
return s ;
}

/∗ ∗
∗ Throws an e x c e p t i o n i f t h e c u r r e n t t o k e n is not the
∗ s p e c i f i e d keyword .
∗ O t h e r w i s e , moves t o t h e n e x t t o k e n .
∗ @param w t h e k e y w o r d s t r i n g
∗/
public void eatKeyword ( S t r i n g w) {
i f ( ! matchKeyword (w) )
throw new B a d S y n t a x E x c e p t i o n ( ) ;
nextToken ( ) ;
}

/∗ ∗
∗ Throws an e x c e p t i o n i f t h e c u r r e n t t o k e n i s n o t
∗ an i d e n t i f i e r .
∗ Otherwise , r e t u r n s the i d e n t i f i e r s t r i n g
∗ and moves t o t h e n e x t t o k e n .
∗ @return t h e s t r i n g v a l u e o f t h e c u r r e n t token
∗/
public S t r i n g e a t I d ( ) {
i f ( ! matchId ( ) )
throw new B a d S y n t a x E x c e p t i o n ( ) ;
S t r i n g s = tok . s v a l ;
nextToken ( ) ;
return s ;
}

p r i v a t e void nextToken ( ) {
try {
t o k . nextToken ( ) ;
}
catch ( I O E x c e p t i o n e ) {
throw new B a d S y n t a x E x c e p t i o n ( ) ;
}
}

p r i v a t e void i n i t K e y w o r d s ( ) {
k ey w o rd s = A r r a y s . a s L i s t ( ” s e l e c t ” , ” from ” , ” where ” , ” and ” ,
” i n s e r t ” , ” i n t o ” , ” v a l u e s ” , ” d e l e t e ” , ” update ” , ” s e t ” ,
” c r e a t e ” , ” t a b l e ” , ” i n t ” , ” v a r c h a r ” , ” v i e w ” , ” a s ” , ” i n d e x ” , ” on ” ) ;
}
}

SimpleDB source file simpledb/parse/Parser.java


package s i m p l e d b . p a r s e ;

import j a v a . u t i l . ∗ ;
import s i m p l e d b . q u e r y . ∗ ;
import s i m p l e d b . r e c o r d . Schema ;

/∗ ∗
∗ The SimpleDB p a r s e r .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s P a r s e r {
private Lexer l e x ;

public P a r s e r ( S t r i n g s ) {
l e x = new L e x e r ( s ) ;
}

// M e t h o d s for parsing predicates , terms , expressions , c o n s t a n t s , and fields

public S t r i n g f i e l d ( ) {
return l e x . e a t I d ( ) ;
}

public C o n s t a n t c o n s t a n t ( ) {
i f ( l e x . matchStringConstant ( ) )
return new S t r i n g C o n s t a n t ( l e x . e a t S t r i n g C o n s t a n t ( ) ) ;
else
return new I n t C o n s t a n t ( l e x . e a t I n t C o n s t a n t ( ) ) ;
}

public E x p r e s s i o n e x p r e s s i o n ( ) {
i f ( l e x . matchId ( ) )
return new F i e l d N a m e E x p r e s s i o n ( f i e l d ( ) ) ;
else
return new C o n s t a n t E x p r e s s i o n ( c o n s t a n t ( ) ) ;
}

public Term term ( ) {


Expression lhs = expression () ;
l e x . e a t D e l i m ( ’= ’ ) ;
Expression rhs = expression () ;
return new Term ( l h s , r h s ) ;
}

189
public P r e d i c a t e p r e d i c a t e ( ) {
P r e d i c a t e p r e d = new P r e d i c a t e ( term ( ) ) ;
i f ( l e x . matchKeyword ( ” and ” ) ) {
l e x . eatKeyword ( ” and ” ) ;
pred . conjoinWith ( p r e d i c a t e ( ) ) ;
}
return p r e d ;
}

// M e t h o d s for parsing queries

public QueryData q u e r y ( ) {
l e x . eatKeyword ( ” s e l e c t ” ) ;
C o l l e c t i o n <S t r i n g > f i e l d s = s e l e c t L i s t ( ) ;
l e x . eatKeyword ( ” from ” ) ;
C o l l e c t i o n <S t r i n g > t a b l e s = t a b l e L i s t ( ) ;
P r e d i c a t e p r e d = new P r e d i c a t e ( ) ;
i f ( l e x . matchKeyword ( ” where ” ) ) {
l e x . eatKeyword ( ” where ” ) ;
pred = p r e d i c a t e ( ) ;
}
return new QueryData ( f i e l d s , t a b l e s , p r e d ) ;
}

p r i v a t e C o l l e c t i o n <S t r i n g > s e l e c t L i s t ( ) {
C o l l e c t i o n <S t r i n g > L = new A r r a y L i s t <S t r i n g >() ;
L . add ( f i e l d ( ) ) ;
i f ( l e x . matchDelim ( ’ , ’ ) ) {
l e x . eatDelim ( ’ , ’ ) ;
L . addAll ( s e l e c t L i s t ( ) ) ;
}
return L ;
}

p r i v a t e C o l l e c t i o n <S t r i n g > t a b l e L i s t ( ) {
C o l l e c t i o n <S t r i n g > L = new A r r a y L i s t <S t r i n g >() ;
L . add ( l e x . e a t I d ( ) ) ;
i f ( l e x . matchDelim ( ’ , ’ ) ) {
l e x . eatDelim ( ’ , ’ ) ;
L . addAll ( t a b l e L i s t ( ) ) ;
}
return L ;
}

// M e t h o d s for parsing the various u p d a t e commands

public O b j e c t updateCmd ( ) {
i f ( l e x . matchKeyword ( ” i n s e r t ” ) )
return i n s e r t ( ) ;
e l s e i f ( l e x . matchKeyword ( ” d e l e t e ” ) )
return d e l e t e ( ) ;
e l s e i f ( l e x . matchKeyword ( ” u p d a t e ” ) )
return m o d i f y ( ) ;
else
return c r e a t e ( ) ;
}

private Object c r e a t e ( ) {
l e x . eatKeyword ( ” c r e a t e ” ) ;
i f ( l e x . matchKeyword ( ” t a b l e ” ) )
return c r e a t e T a b l e ( ) ;
e l s e i f ( l e x . matchKeyword ( ” v i e w ” ) )
return c r e a t e V i e w ( ) ;
else
return c r e a t e I n d e x ( ) ;
}

// Method for parsing d e l e t e commands

public D e l e t e D a t a d e l e t e ( ) {
l e x . eatKeyword ( ” d e l e t e ” ) ;
l e x . eatKeyword ( ” from ” ) ;
S t r i n g tblname = l e x . e a t I d ( ) ;
P r e d i c a t e p r e d = new P r e d i c a t e ( ) ;
i f ( l e x . matchKeyword ( ” where ” ) ) {
l e x . eatKeyword ( ” where ” ) ;
pred = p r e d i c a t e ( ) ;
}
return new D e l e t e D a t a ( tblname , p r e d ) ;
}

// M e t h o d s for parsing i n s e r t commands

public I n s e r t D a t a i n s e r t ( ) {
l e x . eatKeyword ( ” i n s e r t ” ) ;
l e x . eatKeyword ( ” i n t o ” ) ;
S t r i n g tblname = l e x . e a t I d ( ) ;
l e x . eatDelim ( ’ ( ’ ) ;
L i s t <S t r i n g > f l d s = f i e l d L i s t ( ) ;
l e x . eatDelim ( ’ ) ’ ) ;
l e x . eatKeyword ( ” v a l u e s ” ) ;
l e x . eatDelim ( ’ ( ’ ) ;
L i s t <Constant> v a l s = c o n s t L i s t ( ) ;
l e x . eatDelim ( ’ ) ’ ) ;
return new I n s e r t D a t a ( tblname , f l d s , vals ) ;
}

p r i v a t e L i s t <S t r i n g > f i e l d L i s t ( ) {
L i s t <S t r i n g > L = new A r r a y L i s t <S t r i n g >() ;
L . add ( f i e l d ( ) ) ;
i f ( l e x . matchDelim ( ’ , ’ ) ) {
l e x . eatDelim ( ’ , ’ ) ;
L . addAll ( f i e l d L i s t ( ) ) ;

190
}
return L ;
}

p r i v a t e L i s t <Constant> c o n s t L i s t ( ) {
L i s t <Constant> L = new A r r a y L i s t <Constant >() ;
L . add ( c o n s t a n t ( ) ) ;
i f ( l e x . matchDelim ( ’ , ’ ) ) {
l e x . eatDelim ( ’ , ’ ) ;
L . addAll ( c o n s t L i s t ( ) ) ;
}
return L ;
}

// Method for parsing m o d i f y commands

public ModifyData m o d i f y ( ) {
l e x . eatKeyword ( ” u p d a t e ” ) ;
S t r i n g tblname = l e x . e a t I d ( ) ;
l e x . eatKeyword ( ” s e t ” ) ;
S t r i n g fldname = f i e l d ( ) ;
l e x . e a t D e l i m ( ’= ’ ) ;
E x p r e s s i o n newv al = e x p r e s s i o n ( ) ;
P r e d i c a t e p r e d = new P r e d i c a t e ( ) ;
i f ( l e x . matchKeyword ( ” where ” ) ) {
l e x . eatKeyword ( ” where ” ) ;
pred = p r e d i c a t e ( ) ;
}
return new ModifyData ( tblname , fldname , newval , pred ) ;
}

// Method for parsing create t a b l e commands

public C r e a t e T a b l e D a t a c r e a t e T a b l e ( ) {
l e x . eatKeyword ( ” t a b l e ” ) ;
S t r i n g tblname = l e x . e a t I d ( ) ;
l e x . eatDelim ( ’ ( ’ ) ;
Schema s c h = f i e l d D e f s ( ) ;
l e x . eatDelim ( ’ ) ’ ) ;
return new C r e a t e T a b l e D a t a ( tblname , s c h ) ;
}

p r i v a t e Schema f i e l d D e f s ( ) {
Schema schema = f i e l d D e f ( ) ;
i f ( l e x . matchDelim ( ’ , ’ ) ) {
l e x . eatDelim ( ’ , ’ ) ;
Schema schema2 = f i e l d D e f s ( ) ;
schema . a d d A l l ( schema2 ) ;
}
return schema ;
}

p r i v a t e Schema f i e l d D e f ( ) {
S t r i n g fldname = f i e l d ( ) ;
return f i e l d T y p e ( f l d n a m e ) ;
}

p r i v a t e Schema f i e l d T y p e ( S t r i n g f l d n a m e ) {
Schema schema = new Schema ( ) ;
i f ( l e x . matchKeyword ( ” i n t ” ) ) {
l e x . eatKeyword ( ” i n t ” ) ;
schema . a d d I n t F i e l d ( f l d n a m e ) ;
}
else {
l e x . eatKeyword ( ” v a r c h a r ” ) ;
l e x . eatDelim ( ’ ( ’ ) ;
int strLen = l e x . eatIntConstant ( ) ;
l e x . eatDelim ( ’ ) ’ ) ;
schema . a d d S t r i n g F i e l d ( fldname , s t r L e n ) ;
}
return schema ;
}

// Method for parsing create v i e w commands

public CreateViewData c r e a t e V i e w ( ) {
l e x . eatKeyword ( ” v i e w ” ) ;
S t r i n g viewname = l e x . e a t I d ( ) ;
l e x . eatKeyword ( ” a s ” ) ;
QueryData qd = q u e r y ( ) ;
return new CreateViewData ( viewname , qd ) ;
}

// Method for parsing create i n d e x commands

public C r e a t e I n d e x D a t a c r e a t e I n d e x ( ) {
l e x . eatKeyword ( ” i n d e x ” ) ;
S t r i n g idxname = l e x . e a t I d ( ) ;
l e x . eatKeyword ( ” on ” ) ;
S t r i n g tblname = l e x . e a t I d ( ) ;
l e x . eatDelim ( ’ ( ’ ) ;
S t r i n g fldname = f i e l d ( ) ;
l e x . eatDelim ( ’ ) ’ ) ;
return new C r e a t e I n d e x D a t a ( idxname , tblname , fldname ) ;
}
}

SimpleDB source file simpledb/parse/BadSyntaxException.java

191
package s i m p l e d b . p a r s e ;

/∗ ∗
∗ A runtime e x c e p t i o n i n d i c a t i n g t h a t the submitted query
∗ has i n c o r r e c t syntax .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
public c l a s s B a d S y n t a x E x c e p t i o n extends R u n t i m e E x c e p t i o n {
public B a d S y n t a x E x c e p t i o n ( ) {
}
}

SimpleDB source file simpledb/parse/QueryData.java


package s i m p l e d b . p a r s e ;

import s i m p l e d b . q u e r y . ∗ ;
import j a v a . u t i l . ∗ ;

/∗ ∗
∗ Data f o r t h e SQL <i >s e l e c t </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s QueryData {
p r i v a t e C o l l e c t i o n <S t r i n g > f i e l d s ;
p r i v a t e C o l l e c t i o n <S t r i n g > t a b l e s ;
private P r e d i c a t e pred ;

/∗ ∗
∗ S a v e s t h e f i e l d and t a b l e l i s t and p r e d i c a t e .
∗/
public QueryData ( C o l l e c t i o n <S t r i n g > f i e l d s , C o l l e c t i o n <S t r i n g > t a b l e s , P r e d i c a t e pred ) {
this . f i e l d s = f i e l d s ;
this . t a b l e s = t a b l e s ;
this . pred = pred ;
}

/∗ ∗
∗ Returns t h e f i e l d s mentioned in t h e select clause .
∗ @ r e t u r n a c o l l e c t i o n o f f i e l d names
∗/
public C o l l e c t i o n <S t r i n g > f i e l d s ( ) {
return f i e l d s ;
}

/∗ ∗
∗ R e t u r n s t h e t a b l e s mentioned i n t h e from clause .
∗ @ r e t u r n a c o l l e c t i o n o f t a b l e names
∗/
public C o l l e c t i o n <S t r i n g > t a b l e s ( ) {
return t a b l e s ;
}

/∗ ∗
∗ Returns the p r e d i c a t e t h a t d e s c r i b e s which
∗ r e c o r d s s h o u l d be in th e output t a b l e .
∗ @return t h e query p r e d i c a t e
∗/
public P r e d i c a t e p r e d ( ) {
return p r e d ;
}

public S t r i n g t o S t r i n g ( ) {
String result = ” select ” ;
for ( S t r i n g fldname : f i e l d s )
r e s u l t += f l d n a m e + ” , ” ;
re su lt = re sul t . substring (0 , res ul t . l e n g t h ( ) −2) ; // r e m o v e f i n a l comma
r e s u l t += ” from ” ;
f o r ( S t r i n g tblname : t a b l e s )
r e s u l t += tblname + ” , ” ;
re su lt = re sul t . substring (0 , res ul t . l e n g t h ( ) −2) ; // r e m o v e f i n a l comma
S t r i n g p r e d s t r i n g = pred . t o S t r i n g ( ) ;
i f ( ! p r e d s t r i n g . equals ( ”” ) )
r e s u l t += ” where ” + p r e d s t r i n g ;
return r e s u l t ;
}
}

SimpleDB source file simpledb/parse/InsertData.java


package s i m p l e d b . p a r s e ;

import s i m p l e d b . q u e r y . C o n s t a n t ;
import j a v a . u t i l . ∗ ;

/∗ ∗
∗ Data f o r t h e SQL <i >i n s e r t </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n s e r t D a t a {
p r i v a t e S t r i n g tblname ;
p r i v a t e L i s t <S t r i n g > f l d s ;
p r i v a t e L i s t <Constant> v a l s ;

/∗ ∗

192
∗ S a v e s t h e t a b l e name and t h e f i e l d and v a l u e l i s t s .
∗/
public I n s e r t D a t a ( S t r i n g tblname , L i s t <S t r i n g > f l d s , L i s t <Constant> v a l s ) {
t h i s . tblname = tblname ;
this . f l d s = f l d s ;
this . vals = vals ;
}

/∗ ∗
∗ R e t u r n s t h e name o f t h e a f f e c t e d table .
∗ @ r e t u r n t h e name o f t h e a f f e c t e d table
∗/
public S t r i n g tableName ( ) {
return tblname ;
}

/∗ ∗
∗ Returns a l i s t of f i e l d s f o r which
∗ v a l u e s w i l l b e s p e c i f i e d i n t h e new r e c o r d .
∗ @ r e t u r n a l i s t o f f i e l d names
∗/
public L i s t <S t r i n g > f i e l d s ( ) {
return f l d s ;
}

/∗ ∗
∗ Returns a l i s t of v a l u e s f o r the s p e c i f i e d f i e l d s .
∗ T h e r e i s a one−one c o r r e s p o n d e n c e b e t w e e n t h i s
∗ l i s t o f v a l u e s and t h e l i s t o f f i e l d s .
∗ @return a l i s t o f Constant v a l u e s .
∗/
public L i s t <Constant> v a l s ( ) {
return v a l s ;
}
}

SimpleDB source file simpledb/parse/DeleteData.java


package s i m p l e d b . p a r s e ;

import s i m p l e d b . q u e r y . ∗ ;

/∗ ∗
∗ Data f o r t h e SQL <i >d e l e t e </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s D e l e t e D a t a {
p r i v a t e S t r i n g tblname ;
private P r e d i c a t e pred ;

/∗ ∗
∗ S a v e s t h e t a b l e name and p r e d i c a t e .
∗/
public D e l e t e D a t a ( S t r i n g tblname , P r e d i c a t e p r e d ) {
t h i s . tblname = tblname ;
this . pred = pred ;
}

/∗ ∗
∗ R e t u r n s t h e name o f t h e a f f e c t e d table .
∗ @ r e t u r n t h e name o f t h e a f f e c t e d table
∗/
public S t r i n g tableName ( ) {
return tblname ;
}

/∗ ∗
∗ Returns the p r e d i c a t e t h a t d e s c r i b e s which
∗ r e c o r d s s h o u l d be d e l e t e d .
∗ @return t h e d e l e t i o n p r e d i c a t e
∗/
public P r e d i c a t e p r e d ( ) {
return p r e d ;
}
}

SimpleDB source file simpledb/parse/ModifyData.java


package s i m p l e d b . p a r s e ;

import s i m p l e d b . q u e r y . ∗ ;

/∗ ∗
∗ Data f o r t h e SQL <i >u p d a t e </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s ModifyData {
p r i v a t e S t r i n g tblname ;
private S t r i n g fldname ;
p r i v a t e E x p r e s s i o n new val ;
private P r e d i c a t e pred ;

/∗ ∗
∗ S a v e s t h e t a b l e name , t h e m o d i f i e d f i e l d and i t s new v a l u e , and t h e p r e d i c a t e .
∗/
public ModifyData ( S t r i n g tblname , S t r i n g fldname , E x p r e s s i o n newval , P r e d i c a t e p r e d ) {

193
this . tblname = tblname ;
this . fldname = fldname ;
this . newva l = newval ;
this . pred = pred ;
}

/∗ ∗
∗ R e t u r n s t h e name o f t h e a f f e c t e d table .
∗ @ r e t u r n t h e name o f t h e a f f e c t e d table
∗/
public S t r i n g tableName ( ) {
return tblname ;
}

/∗ ∗
∗ Returns t h e f i e l d whose v a l u e s w i l l be modified
∗ @ r e t u r n t h e name o f t h e t a r g e t field
∗/
public S t r i n g t a r g e t F i e l d ( ) {
return f l d n a m e ;
}

/∗ ∗
∗ R e t u r n s an e x p r e s s i o n .
∗ Evaluating t h i s expression for a record produces
∗ t he v a l u e t h a t w i l l be s t o r e d in t he record ’ s t a r g e t field .
∗ @return t h e t a r g e t e x p r e s s i o n
∗/
public E x p r e s s i o n newValue ( ) {
return newv al ;
}

/∗ ∗
∗ Returns the p r e d i c a t e t h a t d e s c r i b e s which
∗ r e c o r d s s h o u l d be modified .
∗ @return t h e m o d i f i c a t i o n p r e d i c a t e
∗/
public P r e d i c a t e p r e d ( ) {
return p r e d ;
}
}

SimpleDB source file simpledb/parse/CreateTableData.java


package s i m p l e d b . p a r s e ;

import s i m p l e d b . r e c o r d . Schema ;

/∗ ∗
∗ Data f o r t h e SQL <i >c r e a t e t a b l e </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s C r e a t e T a b l e D a t a {
p r i v a t e S t r i n g tblname ;
p r i v a t e Schema s c h ;

/∗ ∗
∗ S a v e s t h e t a b l e name and schema .
∗/
public C r e a t e T a b l e D a t a ( S t r i n g tblname , Schema s c h ) {
t h i s . tblname = tblname ;
this . sch = sch ;
}

/∗ ∗
∗ R e t u r n s t h e name o f t h e new t a b l e .
∗ @ r e t u r n t h e name o f t h e new t a b l e
∗/
public S t r i n g tableName ( ) {
return tblname ;
}

/∗ ∗
∗ R e t u r n s t h e schema o f t h e new t a b l e .
∗ @ r e t u r n t h e schema o f t h e new t a b l e
∗/
public Schema newSchema ( ) {
return s c h ;
}
}

SimpleDB source file simpledb/parse/CreateViewData.java


package s i m p l e d b . p a r s e ;

/∗ ∗
∗ Data f o r t h e SQL <i >c r e a t e v i e w </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s CreateViewData {
p r i v a t e S t r i n g viewname ;
p r i v a t e QueryData q r y d a t a ;

/∗ ∗
∗ Saves the v i e w name and its definition .
∗/

194
public CreateViewData ( S t r i n g viewname , QueryData q r y d a t a ) {
t h i s . viewname = viewname ;
this . qrydata = qrydata ;
}

/∗ ∗
∗ R e t u r n s t h e name o f t h e new v i e w .
∗ @ r e t u r n t h e name o f t h e new v i e w
∗/
public S t r i n g viewName ( ) {
return viewname ;
}

/∗ ∗
∗ R e t u r n s t h e d e f i n i t i o n o f t h e new v i e w .
∗ @ r e t u r n t h e d e f i n i t i o n o f t h e new v i e w
∗/
public S t r i n g v i e w D e f ( ) {
return q r y d a t a . t o S t r i n g ( ) ;
}
}

SimpleDB source file simpledb/parse/CreateIndexData.java


package s i m p l e d b . p a r s e ;

/∗ ∗
∗ The p a r s e r f o r t h e <i >c r e a t e i n d e x </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s C r e a t e I n d e x D a t a {
p r i v a t e S t r i n g idxname , tblname , f l d n a m e ;

/∗ ∗
∗ S a v e s t h e t a b l e and f i e l d names o f t h e s p e c i f i e d i n d e x .
∗/
public C r e a t e I n d e x D a t a ( S t r i n g idxname , S t r i n g tblname , S t r i n g fldname ) {
t h i s . idxname = idxname ;
t h i s . tblname = tblname ;
this . fldname = fldname ;
}

/∗ ∗
∗ R e t u r n s t h e name o f t h e i n d e x .
∗ @ r e t u r n t h e name o f t h e i n d e x
∗/
public S t r i n g indexName ( ) {
return idxname ;
}

/∗ ∗
∗ R e t u r n s t h e name o f t h e i n d e x e d table .
∗ @ r e t u r n t h e name o f t h e i n d e x e d table
∗/
public S t r i n g tableName ( ) {
return tblname ;
}

/∗ ∗
∗ R e t u r n s t h e name o f t h e i n d e x e d field .
∗ @ r e t u r n t h e name o f t h e i n d e x e d field
∗/
public S t r i n g f i e l d N a m e ( ) {
return f l d n a m e ;
}
}

4.9 Query Execution Planner


(Sciore, 2008, Chapter 19)
• The basic Planner component
takes the parse tree for the SQL statement Parser component, and
gives an initial Plan for performing the action it specifies – for either modifying
the database or answering the Query.
• Here we concentrate on how the RDBMS can build this initial Plan for an SQL
Query.
– The Planner also optimizes this initial Plan to lower its estimated cost. . .
– . . . but because clever use of indexes is central to this Query optimization, it
must wait until we have discussed indexing.

195
– The Plans to INSERT, UPDATE or DELETE Record s involve these Query
Plans too.

• The SQL standard specifies the following structure for this initial Plan, from the
leaf nodes towards the root in its Relational Algebra expression trees:

¬ stored Tables
­ products and joins
® outerjoins
¯ selections, semijoins and antijoins
° extend operations
± projections
² unions
³ sort operation.

Simple Queries

• The simplest queries have the form

SELECT A1 , A2 , A3 , . . . , Ap
FROM T1 , T2 , T3 , . . . , Tq
WHERE P1
AND P2
AND P3
.
.
.
AND Pr

such that each

Ai is an attribute name
Tj is a Table name
Pk is a Term.

• This is the form supported by SimpleDB.

• Its translation is in Figure 72.

• Figure 73 shows an example whose WHERE part has not been split into its Terms
yet.

• This splitting will be used in

– translating [NOT] IN. . . and EXISTS. . . in the WHERE part


– optimizing this initial translation.

196
answer

project {A1 , A2 , A3 , . . . , Aq }
SELECT part ±

select P1

select P2

select P3
WHERE part ¯
select Pr

product

product Tp

product T3

T1 T2 FROM part ­

Figure 72: Relational Algebra expression for a simple query.

Figure 73: Basic query translation example. (Sciore, 2008)

197
Views and Nested Queries in the FROM Part
• View s can be added into this translation:
– Suppose that some Tj is the name of a View (and not a Table).
– The definition of this Tj is another Query Qj .
– This Qj has its own translation Rj into Relational Algebra.
– This Rj can replace its name Tj within the translation of the whole Query.
• Figure 74 shows an example, where
(a) is the whole Query and the definition of a View named EINSTEIN used in it
(b) is the translation for this EINSTEIN
(c) is the translation of the whole Query with (b) as its subexpression.
• The same idea can be used also for queries nested into the FROM part – as if they
were unnamed queries whose definitions are nested within the Query itself.

AS in the SELECT Part


• If an Ai in the SELECT part is Ei AS Bj (and not just an attribute name) then
it generates the corresponding extend(. . .,Ei ,Bj ) Relational Algebra operation
into °, as in Figure 75.

Range Variables in the FROM Part


• If a Table, View or a nested Query Tj has a range variable vj – that is, if the FROM
part has Tj vj – then all the attributes
C1 , C2 , C3 , . . . , Cs
of Tj get renamed into
vj .C1 , vj .C2 , vj .C3 , . . . , vj .Cs
with the corresponding chain

rename(. . .(rename(rename(rename(Tj ,
C1 ,vj .C1 ),C2 ,vj .C2 ),C3 ,vj .C3 ),. . .),Cs ,vj .Cs )

instead of Tj in ¬.
• This rename Relational Algebra operation can be implemented at essentially no cost
by just changing the name of the attribute in the Schema used for Tj .

IN-Subqueries in the WHERE Part


• If some Pk in WHERE part is
x IN (SELECT y FROM . . .)
| {z }
call this φ

(and not just a Term) then it generates a semijoin instead of a selection, as in


Figure 76.
• NOT IN is translated in the same way, except that it generates an antijoin instead
of a semijoin.

198
Figure 74: View translation example. (Sciore, 2008)

project {. . . , Ai−1 , Ei AS Bi , Ai+1 , . . .} project {. . . , Ai−1 , Bi , Ai+1 , . . .}


| {z }
Ai
becomes extend Ei Bi °

the subtree below the subtree below

Figure 75: Translating AS in the SELECT part.

199
select x IN φ semijoin x = y ¯
becomes
the subtree below the subtree below translation of φ

Figure 76: Translating IN in the WHERE part.

EXISTS-Subqueries in the WHERE Part

• The translation in Figure 76 was straightforward, because it could assume its nested
subQuery φ to be closed:

– In other words, that φ did not mention any attribute or range variable names
defined outside it – in the FROM part of a Query containing φ.
– Hence this φ
gives values of y to the Query containing it, but
takes nothing from it.
– In other words, that all communication between φ and its enclosing Query was
by φ giving its values for y and the enclosing Query comparing them with its
values for x.

• But if the FROM part of the enclosing Query contains an EXISTS ψ subQuery,
then we can no longer assume its subQuery ψ to be closed, and this makes its
translation somewhat more intricate.

• Call in our University example two SECTIONs of the same COURSE adjacent if
there has been no third SECTION of the same COURSE between them.
SELECT ∗
FROM SECTION s1 ,
SECTION s2
WHERE s1 . CourseID = s2 . CourseID
AND s1 . Y e a r O f f e r e d < s2 . Y e a r O f f e r e d
AND NOT EXISTS (SELECT ∗
FROM SECTION s3
WHERE s3 . CourseID=s1 . CourseID
AND s1 . Y e a r O f f e r e d < s3 . Y e a r O f f e r e d
AND s3 . Y e a r O f f e r e d < s2 . Y e a r O f f e r e d )

Here we assume that we can compare years with ’<’.

• It seems difficult to express this SQL Query in a way which would not use SEC-
TIONS s1 and s2 in the WHERE part of the subQuery for s3.

200
select ψ becomes semijoin public

the subtree below the subtree below translation of


SELECT *
FROM . . .
WHERE private

Figure 77: Relational Algebra expression for an EXISTS subquery.

• Instead, this subQuery

takes the values of s1 and s2 from its enclosing Query, and


gives back an indication whether or not there is some value s3 between them.

• This kind of a subQuery ψ can be reordered into

SELECT *
FROM ...
WHERE private
AND public

such that its

private part mentions only attributes and range variables defined inside (the FROM
part of) this ψ
public part mentions also attributes and range variables defined outside this φ in
the enclosing Query – so that ψ would be closed without it.

• This partition of the WHERE part of the subQuery ψ into a private and a public
part permits its translation as in Figure 77.

• This partitioning is possible, because AND is commutative:

Q AND R means the same as R AND Q. (20)

• The translation in Figure 77 works also for NOT EXISTS. . . with antijoin
instead of semijoin.

• In the “adjacent courses” example, the whole WHERE part of the subQuery is
public.

– Hence its subQuery reduces to

201
semijoin (. . . AND χ) becomes semijoin (. . . )

the left subtree below the right subtree below the left subtree below semijoin (public of χ)

the right subtree below translation of


private of χ

Figure 78: Relational Algebra expression for nested subqueries.

SELECT *
FROM SECTION s3
WHERE TRUE

because an empty AND is true.


– This is because true is the neutral element of AND:

Q AND true means the same as Q. (21)

– In other words, it reduces to just the SECTION Table with range variable s3.

Nested EXISTS Queries

• What if the public part in Figure 77 has the form “. . . AND χ” for another nested
subQuery χ?

• This χ restricts the output from the private part, so its semi- or antijoin must be
added into the right subtree below, as in Figure 78.

• It shows a χ of the form EXISTS. . . – a NOT would again produce an antijoin


instead.

• However, Figure 78 reveals a small problem:


If χ mentions an attribute x defined in the left subtree below, then this x is no
longer available in its translation.

• This happens for instance in the Query

SELECT ...
FROM T x
WHERE EXISTS(SELECT *
FROM α
WHERE EXISTS(SELECT *
FROM β
WHERE γ))

where γ mentions x:

202
– The left subtree of this whole Query has the translation of T.
– The right subtree of this whole Query is the translation of its outer EXISTS. . .
subQuery.
– Its inner EXISTS. . . subQuery mentions x but is within that right subtree
of this whole Query.

• Fortunately this can be fixed by copying the definition of x into α as well:

SELECT ...
FROM T x
WHERE EXISTS(SELECT *
FROM α,
T y
WHERE y = x
AND EXISTS(SELECT *
FROM β
WHERE γ 0 ))

y is a new range variable for this self-join of T.


γ 0 is γ where every mention of x has been replaced with this new y.
y = x AND. . . is added into the public part of its enclosing subQuery so that y is
indeed a local copy of the current x.

• After this copying, the translation of the inner EXISTS. . . subQuery

mentions y instead of x
gets this y from the translation of the outer EXISTS. . . subQuery – which is the
right subtree below in Figure 78.

• In general, we can

call a subQuery almost closed if its public part mentions only those attributes
which are defined in the nearest enclosing FROM part, and
assume that every subQuery is at least almost closed, before we start translating
the whole Query

because a subQuery can always be made almost closed with suitable copying before
translating it – and the RDBMS can perform this copying internally.

• Moreover, such copied definitions are good candidates for materialization, because
they are used in many places of the whole Query.

Disjunctions

• Our translations have assumed only ANDs but not ORs in their WHERE parts.

• ANDs are much more common than ORs.

• ORs can be readily added into the translation of a simple Query shown in Figure 72:

203
– Because a selection operation permits ANDd and ORs in its pred icate,
we could in fact have used just one big selection operation for the whole
WHERE part.
– However, the Query will turn out to be easier to optimize, if we still split its
WHERE part into several selection operations, but now each Pi is an OR
of Terms.
– This splitting is (or should be!) familiar from propositional logic:
It is the Conjunctive Normal Form (CNF) of the logical formula which is the
WHERE part of this simple Query.

• Recall that a formula in propositional logic is in CNF if it has the form (using our
notation and omitting NOTs):

(Term OR Term OR Term OR . . .)


AND (Term OR Term OR Term OR . . .)
AND (Term OR Term OR Term OR . . .)
..
.

and that ANDs can be lifted above ORs in this way by using the equivalence that

(Q AND R) OR S means the same as


(Q OR S) AND (R OR S) (22)

appropriately.

• Recall moreover that also OR commutes like AND in Eq. (20).

• Our translation for a closed subQuery of the form “x [NOT] IN. . . ” in Figure 76
assumes that it is alone in its selection operation – and hence that ORs have been
eliminated first from the WHERE part containing this subQuery.

• For this, recall another normal form from propositional logic:

– A formula is in Disjunctive Normal Form (DNF) if its ORs are above its ANDs
– the “other way around” than in CNF.
– It too can be reached similarly to Eq. (22):

(Q OR R) AND S means the same as


(Q AND S) OR (R AND S) (23)

• Suppose then that we have a Query

SELECT α FROM β
WHERE γ

whose γ contains such a subQuery.

¬ First we can convert γ towards DNF to get

204
SELECT α FROM β
WHERE γ1
OR γ2

which exposes one OR to eliminate.


­ This OR can be eliminated by turning it into a UNION instead:
(SELECT α FROM β
WHERE γ1 )
UNION
(SELECT α FROM β
WHERE γ2 )

• Repeating this conversion of ORs into UNIONs will eventually lead into a Query
in which the WHERE parts containing this subQuery no longer have ORs.

• However, the whole Query can get much larger, because its FROM β part gets
repeated.

• This repeating FROM β part is a natural candidate for materialization.

• Moreover, if this UNION is inside an EXISTS. . . query, then we must be able to


extract its public parts. Since this requires ANDs, we must continue further:

® First pull this UNION from under the EXISTS by turning


EXISTS ((SELECT α FROM β
WHERE γ1 )
UNION
(SELECT α FROM β
WHERE γ2 ))

into
(EXISTS (SELECT α FROM β
WHERE γ1 ))
OR
(EXISTS (SELECT α FROM β
WHERE γ2 ).

¯ Then continue by eliminating this new OR in the enclosing WHERE part as


before.

• This elimination in ¯ turns out to be simple, if this is a NOT EXISTS. . . subQuery:

(NOT EXISTS (SELECT α FROM β


WHERE γ1 ))
AND
(NOT EXISTS (SELECT α FROM β
WHERE γ2 )

205
by (the logical versions of) de Morgan’s laws:

OR
NOT(Q R) means the same as
AND
AND
(NOT Q) (NOT R) (24)
OR

• We can apply these laws (24) also into a

x NOT IN (Q1
UNION
Q2 )

which turns it into

(x NOT IN (Q1 ))
AND
(x NOT IN (Q2 ))

because this AND

– is faster to execute
– permits more further optimizations

than that UNION.

• These transformations can get rid of the unwanted ORs before the whole Query is
translated into Relational Algebra.

Postprocessing
• If an SQL (sub)Query ends with

.
.
.
GROUP BY grouping HAVING pred

then its Relational Algebra translation has a

select(groupby(translation of
.
.
.
,grouping
,computing)
,pred )

below the translation of its SELECTion part.

• Recall that it can be computing summaries of its groups.

206
Figure 79: A big SQL query to translate. (Sciore, 2008)

• If it omits its optional HAVING part, then its pred icate can be taken to be true,
and its selection omitted.

• Similarly, the whole Query (but not a subQuery) can end with

.
.
.
ORDER BY attributes

and then its translation has

sort(translation of
.
.
.
,attributes).

on top of everything.

• However, note that we have skipped many details here. . .

• Figures 79 and 80 show an example of this GROUPing and ORDERing.

207
Figure 80: Translation of Figure 79. (Sciore, 2008)

208
SimpleDB source file simpledb/planner/Planner.java

• Here is the SimpleDB Planner object.

• It consists of 2 subPlanner s:

QueryPlanner which translates each SQL Query into a Plan as outlined earlier.
UpdatePlanner which executes the SQL INSERT, UPDATE and DELETE Statements.
It handles also the CREATE (and DROP and ALTER, if SimpleDB sup-
ported them) Statements, because they are similar updates of the catalog meta-
data.

• These subPlanner s are interfaces to permit replacing how SimpleDB implements


them with something more powerful.

package s i m p l e d b . p l a n n e r ;

import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . p a r s e . ∗ ;
import s i m p l e d b . q u e r y . ∗ ;

/∗ ∗
∗ The o b j e c t t h a t e x e c u t e s SQL s t a t e m e n t s .
∗ @author s c i o r e
∗/
public c l a s s P l a n n e r {
private QueryPlanner q p l a n n e r ;
private UpdatePlanner uplanner ;

public P l a n n e r ( Q u e r y P l a n n e r q p l a n n e r , UpdatePlanner uplanner ) {


this . qplanner = qplanner ;
this . uplanner = uplanner ;
}

/∗ ∗
∗ C r e a t e s a p l a n f o r an SQL s e l e c t s t a t e m e n t , u s i n g t h e s u p p l i e d planner .
∗ @param q r y t h e SQL q u e r y s t r i n g
∗ @param t x t h e t r a n s a c t i o n
∗ @return t h e scan c o r r e s p o n d i n g to t h e query plan
∗/
public Plan c r e a t e Q u e r y P l a n ( S t r i n g qry , T r a n s a c t i o n t x ) {
P a r s e r p a r s e r = new P a r s e r ( q r y ) ;
QueryData d a t a = p a r s e r . q u e r y ( ) ;
return q p l a n n e r . c r e a t e P l a n ( data , t x ) ;
}

/∗ ∗
∗ E x e c u t e s an SQL i n s e r t , d e l e t e , m o d i f y , o r
∗ create statement .
∗ The method d i s p a t c h e s t o t h e a p p r o p r i a t e method o f t h e
∗ s u p p l i e d update planner ,
∗ d e p e n d i n g on w h a t t h e p a r s e r r e t u r n s .
∗ @param cmd t h e SQL u p d a t e s t r i n g
∗ @param t x t h e t r a n s a c t i o n
∗ @ r e t u r n an i n t e g e r d e n o t i n g t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e U p d a t e ( S t r i n g cmd , T r a n s a c t i o n t x ) {
P a r s e r p a r s e r = new P a r s e r ( cmd ) ;
O b j e c t o b j = p a r s e r . updateCmd ( ) ;
i f ( obj instanceof InsertData )
return u p l a n n e r . e x e c u t e I n s e r t ( ( I n s e r t D a t a ) o b j , t x ) ;
else i f ( obj instanceof DeleteData )
return u p l a n n e r . e x e c u t e D e l e t e ( ( D e l e t e D a t a ) o b j , t x ) ;
e l s e i f ( o b j i n s t a n c e o f ModifyData )
return u p l a n n e r . e x e c u t e M o d i f y ( ( ModifyData ) o b j , t x ) ;
els e i f ( obj instanceof CreateTableData )
return u p l a n n e r . e x e c u t e C r e a t e T a b l e ( ( C r e a t e T a b l e D a t a ) o b j , t x ) ;
e l s e i f ( o b j i n s t a n c e o f CreateViewData )
return u p l a n n e r . e x e c u t e C r e a t e V i e w ( ( CreateViewData ) o b j , t x ) ;
else i f ( obj instanceof CreateIndexData )
return u p l a n n e r . e x e c u t e C r e a t e I n d e x ( ( C r e a t e I n d e x D a t a ) o b j , t x ) ;
else
return 0 ;
}
}

SimpleDB source file simpledb/planner/QueryPlanner.java


package s i m p l e d b . p l a n n e r ;

import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . q u e r y . Plan ;
import s i m p l e d b . p a r s e . QueryData ;

/∗ ∗
∗ The i n t e r f a c e implemented by planners for

209
∗ t h e SQL s e l e c t s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e

∗/
public i n t e r f a c e Q u e r y P l a n n e r {

/∗ ∗
∗ Creates a plan for the parsed query .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e q u e r y
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @return a plan f o r t h a t query
∗/
public Plan c r e a t e P l a n ( QueryData data , T r a n s a c t i o n t x ) ;
}

SimpleDB source file simpledb/planner/UpdatePlanner.java


package s i m p l e d b . p l a n n e r ;

import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . p a r s e . ∗ ;

/∗ ∗
∗ The i n t e r f a c e i m p l e m e n t e d b y t h e p l a n n e r s
∗ f o r SQL i n s e r t , d e l e t e , and m o d i f y s t a t e m e n t s .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e U p d a t e P l a n n e r {

/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d i n s e r t s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e i n s e r t s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e I n s e r t ( I n s e r t D a t a data , T r a n s a c t i o n t x ) ;

/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d d e l e t e s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e d e l e t e s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e D e l e t e ( D e l e t e D a t a data , T r a n s a c t i o n t x ) ;

/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d m o d i f y s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e m o d i f y s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e M o d i f y ( ModifyData data , T r a n s a c t i o n t x ) ;

/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d c r e a t e t a b l e s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e c r e a t e t a b l e s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e C r e a t e T a b l e ( C r e a t e T a b l e D a t a data , T r a n s a c t i o n t x ) ;

/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d c r e a t e v i e w s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e c r e a t e v i e w s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e C r e a t e V i e w ( CreateViewData data , T r a n s a c t i o n t x ) ;

/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d c r e a t e i n d e x s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e c r e a t e i n d e x s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e C r e a t e I n d e x ( C r e a t e I n d e x D a t a data , T r a n s a c t i o n t x ) ;
}

SimpleDB source file simpledb/planner/BasicQueryPlanner.java

• Here is the SimpleDB basic Query Planner .

• It performs the translation in Figure 73.

210
package s i m p l e d b . p l a n n e r ;

import simpledb . tx . Transaction ;


import simpledb . query . ∗ ;
import simpledb . parse . ∗ ;
import simpledb . s e r v e r . SimpleDB ;
import java . u t i l .∗;

/∗ ∗
∗ The s i m p l e s t , most n a i v e q u e r y p l a n n e r p o s s i b l e .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B a s i c Q u e r y P l a n n e r implements Q u e r y P l a n n e r {

/∗ ∗
∗ Creates a query plan as f o l l o w s . It f i r s t takes
∗ t h e p r o d u c t o f a l l t a b l e s and v i e w s ; i t t h e n s e l e c t s on t h e p r e d i c a t e ;
∗ and f i n a l l y i t p r o j e c t s on t h e f i e l d l i s t .
∗/
public Plan c r e a t e P l a n ( QueryData data , T r a n s a c t i o n t x ) {
// S t e p 1 : C r e a t e a p l a n f o r e a c h m e n t i o n e d t a b l e o r v i e w
L i s t <Plan> p l a n s = new A r r a y L i s t <Plan >() ;
f o r ( S t r i n g tblname : d a t a . t a b l e s ( ) ) {
S t r i n g v i e w d e f = SimpleDB . mdMgr ( ) . g e t V i e w D e f ( tblname , t x ) ;
i f ( v i e w d e f != n u l l )
p l a n s . add ( SimpleDB . p l a n n e r ( ) . c r e a t e Q u e r y P l a n ( v i e w d e f , t x ) ) ;
else
p l a n s . add (new T a b l e P l a n ( tblname , t x ) ) ;
}

// S t e p 2 : C r e a t e t h e p r o d u c t o f a l l t a b l e plans
Plan p = p l a n s . remove ( 0 ) ;
f o r ( Plan n e x t p l a n : p l a n s )
p = new P r o d u c t P l a n ( p , n e x t p l a n ) ;

// S t e p 3 : Add a s e l e c t i o n p l a n f o r t h e predicate
p = new S e l e c t P l a n ( p , d a t a . p r e d ( ) ) ;

// S t e p 4 : P r o j e c t on t h e f i e l d names
p = new P r o j e c t P l a n ( p , d a t a . f i e l d s ( ) ) ;
return p ;
}
}

SimpleDB source file simpledb/planner/BasicUpdatePlanner.java

• Here is the SimpleDB basic Update Planner .

• For SQL DELETE and UPDATE Statements, it. . .

¬ first extracts the updatable Plan


­ then opens it into an updatable Scan, and
® finally executes the corresponding operation on each Record in its result set.

211
package s i m p l e d b . p l a n n e r ;

import java . u t i l . Iterator ;


import simpledb . s e r v e r . SimpleDB ;
import simpledb . tx . Transaction ;
import simpledb . parse . ∗ ;
import simpledb . query . ∗ ;

/∗ ∗
∗ The b a s i c p l a n n e r f o r SQL u p d a t e s t a t e m e n t s .
∗ @author s c i o r e
∗/
public c l a s s B a s i c U p d a t e P l a n n e r implements U p d a t e P l a n n e r {

public i n t e x e c u t e D e l e t e ( D e l e t e D a t a data , T r a n s a c t i o n t x ) {
Plan p = new T a b l e P l a n ( d a t a . tableName ( ) , t x ) ;
p = new S e l e c t P l a n ( p , d a t a . p r e d ( ) ) ;
UpdateScan u s = ( UpdateScan ) p . open ( ) ;
int count = 0 ;
while ( u s . n e x t ( ) ) {
us . d e l e t e ( ) ;
c o u n t ++;
}
us . c l o s e ( ) ;
return c o u n t ;
}

public i n t e x e c u t e M o d i f y ( ModifyData data , T r a n s a c t i o n t x ) {


Plan p = new T a b l e P l a n ( d a t a . tableName ( ) , t x ) ;
p = new S e l e c t P l a n ( p , d a t a . p r e d ( ) ) ;
UpdateScan u s = ( UpdateScan ) p . open ( ) ;
int count = 0 ;
while ( u s . n e x t ( ) ) {
C o n s t a n t v a l = d a t a . newValue ( ) . e v a l u a t e ( u s ) ;
us . s e t V a l ( data . t a r g e t F i e l d ( ) , v a l ) ;
c o u n t ++;
}
us . c l o s e ( ) ;
return c o u n t ;
}

public i n t e x e c u t e I n s e r t ( I n s e r t D a t a data , T r a n s a c t i o n t x ) {
Plan p = new T a b l e P l a n ( d a t a . tableName ( ) , t x ) ;
UpdateScan u s = ( UpdateScan ) p . open ( ) ;
us . i n s e r t ( ) ;
I t e r a t o r <Constant> i t e r = d a t a . v a l s ( ) . i t e r a t o r ( ) ;
for ( S t r i n g fldname : data . f i e l d s ( ) ) {
Constant v a l = i t e r . next ( ) ;
u s . s e t V a l ( fldname , v a l ) ;
}
us . c l o s e ( ) ;
return 1 ;
}

public i n t e x e c u t e C r e a t e T a b l e ( C r e a t e T a b l e D a t a data , T r a n s a c t i o n t x ) {
SimpleDB . mdMgr ( ) . c r e a t e T a b l e ( d a t a . tableName ( ) , d a t a . newSchema ( ) , t x ) ;
return 0 ;
}

public i n t e x e c u t e C r e a t e V i e w ( CreateViewData data , T r a n s a c t i o n t x ) {


SimpleDB . mdMgr ( ) . c r e a t e V i e w ( d a t a . viewName ( ) , d a t a . v i e w D e f ( ) , t x ) ;
return 0 ;
}
public i n t e x e c u t e C r e a t e I n d e x ( C r e a t e I n d e x D a t a data , T r a n s a c t i o n t x ) {
SimpleDB . mdMgr ( ) . c r e a t e I n d e x ( d a t a . indexName ( ) , d a t a . tableName ( ) , d a t a . f i e l d N a m e ( ) , t x ) ;
return 0 ;
}
}

4.10 The Remote Database Server


(Sciore, 2008, Chapters 7–8 and 20)
• The Remote component of SimpleDB provides. . .

– on the server machine side, the initialization of the SimpleDB server process
– communication between client processes and this server process.
Each client Connection runs as its own separate OS thread within this server
process.
– on the client side, a subset of the JDBC standard for this communication.

SimpleDB source file simpledb/remote/RemoteDriverImpl.java


• In the server side, Remote Method Invocation (RMI) needs a Driver whose stub is
the (only) object published in the RMI registry – its “phone book”.

212
• Its job is to provide remote Connections.
• The programmer just writes these Implementation classes – Java supplies their
stubs.
package s i m p l e d b . r e m o t e ;

import j a v a . rmi . RemoteException ;


import j a v a . rmi . s e r v e r . U n i c a s t R e m o t e O b j e c t ;

/∗ ∗
∗ The RMI s e r v e r −s i d e i m p l e m e n t a t i o n o f R e m o t e D r i v e r .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
public c l a s s R e m o t e D r i v e r I m p l extends U n i c a s t R e m o t e O b j e c t implements R e m ot e D r i v er {
public R e m o t e D r i v e r I m p l ( ) throws RemoteException {
}

/∗ ∗
∗ C r e a t e s a new R e m o t e C o n n e c t i o n I m p l o b j e c t and
∗ returns i t .
∗ @see s i m p l e d b . r e m o t e . R e m o t e D r i v e r#c o n n e c t ( )
∗/
public RemoteConnection c o n n e c t ( ) throws RemoteException {
return new RemoteConnectionImpl ( ) ;
}
}

SimpleDB source file simpledb/remote/RemoteDriver.java


package s i m p l e d b . r e m o t e ;

import j a v a . rmi . ∗ ;

/∗ ∗
∗ The RMI r e m o t e i n t e r f a c e c o r r e s p o n d i n g t o D r i v e r .
∗ The method i s s i m i l a r t o t h a t o f D r i v e r ,
∗ e x c e p t t h a t i t t a k e s no a r g u m e n t s and
∗ throws RemoteExceptions i n s t e a d o f SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e R e m ot e D r i v er extends Remote {
public RemoteConnection c o n n e c t ( ) throws RemoteException ;
}

SimpleDB source file simpledb/remote/SimpleDriver.java


• In the client side, this SimpleDB driver gets the Remote Driver via RMI, so that
it can form a Connection.
package s i m p l e d b . r e m o t e ;

import j a v a . s q l . ∗ ;
import j a v a . rmi . ∗ ;
import j a v a . u t i l . P r o p e r t i e s ;

/∗ ∗
∗ The SimpleDB d a t a b a s e d r i v e r .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S i m p l e D r i v e r extends D r i v e r A d a p t e r {

/∗ ∗
∗ C o n n e c t s t o t h e SimpleDB s e r v e r on t h e s p e c i f i e d h o s t .
∗ The method r e t r i e v e s t h e R e m o t e D r i v e r s t u b f r o m
∗ t h e RMI r e g i s t r y on t h e s p e c i f i e d h o s t .
∗ I t t h e n c a l l s t h e c o n n e c t method on t h a t s t u b ,
∗ w h i c h i n t u r n c r e a t e s a new c o n n e c t i o n and
∗ r e t u r n s t h e RemoteConnection s t u b f o r i t .
∗ This s t u b i s wrapped i n a SimpleConnection o b j e c t
∗ and i s r e t u r n e d .
∗ <P>
∗ The c u r r e n t i m p l e m e n t a t i o n o f t h i s method i g n o r e s t h e
∗ p r o p e r t i e s argument .
∗ @see j a v a . s q l . D r i v e r#c o n n e c t ( j a v a . l a n g . S t r i n g , P r o p e r t i e s )
∗/
public C o n n e c t i o n c o n n e c t ( S t r i n g u r l , P r o p e r t i e s prop ) throws SQLException {
try {
S t r i n g n e w u r l = u r l . r e p l a c e ( ” j d b c : s i m p l e d b ” , ” rmi ” ) + ” / s i m p l e d b ” ;
R e m o te D r i v e r r d v r = ( R e m o te D r i v e r ) Naming . l o o k u p ( n e w u r l ) ;
RemoteConnection r c o n n = r d v r . c o n n e c t ( ) ;
return new S i m p l e C o n n e c t i o n ( r c o n n ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}
}

213
SimpleDB source file simpledb/remote/DriverAdapter.java
package s i m p l e d b . r e m o t e ;

import j a v a . s q l . ∗ ;
import j a v a . u t i l . ∗ ;

/∗ ∗
∗ This c l a s s implements a l l o f t h e methods o f t h e Driver i n t e r f a c e ,
∗ b y t h r o w i n g an e x c e p t i o n f o r e a c h one .
∗ S u b c l a s s e s ( s u c h a s S i m p l e D r i v e r ) can o v e r r i d e t h o s e m e t h o d s t h a t
∗ i t want t o i m p l e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public a b s t r a c t c l a s s D r i v e r A d a p t e r implements D r i v e r {
public boolean acceptsURL ( S t r i n g u r l ) throws SQLException {
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public C o n n e c t i o n c o n n e c t ( S t r i n g u r l , P r o p e r t i e s i n f o ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public i n t g e t M a j o r V e r s i o n ( ) {
return 0 ;
}

public i n t g e t M i n o r V e r s i o n ( ) {
return 0 ;
}

public D r i v e r P r o p e r t y I n f o [ ] getPropertyInfo ( String url , Properties info ) {


return n u l l ;
}

public boolean j d b c C o m p l i a n t ( ) {
return f a l s e ;
}
}

• In this way, each Service has 4 Java source files:

RemoteService Impl.java provides the server-side implementation of this Service.


RemoteService.java is the interface which this server-side Service implements.
SimpleService.java is the client-side wrapper for this Service provided remotely
by the server.
Service Adapter.java is a “do nothing” client-side wrapper.
– The actual wrapper implements some of its methods to do something
instead.
– SimpleDB implements a subset of the JDBC standard via these Adapters.
– Most of these “do nothing” methods just throw an SQL exception that
"operation not implemented".

• SimpleDB implements 5 such Services:

Driver for RMI so that the clients and the server can establish Connections be-
tween them.
Connection for this client-server communication.
Statement for passing SQL Statements from a client to the server via these Connections.
ResultSet for passing the result rows of an SQL Query from the server back to its
client.
MetaData for passing the metadata for these result rows.

SimpleDB source file simpledb/remote/RemoteConnectionImpl.java

• SimpleDB Remote Connections ensure that each Query gets executed as its own
Transaction.

214
• In this way, SimpleDB supports only the SQL AUTOCOMMIT mode.
package s i m p l e d b . r e m o t e ;

import s i m p l e d b . t x . T r a n s a c t i o n ;
import j a v a . rmi . RemoteException ;
import j a v a . rmi . s e r v e r . U n i c a s t R e m o t e O b j e c t ;

/∗ ∗
∗ The RMI s e r v e r −s i d e i m p l e m e n t a t i o n o f R e m o t e C o n n e c t i o n .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
c l a s s RemoteConnectionImpl extends U n i c a s t R e m o t e O b j e c t implements RemoteConnection {
private T r a n s a c t i o n tx ;

/∗ ∗
∗ Creates a remote co nne ct ion
∗ and b e g i n s a new t r a n s a c t i o n f o r i t .
∗ @throws RemoteException
∗/
RemoteConnectionImpl ( ) throws RemoteException {
t x = new T r a n s a c t i o n ( ) ;
}

/∗ ∗
∗ C r e a t e s a new R e m o t e S t a t e m e n t f o r t h i s c o n n e c t i o n .
∗ @see s i m p l e d b . r e m o t e . R e m o t e C o n n e c t i o n#c r e a t e S t a t e m e n t ( )
∗/
public RemoteStatement c r e a t e S t a t e m e n t ( ) throws RemoteException {
return new RemoteStatementImpl ( t h i s ) ;
}

/∗ ∗
∗ Closes the connection .
∗ The c u r r e n t t r a n s a c t i o n i s c o m m i t t e d .
∗ @see s i m p l e d b . r e m o t e . R e m o t e C o n n e c t i o n#c l o s e ( )
∗/
public void c l o s e ( ) throws RemoteException {
t x . commit ( ) ;
}

// The f o l l o w i n g methods are used by the s e r v e r −s i d e classes .

/∗ ∗
∗ Returns the t r a n s a c t i o n c u r r e n t l y a s s o c i a t e d with
∗ t h i s connection .
∗ @return t h e t r a n s a c t i o n a s s o c i a t e d with t h i s connection
∗/
Transaction getTransaction () {
return t x ;
}

/∗ ∗
∗ Commits t h e c u r r e n t t r a n s a c t i o n ,
∗ and b e g i n s a new one .
∗/
void commit ( ) {
t x . commit ( ) ;
t x = new T r a n s a c t i o n ( ) ;
}

/∗ ∗
∗ R o l l s back the current transaction ,
∗ and b e g i n s a new one .
∗/
void r o l l b a c k ( ) {
tx . r o l l b a c k ( ) ;
t x = new T r a n s a c t i o n ( ) ;
}
}

SimpleDB source file simpledb/remote/RemoteConnection.java


package s i m p l e d b . r e m o t e ;

import j a v a . rmi . ∗ ;

/∗ ∗
∗ The RMI r e m o t e i n t e r f a c e c o r r e s p o n d i n g t o C o n n e c t i o n .
∗ The m e t h o d s a r e i d e n t i c a l t o t h o s e o f C o n n e c t i o n ,
∗ e x c e p t t h a t t h e y throw RemoteExceptions i n s t e a d o f SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e RemoteConnection extends Remote {
public RemoteStatement c r e a t e S t a t e m e n t ( ) throws RemoteException ;
public void c l o s e ( ) throws RemoteException ;
}

SimpleDB source file simpledb/remote/SimpleConnection.java


package s i m p l e d b . r e m o t e ;

import j a v a . s q l . ∗ ;

215
/∗ ∗
∗ An a d a p t e r c l a s s t h a t w r a p s R e m o t e C o n n e c t i o n .
∗ I t s m e t h o d s do n o t h i n g e x c e p t t r a n s f o r m R e m o t e E x c e p t i o n s
∗ i n t o SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S i m p l e C o n n e c t i o n extends C o n n e c t i o n A d a p t e r {
p r i v a t e RemoteConnection r c o n n ;

public S i m p l e C o n n e c t i o n ( RemoteConnection c ) {
rconn = c ;
}

public S t a t e m e n t c r e a t e S t a t e m e n t ( ) throws SQLException {


try {
RemoteStatement r s t m t = r c o n n . c r e a t e S t a t e m e n t ( ) ;
return new S i m p l e S t a t e m e n t ( r s t m t ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}

public void c l o s e ( ) throws SQLException {


try {
rconn . c l o s e ( ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}
}

SimpleDB source file simpledb/remote/ConnectionAdapter.java


• Here is an example of an Adapter which just throws exceptions.

• The other such Adapters are omitted for brevity.


package s i m p l e d b . r e m o t e ;

import j a v a . s q l . ∗ ;
import j a v a . u t i l . ∗ ;

/∗ ∗
∗ This c l a s s implements a l l o f t h e methods o f t h e Connection i n t e r f a c e ,
∗ b y t h r o w i n g an e x c e p t i o n f o r e a c h one .
∗ S u b c l a s s e s ( s u c h a s S i m p l e C o n n e c t i o n ) can o v e r r i d e t h o s e m e t h o d s t h a t
∗ i t want t o i m p l e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public a b s t r a c t c l a s s C o n n e c t i o n A d a p t e r implements C o n n e c t i o n {
public void c l e a r W a r n i n g s ( ) throws SQLException {
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public void c l o s e ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public void commit ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public Array c r e a t e A r r a y O f ( S t r i n g typeName , O b j e c t [ ] e l e m e n t s ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public Blob c r e a t e B l o b ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public Clob c r e a t e C l o b ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public NClob c r e a t e N C l o b ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public SQLXML createSQLXML ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public S t a t e m e n t c r e a t e S t a t e m e n t ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public S t a t e m e n t c r e a t e S t a t e m e n t ( i n t r e s u l t S e t T y p e , i n t r e s u l t S e t C o n c u r r e n c y ) throws SQLException &


{
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public S t a t e m e n t c r e a t e S t a t e m e n t ( i n t r e s u l t S e t T y p e , i n t resultSetConcurrency , int &


r e s u l t S e t H o l d a b i l i t y ) throws SQLException {
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;

216
}

public S t r u c t c r e a t e S t r u c t ( S t r i n g typeName , O b j e c t [ ] a t t r i b u t e s ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public boolean getAutoCommit ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public S t r i n g g e t C a t a l o g ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public P r o p e r t i e s g e t C l i e n t I n f o ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public S t r i n g g e t C l i e n t I n f o ( S t r i n g name ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public i n t g e t H o l d a b i l i t y ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public DatabaseMetaData getMetaData ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public i n t g e t T r a n s a c t i o n I s o l a t i o n ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public Map<S t r i n g , C l a s s <?>> getTypeMap ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public SQLWarning g e t W a r n i n g s ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public boolean i s C l o s e d ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public boolean i s R e a d O n l y ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public boolean i s V a l i d ( i n t t i m e o u t ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public S t r i n g nativeSQL ( S t r i n g s q l ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public C a l l a b l e S t a t e m e n t p r e p a r e C a l l ( S t r i n g s q l ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public C a l l a b l e S t a t e m e n t p r e p a r e C a l l ( S t r i n g s q l , i n t r e s u l t S e t T y p e , int resultSetConcurrency ) &


throws SQLException {
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public C a l l a b l e S t a t e m e n t p r e p a r e C a l l ( S t r i n g s q l , i n t r e s u l t S e t T y p e , int resultSetConcurrency , int &


r e s u l t S e t H o l d a b i l i t y ) throws SQLException {
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public P r e p a r e d S t a t e m e n t p r e p a r e S t a t e m e n t ( S t r i n g s q l ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public P r e p a r e d S t a t e m e n t p r e p a r e S t a t e m e n t ( S t r i n g s q l , i n t a u t o G e n e r a t e d K e y s ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public P r e p a r e d S t a t e m e n t p r e p a r e S t a t e m e n t ( S t r i n g s q l , i n t [ ] c o l u m n I n d e x e s ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public P r e p a r e d S t a t e m e n t p r e p a r e S t a t e m e n t ( S t r i n g s q l , i n t r e s u l t S e t T y p e , int resultSetConcurrency )&


throws SQLException {
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public P r e p a r e d S t a t e m e n t p r e p a r e S t a t e m e n t ( S t r i n g s q l , i n t r e s u l t S e t T y p e , int resultSetConcurrency , &


i n t r e s u l t S e t H o l d a b i l i t y ) throws SQLException {
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public P r e p a r e d S t a t e m e n t p r e p a r e S t a t e m e n t ( S t r i n g s q l , S t r i n g [ ] columnNames ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public void r e l e a s e S a v e p o i n t ( S a v e p o i n t s a v e p o i n t ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public void r o l l b a c k ( ) throws SQLException {

217
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public void r o l l b a c k ( S a v e p o i n t s v e p o i n t ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public void setAutoCommit ( boolean autoCommit ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public void s e t C a t a l o g ( S t r i n g c a t a l o g ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public void s e t C l i e n t I n f o ( S t r i n g name , String value ) {


}

public void s e t C l i e n t I n f o ( P r o p e r t i e s properties ) {


}

public void s e t H o l d a b i l i t y ( i n t h o l d a b i l i t y ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public void s e t R e a d O n l y ( boolean r e a d O n l y ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public S a v e p o i n t s e t S a v e p o i n t ( ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public S a v e p o i n t s e t S a v e p o i n t ( S t r i n g name ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public void s e t T r a n s a c t i o n I s o l a t i o n ( i n t l e v e l ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public void setTypeMap (Map<S t r i n g , C l a s s <?>> map ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public boolean i s W r a p p e r F o r ( C l a s s <?> i f a c e ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}

public <T> T unwrap ( C l a s s <T> i f a c e ) throws SQLException {


throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}
}

SimpleDB source file simpledb/remote/RemoteStatementImpl.java


package s i m p l e d b . r e m o t e ;

import simpledb . tx . Transaction ;


import s i m p l e d b . q u e r y . Plan ;
import s i m p l e d b . s e r v e r . SimpleDB ;
import j a v a . rmi . RemoteException ;
import j a v a . rmi . s e r v e r . U n i c a s t R e m o t e O b j e c t ;

/∗ ∗
∗ The RMI s e r v e r −s i d e i m p l e m e n t a t i o n o f R e m o t e S t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
c l a s s RemoteStatementImpl extends U n i c a s t R e m o t e O b j e c t implements RemoteStatement {
p r i v a t e RemoteConnectionImpl r c o n n ;

public RemoteStatementImpl ( RemoteConnectionImpl r c o n n ) throws RemoteException {


this . rconn = rconn ;
}

/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d SQL q u e r y s t r i n g .
∗ The method c a l l s t h e q u e r y p l a n n e r t o c r e a t e a p l a n
∗ for the query . I t then sends the plan to the
∗ RemoteResultSetImpl constructor for processing .
∗ @see s i m p l e d b . r e m o t e . R e m o t e S t a t e m e n t#e x e c u t e Q u e r y ( j a v a . l a n g . S t r i n g )
∗/
public R e m o t e R e s u l t S e t e x e c u t e Q u e r y ( S t r i n g q r y ) throws RemoteException {
try {
Transaction tx = rconn . g e t T r a n s a c t i o n ( ) ;
Plan p l n = SimpleDB . p l a n n e r ( ) . c r e a t e Q u e r y P l a n ( qry , t x ) ;
return new R e m o t e R e s u l t S e t I m p l ( pln , r c o n n ) ;
}
catch ( R u n t i m e E x c e p t i o n e ) {
rconn . r o l l b a c k ( ) ;
throw e ;
}
}

/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d SQL u p d a t e command .
∗ The method s e n d s t h e command t o t h e u p d a t e p l a n n e r ,
∗ which e x e c u t e s i t .
∗ @see s i m p l e d b . r e m o t e . R e m o t e S t a t e m e n t#e x e c u t e U p d a t e ( j a v a . l a n g . S t r i n g )

218
∗/
public i n t e x e c u t e U p d a t e ( S t r i n g cmd ) throws RemoteException {
try {
Transaction tx = rconn . g e t T r a n s a c t i o n ( ) ;
i n t r e s u l t = SimpleDB . p l a n n e r ( ) . e x e c u t e U p d a t e ( cmd , t x ) ;
r c o n n . commit ( ) ;
return r e s u l t ;
}
catch ( R u n t i m e E x c e p t i o n e ) {
rconn . r o l l b a c k ( ) ;
throw e ;
}
}
}

SimpleDB source file simpledb/remote/RemoteStatement.java


package s i m p l e d b . r e m o t e ;

import j a v a . rmi . ∗ ;

/∗ ∗
∗ The RMI r e m o t e i n t e r f a c e c o r r e s p o n d i n g t o S t a t e m e n t .
∗ The m e t h o d s a r e i d e n t i c a l t o t h o s e o f S t a t e m e n t ,
∗ e x c e p t t h a t t h e y throw RemoteExceptions i n s t e a d o f SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e RemoteStatement extends Remote {
public R e m o t e R e s u l t S e t e x e c u t e Q u e r y ( S t r i n g q r y ) throws RemoteException ;
public i n t e x e c u t e U p d a t e ( S t r i n g cmd ) throws RemoteException ;
}

SimpleDB source file simpledb/remote/SimpleStatement.java


package s i m p l e d b . r e m o t e ;

import j a v a . s q l . ∗ ;

/∗ ∗
∗ An a d a p t e r c l a s s t h a t w r a p s R e m o t e S t a t e m e n t .
∗ I t s m e t h o d s do n o t h i n g e x c e p t t r a n s f o r m R e m o t e E x c e p t i o n s
∗ i n t o SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S i m p l e S t a t e m e n t extends S t a t e m e n t A d a p t e r {
p r i v a t e RemoteStatement r s t m t ;

public S i m p l e S t a t e m e n t ( RemoteStatement s ) {
rstmt = s ;
}

public R e s u l t S e t e x e c u t e Q u e r y ( S t r i n g q r y ) throws SQLException {


try {
RemoteResultSet r r s = rstmt . executeQuery ( qry ) ;
return new S i m p l e R e s u l t S e t ( r r s ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}

public i n t e x e c u t e U p d a t e ( S t r i n g cmd ) throws SQLException {


try {
return r s t m t . e x e c u t e U p d a t e ( cmd ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}
}

SimpleDB source file simpledb/remote/RemoteResultSetImpl.java


package s i m p l e d b . r e m o t e ;

import s i m p l e d b . r e c o r d . Schema ;
import simpledb . query . ∗ ;
import j a v a . rmi . RemoteException ;
import j a v a . rmi . s e r v e r . U n i c a s t R e m o t e O b j e c t ;

/∗ ∗
∗ The RMI s e r v e r −s i d e i m p l e m e n t a t i o n o f R e m o t e R e s u l t S e t .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
c l a s s R e m o t e R e s u l t S e t I m p l extends U n i c a s t R e m o t e O b j e c t implements R e m o t e R e s u l t S e t {
p r i v a t e Scan s ;
p r i v a t e Schema s c h ;
p r i v a t e RemoteConnectionImpl r c o n n ;

/∗ ∗
∗ Creates a RemoteResultSet object .

219
∗ The s p e c i f i e d p l a n i s o p e n e d , and t h e s c a n i s s a v e d .
∗ @param p l a n t h e q u e r y p l a n
∗ @param r c o n n TODO
∗ @throws RemoteException
∗/
public R e m o t e R e s u l t S e t I m p l ( Plan p l a n , RemoteConnectionImpl r c o n n ) throws RemoteException {
s = p l a n . open ( ) ;
s c h = p l a n . schema ( ) ;
this . rconn = rconn ;
}

/∗ ∗
∗ Moves t o t h e n e x t r e c o r d i n t h e r e s u l t s e t ,
∗ by moving t o t h e n e x t r e c o r d i n t h e s a v e d scan .
∗ @see s i m p l e d b . r e m o t e . R e m o t e R e s u l t S e t#n e x t ( )
∗/
public boolean n e x t ( ) throws RemoteException {
try {
return s . n e x t ( ) ;
}
catch ( R u n t i m e E x c e p t i o n e ) {
rconn . r o l l b a c k ( ) ;
throw e ;
}
}

/∗ ∗
∗ Returns the i n t e g e r v a l u e of the s p e c i f i e d f i e l d ,
∗ b y r e t u r n i n g t h e c o r r e s p o n d i n g v a l u e on t h e s a v e d s c a n .
∗ @see s i m p l e d b . r e m o t e . R e m o t e R e s u l t S e t#g e t I n t ( j a v a . l a n g . S t r i n g )
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) throws RemoteException {
try {
f l d n a m e = f l d n a m e . toLowerCase ( ) ; // t o e n s u r e c a s e − i n s e n s i t i v i t y
return s . g e t I n t ( f l d n a m e ) ;
}
catch ( R u n t i m e E x c e p t i o n e ) {
rconn . r o l l b a c k ( ) ;
throw e ;
}
}

/∗ ∗
∗ Returns the i n t e g e r v a l u e of the s p e c i f i e d f i e l d ,
∗ b y r e t u r n i n g t h e c o r r e s p o n d i n g v a l u e on t h e s a v e d s c a n .
∗ @see s i m p l e d b . r e m o t e . R e m o t e R e s u l t S e t#g e t I n t ( j a v a . l a n g . S t r i n g )
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) throws RemoteException {
try {
f l d n a m e = f l d n a m e . toLowerCase ( ) ; // t o e n s u r e c a s e − i n s e n s i t i v i t y
return s . g e t S t r i n g ( f l d n a m e ) ;
}
catch ( R u n t i m e E x c e p t i o n e ) {
rconn . r o l l b a c k ( ) ;
throw e ;
}
}

/∗ ∗
∗ Returns t h e r e s u l t s e t ’ s metadata ,
∗ b y p a s s i n g i t s schema i n t o t h e RemoteMetaData c o n s t r u c t o r .
∗ @see s i m p l e d b . r e m o t e . R e m o t e R e s u l t S e t#g e t M e t a D a t a ( )
∗/
public RemoteMetaData getMetaData ( ) throws RemoteException {
return new RemoteMetaDataImpl ( s c h ) ;
}

/∗ ∗
∗ C l o s e s t h e r e s u l t s e t by c l o s i n g i t s scan .
∗ @see s i m p l e d b . r e m o t e . R e m o t e R e s u l t S e t#c l o s e ( )
∗/
public void c l o s e ( ) throws RemoteException {
s . close () ;
r c o n n . commit ( ) ;
}
}

SimpleDB source file simpledb/remote/RemoteResultSet.java


package s i m p l e d b . r e m o t e ;

import j a v a . rmi . ∗ ;

/∗ ∗
∗ The RMI r e m o t e i n t e r f a c e c o r r e s p o n d i n g t o R e s u l t S e t .
∗ The m e t h o d s a r e i d e n t i c a l t o t h o s e o f R e s u l t S e t ,
∗ e x c e p t t h a t t h e y throw RemoteExceptions i n s t e a d o f SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e R e m o t e R e s u l t S e t extends Remote {
public boolean n e x t ( ) throws RemoteException ;
public i n t g e t I n t ( S t r i n g f l d n a m e ) throws RemoteException ;
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) throws RemoteException ;
public RemoteMetaData getMetaData ( ) throws RemoteException ;
public void c l o s e ( ) throws RemoteException ;
}

220
SimpleDB source file simpledb/remote/SimpleResultSet.java
package s i m p l e d b . r e m o t e ;

import j a v a . s q l . ∗ ;

/∗ ∗
∗ An a d a p t e r c l a s s t h a t w r a p s R e m o t e R e s u l t S e t .
∗ I t s m e t h o d s do n o t h i n g e x c e p t t r a n s f o r m R e m o t e E x c e p t i o n s
∗ i n t o SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S i m p l e R e s u l t S e t extends R e s u l t S e t A d a p t e r {
private RemoteResultSet r r s ;

public S i m p l e R e s u l t S e t ( R e m o t e R e s u l t S e t s ) {
rrs = s ;
}

public boolean n e x t ( ) throws SQLException {


try {
return r r s . n e x t ( ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}

public i n t g e t I n t ( S t r i n g f l d n a m e ) throws SQLException {


try {
return r r s . g e t I n t ( f l d n a m e ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}

public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) throws SQLException {


try {
return r r s . g e t S t r i n g ( f l d n a m e ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}

public R e s u l t S e t M e t a D a t a getMetaData ( ) throws SQLException {


try {
RemoteMetaData rmd = r r s . getMetaData ( ) ;
return new SimpleMetaData ( rmd ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}

public void c l o s e ( ) throws SQLException {


try {
rrs . close () ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}
}

SimpleDB source file simpledb/remote/RemoteMetaDataImpl.java

• This Remote MetaData assigns a column number for each Attribute of the Result
Set.

• This helps when this set is printed out as rows of fixed-width columns, as in the
SQL interpreter example client.

package s i m p l e d b . r e m o t e ;

import s i m p l e d b . r e c o r d . Schema ;
import s t a t i c j a v a . s q l . Types . INTEGER ;
import j a v a . rmi . RemoteException ;
import j a v a . rmi . s e r v e r . U n i c a s t R e m o t e O b j e c t ;
import java . u t i l . ∗ ;

/∗ ∗
∗ The RMI s e r v e r −s i d e i m p l e m e n t a t i o n o f RemoteMetaData .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
public c l a s s RemoteMetaDataImpl extends U n i c a s t R e m o t e O b j e c t implements RemoteMetaData {
p r i v a t e Schema s c h ;
p r i v a t e L i s t <S t r i n g > f i e l d s = new A r r a y L i s t <S t r i n g >() ;

/∗ ∗
∗ Creates a metadata object that wraps the specified schema .

221
∗ The method a l s o c r e a t e s a l i s t t o h o l d t h e schema ’ s
∗ c o l l e c t i o n o f f i e l d names ,
∗ s o t h a t t h e f i e l d s can b e a c c e s s e d b y p o s i t i o n .
∗ @param s c h t h e schema
∗ @throws RemoteException
∗/
public RemoteMetaDataImpl ( Schema s c h ) throws RemoteException {
this . sch = sch ;
f i e l d s . addAll ( sch . f i e l d s ( ) ) ;
}

/∗ ∗
∗ Returns the s i z e of the f i e l d l i s t .
∗ @see s i m p l e d b . r e m o t e . RemoteMetaData#g e t C o l u m n C o u n t ( )
∗/
public i n t getColumnCount ( ) throws RemoteException {
return f i e l d s . s i z e ( ) ;
}

/∗ ∗
∗ R e t u r n s t h e f i e l d name f o r t h e s p e c i f i e d column number .
∗ I n JDBC , column n u m b e r s s t a r t w i t h 1 , s o t h e f i e l d
∗ i s t a k e n f r o m p o s i t i o n ( column −1) i n t h e l i s t .
∗ @see s i m p l e d b . r e m o t e . RemoteMetaData#getColumnName ( i n t )
∗/
public S t r i n g getColumnName ( i n t column ) throws RemoteException {
return f i e l d s . g e t ( column −1) ;
}

/∗ ∗
∗ R e t u r n s t h e t y p e o f t h e s p e c i f i e d column .
∗ The method f i r s t f i n d s t h e name o f t h e f i e l d i n t h a t column ,
∗ and t h e n l o o k s up i t s t y p e i n t h e schema .
∗ @see s i m p l e d b . r e m o t e . RemoteMetaData#g e t C o l u m n T y p e ( i n t )
∗/
public i n t getColumnType ( i n t column ) throws RemoteException {
S t r i n g f l d n a m e = getColumnName ( column ) ;
return s c h . t y p e ( f l d n a m e ) ;
}

/∗ ∗
∗ R e t u r n s t h e number o f c h a r a c t e r s r e q u i r e d t o d i s p l a y t h e
∗ s p e c i f i e d column .
∗ For a s t r i n g −t y p e f i e l d , t h e method s i m p l y l o o k s up t h e
∗ f i e l d ’ s l e n g t h i n t h e schema and r e t u r n s t h a t .
∗ For an i n t −t y p e f i e l d , t h e method n e e d s t o d e c i d e how
∗ l a r g e i n t e g e r s can b e .
∗ Here , t h e method a r b i t r a r i l y c h o o s e s 6 c h a r a c t e r s ,
∗ w h i c h means t h a t i n t e g e r s o v e r 9 9 9 , 9 9 9 w i l l
∗ probably get displayed improperly .
∗ @see s i m p l e d b . r e m o t e . RemoteMetaData#g e t C o l u m n D i s p l a y S i z e ( i n t )
∗/
public i n t g e t C o l u m n D i s p l a y S i z e ( i n t column ) throws RemoteException {
S t r i n g f l d n a m e = getColumnName ( column ) ;
int f l d t y p e = sch . type ( fldname ) ;
int f l d l e n g t h = sch . l e n g t h ( fldname ) ;
i f ( f l d t y p e == INTEGER)
return 6 ; // accommodate 6− d i g i t i n t e g e r s
else
return f l d l e n g t h ;
}
}

SimpleDB source file simpledb/remote/RemoteMetaData.java


package s i m p l e d b . r e m o t e ;

import j a v a . rmi . ∗ ;

/∗ ∗
∗ The RMI r e m o t e i n t e r f a c e c o r r e s p o n d i n g t o R e s u l t S e t M e t a D a t a .
∗ The m e t h o d s a r e i d e n t i c a l t o t h o s e o f R e s u l t S e t M e t a D a t a ,
∗ e x c e p t t h a t t h e y throw RemoteExceptions i n s t e a d o f SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e RemoteMetaData extends Remote {
public i n t getColumnCount ( ) throws RemoteException ;
public S t r i n g getColumnName ( i n t column ) throws RemoteException ;
public i n t getColumnType ( i n t column ) throws RemoteException ;
public i n t g e t C o l u m n D i s p l a y S i z e ( i n t column ) throws RemoteException ;
}

SimpleDB source file simpledb/remote/SimpleMetaData.java


package s i m p l e d b . r e m o t e ;

import j a v a . s q l . ∗ ;

/∗ ∗
∗ An a d a p t e r c l a s s t h a t w r a p s RemoteMetaData .
∗ I t s m e t h o d s do n o t h i n g e x c e p t t r a n s f o r m R e m o t e E x c e p t i o n s
∗ i n t o SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s SimpleMetaData extends R e s u l t S e t M e t a D a t a A d a p t e r {
p r i v a t e RemoteMetaData rmd ;

222
public SimpleMetaData ( RemoteMetaData md) {
rmd = md ;
}

public i n t getColumnCount ( ) throws SQLException {


try {
return rmd . getColumnCount ( ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}

public S t r i n g getColumnName ( i n t column ) throws SQLException {


try {
return rmd . getColumnName ( column ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}

public i n t getColumnType ( i n t column ) throws SQLException {


try {
return rmd . getColumnType ( column ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}

public i n t g e t C o l u m n D i s p l a y S i z e ( i n t column ) throws SQLException {


try {
return rmd . g e t C o l u m n D i s p l a y S i z e ( column ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}
}

SimpleDB source file studentClient/simpledb/SQLInterpreter.java


import j a v a . s q l . ∗ ;
import s i m p l e d b . r e m o t e . S i m p l e D r i v e r ;
import j a v a . i o . ∗ ;

public c l a s s S Q L I n t e r p r e t e r {
p r i v a t e s t a t i c C o n n e c t i o n conn = n u l l ;

public s t a t i c void main ( S t r i n g [ ] a r g s ) {


try {
D r i v e r d = new S i m p l e D r i v e r ( ) ;
conn = d . c o n n e c t ( ” j d b c : s i m p l e d b : / / l o c a l h o s t ” , n u l l ) ;

Reader r d r = new I n p u t S t r e a m R e a d e r ( System . i n ) ;


B u f f e r e d R e a d e r b r = new B u f f e r e d R e a d e r ( r d r ) ;

while ( true ) {
// p r o c e s s one l i n e o f i n p u t
System . o u t . p r i n t ( ” \nSQL> ” ) ;
S t r i n g cmd = b r . r e a d L i n e ( ) . t r i m ( ) ;
System . o u t . p r i n t l n ( ) ;
i f ( cmd . s t a r t s W i t h ( ” e x i t ” ) )
break ;
e l s e i f ( cmd . s t a r t s W i t h ( ” s e l e c t ” ) )
doQuery ( cmd ) ;
else
doUpdate ( cmd ) ;
}
}
catch ( E x c e p t i o n e ) {
e . printStackTrace () ;
}
finally {
try {
i f ( conn != n u l l )
conn . c l o s e ( ) ;
}
catch ( E x c e p t i o n e ) {
e . printStackTrace () ;
}
}
}

p r i v a t e s t a t i c void doQuery ( S t r i n g cmd ) {


try {
S t a t e m e n t stmt = conn . c r e a t e S t a t e m e n t ( ) ;
R e s u l t S e t r s = stmt . e x e c u t e Q u e r y ( cmd ) ;
R e s u l t S e t M e t a D a t a md = r s . getMetaData ( ) ;
i n t numcols = md . getColumnCount ( ) ;
int totalwidth = 0 ;

// p r i n t h e a d e r
f o r ( i n t i =1; i <=numcols ; i ++) {
i n t w i d t h = md . g e t C o l u m n D i s p l a y S i z e ( i ) ;
t o t a l w i d t h += w i d t h ;
S t r i n g fmt = ”%” + w i d t h + ” s ” ;
System . o u t . f o r m a t ( fmt , md . getColumnName ( i ) ) ;

223
}
System . o u t . p r i n t l n ( ) ;
f o r ( i n t i =0; i <t o t a l w i d t h ; i ++)
System . o u t . p r i n t ( ”−” ) ;
System . o u t . p r i n t l n ( ) ;

// p r i n t r e c o r d s
while ( r s . n e x t ( ) ) {
for ( int i =1; i <=numcols ; i ++) {
S t r i n g f l d n a m e = md . getColumnName ( i ) ;
i n t f l d t y p e = md . getColumnType ( i ) ;
S t r i n g fmt = ”%” + md . g e t C o l u m n D i s p l a y S i z e ( i ) ;
i f ( f l d t y p e == Types . INTEGER)
System . o u t . f o r m a t ( fmt + ”d” , r s . g e t I n t ( f l d n a m e ) ) ;
else
System . o u t . f o r m a t ( fmt + ” s ” , r s . g e t S t r i n g ( f l d n a m e ) ) ;
}
System . o u t . p r i n t l n ( ) ;
}
rs . close () ;
}
catch ( SQLException e ) {
System . o u t . p r i n t l n ( ”SQL E x c e p t i o n : ” + e . g e t M e s s a g e ( ) ) ;
e . printStackTrace () ;
}
}

p r i v a t e s t a t i c void doUpdate ( S t r i n g cmd ) {


try {
S t a t e m e n t stmt = conn . c r e a t e S t a t e m e n t ( ) ;
i n t howmany = stmt . e x e c u t e U p d a t e ( cmd ) ;
System . o u t . p r i n t l n ( howmany + ” r e c o r d s p r o c e s s e d ” ) ;
}
catch ( SQLException e ) {
System . o u t . p r i n t l n ( ”SQL E x c e p t i o n : ” + e . g e t M e s s a g e ( ) ) ;
e . printStackTrace () ;
}
}
}

SimpleDB source file simpledb/server/Startup.java

• Here is the SimpleDB server startup code.

¬ First the data structures of this server process are initialized.


­ Then its Remote Driver Implementation is added to the RMI registry running
on this same server machine under the service name "simpledb".
® When a client calls this server machine with the service name "simpledb", this
RMI redirects its Connection into a new thread within this server process.

package s i m p l e d b . s e r v e r ;

import s i m p l e d b . r e m o t e . ∗ ;
import j a v a . rmi . ∗ ;

public c l a s s S t a r t u p {
public s t a t i c void main ( S t r i n g a r g s [ ] ) throws E x c e p t i o n {
// c o n f i g u r e and i n i t i a l i z e the database
SimpleDB . i n i t ( a r g s [ 0 ] ) ;

// p o s t t h e s e r v e r e n t r y i n t h e rmi r e g i s t r y
R e m ot e D r i v er d = new R e m o t e D r i v e r I m p l ( ) ;
Naming . r e b i n d ( ” s i m p l e d b ” , d ) ;

System . o u t . p r i n t l n ( ” d a t a b a s e server ready ” ) ;


}
}

SimpleDB source file simpledb/server/SimpleDB.java


package s i m p l e d b . s e r v e r ;

import simpledb . f i l e . FileMgr ;


import simpledb . buffer .∗;
import simpledb . tx . Transaction ;
import simpledb . l o g . LogMgr ;
import simpledb . metadata . MetadataMgr ;
import simpledb . planner . ∗ ;
import simpledb . opt . H e u r i s t i c Q u e r y P l a n n e r ;
import simpledb . index . planner . IndexUpdatePlanner ;

/∗ ∗
∗ The c l a s s t h a t p r o v i d e s s y s t e m −w i d e s t a t i c g l o b a l v a l u e s .
∗ T h e s e v a l u e s must b e i n i t i a l i z e d b y t h e method
∗ { @link #i n i t ( S t r i n g ) i n i t } b e f o r e use .
∗ The m e t h o d s { @ l i n k #i n i t F i l e M g r ( S t r i n g ) i n i t F i l e M g r } ,

224
∗ { @ l i n k #i n i t F i l e A n d L o g M g r ( S t r i n g ) i n i t F i l e A n d L o g M g r } ,
∗ { @ l i n k #i n i t F i l e L o g A n d B u f f e r M g r ( S t r i n g ) i n i t F i l e L o g A n d B u f f e r M g r } ,
∗ and { @ l i n k #i n i t M e t a d a t a M g r ( b o o l e a n , T r a n s a c t i o n ) i n i t M e t a d a t a M g r }
∗ p r o v i d e l i m i t e d i n i t i a l i z a t i o n , and a r e u s e f u l f o r
∗ debugging purposes .

∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s SimpleDB {
public s t a t i c i n t BUFFER SIZE = 8 ;
public s t a t i c S t r i n g LOG FILE = ” s i m p l e d b . l o g ” ;

private s t a t i c FileMgr fm ;
private s t a t i c BufferMgr bm ;
private s t a t i c LogMgr logm ;
private s t a t i c MetadataMgr mdm;

/∗ ∗
∗ I n i t i a l i z e s the system .
∗ T h i s method i s c a l l e d d u r i n g s y s t e m s t a r t u p .
∗ @param d i r n a m e t h e name o f t h e d a t a b a s e d i r e c t o r y
∗/
public s t a t i c void i n i t ( S t r i n g dirname ) {
i n i t F i l e L o g A n d B u f f e r M g r ( dirname ) ;
T r a n s a c t i o n t x = new T r a n s a c t i o n ( ) ;
boolean i s n e w = fm . isNew ( ) ;
i f ( isnew )
System . o u t . p r i n t l n ( ” c r e a t i n g new d a t a b a s e ” ) ;
else {
System . o u t . p r i n t l n ( ” r e c o v e r i n g e x i s t i n g d a t a b a s e ” ) ;
tx . r e c o v e r ( ) ;
}
initMetadataMgr ( isnew , tx ) ;
t x . commit ( ) ;
}

// The f o l l o w i n g i n i t i a l i z a t i o n m e t h o d s a r e u s e f u l f o r
// t e s t i n g t h e l o w e r − l e v e l c o m p o n e n t s o f t h e s y s t e m
// w i t h o u t h a v i n g t o i n i t i a l i z e e v e r y t h i n g .

/∗ ∗
∗ I n i t i a l i z e s o n l y t h e f i l e manager .
∗ @param d i r n a m e t h e name o f t h e d a t a b a s e d i r e c t o r y
∗/
public s t a t i c void i n i t F i l e M g r ( S t r i n g dirname ) {
fm = new F i l e M g r ( dirname ) ;
}

/∗ ∗
∗ I n i t i a l i z e s t h e f i l e and l o g m a n a g e r s .
∗ @param d i r n a m e t h e name o f t h e d a t a b a s e d i r e c t o r y
∗/
public s t a t i c void i n i t F i l e A n d L o g M g r ( S t r i n g dirname ) {
i n i t F i l e M g r ( dirname ) ;
logm = new LogMgr ( LOG FILE ) ;
}

/∗ ∗
∗ I n i t i a l i z e s t h e f i l e , l o g , and b u f f e r m a n a g e r s .
∗ @param d i r n a m e t h e name o f t h e d a t a b a s e d i r e c t o r y
∗/
public s t a t i c void i n i t F i l e L o g A n d B u f f e r M g r ( S t r i n g dirname ) {
i n i t F i l e A n d L o g M g r ( dirname ) ;
bm = new B u f f e r M g r ( BUFFER SIZE ) ;
}

/∗ ∗
∗ I n i t i a l i z e s m e t a d a t a manager .
∗ @param i s n e w an i n d i c a t i o n o f w h e t h e r a new
∗ d a t a b a s e needs to be c r e a t e d .
∗ @param t x t h e t r a n s a c t i o n p e r f o r m i n g t h e i n i t i a l i z a t i o n
∗/
public s t a t i c void i n i t M e t a d a t a M g r ( boolean i s n e w , T r a n s a c t i o n t x ) {
mdm = new MetadataMgr ( i s n e w , t x ) ;
}

public s t a t i c FileMgr fileMgr () { return fm ; }


public s t a t i c BufferMgr bufferMgr ( ) { return bm ; }
public s t a t i c LogMgr logMgr ( ) { return logm ; }
public s t a t i c MetadataMgr mdMgr ( ) { return mdm; }

/∗ ∗
∗ C r e a t e s a p l a n n e r f o r SQL commands .
∗ To c h a n g e how t h e p l a n n e r w o r k s , m o d i f y t h i s method .
∗ @ r e t u r n t h e s y s t e m ’ s p l a n n e r f o r SQL commands
∗/ public s t a t i c P l a n n e r p l a n n e r ( ) {
QueryPlanner q p l a n n e r = new B a s i c Q u e r y P l a n n e r ( ) ;
U p d a t e P l a n n e r u p l a n n e r = new B a s i c U p d a t e P l a n n e r ( ) ;
return new P l a n n e r ( q p l a n n e r , u p l a n n e r ) ;
}
}

5 Indexing
(Sciore, 2008, Chapters 6.3 and 21)

225
• The basic SimpleDB design described in section 4 did not include indexes although
they are a central part of any realistic RDBMS – without them, processing larger
databases would soon become infeasible.

• RDBMS indexes use similar data structures as dictionaries in RAM:

hashing and/or
search trees.

• Their basic ideas are already familiar from the courses “Data Structures I&II” (”Tie-
torakenteet I&II” (TRAI&II) in Finnish).

• However, here we must take into account that the storage medium is not RAM but
a disk – which is much slower and available as big blocks.

• The index for an Attribute A of a stored RDBMS Table T (A, . . .) is a disk-based


dictionary

from key values k of the same type as A


into RIDs within T whose Field A has the same value as this given key k.

Let us call such an key-RID pair an index record.

• This index is

unique if there is at most 1 RID for each key, and


nonunique if there can be more than 1.

• In general, RDBMSs support also indexes on many Attributes A1 , A2 , A3 , . . . , Am of


T (A1 , A2 , A3 , . . . , Am , . . .).

– Then the key values can be thought to be m-tuples

hk1 , k2 , k3 , . . . , km i ∈ A1 × A2 × A3 × · · · × Am .

– However, SimpleDB supports only m = 1.


– These lectures concentrate on m = 1 too, for simplicity.

• An RDBMS builds a unique index for the chosen primary key Attributes of each
NF1 stored Table.

• In particular, if the Attribute B of another stored Table U (B, . . .) is defined to be a


foreign key of T , then

– the row of T corresponding to the row r of U can be fetched quickly by asking


for the RID corresponding to the key value r .B , and
– requirement 5 ensures that this RID does indeed exist.

• SQL permits the user to CREATE other INDEXes too.


They can in turn be used for speeding up other connections between Tables in this
way.

• However, this requires that the Planner is aware of these indexes.

226
Hashing Balanced search trees.
Those index record types whose key All kinds of keys.
Applies to

component type is either already an


integer or is easy to transform into an
integer.
Works the same way for both unique Works well for unique indexes. Can be
Order Unique

and nonunique indexes. extended for nonunique indexes, but


they need more work.
Does not take the ordering of the data Keeps the index records ordered
into account. according to their keys.

Quite fast but unpredictable: Slightly slower but predictable:


Speed

Operations run usually in about Each operation is guaranteed to take


constant time – but sometimes they logarithmic time.
reorganize the whole hash table, and
this takes linear time wrt. the number
of index records stored.
Some key combinations can even cause
bad performance for most operations –
because they cause many collisions.
Very little extra storage required per Each index record requires several
Size

each index records stored on disk. bytes of extra storage – for the pointers
which hold the tree together.
Table 5: Hashing vs. search trees.

227
• Table 5 summarizes the differences between typical hashing and search trees.

• If an RDBMS has only one kind of index, then it is usually search tree -based.

• SimpleDB does the opposite: it provides hash indexes by default, but search trees
must be turned on separately.

SimpleDB source file simpledb/index/Index.java

• This SimpleDB interface for Index es follows the same beforeFirst. . . next. . . access
pattern as its Scans.

• However, here the beforeFirst method takes an argument:


The key value k to search for in this index.

• Similarly to selection Scans, here the next method moves to the next index record
with this key value k, or returns false if there are no (more) such index records.

• There is a method for getting the RID of the current index record. . .

• . . . but no method for getting its key, because that would always be k.

package s i m p l e d b . i n d e x ;

import s i m p l e d b . r e c o r d . RID ;
import s i m p l e d b . q u e r y . C o n s t a n t ;

/∗ ∗
∗ This i n t e r f a c e c o n t a i n s methods to t r a v e r s e an i n d e x .
∗ @ a u t h o r Edward S c i o r e

∗/
public i n t e r f a c e I n d e x {

/∗ ∗
∗ Positions the index before the f i r s t record
∗ having the s p e c i f i e d search key .
∗ @param s e a r c h k e y t h e s e a r c h k e y v a l u e .
∗/
public void b e f o r e F i r s t ( Constant s e a r c h k e y ) ;

/∗ ∗
∗ Moves t h e i n d e x t o t h e n e x t r e c o r d h a v i n g t h e
∗ s e a r c h k e y s p e c i f i e d i n t h e b e f o r e F i r s t method .
∗ R e t u r n s f a l s e i f t h e r e a r e no more s u c h i n d e x r e c o r d s .
∗ @ r e t u r n f a l s e i f no o t h e r i n d e x r e c o r d s h a v e t h e s e a r c h key .
∗/
public boolean n e x t ( ) ;

/∗ ∗
∗ R e t u r n s t h e dataRID v a l u e s t o r e d i n t h e c u r r e n t index record .
∗ @ r e t u r n t h e dataRID s t o r e d i n t h e c u r r e n t i n d e x record .
∗/
public RID getDataRid ( ) ;

/∗ ∗
∗ I n s e r t s an i n d e x r e c o r d h a v i n g t h e s p e c i f i e d
∗ d a t a v a l and dataRID v a l u e s .
∗ @param d a t a v a l t h e d a t a v a l i n t h e new i n d e x r e c o r d .
∗ @param d a t a r i d t h e dataRID i n t h e new i n d e x r e c o r d .
∗/
public void i n s e r t ( C o n s t a n t d a t a v a l , RID d a t a r i d ) ;

/∗ ∗
∗ Deletes the index record having the s p e c i f i e d
∗ d a t a v a l and dataRID v a l u e s .
∗ @param d a t a v a l t h e d a t a v a l o f t h e d e l e t e d i n d e x r e c o r d
∗ @param d a t a r i d t h e dataRID o f t h e d e l e t e d i n d e x r e c o r d
∗/
public void d e l e t e ( C o n s t a n t d a t a v a l , RID d a t a r i d ) ;

/∗ ∗
∗ Closes the index .
∗/
public void close () ;
}

228
5.1 Extendable Hashing
(Elmasri and Navathe, 2011, Chapter 16.8.3), (Sciore, 2008, Chapters 21.2–21.3)
• Now we study extendable hashing as an index implementation technique.

• We assume that we have a function hash(k) which maps each key value k into a
“small” (32-bit, say) unsigned integer.
(If the keys k are already such integers, then this function is not needed.)

• Recall the basic idea of hashing:

– The hash table is an array of buckets.


– An index record x is stored in its bucket number hash(x .key).
If this number would be too large, then it is truncated to fit into the table.
– Then each bucket should store only a few index records, so that simple linear
search within it would be fast enough.
On disk, each bucket should occupy only one Block .

• An extendable hash table consists of 2 files:

Bucket directory with


– its current directory .globalDepth, and
– an array
directory .bucket[0| . . . 2directory{z
. globalDepth
− 1}]
2directory . globalDepth buckets

of disk Block pointers into the other file. They point to the actual buckets.
This hashing is extendable, because it can grow as needed.
Bucket file whose bucket b has
– a directory .bucket[b].localDepth ≤ directory .globalDepth .
– an array directory .bucket[b].slot[0 . . .] of index entries. It is long enough
to fill this disk Block , so its length depends on their size.
– a disk Block pointer directory .bucket[b].overflow to its overflow chain.
Each Block in this chain is also in this bucket file, and is otherwise similar
but does not have the localDepth field.

• Figure 81 shows an example, where

– each bucket file Block can contain up to 3 index records


– the directory has 8 buckets – so globalDepth = 3 bits
– their localDepths are 2, 1, and 2 bits.
– hash(SId ) = SId mod 8

• Note that many directory entries share the same bucket to save disk space:

– Each bucket b stores those index records r whose hash(r .key) has the same
directory .bucket[b].localDepth lowest bits as b.
– For instance, bucket 0 stores those index records where these 2 lowest bits are
. . . 00, bucket 1 with . . . 1, and bucket 2 with . . . 10.

229
Figure 81: An example of an extendable hash table. (Sciore, 2008)

• Finding the first index entry with the given key value k is:

1 b = the directory .globalDepth lowest bits of hash(k);


2 c = bucket file block number directory .bucket[b];
3 i = 0;
4 found = false;
5 while not found and c 6= NoBlock
6 if i is so big that c has no slot[i]
7 c = c .overflow ;
8 i=0
9 elseif slot[i ].RID = NoRID or slot[i ].key 6= k
10 i = i+1
11 else found = true.

NoBlock is the number of a block which cannot exist – a “NULL pointer” on disk.
NoRID is the RID of a Record which cannot exist – it marks an unused slot.

(They could be for instance −1.)

• Inserting a new index record x is:

230
1 repeat
2 b = the directory .globalDepth lowest bits of hash(x .key);
3 c = bucket file block number directory .bucket[b];
4 if c or its overflow chain has an unused slot
5 store x there
6 elseif c .localDepth < directory .globalDepth and
it is OK to split this bucket c
7 d = the c .localDepth lowest bits of b;
8 c .localDepth = c .localDepth + 1;
9 c0 = a new bucket with the same localDepth as c;
10 for every other bucket number b0 such that directory .bucket[b0 ] = c
11 directory .bucket[b0 ] = c0 ;
12 rehash all the index records in c
13 elseif c .localDepth = directory .globalDepth and
it is OK to double the directory
14 directory .globalDepth = directory .globalDepth + 1
15 double the length of the bucket array;
16 fill its new half with a copy of its old half
17 else add a Block into the overflow chain of c;
18 store x
19 until x has been stored.

• Line 10. . .
– considers those directory .buckets which point to this old bucket c
– redirects the 2nd, 4th, 6th,. . . of them into pointing to this new bucket c0
instead
– can be optimized with suitable bit arithmetic:
b0 = d0 1d for d0 = 0, 1, 2, . . ..
• Line 12 splits the index records r in c so that if hash(r .key) ends in the bit pattern. . .
. . . 1d then r goes into the new bucket (or into its overflow chain, if necessary)
. . . 0d then r stays in c
which may permit shortening the overflow chain of c.
• In theory, it is always OK to split on line 6 and to double on line 13.
• In practice, they can be used to fine-tune this basic algorithm.
• Figure 82 shows how this data structure could have evolved into Figure 81:
¬ In the beginning (not shown), directory .globalDepth is set so that the array
fills the only block of the directory file as well as possible.
This example does not involve doubling the directory.
­ In (a), the only block of the bucket file is now full.
The ‘L’ marks its localDepth, which starts out as 0.
® In (b), this only bucket 0 has been split into 2 buckets 0 and 1.
Note how every other 0 turns into 1 in the directory.
¯ In (c), bucket 0 splits again into 0 and 2.
Note how every other remaining 0 turns into 2 in the directory.

231
Figure 82: How Figure 81 could have been been built. (Sciore, 2008)

• Splitting the current bucket c is almost always OK on line 6.


– The unlikely exception is when every index record in c has the same hash value
as x.
– Then they will always stay together after every split, and all we get is a much
larger directory.
• Similarly, doubling the current directory on line 13 might not be OK, if its file is
already large.
• In these situations the insertion algorithm extends the overflow chain of c, even
though it slows down the performance.
• Deletion of an index record r could be a variant of finding x:

1 b = the directory .globalDepth lowest bits of hash(r .key);


2 c = bucket file block number directory .bucket[b];
3 i = 0;
4 found = false;
5 while not found and c 6= NoBlock
6 if i is so big that c has no slot[i]
7 c = c .overflow ;
8 i=0
9 elseif slot[i ].RID 6= r .RID or slot[i ].key 6= r .key
10 i = i+1
11 else found = true;
12 if found
13 mark it deleted with slot[i ].RID = NoRID.

232
However, this does not try to shorten the overflow chain of c if now possible – but
the insertion algorithm extended it only in rare situations, so this might be enough.

SimpleDB source file simpledb/index/hash/HashIndex.java

• Here is the SimpleDB hash index implementation.

• It is not extendable:
Instead, it allocates a fixed number of buckets as its directory, which it does not
double.

• Hence its performance does not scale well when the number of index records to store
grows.

• Index operations take place within a transaction tx so that they can be ABORTed
or recovered if needed.

• Indexes also have a function searchCost which estimates the number of disk Block s
read when this index is used for looking up the RID corresponding to a given key.

• This function is used by the blocksAccessed function of the Index Metadata Man-
ager to calculate the I/O costs of using this index.

package s i m p l e d b . i n d e x . hash ;

import simpledb . tx . Transaction ;


import simpledb . record . ∗ ;
import simpledb . query . ∗ ;
import simpledb . index . Index ;

/∗ ∗
∗ A s t a t i c hash implementation of the Index i n t e r f a c e .
∗ A f i x e d number o f b u c k e t s i s a l l o c a t e d ( c u r r e n t l y , 1 0 0 ) ,
∗ and e a c h b u c k e t i s i m p l e m e n t e d a s a f i l e o f i n d e x r e c o r d s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s HashIndex implements I n d e x {
public s t a t i c i n t NUM BUCKETS = 1 0 0 ;
p r i v a t e S t r i n g idxname ;
p r i v a t e Schema s c h ;
private T r a n s a c t i o n tx ;
private Constant s e a r c h k e y = null ;
private TableScan t s = null ;

/∗ ∗
∗ Opens a h a s h i n d e x f o r t h e s p e c i f i e d i n d e x .
∗ @param i d x n a m e t h e name o f t h e i n d e x
∗ @param s c h t h e schema o f t h e i n d e x r e c o r d s
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public HashIndex ( S t r i n g idxname , Schema sch , T r a n s a c t i o n t x ) {
t h i s . idxname = idxname ;
this . sch = sch ;
this . tx = tx ;
}

/∗ ∗
∗ Positions the index before the f i r s t index record
∗ having the s p e c i f i e d search key .
∗ The method h a s h e s t h e s e a r c h k e y t o d e t e r m i n e t h e b u c k e t ,
∗ and t h e n o p e n s a t a b l e s c a n on t h e f i l e
∗ corresponding to the bucket .
∗ The t a b l e s c a n f o r t h e p r e v i o u s b u c k e t ( i f any ) i s c l o s e d .
∗ @see s i m p l e d b . i n d e x . I n d e x#b e f o r e F i r s t ( s i m p l e d b . q u e r y . C o n s t a n t )
∗/
public void b e f o r e F i r s t ( C o n s t a n t s e a r c h k e y ) {
close () ;
this . searchkey = searchkey ;
i n t b u c k e t = s e a r c h k e y . hashCode ( ) % NUM BUCKETS;
S t r i n g tblname = idxname + b u c k e t ;
T a b l e I n f o t i = new T a b l e I n f o ( tblname , s c h ) ;
t s = new T a b l e S c a n ( t i , t x ) ;
}

/∗ ∗
∗ Moves t o t h e n e x t r e c o r d h a v i n g t h e s e a r c h k e y .
∗ The method l o o p s t h r o u g h t h e t a b l e s c a n f o r t h e b u c k e t ,
∗ l o o k i n g f o r a m a t c h i n g r e c o r d , and r e t u r n i n g f a l s e
∗ i f t h e r e a r e no more s u c h r e c o r d s .
∗ @see s i m p l e d b . i n d e x . I n d e x#n e x t ( )
∗/
public boolean n e x t ( ) {

233
while ( t s . n e x t ( ) )
i f ( ts . getVal ( ” dataval ” ) . equals ( searchkey ) )
return true ;
return f a l s e ;
}

/∗ ∗
∗ R e t r i e v e s t h e dataRID f r o m t h e c u r r e n t record
∗ in the t a b l e scan f o r the b u c k e t .
∗ @see s i m p l e d b . i n d e x . I n d e x#g e t D a t a R i d ( )
∗/
public RID g e t D a t a R i d ( ) {
i n t blknum = t s . g e t I n t ( ” b l o c k ” ) ;
int id = t s . g e t I n t ( ” id ” ) ;
return new RID ( blknum , i d ) ;
}

/∗ ∗
∗ I n s e r t s a new r e c o r d i n t o t h e t a b l e s c a n f o r t h e b u c k e t .
∗ @see s i m p l e d b . i n d e x . I n d e x#i n s e r t ( s i m p l e d b . q u e r y . C o n s t a n t , s i m p l e d b . r e c o r d . RID )
∗/
public void i n s e r t ( C o n s t a n t v a l , RID r i d ) {
beforeFirst ( val ) ;
ts . insert () ;
t s . s e t I n t ( ” b l o c k ” , r i d . blockNumber ( ) ) ;
ts . setInt (” id ” , rid . id () ) ;
ts . setVal ( ” dataval ” , val ) ;
}

/∗ ∗
∗ D e l e t e s t h e s p e c i f i e d r e c o r d from t h e t a b l e scan f o r
∗ the bucket . The method s t a r t s a t t h e b e g i n n i n g o f t h e
∗ s c a n , and l o o p s t h r o u g h t h e r e c o r d s u n t i l t h e
∗ s p e c i f i e d record i s found .
∗ @see s i m p l e d b . i n d e x . I n d e x#d e l e t e ( s i m p l e d b . q u e r y . C o n s t a n t , s i m p l e d b . r e c o r d . RID )
∗/
public void d e l e t e ( C o n s t a n t v a l , RID r i d ) {
beforeFirst ( val ) ;
while ( n e x t ( ) )
i f ( getDataRid ( ) . e q u a l s ( r i d ) ) {
ts . delete () ;
return ;
}
}

/∗ ∗
∗ C l o s e s t h e i n d e x by c l o s i n g t h e c u r r e n t table scan .
∗ @see s i m p l e d b . i n d e x . I n d e x#c l o s e ( )
∗/
public void c l o s e ( ) {
i f ( t s != n u l l )
ts . close () ;
}

/∗ ∗
∗ R e t u r n s t h e c o s t o f s e a r c h i n g an i n d e x f i l e h a v i n g t h e
∗ s p e c i f i e d number o f b l o c k s .
∗ The method a s s u m e s t h a t a l l b u c k e t s a r e a b o u t t h e
∗ same s i z e , and s o t h e c o s t i s s i m p l y t h e s i z e o f
∗ the bucket .
∗ @param n u m b l o c k s t h e number o f b l o c k s o f i n d e x r e c o r d s
∗ @param r p b t h e number o f r e c o r d s p e r b l o c k ( n o t u s e d h e r e )
∗ @return t h e c o s t o f t r a v e r s i n g t h e index
∗/
public s t a t i c i n t s e a r c h C o s t ( i n t numblocks , i n t rpb ) {
return numblocks / HashIndex .NUM BUCKETS;
}
}

5.2 B+ -trees
For B-trees, see Cormen et al. (2009, Chapter 18) or Elmasri and Navathe (2011, Chap-
ter 17.3.1). For B+ -trees, see Elmasri and Navathe (2011, Chapter 17.3.2) or Sciore (2008,
Chapter 21.4).
• Table 6 summarizes the differences between RAM-based and disk-based search trees.

• RAM-based width balanced trees have also been developed, such as 2-3 and 2-3-4
trees, where these numbers tell how many subtrees they permit.
However, height-balanced trees are preferred to them in RAM.

• B+ -trees are the most popular tree-based index implementation data structute in
RDBMSs.

• They are often called just B-trees for simplicity, but this is slightly inaccurate:

234
in RAM on disk
Contains 1 key value and 2 pointers to Contains many more than 1 keys and
Node (possibly empty) subtrees. Designed to 1 more subtree pointer than keys.
be small to save RAM. Designed to fill a Block to save I/O.
The node .key redirects each operation Similarly, the node .keys redirect each
keys

to the node .left or node .right subtree, operation to the appropriate subtree.
as appropriate, based on the input key.
By subtree height: For instance, AVL By node width: Each subtree has
Balance

trees require that the heights of the exactly the same height. This is
2 subtrees of a node differ from each achieved by allowing different nodes to
other by at most 1. have very different numbers of actual
subtrees.
Balance ensures that operations take Same logarithmic time, but here the
Speed

logarithmic time wrt. the number of branching factor of the tree > 2, and so
index records stored in the tree. is its base.
Table 6: RAM vs. disk search trees.

B-trees were discovered first.


– They are endogeneous: Their internal nodes contain RIDs too.
– Here a node is a leaf, if it has no subtrees.
B+ -trees are their modification.
– Here there are 2 separate kinds of nodes: leaf and internal (= non-leaf)
nodes.
– They are exogeneous: Only their leaf nodes contain RIDs too.
Their internal nodes contain only keys, whose job is only to guide the
operations into the correct leaf.
– This modification makes them better suited for DBMSs, because it opti-
mizes search operation I/O.

• Consider first for simplicity unique B+ -tree indexes.

• These 2 kinds of B+ -tree nodes contain the following.

Leaf:
– An array node .slot[1. . . ] of index records.
∗ Its length is chosen so that the disk Block is as fully used as possible
– to maximize I/O utilization.
∗ This length depends on how much space must be reserved for each
keys in this Block .
– The counter node .last indicating that only the prefix node .slot[1. . . node .last]
is currently used, while the suffix node .slot[node .last+1. . . ] is still unused.
– A disk Block pointer node .next to the next leaf (if any). In. . .
theory these are not needed
practice they are extremely useful in many situations, and are therefore
included.
Internal:
– An array node .key[1. . . ] of keys.

235
Figure 83: An example B+ -tree. (Sciore, 2008)

– Another array node .subtree[0. . . ] of disk Block pointers to subtees.


– Their lengths are again chosen to fill a disk Block depending on key size.
– The same counter node .last indicating the prefix node .key[1. . . node .last]
currently used with the rest unused.
Then also only the prefix node .subtree[0. . . node .last] is used but the rest
is unused.
• These array lengths can easily be ≈ 100 on today’s disks with their big Block s.

• Figure 83 shows an example of a B+ -tree of height 1.

(a) shows a sorted file of index records, where


the keys are the SName attribute values, and
their RIDs are their corresponding STUDENT Table records.
The B+ -tree can be thought of as providing a “directory” for searching this
“phone book” file.
(b) shows a first attempt for such a directory:
A disk Block of index records, where
the key is the first (= smallest) key in a Block of the sorted file (a)
its RID is the number of this Block .
It helps us locate the STUDENT record corresponding to a given name n
quickly:
¬ Find in this Block (b) the record r whose r .key = n by binary search.
­ Move directly to the correct Block of (a) using this r .RID found in (b).
® Find in that Block of (a) the record s whose s .key = n, again by binary
search.

236
¯ Move directly to the correct record in the STUDENT Table using this
s .RID found in (a).
(c) shows the abstraction of this directory Block (b) into a B+ -tree internal node
on top of the leaf nodes from (a):
– The RIDs of (b) are now shown as arrows/pointers.
– The 1st key (here ’Amy’) can be omitted, because we know that if n <
the 2nd key (here ’Bob’) then n must be in the 1st subtree.
– This leads to the idea that there is 1 more subtree pointer than keys.
– In addition its leaf nodes from (a) would be linked together into an ordered
chain of next pointers (not shown).

Key Order Condition and Lookup

• In 1-key RAM-based search trees, we required that

the largest key in the node .left subtree <


the node .key between them
< the smallest key in the node .right subtree. (25)

• Eq. (25) takes the following form in the internal nodes of B+ -trees

the largest key in the preceding node .subtree[i − 1] <


the node .key[i] between them
≤ the largest key in the following node .subtree[i] (26)

for every i = 0, 1, 2, . . . , node .last currently used.

• By Eq. (26)
node .key[1 . . . node .last] (27)
is ordered into strictly ascending order.

• The same holds also for leaf nodes:

node .slot[1 . . . node .last].keys (28)

are ordered into strictly ascending order too.

• The algorithm to look up a key k from a B+ -tree is:

1 node = the root of the B+ -tree;


2 while this node is internal
3 determine (with binary search, by Eq. (27))
the only node .subtree[child ] where this key k could be;
4 node = child ;
5 determine (with binary search, by Eq. (28))
whether key k appears in this leaf node or not;
6 if it does
7 return the corresponding RID
8 else return the constant NoRID.

237
Balance Condition and Insertion
• Recall that the height of

a leaf node in a tree is 0


an internal node is 1+ the maximum of the heights of its subtrees.

• The balance condition of B+ -trees is:

– All the subtrees of an internal node have exactly the same height.
– Every non-root node is at least half full:
how many keys would fit into this node
node .last ≥ . (29)
2
– If the root is internal, then its

its .last ≥ 1.

Hence a B+ -tree is balanced by keeping (the disk Block s storing) its non-root nodes
between half and totally full.

• In particular, the initially empty B+ -tree is just a leaf node as its root, and it has

root .last = 0

and

root .next = NoBlock .

• This balance condition is maintained by the algorithm which inserts a given index
record into a given B+ -tree.

• Describing this algorithm is simpler, if we assume that each node can in fact become
over full while it is in RAM:

– The it has 1 more key and RID/subtree than would fit into its disk Block .
– When this node gets written back into its disk Block , it will no longer be
overfull – because the algorithm will have rebalanced the B+ -tree first.
– We leave the details to the programmer. . .

• The central insertion algorithm is recursive. Its call insert(r, T ) returns. . .

either OK if it could insert this new index record r into the B+ -(sub)tree T without
its height growing – that is, it will modify its parameter T
(which some consider to be a bad programming habit, but here we are trying
to save disk Block s)
or a pair hm, U i if this could not be done.
– Instead, consider a new tree V whose root had just 1 key m with the
modified T as its left and this new U as its right subtree.
– This new V would be a correct B+ -tree for r and the index record originally
in T .

238
– However, the height of this new V would also be 1+ the height of the
original T . . .

This fairly elaborate explanation of its return value is needed for arguing that this
recursive algorithm is correct.

• If this return value is a pair hm, U i and its caller was

the root node then the required rebalancing is easy:

1 if insert(r, root) = hm, U i


2 root = this new node V .
– This is the main insertion subroutine, which begins the recursion.
– Here we can use this new tree V directly, because here are no other subtrees
whose heights we must take into account.
– B+ -trees grow higher this way at the root.
(In contrast, height-balanced RAM-based search trees grow higher at their
leaves instead.)
– In practice, we must know where in the file this root is, and the easiest
way is to keep it always in Block 0.
This requires some more copying in this then branch.
an internal node then we cannot use V directly, because T has siblings which still
have their original height.
¬ Instead, we splice m and U into this node.
­ If this node now overflows (that is, if it was already full before this inser-
tion) then we split it into 2 half full nodes – and we must return another
pair hm0 , U 0 i to the caller of this caller.
a leaf node then we can use this same splice-split approach.

239
Figure 84: A B+ -tree with height 2. (Sciore, 2008)

insert(r, node):
1 if this node is internal
2 determine (by binary search) the only
node .subtree[child ] where r.key could be;
3 if insert(r, child ) returned hm, U i
4 splice m between node .key[1 . . . child ] and node .key[child + 1 . . .];
5 splice U into the corresponding position within node .subtree;
6 if this splicing overflowed this node
7 U 0 = a new initially empty internal node;
8 move the top half of the node .subtree array
and the node .keys between them into U 0 ;
9 m0 = detach the last node .key which was not moved
(and which can no longer stay in node)
10 return hm0 , U 0 i
11 else return OK
12 else return OK
13 else determine (by binary search) whether r .key appears in some node .slot;
14 if it does
15 change the RID of that slot into r .RID
16 return OK
17 else splice r into its correct place within node .slot;
18 if this splicing overflowed this node
19 U 0 = a new initially empty leaf node;
20 U 0 .next = node .next;
21 node .next = U 0 ;
22 move the top half of the node .slot array into U 0 ;
23 m0 = U 0 .slot[1].key;
24 return hm0 , U 0 i
25 else return OK.

• Figure 84 shows an example B+ -tree. Assume that all its leaf nodes are already full.

• Figure 84 then shows what happens when a new key “hal” is inserted:
¬ It goes into the leaf starting with “eli”. . .

240
Figure 85: Splitting a leaf node. (Sciore, 2008)

Figure 86: Splitting an internal node. (Sciore, 2008)

­ This leaf splits into 2 leaf nodes. The new leaf node starts with “jim”. . .
® This starting key gets copied into its parent.
¯ This parent still has space for it, but becomes full too.

• Figure 86 shows what happens when another new key “zoe” is inserted:

¬ It goes into the leaf starting with “sue”. . .


­ This leaf splits into 2 leaf nodes. The new leaf node starts with “tom”. . .
® This starting key gets copied into its parent. . .
¯ . . . but its parent is now full, and splits too.

• The grandparent is also full, and so it must split too.


Because it is the root of the whole tree, it grows in height (by 1) as shown in
Figure 87.

• These B+ -trees scale very well when the amount of data grows:

– Such a tree can store up to

(how many subtrees fit into its internal node)its height


· (how many slots fit into its leaf)

index records.

241
Figure 87: Splitting the root node. (Sciore, 2008)

– On today’s disks with their big Block s, these internal nodes can have over 100
subtrees.
– Then even the largest indexes have height of only about 6.
– For instance the lookup and insertion algorithm access only height + 1 disk
Block s.

On Deletion

• Usually an RDBMS deletes an index record from a B+ -tree simply by erasing it


from the leaf node containing it.

• This restricts Eq. (29) only to non-root internal nodes – leaf nodes are now permitted
to be even less than half full.

• This is because. . .

– It is possible to develop a deletion algorithm which merges less than half full
nodes together, but its details turn out to be intricate.
– Moreover, this deletion algorithm would not perform very well when Transactions
are executing it concurrently with other operations.
– The disk space saving would be unlikely to be worth these troubles, because
databases which have already grown big rarely get much smaller in the future
either. . .

• Another compromise is to delete a non-root leaf or internal node when it becomes


totally empty.

Nonunique B+ -Tree Indexes

• We have assumed until now that our B+ -tree index is unique:


Each key appears only once in its index records.

242
Figure 88: Splitting in a nonunique B+ -tree index. (Sciore, 2008)

• When we extend our B+ -trees to nonunique indexes, where the same key can appear
in many index records, Eq. (26) forces us to keep all the index records with the same
key in the same node .subtree too.
• Hence we must keep all of them in the same leaf node too.
• Figure 88 shows what this means when splitting nodes:
We may have to split them unevenly.

• Because there can be more index records with the same key than fit into one leaf
node, we must also allow a chain of overflow Block s in our leaf nodes, as in Figure 89.
• Hence a nonunique B+ -tree has 2 kinds of leaf nodes:
One-Block leaf nodes. They are as before, except that. . .
– the same key is allowed to repeat
– they can be less than half full.
Many-Block leaf nodes with an overflow chain, where
– all index records must share the same key.

243
Figure 89: Overflow chain in a nonunique B+ -tree index leaf. (Sciore, 2008)

Range Queries

• A range query asks for all the records where some Attribute falls into a specified
interval with a lower and upper limit.

• For instance, we can ask for all the STUDENTs of our university example whose
names begin with ’b’:
SELECT ∗
FROM STUDENT
WHERE ’ b ’ <= SName
AND SName < ’ c ’

(Full SQL would offer a special “SName LIKE ’b%’” operation for such queries.)

Lower limit is that the student’s name must be alphabetically at least ’b’.
Upper limit is that it must be less than ’c’.

• Because a search tree like B+ -tree retains the order of its keys, a search tree based
index on this Attribute can answer such queries efficiently:

1 c = the first index record which satisfies the lower limit test;
2 while this current index record c exists
and its c .key satisfies the upper limit test
3 fetch the corresponding data record via c .RID;
4 report it as the next row of the query result;
5 move c to the next index record.

– Line 1 is quick with a slightly modified index lookup algorithm:


Look up the lower limit or the index record with the next larger key value.
– The node .next pointers are very handy on lines 1 and 5.

• The I/O cost of this algorithm is

(the logarithmic cost of the lookup on line 1)


+ (the number of rows in the range) · (the unit cost of line 3).

244
• If we have built an extra (nonunique) B+ -tree index on the (non-key) SName
Attribute of the STUDENT Table, then the RDBMS can. . .

¬ first find quickly the alphabetically first student whose name is ’b’ or greater
using our extra index, as line 1
­ then move directly to the next index record on each line 5
® finally stop this while loop when the name of this next student becomes ’c’ or
greater.

Tree Locking (Weikum and Vossen, 2001, Chapter 9)

• Let us now consider B+ -trees in a concurrent setting, where many Transactions


want to use the same index at the same time.

• One correct solution (taken by SimpleDB) would be to use the same 2PL developed
for Table file Block s also for these index file Block s:

slock(node) when an algorithm reads the contents of a node


xlock(node) when it modifies them

and hold all these locks until this Transaction ends.

• But this would not be a very good solution for concurrency:

– Every tree operation starts at its root, so every Transaction needs the corre-
sponding lock(root).
– Then for instance a Transaction t which modifies an index takes xlock(root)
– which forces all later Transactions to wait until t ends before they can use
this index.

• This is why B+ -trees use their own specific non-2PL locking mechanisms, which
allow multiple Transactions to access different parts of the same tree at the same
time.

• Consider first a single lookup operation performed by some Transaction t.

– Because this operation does not modify the tree, t needs only slocks on the
nodes of tree.
– Transaction t needs an slock(node) where it currently is – so that no other
Transaction u can modify this node.
– To guarantee this, Transaction t must also take another slock(child ) of this
current node before t can descend into it on line 4.
– However, t can release(node) as soon as t has descended to its child .
This increases concurrency for the later Transactions u who want to use the
same tree.
– When this u walks the same path as t down the tree, u cannot go faster than t
because of these locks.
Hence this lookup operation by t will happen before the tree operation by u,
and these 2 operations remain serializable.

245
• This locking mechanism is called lock coupling because the locks of the node and
its child are considered together.

• This lock coupling becomes slightly more complicated for a single insertion opera-
tion.

– Now t needs xlocks on the nodes of the path it takes, because it will modify
some of those nodes.
– Moreover, since these modifications happen when the recursion returns back
along this path, it seems that Transaction t must hold these xlocks during the
whole operation. . .
– However, closer reading of the recursive insertion subroutine reveals that the
operation is done as soon as the first OK is returned:
Its caller will just return OK to its own caller, and so on.
– Hence when the insertion subroutine is descending from its current node into
its child , it can use the following mehanism:

xlock(child );
if child .last < how many keys fit into an internal node
release all the other xlocks taken during this operation
(including xlock(node) in particular);
This is because its child will always return OK, since it is not yet completely
full.
– In this way Transaction t releases its xlock on a node n already when it is
descending into child ren as soon as it is certain that it will not modify n.
– Again, this serializes all the other later Transactions u to reach this part of
the tree which t might modify to execute only after t.

• However, lock coupling is not yet enough, because it considers only a single opera-
tion, but Transactions consist of many.

– Let our B+ -tree index contain 2 index records x and y located on the same
disk Block b.
– Consider the following 2 concurrent Transactions:

Transaction t: Transaction u:
1 look up x; 1 look up y;
2 modify y based on x. 2 modify x based on y.

– Lock coupling permits the following interleaved schedule:

1 step 1 of t (taking slock(b) and releasing it);


2 step 1 of u (taking slock(b) and releasing it);
3 step 2 of t (taking xlock(b) and releasing it);
4 step 2 of u (taking xlock(b) and releasing it).
– However, this schedule is not serializable:
∗ The final value of x is based on the initial value of y and vice versa.
∗ So which of these 2 Transactions t and u would execute first in any equiv-
alent serial schedule?

246
• The solution is to add another level of locking on top of the Page-level locks we
have had so far.

• When a Transaction t wants to perform a B+ -tree index operation, it. . .

– first takes a high-level lock on the key range associated with this operation.
∗ Intuitively, t takes a lock for a range of index records in the leaf nodes of
the tree.
∗ These high-level locks do obey 2PL, so t holds them until it ends.
– then performs the operation, using lock coupling on the low-level Page locks
for the disk file Block s where the affected index records are stored.

The another Transaction u can perform another concurrent operation, if its key
range does not overlap the key range(s) locked by t – that is, if u operates on
different index records than t.

• In the previous unserializable interleaved schedule

1 step 1 of t takes
a high-level slock(x) which it keeps, and
a low-level slock(b) which it releases soon;
2 step 1 of u takes
a high-level slock(y) which it keeps, and
a low-level slock(b) which it releases soon;
3 step 2 of t tries to take a high-level xlock(y),
but must wait for u;
4 step 2 of u tries to take a high-level xlock(x),
but must wait for t;
5 the scheduler detects this deadlock

so either t or u executes completely first instead, as it should to ensure serializability.

• The key range corresponding to a high-level lock(k)

begins at k itself
extends until the next larger key l than k in the index, but does not include this l.

• This choice stems from range queries:

– Suppose that k is the key for the current row of a range query result set.
– Then this slock(k) range extends (almost) until the next l in this result set.

• Such a high-level lock(k) tells the other concurrently running Transactions that

“I have noticed that the part of this B+ -tree corresponding to this range
has no other keys than k, so if you are going to change that (by taking an
xlock within this range) then you must wait until I have done everything
that I am going to do first.”

This ensures serializability among them.

247
• But we do know this l when we are taking this high-level lock(k) – so how can we
compare it against the other high-level locks already taken?

¬ Instead, Transaction t simply starts executing its B+ -tree operation without


taking its lock(k) yet!
Low-level lock coupling handles concurrent Transactions during this step.
­ This continues until t reaches a leaf node.
Note that this does not modify the B+ -tree.
® The next key l for this key k can be found in this leaf .
Or if k is the last key in this leaf then follow its leaf .next link.
¯ Now t can finally ask “Can I now take my high-level lock(k) which I did not
take in step ¬?”
° If the lock table says. . .
“yes” then t completes its B+ -tree operation normally.
“no” then t must wait for its high-level lock(k) first.
¶ First t releases all the low-level locks it has taken during step ¬,
so that other concurrent Transactions can access the B+ -tree freely
while t waits.
· Then t waits until it gets its high-level lock(k).
¸ Finally t re-executes its B+ -tree operation from its beginning – but
this time it does already have its high-level lock(k), so it does not
have to ask for it again.

• Now we can state the high-level key range locking rules, which ensure serializability
for Transactions performing many B+ -tree operations:

Looking up a key k takes. . .


– either slock(k) if k is in the B+ -tree
– or slock(the previous smaller key j than k in the B+ -tree) otherwise.
– This j can be found in either the leaf node or among the internal keys
which guided this lookup into it.
– These are called previous key locking rules because of this j.
Range queries
¬ start with such a lookup operation
­ continue by repeatedly taking the slock for the next key in the B+ -tree.
Insertion takes xlocks on both the newly inserted key k and its previous key j.
Deletion takes the same locks as insertion, when performed in the easy but not
very space-efficient way.

• RDBMSs often use this kind of 2-level locking (not only for their B+ -tree indexes
but also) their stored Tables.

– There the items with 2PL high-level locks are (not keys k but) the RIDs of
their stored Record s.
– Early-release low-level locks synchronize in turn access to the Pages where
these Record s are stored.

248
– This allows a Transaction to lock only some of the Record s in a Page, and the
other concurrently running Transactions can still access its other Record s.
– This increased concurrency is not possible, if we only have the Page locks (like
SimpleDB does).

SimpleDB source file simpledb/index/btree/BTreeIndex.java


• Here is the SimpleDB B+ -tree index implementation.

• It differs slightly from what we have discussed:

– It splits a node already when it becomes full – it does not wait it to become
over full.
+ This avoids having to program the support of the overflowing part of a node
in RAM, but. . .
− it also means that the nodes on disk can never fill a Block completely, because
they always have at least one unused node .slot.

• It keeps the internal and leaf nodes of the same B+ -tree in 2 separate files.

• This allows it to treat each file as consisting of just one kind of Page.
package s i m p l e d b . i n d e x . b t r e e ;

import s t a t i c j a v a . s q l . Types . INTEGER ;


import simpledb . f i l e . Block ;
import simpledb . tx . Transaction ;
import simpledb . record . ∗ ;
import simpledb . query . ∗ ;
import simpledb . index . Index ;

/∗ ∗
∗ A B−t r e e i m p l e m e n t a t i o n o f t h e I n d e x i n t e r f a c e .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s BTreeIndex implements I n d e x {
private T r a n s a c t i o n tx ;
private T a b l e I n f o d i r T i , l e a f T i ;
p r i v a t e B T r e e Le a f l e a f = n u l l ;
private Block r o o t b l k ;

/∗ ∗
∗ Opens a B−t r e e i n d e x f o r t h e s p e c i f i e d i n d e x .
∗ The method d e t e r m i n e s t h e a p p r o p r i a t e f i l e s
∗ f o r t h e l e a f and d i r e c t o r y r e c o r d s ,
∗ c r e a t i n g them i f t h e y d i d n o t e x i s t .
∗ @param i d x n a m e t h e name o f t h e i n d e x
∗ @param l e a f s c h t h e schema o f t h e l e a f i n d e x r e c o r d s
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public BTreeIndex ( S t r i n g idxname , Schema l e a f s c h , T r a n s a c t i o n t x ) {
this . tx = tx ;
// d e a l w i t h t h e l e a v e s
S t r i n g l e a f t b l = idxname + ” l e a f ” ;
l e a f T i = new T a b l e I n f o ( l e a f t b l , l e a f s c h ) ;
i f ( t x . s i z e ( l e a f T i . f i l e N a m e ( ) ) == 0 )
t x . append ( l e a f T i . f i l e N a m e ( ) , new BTPageFormatter ( l e a f T i , −1) ) ;

// d e a l w i t h t h e d i r e c t o r y
Schema d i r s c h = new Schema ( ) ;
d i r s c h . add ( ” b l o c k ” , leafsch ) ;
d i r s c h . add ( ” d a t a v a l ” , l e a f s c h ) ;
S t r i n g d i r t b l = idxname + ” d i r ” ;
d i r T i = new T a b l e I n f o ( d i r t b l , d i r s c h ) ;
r o o t b l k = new B l o c k ( d i r T i . f i l e N a m e ( ) , 0 ) ;
i f ( t x . s i z e ( d i r T i . f i l e N a m e ( ) ) == 0 )
// c r e a t e new r o o t b l o c k
t x . append ( d i r T i . f i l e N a m e ( ) , new BTPageFormatter ( d i r T i , 0) ) ;
BTreePage page = new BTreePage ( r o o t b l k , d i r T i , t x ) ;
i f ( page . getNumRecs ( ) == 0 ) {
// i n s e r t i n i t i a l d i r e c t o r y e n t r y
int f l d t y p e = d i r s c h . type ( ” dataval ” ) ;
C o n s t a n t m i n v a l = ( f l d t y p e == INTEGER) ?
new I n t C o n s t a n t ( I n t e g e r . MIN VALUE) :
new S t r i n g C o n s t a n t ( ” ” ) ;
page . i n s e r t D i r ( 0 , minval , 0 ) ;
}
page . c l o s e ( ) ;
}

/∗ ∗
∗ Traverses the directory to find the leaf block corresponding

249
∗ to the s p e c i f i e d search key .
∗ The method t h e n o p e n s a p a g e f o r t h a t l e a f b l o c k , and
∗ p o s i t i o n s t h e p a g e b e f o r e t h e f i r s t r e c o r d ( i f any )
∗ having t h a t search key .
∗ The l e a f p a g e i s k e p t open , f o r u s e b y t h e m e t h o d s n e x t
∗ and g e t D a t a R i d .
∗ @see s i m p l e d b . i n d e x . I n d e x#b e f o r e F i r s t ( s i m p l e d b . q u e r y . C o n s t a n t )
∗/
public void b e f o r e F i r s t ( C o n s t a n t s e a r c h k e y ) {
close () ;
BTreeDir r o o t = new BTreeDir ( r o o t b l k , d i r T i , t x ) ;
i n t blknum = r o o t . s e a r c h ( s e a r c h k e y ) ;
root . close () ;
B l o c k l e a f b l k = new B l o c k ( l e a f T i . f i l e N a m e ( ) , blknum ) ;
l e a f = new B T r e e L ea f ( l e a f b l k , l e a f T i , s e a r c h k e y , t x ) ;
}

/∗ ∗
∗ Moves t o t h e n e x t l e a f r e c o r d h a v i n g t h e
∗ p r e v i o u s l y −s p e c i f i e d s e a r c h k e y .
∗ R e t u r n s f a l s e i f t h e r e a r e no more s u c h l e a f records .
∗ @see s i m p l e d b . i n d e x . I n d e x#n e x t ( )
∗/
public boolean n e x t ( ) {
return l e a f . n e x t ( ) ;
}

/∗ ∗
∗ R e t u r n s t h e dataRID v a l u e f r o m t h e c u r r e n t leaf record .
∗ @see s i m p l e d b . i n d e x . I n d e x#g e t D a t a R i d ( )
∗/
public RID g e t D a t a R i d ( ) {
return l e a f . g e t D a t a R i d ( ) ;
}

/∗ ∗
∗ Inserts the s p e c i f i e d record into the index .
∗ The method f i r s t t r a v e r s e s t h e d i r e c t o r y t o f i n d
∗ the a p p r o p r i a t e l e a f page ; then i t i n s e r t s
∗ the record into the l e a f .
∗ I f the i n s e r t i o n causes the l e a f to s p l i t , then
∗ t h e method c a l l s i n s e r t on t h e r o o t ,
∗ p a s s i n g i t t h e d i r e c t o r y e n t r y o f t h e new l e a f p a g e .
∗ I f t h e r o o t n o d e s p l i t s , t h e n makeNewRoot i s c a l l e d .
∗ @see s i m p l e d b . i n d e x . I n d e x#i n s e r t ( s i m p l e d b . q u e r y . C o n s t a n t , s i m p l e d b . r e c o r d . RID )
∗/
public void i n s e r t ( C o n s t a n t d a t a v a l , RID d a t a r i d ) {
b e f o r e F i r s t ( dataval ) ;
DirEntry e = l e a f . i n s e r t ( d a t a r i d ) ;
leaf . close () ;
i f ( e == n u l l )
return ;
BTreeDir r o o t = new BTreeDir ( r o o t b l k , d i r T i , t x ) ;
DirEntry e2 = r o o t . i n s e r t ( e ) ;
i f ( e 2 != n u l l )
r o o t . makeNewRoot ( e 2 ) ;
root . close () ;
}

/∗ ∗
∗ Deletes the s p e c i f i e d index record .
∗ The method f i r s t t r a v e r s e s t h e d i r e c t o r y t o f i n d
∗ the l e a f page c o n t a i n i n g t h a t record ; then i t
∗ d e l e t e s t h e r e c o r d from t h e page .
∗ @see s i m p l e d b . i n d e x . I n d e x#d e l e t e ( s i m p l e d b . q u e r y . C o n s t a n t , s i m p l e d b . r e c o r d . RID )
∗/
public void d e l e t e ( C o n s t a n t d a t a v a l , RID d a t a r i d ) {
b e f o r e F i r s t ( dataval ) ;
l e a f . delete ( datarid ) ;
leaf . close () ;
}

/∗ ∗
∗ C l o s e s t h e i n d e x by c l o s i n g i t s open leaf page ,
∗ i f necessary .
∗ @see s i m p l e d b . i n d e x . I n d e x#c l o s e ( )
∗/
public void c l o s e ( ) {
i f ( l e a f != n u l l )
leaf . close () ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s
∗ required to find a l l index records having
∗ a p a r t i c u l a r search key .
∗ @param n u m b l o c k s t h e number o f b l o c k s i n t h e B−t r e e d i r e c t o r y
∗ @param r p b t h e number o f i n d e x e n t r i e s p e r b l o c k
∗ @return t h e e s t i m a t e d t r a v e r s a l c o s t
∗/
public s t a t i c i n t s e a r c h C o s t ( i n t numblocks , i n t rpb ) {
return 1 + ( i n t ) ( Math . l o g ( numblocks ) / Math . l o g ( rpb ) ) ;
}
}

SimpleDB source file simpledb/index/btree/BTreePage.java


package s i m p l e d b . i n d e x . b t r e e ;

import s t a t i c j a v a . s q l . Types . INTEGER ;

250
import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import simpledb . f i l e . Block ;
import simpledb . record . ∗ ;
import simpledb . query . ∗ ;
import simpledb . tx . Transaction ;

/∗ ∗
∗ B−t r e e d i r e c t o r y and l e a f p a g e s h a v e many c o m m o n a l i t i e s :
∗ in p a r t i c u l a r , t h e i r r e c o r d s are s t o r e d in s o r t e d order ,
∗ and p a g e s s p l i t when f u l l .
∗ A BTreePage o b j e c t c o n t a i n s t h i s common f u n c t i o n a l i t y .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s BTreePage {
private Block c u r r e n t b l k ;
private T a b l e I n f o t i ;
private T r a n s a c t i o n tx ;
private int s l o t s i z e ;

/∗ ∗
∗ Opens a p a g e f o r t h e s p e c i f i e d B−t r e e b l o c k .
∗ @param c u r r e n t b l k a r e f e r e n c e t o t h e B−t r e e b l o c k
∗ @param t i t h e m e t a d a t a f o r t h e p a r t i c u l a r B−t r e e f i l e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public BTreePage ( B l o c k c u r r e n t b l k , T a b l e I n f o t i , T r a n s a c t i o n t x ) {
this . currentblk = currentblk ;
this . t i = t i ;
this . tx = tx ;
s l o t s i z e = t i . recordLength () ;
tx . pin ( c u r r e n t b l k ) ;
}

/∗ ∗
∗ C a l c u l a t e s t h e p o s i t i o n where t h e f i r s t r e c o r d h a v i n g
∗ t h e s p e c i f i e d s e a r c h k e y s h o u l d be , t h e n r e t u r n s
∗ the position before i t .
∗ @param s e a r c h k e y t h e s e a r c h k e y
∗ @return t h e p o s i t i o n b e f o r e where t h e s e a r c h key g o e s
∗/
public i n t f i n d S l o t B e f o r e ( C o n s t a n t s e a r c h k e y ) {
int s l o t = 0 ;
while ( s l o t < getNumRecs ( ) && g e t D a t a V a l ( s l o t ) . compareTo ( s e a r c h k e y ) < 0 )
s l o t ++;
return s l o t −1;
}

/∗ ∗
∗ C l o s e s t h e page by unpinning its buffer .
∗/
public void c l o s e ( ) {
i f ( c u r r e n t b l k != n u l l )
tx . unpin ( c u r r e n t b l k ) ;
c u r r e n t b l k = null ;
}

/∗ ∗
∗ Returns t r ue i f the b l o c k i s f u l l .
∗ @return t r u e i f t h e b l o c k i s f u l l
∗/
public boolean i s F u l l ( ) {
return s l o t p o s ( getNumRecs ( ) +1) >= BLOCK SIZE ;
}

/∗ ∗
∗ S p l i t s the page at the s p e c i f i e d p o s i t i o n .
∗ A new p a g e i s c r e a t e d , and t h e r e c o r d s o f t h e p a g e
∗ s t a r t i n g a t t h e s p l i t p o s i t i o n a r e t r a n s f e r r e d t o t h e new p a g e .
∗ @param s p l i t p o s t h e s p l i t p o s i t i o n
∗ @param f l a g t h e i n i t i a l v a l u e o f t h e f l a g f i e l d
∗ @ r e t u r n t h e r e f e r e n c e t o t h e new b l o c k
∗/
public B l o c k s p l i t ( i n t s p l i t p o s , i n t f l a g ) {
B l o c k newblk = appendNew ( f l a g ) ;
BTreePage newpage = new BTreePage ( newblk , t i , t x ) ;
t r a n s f e r R e c s ( s p l i t p o s , newpage ) ;
newpage . s e t F l a g ( f l a g ) ;
newpage . c l o s e ( ) ;
return newblk ;
}

/∗ ∗
∗ Returns the d a t a v a l of the record at the s p e c i f i e d slot .
∗ @param s l o t t h e i n t e g e r s l o t o f an i n d e x r e c o r d
∗ @return t h e d a t a v a l o f t h e record at t h a t s l o t
∗/
public C o n s t a n t g e t D a t a V a l ( i n t s l o t ) {
return g e t V a l ( s l o t , ” d a t a v a l ” ) ;
}

/∗ ∗
∗ Returns the v a l u e of the page ’ s f l a g field
∗ @return t h e v a l u e o f t h e page ’ s f l a g field
∗/
public i n t g e t F l a g ( ) {
return t x . g e t I n t ( c u r r e n t b l k , 0 ) ;
}

/∗ ∗
∗ S e t s the page ’ s f l a g f i e l d to the s p e c i f i e d value
∗ @param v a l t h e new v a l u e o f t h e p a g e f l a g
∗/
public void s e t F l a g ( i n t v a l ) {

251
tx . s e t I n t ( currentblk , 0, val ) ;
}

/∗ ∗
∗ Appends a new b l o c k t o t h e end o f t h e s p e c i f i e d B−t r e e f i l e ,
∗ having the s p e c i f i e d f l a g value .
∗ @param f l a g t h e i n i t i a l v a l u e o f t h e f l a g
∗ @ r e t u r n a r e f e r e n c e t o t h e n e w l y −c r e a t e d b l o c k
∗/
public B l o c k appendNew ( i n t f l a g ) {
return t x . append ( t i . f i l e N a m e ( ) , new BTPageFormatter ( t i , f l a g ) ) ;
}

// M e t h o d s called only b y BTreeDir

/∗ ∗
∗ R e t u r n s t h e b l o c k number s t o r e d i n t h e i n d e x r e c o r d
∗ at the s p e c i f i e d s l o t .
∗ @param s l o t t h e s l o t o f an i n d e x r e c o r d
∗ @ r e t u r n t h e b l o c k number s t o r e d i n t h a t r e c o r d
∗/
public i n t getChildNum ( i n t s l o t ) {
return g e t I n t ( s l o t , ” b l o c k ” ) ;
}

/∗ ∗
∗ Inserts a directory entry at the s p e c i f i e d s l o t .
∗ @param s l o t t h e s l o t o f an i n d e x r e c o r d
∗ @param v a l t h e d a t a v a l t o b e s t o r e d
∗ @param b l k n u m t h e b l o c k number t o b e s t o r e d
∗/
public void i n s e r t D i r ( i n t s l o t , C o n s t a n t v a l , i n t blknum ) {
insert ( slot ) ;
setVal ( slot , ” dataval ” , val ) ;
s e t I n t ( s l o t , ” b l o c k ” , blknum ) ;
}

// M e t h o d s called only by BTreeLeaf

/∗ ∗
∗ R e t u r n s t h e dataRID v a l u e s t o r e d i n t h e s p e c i f i e d l e a f i n d e x record .
∗ @param s l o t t h e s l o t o f t h e d e s i r e d i n d e x r e c o r d
∗ @ r e t u r n t h e dataRID v a l u e s t o r e a t t h a t s l o t
∗/
public RID g e t D a t a R i d ( i n t s l o t ) {
return new RID ( g e t I n t ( s l o t , ” b l o c k ” ) , g e t I n t ( s l o t , ” i d ” ) ) ;
}

/∗ ∗
∗ Inserts a l e a f index record at the s p e c i f i e d s l o t .
∗ @param s l o t t h e s l o t o f t h e d e s i r e d i n d e x r e c o r d
∗ @param v a l t h e new d a t a v a l
∗ @param r i d t h e new dataRID
∗/
public void i n s e r t L e a f ( i n t s l o t , C o n s t a n t v a l , RID r i d ) {
insert ( slot ) ;
setVal ( slot , ” dataval ” , val ) ;
s e t I n t ( s l o t , ” b l o c k ” , r i d . blockNumber ( ) ) ;
setInt ( slot , ” id ” , rid . id () ) ;
}

/∗ ∗
∗ Deletes the index record at the s p e c i f i e d slot .
∗ @param s l o t t h e s l o t o f t h e d e l e t e d i n d e x record
∗/
public void d e l e t e ( i n t s l o t ) {
f o r ( i n t i=s l o t +1; i <getNumRecs ( ) ; i ++)
c o p y R e c o r d ( i , i −1) ;
setNumRecs ( getNumRecs ( ) −1) ;
return ;
}

/∗ ∗
∗ R e t u r n s t h e number o f i n d e x r e c o r d s i n t h i s page .
∗ @ r e t u r n t h e number o f i n d e x r e c o r d s i n t h i s page
∗/
public i n t getNumRecs ( ) {
return t x . g e t I n t ( c u r r e n t b l k , INT SIZE ) ;
}

// P r i v a t e methods

private int g e t I n t ( int s l o t , S t r i n g fldname ) {


int pos = f l d p o s ( s l o t , fldname ) ;
return t x . g e t I n t ( c u r r e n t b l k , p o s ) ;
}

private S t r i n g g e t S t r i n g ( int s l o t , S t r i n g fldname ) {


int pos = f l d p o s ( s l o t , fldname ) ;
return t x . g e t S t r i n g ( c u r r e n t b l k , p o s ) ;
}

private Constant getVal ( int s l o t , S t r i n g fldname ) {


i n t t y p e = t i . schema ( ) . t y p e ( f l d n a m e ) ;
i f ( t y p e == INTEGER)
return new I n t C o n s t a n t ( g e t I n t ( s l o t , f l d n a m e ) ) ;
else
return new S t r i n g C o n s t a n t ( g e t S t r i n g ( s l o t , f l d n a m e ) ) ;
}

p r i v a t e void s e t I n t ( i n t s l o t , S t r i n g fldname , int val ) {


int pos = f l d p o s ( s l o t , fldname ) ;
t x . s e t I n t ( c u r r e n t b l k , pos , v a l ) ;

252
}

p r i v a t e void s e t S t r i n g ( i n t s l o t , S t r i n g fldname , String val ) {


int pos = f l d p o s ( s l o t , fldname ) ;
t x . s e t S t r i n g ( c u r r e n t b l k , pos , v a l ) ;
}

p r i v a t e void s e t V a l ( i n t s l o t , S t r i n g fldname , C o n s t a n t v a l ) {
i n t t y p e = t i . schema ( ) . t y p e ( f l d n a m e ) ;
i f ( t y p e == INTEGER)
s e t I n t ( s l o t , fldname , ( I n t e g e r ) v a l . a s J a v a V a l ( ) ) ;
else
s e t S t r i n g ( s l o t , fldname , ( S t r i n g ) v a l . a s J a v a V a l ( ) ) ;
}

p r i v a t e void setNumRecs ( i n t n ) {
t x . s e t I n t ( c u r r e n t b l k , INT SIZE , n ) ;
}

p r i v a t e void i n s e r t ( i n t s l o t ) {
f o r ( i n t i=getNumRecs ( ) ; i >s l o t ; i −−)
c o p y R e c o r d ( i −1 , i ) ;
setNumRecs ( getNumRecs ( ) +1) ;
}

p r i v a t e void c o p y R e c o r d ( i n t from , i n t t o ) {
Schema s c h = t i . schema ( ) ;
for ( S t r i n g fldname : sch . f i e l d s ( ) )
s e t V a l ( to , fldname , g e t V a l ( from , f l d n a m e ) ) ;
}

p r i v a t e void t r a n s f e r R e c s ( i n t s l o t , BTreePage d e s t ) {
int d e s t s l o t = 0 ;
while ( s l o t < getNumRecs ( ) ) {
dest . i n s e r t ( d e s t s l o t ) ;
Schema s c h = t i . schema ( ) ;
for ( S t r i n g fldname : sch . f i e l d s ( ) )
d e s t . s e t V a l ( d e s t s l o t , fldname , g e t V a l ( s l o t , f l d n a m e ) ) ;
delete ( slot ) ;
d e s t s l o t ++;
}
}

private int f l d p o s ( int s l o t , S t r i n g fldname ) {


int o f f s e t = t i . o f f s e t ( fldname ) ;
return s l o t p o s ( s l o t ) + o f f s e t ;
}

private int s l o t p o s ( int s l o t ) {


return INT SIZE + INT SIZE + ( s l o t ∗ slotsize ) ;
}
}

SimpleDB source file simpledb/index/btree/BTPageFormatter.java


package s i m p l e d b . i n d e x . b t r e e ;

import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import s t a t i c j a v a . s q l . Types . INTEGER ;
import s i m p l e d b . f i l e . Page ;
import simpledb . b u f f e r . PageFormatter ;
import simpledb . record . TableInfo ;

/∗ ∗
∗ An o b j e c t t h a t can f o r m a t a p a g e t o l o o k l i k e an
∗ empty B−t r e e b l o c k .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s BTPageFormatter implements P a g e F o r m a t t e r {
private T a b l e I n f o t i ;
private int f l a g ;

/∗ ∗
∗ C r e a t e s a f o r m a t t e r f o r a new p a g e o f t h e
∗ s p e c i f i e d B−t r e e i n d e x .
∗ @param t i t h e i n d e x ’ s m e t a d a t a
∗ @param f l a g t h e p a g e ’ s i n i t i a l f l a g v a l u e
∗/
public BTPageFormatter ( T a b l e I n f o t i , i n t f l a g ) {
this . t i = t i ;
this . f l a g = f l a g ;
}

/∗ ∗
∗ F o r m a t s t h e p a g e b y i n i t i a l i z i n g a s many i n d e x −r e c o r d s l o t s
∗ as p o s s i b l e to have d e f a u l t v a l u e s .
∗ Each i n t e g e r f i e l d i s g i v e n a v a l u e o f 0 , and
∗ each s t r i n g f i e l d i s g i v e n a v a l u e of ””.
∗ The l o c a t i o n t h a t i n d i c a t e s t h e number o f r e c o r d s
∗ in the page i s a l s o s e t to 0.
∗ @see s i m p l e d b . b u f f e r . P a g e F o r m a t t e r#f o r m a t ( s i m p l e d b . f i l e . Page )
∗/
public void f o r m a t ( Page page ) {
page . s e t I n t ( 0 , f l a g ) ;
page . s e t I n t ( INT SIZE , 0 ) ; // #r e c o r d s = 0
int r e c s i z e = t i . recordLength ( ) ;
f o r ( i n t p o s =2∗INT SIZE ; p o s+r e c s i z e <=BLOCK SIZE ; p o s += r e c s i z e )
m a k e D e f a u l t R e c o r d ( page , p o s ) ;
}

253
p r i v a t e void m a k e D e f a u l t R e c o r d ( Page page , i n t p o s ) {
f o r ( S t r i n g f l d n a m e : t i . schema ( ) . f i e l d s ( ) ) {
int o f f s e t = t i . o f f s e t ( fldname ) ;
i f ( t i . schema ( ) . t y p e ( f l d n a m e ) == INTEGER)
page . s e t I n t ( p o s + o f f s e t , 0 ) ;
else
page . s e t S t r i n g ( p o s + o f f s e t , ” ” ) ;
}
}
}

SimpleDB source file simpledb/index/btree/BTreeLeaf.java


package s i m p l e d b . i n d e x . b t r e e ;

import simpledb . f i l e . Block ;


import simpledb . tx . Transaction ;
import simpledb . record . ∗ ;
import simpledb . query . Constant ;

/∗ ∗
∗ An o b j e c t t h a t h o l d s t h e c o n t e n t s o f a B−t r e e leaf block .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B T r e e L ea f {
private T a b l e I n f o t i ;
private T r a n s a c t i o n tx ;
private Constant s e a r c h k e y ;
p r i v a t e BTreePage c o n t e n t s ;
private int c u r r e n t s l o t ;

/∗ ∗
∗ Opens a p a g e t o h o l d t h e s p e c i f i e d l e a f b l o c k .
∗ The p a g e i s p o s i t i o n e d i m m e d i a t e l y b e f o r e t h e f i r s t r e c o r d
∗ h a v i n g t h e s p e c i f i e d s e a r c h k e y ( i f any ) .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗ @param t i t h e m e t a d a t a o f t h e B−t r e e l e a f f i l e
∗ @param s e a r c h k e y t h e s e a r c h k e y v a l u e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public B T r e e L ea f ( B l o c k b l k , T a b l e I n f o t i , C o n s t a n t s e a r c h k e y , Transaction tx ) {
this . t i = t i ;
this . tx = tx ;
this . searchkey = searchkey ;
c o n t e n t s = new BTreePage ( b l k , t i , t x ) ;
c u r r e n t s l o t = contents . findSlotBefore ( searchkey ) ;
}

/∗ ∗
∗ Closes the l e a f page .
∗/
public void c l o s e ( ) {
contents . close () ;
}

/∗ ∗
∗ Moves t o t h e n e x t l e a f r e c o r d h a v i n g t h e
∗ p r e v i o u s l y −s p e c i f i e d s e a r c h k e y .
∗ R e t u r n s f a l s e i f t h e r e i s no more s u c h r e c o r d s .
∗ @ r e t u r n f a l s e i f t h e r e a r e no more l e a f r e c o r d s f o r t h e s e a r c h key
∗/
public boolean n e x t ( ) {
c u r r e n t s l o t ++;
i f ( c u r r e n t s l o t >= c o n t e n t s . getNumRecs ( ) )
return t r y O v e r f l o w ( ) ;
e l s e i f ( c o n t e n t s . getDataVal ( c u r r e n t s l o t ) . e q u a l s ( searchkey ) )
return true ;
else
return t r y O v e r f l o w ( ) ;
}

/∗ ∗
∗ R e t u r n s t h e dataRID v a l u e o f t h e c u r r e n t l e a f record .
∗ @ r e t u r n t h e dataRID o f t h e c u r r e n t r e c o r d
∗/
public RID g e t D a t a R i d ( ) {
return c o n t e n t s . g e t D a t a R i d ( c u r r e n t s l o t ) ;
}

/∗ ∗
∗ D e l e t e s t h e l e a f r e c o r d h a v i n g t h e s p e c i f i e d dataRID
∗ @param d a t a r i d t h e d a t a R I d w h o s e r e c o r d i s t o b e d e l e t e d
∗/
public void d e l e t e ( RID d a t a r i d ) {
while ( n e x t ( ) )
i f ( getDataRid ( ) . e q u a l s ( d a t a r i d ) ) {
contents . delete ( currentslot ) ;
return ;
}
}

/∗ ∗
∗ I n s e r t s a new l e a f r e c o r d h a v i n g t h e s p e c i f i e d dataRID
∗ and t h e p r e v i o u s l y − s p e c i f i e d s e a r c h k e y .
∗ I f t h e r e c o r d d o e s not f i t i n t h e page , t h e n
∗ t h e p a g e s p l i t s and t h e method r e t u r n s t h e
∗ d i r e c t o r y e n t r y f o r t h e new p a g e ;
∗ o t h e r w i s e , t h e method r e t u r n s n u l l .
∗ I f a l l o f t h e r e c o r d s i n t h e p a g e h a v e t h e same d a t a v a l ,
∗ t h e n t h e b l o c k d o e s n o t s p l i t ; i n s t e a d , a l l b u t one o f t h e

254
∗ r e c o r d s a r e p l a c e d i n t o an o v e r f l o w b l o c k .
∗ @param d a t a r i d t h e dataRID v a l u e o f t h e new r e c o r d
∗ @ r e t u r n t h e d i r e c t o r y e n t r y o f t h e n e w l y − s p l i t p a g e , i f one e x i s t s .
∗/
public D i r E n t r y i n s e r t ( RID d a t a r i d ) {
c u r r e n t s l o t ++;
contents . i n s e r t L e a f ( c u r r e n t s l o t , searchkey , datarid ) ;
i f ( ! contents . i s F u l l () )
return n u l l ;
// e l s e p a g e i s f u l l , s o s p l i t i t
Constant f i r s t k e y = c o n t e n t s . getDataVal ( 0 ) ;
Constant l a s t k e y = c o n t e n t s . g e t D a t a V a l ( c o n t e n t s . getNumRecs ( ) −1) ;
i f ( lastkey . equals ( f i r s t k e y ) ) {
// c r e a t e an o v e r f l o w b l o c k t o h o l d a l l b u t t h e f i r s t r e c o r d
B l o c k newblk = c o n t e n t s . s p l i t ( 1 , c o n t e n t s . g e t F l a g ( ) ) ;
c o n t e n t s . s e t F l a g ( newblk . number ( ) ) ;
return n u l l ;
}
else {
i n t s p l i t p o s = c o n t e n t s . getNumRecs ( ) / 2 ;
Constant s p l i t k e y = c o n t e n t s . getDataVal ( s p l i t p o s ) ;
i f ( splitkey . equals ( f i r s t k e y ) ) {
// move r i g h t , l o o k i n g f o r t h e n e x t k e y
while ( c o n t e n t s . g e t D a t a V a l ( s p l i t p o s ) . e q u a l s ( s p l i t k e y ) )
s p l i t p o s ++;
s p l i t k e y = c o n t e n t s . getDataVal ( s p l i t p o s ) ;
}
else {
// move l e f t , l o o k i n g f o r f i r s t e n t r y h a v i n g t h a t k e y
while ( c o n t e n t s . g e t D a t a V a l ( s p l i t p o s −1) . e q u a l s ( s p l i t k e y ) )
s p l i t p o s −−;
}
B l o c k newblk = c o n t e n t s . s p l i t ( s p l i t p o s , −1) ;
return new D i r E n t r y ( s p l i t k e y , newblk . number ( ) ) ;
}
}

p r i v a t e boolean t r y O v e r f l o w ( ) {
Constant f i r s t k e y = c o n t e n t s . getDataVal ( 0 ) ;
int f l a g = contents . getFlag ( ) ;
i f ( ! searchkey . equals ( f i r s t k e y ) | | f l a g < 0)
return f a l s e ;
contents . close () ;
B l o c k n e x t b l k = new B l o c k ( t i . f i l e N a m e ( ) , f l a g ) ;
c o n t e n t s = new BTreePage ( n e x t b l k , t i , t x ) ;
currentslot = 0;
return true ;
}
}

SimpleDB source file simpledb/index/btree/BTreeDir.java


package s i m p l e d b . i n d e x . b t r e e ;

import simpledb . f i l e . Block ;


import simpledb . tx . Transaction ;
import simpledb . record . TableInfo ;
import simpledb . query . Constant ;

/∗ ∗
∗ A B−t r e e d i r e c t o r y b l o c k .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s BTreeDir {
private T a b l e I n f o t i ;
private T r a n s a c t i o n tx ;
private S t r i n g f i l e n a m e ;
p r i v a t e BTreePage c o n t e n t s ;

/∗ ∗
∗ C r e a t e s an o b j e c t t o h o l d t h e c o n t e n t s o f t h e s p e c i f i e d
∗ B−t r e e b l o c k .
∗ @param b l k a r e f e r e n c e t o t h e s p e c i f i e d B−t r e e b l o c k
∗ @param t i t h e m e t a d a t a o f t h e B−t r e e d i r e c t o r y f i l e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
BTreeDir ( B l o c k b l k , T a b l e I n f o t i , T r a n s a c t i o n t x ) {
this . t i = t i ;
this . tx = tx ;
filename = blk . fileName () ;
c o n t e n t s = new BTreePage ( b l k , t i , t x ) ;
}

/∗ ∗
∗ Closes the directory page .
∗/
public void c l o s e ( ) {
contents . close () ;
}

/∗ ∗
∗ R e t u r n s t h e b l o c k number o f t h e B−t r e e l e a f b l o c k
∗ t h a t contains the s p e c i f i e d search key .
∗ @param s e a r c h k e y t h e s e a r c h k e y v a l u e
∗ @ r e t u r n t h e b l o c k number o f t h e l e a f b l o c k c o n t a i n i n g that search key
∗/
public i n t s e a r c h ( C o n s t a n t s e a r c h k e y ) {
Block c h i l d b l k = f i n d C h i l d B l o c k ( searchkey ) ;
while ( c o n t e n t s . g e t F l a g ( ) > 0 ) {
contents . close () ;

255
c o n t e n t s = new BTreePage ( c h i l d b l k , t i , tx ) ;
childblk = findChildBlock ( searchkey ) ;
}
return c h i l d b l k . number ( ) ;
}

/∗ ∗
∗ C r e a t e s a new r o o t b l o c k f o r t h e B−t r e e .
∗ The new r o o t w i l l h a v e t w o c h i l d r e n :
∗ t h e o l d r o o t , and t h e s p e c i f i e d b l o c k .
∗ S i n c e t h e r o o t must a l w a y s b e i n b l o c k 0 o f t h e f i l e ,
∗ t h e c o n t e n t s o f t h e o l d r o o t w i l l g e t t r a n s f e r r e d t o a new b l o c k .
∗ @param e t h e d i r e c t o r y e n t r y t o b e a d d e d a s a c h i l d o f t h e new r o o t
∗/
public void makeNewRoot ( D i r E n t r y e ) {
Constant f i r s t v a l = c o n t e n t s . getDataVal ( 0 ) ;
int l e v e l = contents . getFlag ( ) ;
B l o c k newblk = c o n t e n t s . s p l i t ( 0 , l e v e l ) ; // i e , t r a n s f e r a l l t h e r e c o r d s
D i r E n t r y o l d r o o t = new D i r E n t r y ( f i r s t v a l , newblk . number ( ) ) ;
insertEntry ( oldroot ) ;
insertEntry ( e ) ;
c o n t e n t s . s e t F l a g ( l e v e l +1) ;
}

/∗ ∗
∗ I n s e r t s a new d i r e c t o r y e n t r y i n t o t h e B−t r e e b l o c k .
∗ I f the block i s at l e v e l 0 , then the entry i s i n s e r t e d there .
∗ Otherwise , the entry i s i n s e r t e d i n t o the a p p r o p r i a t e
∗ c h i l d node , and t h e r e t u r n v a l u e i s e x a m i n e d .
∗ A non−n u l l r e t u r n v a l u e i n d i c a t e s t h a t t h e c h i l d n o d e
∗ s p l i t , and s o t h e r e t u r n e d e n t r y i s i n s e r t e d i n t o
∗ this block .
∗ I f t h i s b l o c k s p l i t s , t h e n t h e method s i m i l a r l y r e t u r n s
∗ t h e e n t r y i n f o r m a t i o n o f t h e new b l o c k t o i t s c a l l e r ;
∗ o t h e r w i s e , t h e method r e t u r n s n u l l .
∗ @param e t h e d i r e c t o r y e n t r y t o b e i n s e r t e d
∗ @ r e t u r n t h e d i r e c t o r y e n t r y o f t h e n e w l y − s p l i t b l o c k , i f one e x i s t s ; otherwise , null
∗/
public D i r E n t r y i n s e r t ( D i r E n t r y e ) {
i f ( c o n t e n t s . g e t F l a g ( ) == 0 )
return i n s e r t E n t r y ( e ) ;
Block c h i l d b l k = f i n d C h i l d B l o c k ( e . dataVal ( ) ) ;
BTreeDir c h i l d = new BTreeDir ( c h i l d b l k , t i , t x ) ;
D i r E n t r y myentry = c h i l d . i n s e r t ( e ) ;
child . close () ;
return ( myentry != n u l l ) ? i n s e r t E n t r y ( myentry ) : n u l l ;
}

private DirEntry i n s e r t E n t r y ( DirEntry e ) {


int newslot = 1 + c o n t e n t s . f i n d S l o t B e f o r e ( e . dataVal ( ) ) ;
c o n t e n t s . i n s e r t D i r ( n e w s l o t , e . d a t a V a l ( ) , e . blockNumber ( ) ) ;
i f ( ! contents . i s F u l l () )
return n u l l ;
// e l s e p a g e i s f u l l , s o s p l i t i t
int l e v e l = contents . getFlag ( ) ;
i n t s p l i t p o s = c o n t e n t s . getNumRecs ( ) / 2 ;
Constant s p l i t v a l = c o n t e n t s . getDataVal ( s p l i t p o s ) ;
B l o c k newblk = c o n t e n t s . s p l i t ( s p l i t p o s , l e v e l ) ;
return new D i r E n t r y ( s p l i t v a l , newblk . number ( ) ) ;
}

private Block f i n d C h i l d B l o c k ( Constant s e a r c h k e y ) {


int s l o t = contents . f i n d S l o t B e f o r e ( searchkey ) ;
i f ( c o n t e n t s . g e t D a t a V a l ( s l o t +1) . e q u a l s ( s e a r c h k e y ) )
s l o t ++;
i n t blknum = c o n t e n t s . getChildNum ( s l o t ) ;
return new B l o c k ( f i l e n a m e , blknum ) ;
}
}

SimpleDB source file simpledb/index/btree/DirEntry.java


package s i m p l e d b . i n d e x . b t r e e ;

import s i m p l e d b . q u e r y . C o n s t a n t ;

/∗ ∗
∗ A d i r e c t o r y e n t r y h a s t w o c o m p o n e n t s : t h e number o f the child block ,
∗ and t h e d a t a v a l o f t h e f i r s t r e c o r d i n t h a t b l o c k .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s D i r E n t r y {
private Constant d a t a v a l ;
p r i v a t e i n t blocknum ;

/∗ ∗
∗ C r e a t e s a new e n t r y f o r t h e s p e c i f i e d d a t a v a l and block number .
∗ @param d a t a v a l t h e d a t a v a l
∗ @param b l o c k n u m t h e b l o c k number
∗/
public D i r E n t r y ( C o n s t a n t d a t a v a l , i n t blocknum ) {
this . dataval = dataval ;
t h i s . blocknum = blocknum ;
}

/∗ ∗
∗ Re tur ns t h e d a t a v a l component of the entry
∗ @return t h e d a t a v a l component of the entry
∗/
public C o n s t a n t d a t a V a l ( ) {

256
return d a t a v a l ;
}

/∗ ∗
∗ R e t u r n s t h e b l o c k number c o m p o n e n t of the entry
∗ @ r e t u r n t h e b l o c k number c o m p o n e n t of the entry
∗/
public i n t blockNumber ( ) {
return blocknum ;
}
}

5.3 Using an Index in a Relational Algebra Operation


(Sciore, 2008, Chapter 21.5)

• Certain Relational Algebra operations can be executed more efficiently if a suitable


index is available.

• Consider first the operation select(T, A = c) where

T is a stored Table
A is an Attribute of T such that there is an index on T .A
c is a constant.

• Its index-aware implementation is simply to look up the Record (s) r of T having


r .A = c using this index.

• It is called an indexselect.

SimpleDB source file simpledb/index/query/IndexSelectPlan.java

• Here is the SimpleDB Plan and Scan for this indexselect.

package s i m p l e d b . i n d e x . q u e r y ;

import simpledb . tx . Transaction ;


import simpledb . r e c o r d . Schema ;
import simpledb . metadata . I n d e x I n f o ;
import simpledb . query . ∗ ;
import simpledb . index . Index ;

/∗ ∗ The P l a n c l a s s c o r r e s p o n d i n g t o t h e <i >i n d e x s e l e c t </i >


∗ r e l a t i o n a l algebra operator .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n d e x S e l e c t P l a n implements Plan {
p r i v a t e Plan p ;
private I n d e x I n f o i i ;
private Constant v a l ;

/∗ ∗
∗ C r e a t e s a new i n d e x s e l e c t n o d e i n t h e query t r e e
∗ f o r t h e s p e c i f i e d i n d e x and s e l e c t i o n constant .
∗ @param p t h e i n p u t t a b l e
∗ @param i i i n f o r m a t i o n a b o u t t h e i n d e x
∗ @param v a l t h e s e l e c t i o n c o n s t a n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public I n d e x S e l e c t P l a n ( Plan p , I n d e x I n f o ii , Constant val , Transaction tx ) {
this . p = p ;
this . i i = i i ;
this . val = val ;
}

/∗ ∗
∗ C r e a t e s a new i n d e x s e l e c t s c a n f o r t h i s q u e r y
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
// t h r o w s an e x c e p t i o n i f p i s n o t a t a b l e p l a n .
T a b l e S c a n t s = ( T a b l e S c a n ) p . open ( ) ;
I n d e x i d x = i i . open ( ) ;
return new I n d e x S e l e c t S c a n ( i d x , v a l , t s ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s t o c o m p u t e t h e
∗ i n d e x s e l e c t i o n , w h i c h i s t h e same a s t h e
∗ i n d e x t r a v e r s a l c o s t p l u s t h e number o f m a t c h i n g d a t a records .

257
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return i i . b l o c k s A c c e s s e d ( ) + r e c o r d s O u t p u t ( ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f o u t p u t r e c o r d s i n the index selection ,
∗ w h i c h i s t h e same a s t h e number o f s e a r c h key values
∗ for the index .
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return i i . r e c o r d s O u t p u t ( ) ;
}

/∗ ∗
∗ Returns t h e d i s t i n c t v a l u e s as d e f i n e d by t h e i n d e x .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
return i i . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}

/∗ ∗
∗ R e t u r n s t h e schema o f t h e d a t a t a b l e .
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return p . schema ( ) ;
}
}

SimpleDB source file simpledb/index/query/IndexSelectScan.java


package s i m p l e d b . i n d e x . q u e r y ;

import s i m p l e d b . r e c o r d . RID ;
import s i m p l e d b . q u e r y . ∗ ;
import s i m p l e d b . i n d e x . I n d e x ;

/∗ ∗
∗ The s c a n c l a s s c o r r e s p o n d i n g t o t h e s e l e c t r e l a t i o n a l
∗ algebra operator .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n d e x S e l e c t S c a n implements Scan {
private Index i dx ;
private Constant v a l ;
private TableScan t s ;

/∗ ∗
∗ C r e a t e s an i n d e x s e l e c t s c a n f o r t h e s p e c i f i e d
∗ i n d e x and s e l e c t i o n c o n s t a n t .
∗ @param i d x t h e i n d e x
∗ @param v a l t h e s e l e c t i o n c o n s t a n t
∗/
public I n d e x S e l e c t S c a n ( I n d e x i d x , C o n s t a n t v a l , T a b l e S c a n ts ) {
this . idx = idx ;
this . val = val ;
this . ts = ts ;
beforeFirst () ;
}

/∗ ∗
∗ P o s i t i o n s the scan b e f o r e the f i r s t record ,
∗ w h i c h i n t h i s c a s e means p o s i t i o n i n g t h e i n d e x
∗ before the f i r s t instance of the s e l e c t i o n constant .
∗ @see s i m p l e d b . q u e r y . Scan#b e f o r e F i r s t ( )
∗/
public void b e f o r e F i r s t ( ) {
idx . b e f o r e F i r s t ( val ) ;
}

/∗ ∗
∗ Moves t o t h e n e x t r e c o r d , w h i c h i n t h i s c a s e means
∗ moving t h e i n d e x t o t h e n e x t r e c o r d s a t i s f y i n g t h e
∗ s e l e c t i o n c o n s t a n t , and r e t u r n i n g f a l s e i f t h e r e a r e
∗ no more s u c h i n d e x r e c o r d s .
∗ I f t h e r e i s a n e x t r e c o r d , t h e method moves t h e
∗ t a b l e s c a n to the corresponding data record .
∗ @see s i m p l e d b . q u e r y . Scan#n e x t ( )
∗/
public boolean n e x t ( ) {
boolean ok = i d x . n e x t ( ) ;
i f ( ok ) {
RID r i d = i d x . g e t D a t a R i d ( ) ;
t s . moveToRid ( r i d ) ;
}
return ok ;
}

/∗ ∗
∗ C l o s e s t h e s c a n b y c l o s i n g t h e i n d e x and t h e tablescan .
∗ @see s i m p l e d b . q u e r y . Scan#c l o s e ( )
∗/
public void c l o s e ( ) {
idx . c l o s e () ;
ts . close () ;
}

258
/∗ ∗
∗ Returns the v a l u e of the f i e l d of the current data record .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) {
return t s . g e t V a l ( f l d n a m e ) ;
}

/∗ ∗
∗ Returns the v a l u e of the f i e l d of the current data record .
∗ @see s i m p l e d b . q u e r y . Scan#g e t I n t ( j a v a . l a n g . S t r i n g )
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
return t s . g e t I n t ( f l d n a m e ) ;
}

/∗ ∗
∗ Returns the v a l u e of the f i e l d of the current data record .
∗ @see s i m p l e d b . q u e r y . Scan#g e t S t r i n g ( j a v a . l a n g . S t r i n g )
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
return t s . g e t S t r i n g ( f l d n a m e ) ;
}

/∗ ∗
∗ Returns whether the data record has the s p e c i f i e d f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#h a s F i e l d ( j a v a . l a n g . S t r i n g )
∗/
public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return t s . h a s F i e l d ( f l d n a m e ) ;
}
}

• Consider then the operation join(E, T, A = B) where


E can be any Relational Algebra expression
T is a stored Table
A is an Attribute of E
B is an Attribute of T such that there is an index on T .B .
• One common case is when E .A is a foreign key referencing T .
That is, the expression E has an Attribute U .A which came from some stored
Table U with this foreign key.
• This permits the following index-aware implementation:

1 for each row r in E


2 use the index to find quickly the row(s) s of T
with s .B = r .A;
3 for each such s
4 add the combination of r and s
into the (initially empty) result.

• This is called an indexjoin.


• Figure 90 shows an example, where
(a) is the example SQL query
(b) implements the join in (a) as
select(product(ENROLL
,STUDENT)
,StudentId=SId)

by the definition of join


(c) reimplements (b) it as an indexjoin.

259
Figure 90: An example of a join using an index. (Sciore, 2008)

260
SimpleDB source file simpledb/index/query/IndexJoinPlan.java

• Here is the indexjoin Plan.

package s i m p l e d b . i n d e x . q u e r y ;

import simpledb . tx . Transaction ;


import simpledb . r e c o r d . Schema ;
import simpledb . metadata . I n d e x I n f o ;
import simpledb . query . ∗ ;
import simpledb . index . Index ;

/∗ ∗ The P l a n c l a s s c o r r e s p o n d i n g t o t h e <i >i n d e x j o i n </i >


∗ r e l a t i o n a l algebra operator .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n d e x J o i n P l a n implements Plan {
p r i v a t e Plan p1 , p2 ;
private I n d e x I n f o i i ;
private S t r i n g j o i n f i e l d ;
p r i v a t e Schema s c h = new Schema ( ) ;

/∗ ∗
∗ Implements the j o i n operator ,
∗ u s i n g t h e s p e c i f i e d LHS and RHS p l a n s .
∗ @param p1 t h e l e f t −hand p l a n
∗ @param p2 t h e r i g h t −hand p l a n
∗ @param i i i n f o r m a t i o n a b o u t t h e r i g h t −hand i n d e x
∗ @param j o i n f i e l d t h e l e f t −hand f i e l d u s e d f o r j o i n i n g
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public I n d e x J o i n P l a n ( Plan p1 , Plan p2 , I n d e x I n f o i i , S t r i n g joinfield , Transaction tx ) {
t h i s . p1 = p1 ;
t h i s . p2 = p2 ;
this . i i = i i ;
this . j o i n f i e l d = j o i n f i e l d ;
s c h . a d d A l l ( p1 . schema ( ) ) ;
s c h . a d d A l l ( p2 . schema ( ) ) ;
}

/∗ ∗
∗ Opens an i n d e x j o i n s c a n f o r t h i s q u e r y
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
Scan s = p1 . open ( ) ;
// t h r o w s an e x c e p t i o n i f p2 i s n o t a t a b l e p l a n
T a b l e S c a n t s = ( T a b l e S c a n ) p2 . open ( ) ;
I n d e x i d x = i i . open ( ) ;
return new I n d e x J o i n S c a n ( s , i d x , j o i n f i e l d , t s ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s t o c o m p u t e t h e join .
∗ The f o r m u l a i s :
∗ <p r e > B( i n d e x j o i n ( p1 , p2 , i d x ) ) = B( p1 ) + R( p1 ) ∗B( i d x )
∗ + R( i n d e x j o i n ( p1 , p2 , i d x ) </p r e >
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return p1 . b l o c k s A c c e s s e d ( )
+ ( p1 . r e c o r d s O u t p u t ( ) ∗ i i . b l o c k s A c c e s s e d ( ) )
+ recordsOutput ( ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f o u t p u t r e c o r d s i n t h e j o i n .
∗ The f o r m u l a i s :
∗ <p r e > R( i n d e x j o i n ( p1 , p2 , i d x ) ) = R( p1 ) ∗R( i d x ) </p r e >
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return p1 . r e c o r d s O u t p u t ( ) ∗ i i . r e c o r d s O u t p u t ( ) ;
}

/∗ ∗
∗ E s t i m a t e s t h e number o f d i s t i n c t v a l u e s f o r t h e
∗ specified field .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
i f ( p1 . schema ( ) . h a s F i e l d ( f l d n a m e ) )
return p1 . d i s t i n c t V a l u e s ( f l d n a m e ) ;
else
return p2 . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}

/∗ ∗
∗ R e t u r n s t h e schema o f t h e i n d e x j o i n .
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return s c h ;
}
}

SimpleDB source file simpledb/index/query/IndexJoinScan.java

261
• Here is the SimpleDB implementation of a pipelined indexjoin Scan.

package s i m p l e d b . i n d e x . q u e r y ;

import s i m p l e d b . q u e r y . ∗ ;
import s i m p l e d b . i n d e x . I n d e x ;

/∗ ∗
∗ The s c a n c l a s s c o r r e s p o n d i n g t o t h e i n d e x j o i n r e l a t i o n a l
∗ algebra operator .
∗ The c o d e i s v e r y s i m i l a r t o t h a t o f P r o d u c t S c a n ,
∗ w h i c h makes s e n s e b e c a u s e an i n d e x j o i n i s e s s e n t i a l l y
∗ t h e p r o d u c t o f e a c h LHS r e c o r d w i t h t h e m a t c h i n g RHS i n d e x records .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n d e x J o i n S c a n implements Scan {
p r i v a t e Scan s ;
private TableScan t s ; // t h e d a t a t a b l e
private Index id x ;
private S t r i n g j o i n f i e l d ;

/∗ ∗
∗ C r e a t e s an i n d e x j o i n s c a n f o r t h e s p e c i f i e d LHS s c a n and
∗ RHS i n d e x .
∗ @param s t h e LHS s c a n
∗ @param i d x t h e RHS i n d e x
∗ @param j o i n f i e l d t h e LHS f i e l d u s e d f o r j o i n i n g
∗/
public I n d e x J o i n S c a n ( Scan s , I n d e x i d x , S t r i n g j o i n f i e l d , T a b l e S c a n ts ) {
this . s = s ;
this . idx = idx ;
this . j o i n f i e l d = j o i n f i e l d ;
this . ts = ts ;
beforeFirst () ;
}

/∗ ∗
∗ P o s i t i o n s the scan b e f o r e the f i r s t record .
∗ That i s , t h e LHS s c a n w i l l b e p o s i t i o n e d a t i t s
∗ f i r s t r e c o r d , and t h e i n d e x w i l l b e p o s i t i o n e d
∗ before the f i r s t record for the join value .
∗ @see s i m p l e d b . q u e r y . Scan#b e f o r e F i r s t ( )
∗/
public void b e f o r e F i r s t ( ) {
s . beforeFirst () ;
s . next ( ) ;
resetIndex () ;
}

/∗ ∗
∗ Moves t h e s c a n t o t h e n e x t r e c o r d .
∗ The method moves t o t h e n e x t i n d e x r e c o r d , i f p o s s i b l e .
∗ O t h e r w i s e , i t moves t o t h e n e x t LHS r e c o r d and t h e
∗ f i r s t index record .
∗ I f t h e r e a r e no more LHS r e c o r d s , t h e method r e t u r n s f a l s e .
∗ @see s i m p l e d b . q u e r y . Scan#n e x t ( )
∗/
public boolean n e x t ( ) {
while ( true ) {
i f ( idx . next ( ) ) {
t s . moveToRid ( i d x . g e t D a t a R i d ( ) ) ;
return true ;
}
i f ( ! s . next ( ) )
return f a l s e ;
resetIndex () ;
}
}

/∗ ∗
∗ C l o s e s t h e s c a n b y c l o s i n g i t s LHS s c a n and i t s RHS i n d e x .
∗ @see s i m p l e d b . q u e r y . Scan#c l o s e ( )
∗/
public void c l o s e ( ) {
s . close () ;
idx . c l o s e () ;
ts . close () ;
}

/∗ ∗
∗ Returns the Constant v a l u e of the s p e c i f i e d f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) {
i f ( t s . h a s F i e l d ( fldname ) )
return t s . g e t V a l ( f l d n a m e ) ;
else
return s . g e t V a l ( f l d n a m e ) ;
}

/∗ ∗
∗ Returns the i n t e g e r v a l u e of the s p e c i f i e d f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
i f ( t s . h a s F i e l d ( fldname ) )
return t s . g e t I n t ( f l d n a m e ) ;
else
return s . g e t I n t ( f l d n a m e ) ;
}

/∗ ∗

262
∗ Returns the s t r i n g v a l u e of the s p e c i f i e d f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
i f ( t s . h a s F i e l d ( fldname ) )
return t s . g e t S t r i n g ( f l d n a m e ) ;
else
return s . g e t S t r i n g ( f l d n a m e ) ;
}

/∗ ∗ R e t u r n s t r u e i f t h e f i e l d i s i n t h e schema .
∗ @see s i m p l e d b . q u e r y . Scan#h a s F i e l d ( j a v a . l a n g . S t r i n g )
∗/
public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return t s . h a s F i e l d ( f l d n a m e ) | | s . h a s F i e l d ( f l d n a m e ) ;
}

p r i v a t e void r e s e t I n d e x ( ) {
Constant s e a r c h k e y = s . getVal ( j o i n f i e l d ) ;
idx . b e f o r e F i r s t ( searchkey ) ;
}
}

263
5.4 Updating Indexed Data
(Sciore, 2008, Chapter 21.6)

• The Planner Component of the RDBMS must also be aware of the existing indexes.

• In particular, when the contents of a stored Table T are updated, it must also
change the indexes defined on T to reflect the update.

• It gets the information about these indexes on T from the Metadata.

SimpleDB source file simpledb/index/planner/IndexUpdatePlanner.java

• Here is the SimpleDB index-aware UpdatePlanner .

• To turn on indexing, this must be substituted for the previous index-unaware


BasicUpdatePlanner .
package s i m p l e d b . i n d e x . p l a n n e r ;

import j a v a . u t i l . I t e r a t o r ;
import j a v a . u t i l . Map ;

import simpledb . r e c o r d . RID ;


import simpledb . s e r v e r . SimpleDB ;
import simpledb . tx . Transaction ;
import simpledb . index . Index ;
import simpledb . metadata . I n d e x I n f o ;
import simpledb . parse . ∗ ;
import simpledb . planner . ∗ ;
import simpledb . query . ∗ ;

/∗ ∗
∗ A modification of the basic update planner .
∗ I t d i s p a t c h e s each update statement to the corresponding
∗ index planner .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n d e x U p d a t e P l a n n e r implements U p d a t e P l a n n e r {

public i n t e x e c u t e I n s e r t ( I n s e r t D a t a data , Transaction tx ) {


S t r i n g tblname = d a t a . tableName ( ) ;
Plan p = new T a b l e P l a n ( tblname , t x ) ;

// f i r s t , i n s e r t t h e r e c o r d
UpdateScan s = ( UpdateScan ) p . open ( ) ;
s . insert () ;
RID r i d = s . g e t R i d ( ) ;

// t h e n m o d i f y e a c h f i e l d , i n s e r t i n g an i n d e x r e c o r d i f a p p r o p r i a t e
Map<S t r i n g , I n d e x I n f o > i n d e x e s = SimpleDB . mdMgr ( ) . g e t I n d e x I n f o ( tblname , tx ) ;
I t e r a t o r <Constant> v a l I t e r = d a t a . v a l s ( ) . i t e r a t o r ( ) ;
for ( S t r i n g fldname : data . f i e l d s ( ) ) {
Constant v a l = v a l I t e r . next ( ) ;
s . s e t V a l ( fldname , v a l ) ;

I n d e x I n f o i i = i n d e x e s . get ( fldname ) ;
i f ( i i != n u l l ) {
I n d e x i d x = i i . open ( ) ;
idx . i n s e r t ( val , r i d ) ;
idx . c l o s e () ;
}
}
s . close () ;
return 1 ;
}

public i n t e x e c u t e D e l e t e ( D e l e t e D a t a data , T r a n s a c t i o n t x ) {
S t r i n g tblname = d a t a . tableName ( ) ;
Plan p = new T a b l e P l a n ( tblname , t x ) ;
p = new S e l e c t P l a n ( p , d a t a . p r e d ( ) ) ;
Map<S t r i n g , I n d e x I n f o > i n d e x e s = SimpleDB . mdMgr ( ) . g e t I n d e x I n f o ( tblname , tx ) ;

UpdateScan s = ( UpdateScan ) p . open ( ) ;


int count = 0 ;
while ( s . n e x t ( ) ) {
// f i r s t , d e l e t e t h e r e c o r d ’ s RID f r o m e v e r y index
RID r i d = s . g e t R i d ( ) ;
for ( S t r i n g fldname : i n d e x e s . keySet ( ) ) {
Constant v a l = s . getVal ( fldname ) ;
I n d e x i d x = i n d e x e s . g e t ( f l d n a m e ) . open ( ) ;
idx . d e l e t e ( val , r i d ) ;
idx . c l o s e () ;
}
// t h e n d e l e t e t h e r e c o r d
s . delete () ;
c o u n t ++;
}
s . close () ;

264
return c o u n t ;
}

public i n t e x e c u t e M o d i f y ( ModifyData data , Transaction tx ) {


S t r i n g tblname = d a t a . tableName ( ) ;
S t r i n g fldname = data . t a r g e t F i e l d ( ) ;
Plan p = new T a b l e P l a n ( tblname , t x ) ;
p = new S e l e c t P l a n ( p , d a t a . p r e d ( ) ) ;

I n d e x I n f o i i = SimpleDB . mdMgr ( ) . g e t I n d e x I n f o ( tblname , tx ) . get ( fldname ) ;


I n d e x i d x = ( i i == n u l l ) ? n u l l : i i . open ( ) ;

UpdateScan s = ( UpdateScan ) p . open ( ) ;


int count = 0 ;
while ( s . n e x t ( ) ) {
// f i r s t , u p d a t e t h e r e c o r d
C o n s t a n t newva l = d a t a . newValue ( ) . e v a l u a t e ( s ) ;
Constant o l d v a l = s . getVal ( fldname ) ;
s . s e t V a l ( d a t a . t a r g e t F i e l d ( ) , newval ) ;

// t h e n u p d a t e t h e a p p r o p r i a t e index , if it exists
i f ( i d x != n u l l ) {
RID r i d = s . g e t R i d ( ) ;
idx . d e l e t e ( oldval , r i d ) ;
i d x . i n s e r t ( newval , r i d ) ;
}
c o u n t ++;
}
i f ( i d x != n u l l ) i d x . c l o s e ( ) ;
s . close () ;
return c o u n t ;
}

public i n t e x e c u t e C r e a t e T a b l e ( C r e a t e T a b l e D a t a data , T r a n s a c t i o n t x ) {
SimpleDB . mdMgr ( ) . c r e a t e T a b l e ( d a t a . tableName ( ) , d a t a . newSchema ( ) , t x ) ;
return 0 ;
}

public i n t e x e c u t e C r e a t e V i e w ( CreateViewData data , T r a n s a c t i o n t x ) {


SimpleDB . mdMgr ( ) . c r e a t e V i e w ( d a t a . viewName ( ) , d a t a . v i e w D e f ( ) , t x ) ;
return 0 ;
}

public i n t e x e c u t e C r e a t e I n d e x ( C r e a t e I n d e x D a t a data , T r a n s a c t i o n t x ) {
SimpleDB . mdMgr ( ) . c r e a t e I n d e x ( d a t a . indexName ( ) , d a t a . tableName ( ) , d a t a . f i e l d N a m e ( ) , t x ) ;
return 0 ;
}
}

6 Query Optimization
(Sciore, 2008, Chapter 24)

• A good query optimizer is an essntial part of every real-world RDBMS, because


otherwise its performance would not scale well when the amount of data grows.

• The overall approach has 2 stages:

¬ Start with the initial translation of the SQL query Q into a Relational Algebra
expression E.
– The purpose of E is to express in Relational Algebra what this query Q
means in SQL – intuitively, they both yield the same answer.
– Then we call them equivalent and denote this symbolically as E ≡ Q.
– However, this E would be much too slow to execute.
­ Therefore the Optimizer Component of the RDBMS first constructs another
Relational Algebra expression F which is much faster to execute than E but
still F ≡ E.

• More precisely:

Tables T ≡ U if no SQL query can tell them apart.


That is, T and U consist of the same rows, but they can appear in a different
order.

265
Queries F ≡ E if
the output of F ≡ the output of E
whenever they are executed on the same database contents.

• Each part of optimization is designed to preserve this ‘≡’ – so that its final result F
preserves the meaning of the original query Q.

• However, stage ­ is often split further into 2 phases:

¶ First E is optimized into a promising Plan P.


· Then this P is refined into the final Plan F.

• This separates different concerns in optimizer design:

¶ is concerned on the overall aspects of how to evaluate the query Q:


– It specifies what Relational Algebra operations to use, and in what order.
– It does not specify how each operation should be executed.
– For instance, in this phase the optimizer may decide to place a join op-
eration somewhere in the Plan.
· is in turn concerned on how each individual operation should be executed:
– For instance, in this phase the optimizer may decide that an indexjoin
would be a fast way to execute this Planned join operation.
– However, it makes these decisions locally, without thinking about the over-
all Plan any more.

• Figure 91 shows why query optimization is crucial (and not just nice):

(a) is the SQL query – which is quite realistic.


(b) it its initial translation E to Relational Algebra.
Executing E would cause 56 250 112 504 500 disk Block accesses with the
statistics of Figure 62 – far too many to be realistic!
(c) is one possible P ≡ E which phase ¶ might produce.
Executing P directly drops it into 139 500 Block s – from millennia into min-
utes.
If ENROLL has an index on StudentId (which it might well have) then phase ·
can choose indexjoin as the implementation of its lower join.
Executing P this way takes only seconds.

• There are 2 approaches for stage ­:

Heuristic approach uses rules for finding F.


– Each such rule transforms the current P into another P 0 .
– They are designed so that P ≡ P 0 .
– They are “rules of thumb” such that their designer knows/believes that P 0
is better than P.
– Conceptually, this approach is

266
Figure 91: Why optimize? (Sciore, 2008)

267
1 P = the initial translation E of Q;
2 repeat
3 ρ = choose a rule which applies into P;
4 P = apply ρ into P
5 until no ρ applies into P;
6 F = P.
– This heuristic approach can compare different rules ρ0 , ρ00 , ρ000 , . . . in its
step 3, maybe by using cost information.
– However, it does not remember different Plans P 0 , P 00 , P 000 , . . . – it just
improves the one current P.
– Hence it can end up with a bad final F by making
early choices which looked good then, but
later turn out to have been bad, because they force it to make much worse
choices to complete F.
– Here splitting stage ­ lets each phase ¶ or · to have its own rules to
consider.
Cost-based approach uses cost estimates for finding F.
– B(s), R(s) and V(s, F ) are these estimates in SimpleDB.
– Conceptually, this approach remembers several Plans P 0 , P 00 , P 000 , . . . so
that it can choose the one with the lowest cost as the final F.
– This is more tedious than the heuristic approach.
– This can avoid getting stuck with good-looking early choices that lead into
a bad final F by always keeping in mind several choices at the same time.
– Here splitting stage ­ lets phase ¶ to consider fewer Plans – otherwise
it must consider also all their implementations from phase · at the same
time.

• A practical RDBMS planner can combine these 2 approaches, and use

cost-based planning for those parts of E which are


crucial in the sense that making a wrong choice here would make the final F
much worse and
complex in the sense that the right choice is difficult to detect – it if was easy,
then we could write a heuristic rule to detect it instead!
An example is choosing the order in which joins should be performed.
heuristics for the other parts of E, so that the optimizer does not spend too much
effort in them, because it would not pay off.

• A practical RDBMS planner might not split stage ­ into phases ¶ and ·.

– This split namely loses information which could be useful in finding a good
final F.
– For instance, phase ¶ can decide to put a join somewhere – but does not know
yet what algorithm it will use, because phase · will decide that only later.
– But then phase ¶ cannot yet use the cost estimate of this still unknown join
algorithm.

268
– Instead, it must use some other estimate which applies to all joins – and this
is coarser.
– Moreover, the measure of this coarser estimate cannot be the number of Block s,
because that would require knowing the particular algorithm – so phase ¶
cannot use the measure which we want the final F to optimize!
– We do not want to use B(select(product(T, U ), . . .)) because this is not how
we want to implement the join!

• One such coarser estimate is

cost(join(T, U )) =
(number of rows in T ) + (number of rows in U ). (30)

– The cost estimate of a whole Plan is the sum of these costs of all its joins.
– The intuition is that joins determine most of the performance of the whole
Plan, because they can. . .
∗ read their input Tables many times, and
∗ generate large output Tables from them.

• Selinger-style optimization avoids this loss of information by adopting a cost-based


approach without this split of stage ­ into phases ¶ and ·

– Such an optimizer is complicated to build, because it must build and compare


“almost final” Plans which include also choices how each operation could be
implemented.
– But its final Plans F are very good, because it can use the precise cost estimates
of these algorithms used measured as the number of Block s they use.
– Hence many optimizers in commercial RDBMSs have chosen this style.
– This style is named after the lead designer of the optimizer in IBM’s early and
influential System R.
– This System R evolved later into the DB2 product.

• We adopt this split and Eq. (30) as our cost measure.

6.1 Heuristic Optimization


(Sciore, 2008, Chapters 24.4.2–24.4.5)

• Phase ¶ rearranges the Relational Algebra expression tree by substituting one sub-
tree with another.

• These substitutions are such that

old subtree ≡ new subtree

to preserve the meaning of the whole tree.

• First we consider what kinds of tree transformations could be used in general.

• Later we consider heuristic rules which suggest how they should be used.

269
Figure 92: A group of products can be reordered freely. (Sciore, 2008)

• The product operation is both commutative and associative:

product(T1, T2) ≡ product(T2, T1) (31)


product(product(T1, T2), T3) ≡ product(T1, product(T2, T3)). (32)

• Figure 92 shows these Eqs. (31) and (32) as tree transformations.


• Together they mean that the optimizer can rearrange the products in the translation
of the SQL FROM part in any way that it wants to, like in Figure 93.

• Let us next consider selection nodes.


• They arose in the translation of the SQL WHERE part.
• One tree transformation is

select(T, p1 AND p2) ≡ select(select(T, p1), p2). (33)

• We have already used Eq. (33) implicitly in our basic translation of SQL into Rela-
tional Algebra.
• Figure 94 illustrates this Eq. (33).

• We can also swap the places of adjacent selections:

select(select(T, p1), p2) ≡ select(T, p1 AND p2)


≡ select(T, p2 AND p1)
≡ select(select(T, p2), p1).

270
Figure 93: Rearranging products freely. (Sciore, 2008)

271
Figure 94: Splitting one selection node into two. (Sciore, 2008)

Figure 95: Moving a selection past a product. (Sciore, 2008)

• Since the SQL translation placed the WHERE part with its selections on top of
the FROM part with its products, we need a transformation to rearrange them:

select(product(T1, T2), p) ≡ product(select(T1, p), T2) (34)

if the selection Predicate p does not mention any of the Attributes of T2 – other-
wise they would no longer be defined on the right-hand side!

• Figure 95 illustrates this transformation.

• We also have the symmetric case, where p applies only to T2 instead:

select(product(T1, T2), p) ≡ select(product(T2, T1), p)


≡ product(select(T2, p), T1)
≡ product(T1, select(T2, p)).

• Figure 96 shows how this transformation allows the optimizer to move a selection
as far down the tree as it will go.

• Figures 96 and 97 show the joint effect of Eqs. (94) and (34).

272
Figure 96: Pushing one selection down. (Sciore, 2008)

• Together, Eqs. (31)–(34) permit reorganizing the FROM and WHERE parts from
the SQL translation quite freely.

• This freedom lets the optimizer turn a select-product pair into a join with

join(T1, T2, p) ≡ select(product(T1, T2), p). (35)

• This is how the optimizer can find out what joins it should perform, even though
the user has written this information only implicitly into the FROM and WHERE
parts of the SQL query.

• Figure 99 shows an example of Eq. (35).

• However, we have not given transformations for the semi- and antijoins arising
from [NOT] IN. . . and EXISTS. . . subqueries in the WHERE part.

– In general, their 1st arguments behave similarly as products.


– For instance, Eq. (34) becomes

select(semijoin(T1, T2, q), p) ≡ semijoin(select(T1, p), T2, q)

where the extra condition is no longer needed, because p is already guaranteed


not to mention any of the Attributes of T2.
– We omit transaformations for their 2nd arguments, and optimize each of them
as a separate subquery instead.
– Commercial RDBMSs can perform more elaborate transformations, for in-
stance merging a subquery into its enclosing query, so that they can be opti-
mized together.

• Consider finally projections.

273
Figure 97: Pushing selections past products. (Sciore, 2008)

274
Figure 98: Figure 97 continued. (Sciore, 2008)

275
Figure 99: Adding joins into Figure 98. (Sciore, 2008)

• SQL translation generates just one projection node on top of the Relational Algebra
expression, whose task is to output only those Attributes which the used asked for.

• The optimizer can

transform any node N into project(N, PN ) where the


Attributes PN contain all the Attributes of N which are mentioned (36)
on the path from the root of the whole tree into N
because the expression containing N uses only these Attributes PN from the result
of N but ignores all its other Attributes.

• Figure 100 shows an example with all possible projections.

• Analogous transformations like these are available also for other operations like
groupby, extend, union,. . . but we omit them here.

Heuristic 1 (selections down). Push selections as far down as possible.

• The intuition is that if we are going to drop a row from the result, then we should
do it early as possible, before we have unnecessarily joined it with other rows.

• If all the Attributes in a selection Predicate φi come from the same stored Table T
then this select(. . . , φi ) lands just on top of T . . .

• . . . or on top of another such φj if T has several.

276
Figure 100: Adding projections into Figure 99. (Sciore, 2008)

277
• They can be recombined with Eq. (33) into

select(T, φ1 AND φ2 AND φ3 AND . . .) (37)


| {z }
ψT

to form this ψT called the selection Predicate of Table T .

Heuristic 2 (introduce joins). Convert each select-product pair into a join.

• The intuition is that the

former generates all combinations of rows from its 2 input Tables, and selects
some of them as its output, but the
latter can avoid generating the other combinations.

• Hence it is easy to see that performing joins instead of select-product pairs


is always a good idea – recall the dramatic improvement from millennia (b) to
minutes (c) in Figure 91.

• But it is not so easy to see what joins should be performed and when.

• However, this is perhaps the single most important question in query optimization!

Heuristic 3 (left-deep joins). Concentrate only on left-deep join trees.

• A single join node is left-deep, if its right subtree does not contain join nodes.

• Or if we want to consider also View s and other nested subqueries within the SQL
FROM part, then amend this into “. . . unless they came from the nested subquery”.

• Here we consider a product(left, right) to be a join(left, right, true).

• A tree of join nodes is left-deep, if all its join nodes are.

• Such a left-deep join tree. . .

1. begins with some Table T1 as the leftmost node


2. then joins another Table T2 to its right
3. then continues by joining a third Table T2 to the right of that result, and so
on. . .

(Here each Ti can have its ψTi or be a nested subquery.)

• Figures 101–102 show different shapes of join trees for the same query.

• However, Figure 104 shows that the best choice in Figures 101–102 would be (f)
which is not left-deep.

– The cost of each join is by Eq. (30).


– The number of rows output by Table T is estimated with R(select(T, ψT )).

• Many optimizers consider only left-deep join trees, even though they can lead to
worse Plans, because. . .

278
Figure 101: Different join tree shapes for the same query. (Sciore, 2008)

279
Figure 102: Figure 101 continued. (Sciore, 2008)

280
Figure 103: Figure 103 continued. (Sciore, 2008)

281
Figure 104: Costs of join trees in Figures 101–102. (Sciore, 2008)

– the best such Plan is usually not much worse – compare (d) to (f).
– the more general and difficult problem
“What is a good join tree?”
turns into the simpler but still difficult problem
“What is a good join order ?”
• A heuristic solution to this simpler problem consists of rules for deciding which
Table (or subquery) should. . .
1. start the left-deep join tree as its leftmost leaf T1 ?
2. be added to the current left-deep join tree as its next leaf Ti+1 to the right?
Heuristic 4 (start with the smallest Table). Start the join order with the Table having
the smallest output.
• The intuition of Heuristic 4 is to start with the smallest intermediate result, and
hope that this causes the intermediate results of later joins to stay small too.
• In Figure 101(a) this heuristic recommends starting the left-deep join tree with
COURSE, because ψCOURSE reduces the estimate of its output size into 12.5 Record s.
Heuristic 5 (start with most restrictive). Start the join order with the Table T whose
selection predicate ψT is most restrictive.
• The intuition of Heuristic 5 is that ψT is most effective when it appears early in the
joins.
• The corresponding expression has usually the form
select(T, A1 = c1 ANDA2 = c2 ANDA3 = c3 AND . . .)
| {z }
ψT

where A1 , A2 , A3 , . . . are Attributes of the stored table T and c1 , c2 , c3 , . . . are con-


stants.

282
• This leads by Figure 68 to the estimate
1
· R(T ) (38)
V(T, A1 ) · V(T, A2 ) · V(T, A3 ) · . . .
| {z }
Maximize this denominator!

for the size of its result.

• In Figure 101(a) this heuristic recommends starting the left-deep join tree with
STUDENT instead of COURSE, because its output size reduction factor is
1 1
< for COURSE.
50 40

• These 2 starting heuristics 4 and 5 can therefore have different opinions.

• The designer of a heuristic optimizer decides which one (s)he will include into the
optimizer.

Heuristic 6 (avoid products). Choose the next Table in the join order so that it can
connected to the preceding join order with an actual join if possible.

• That is, try to choose the next Table N so that there is some selection Predicate φ
which compares Attributes of N to Attributes in the preceding join order.

• Then we get

join(the preceding join order, N, φ)

with a nontrivial Predicate φ instead of

product(the preceding join order, N )

which would be joining with the trivial Predicate

φ = true.

• Its intuition is to. . .

– avoid the costly products if at all possible, and


– if they cannot be avoided, then place them as high in the tree as possible,
hoping that their inputs would have become small by then.
– They can be avoided in most queries, because usually a query does not ask for
all combinations of rows with no further connection between them.

• In Figure 101, this Heuristic 6 determines the rest of the join order, once its starting
Table has been chosen Heuristic 4 or 5. Starting with. . .

COURSE leads into tree (d).


STUDENT leads into tree (b) instead.

• Let us then turn to heuristics for phase ·, which selects implementations for the
nodes of the plan P produced by phase ¶.

283
• Phase · starts at the leaves and works towards the root:

– This way it has already chosen algorithms for the children of its current node C.
– Then it can choose the algorithm for C based on their actual costs.

Heuristic 7 (use an index). Implement a select operation with the indexselect algo-
rithm whenever possible.

• The intuition is that if a stored Table T does have a suitable index, then use it.

• Note that if phase ¶ has produced

select(T, A1 = c1 ANDA2 = c2 ANDA3 = c3 AND . . .)

and T has an index on A1 , then phase · must first use Eq. (33) to get

select(select(T, A1 = c1 ), A2 = c2 ANDA3 = c3 AND . . .).


| {z }
the indexselect

• If T has many indexes Ai , then choose the one with the largest V(T, Ai ) by Eq. (38).

Heuristic 8 (how to join). Implement join with. . .


1. indexjoin if possible, or

2. hashjoin if one of its input Tables is small, or

3. mergejoin otherwise.

• We have considered the indexjoin algorithm in section 5.3.

• The hashjoin algorithm is in turn based on the insight that if we build in

join(T, U, T .A = U .B )

a hash table for Table U on its join Attribute U .B , then all the rows s of U to be
joined with a row r of T can be found in the bucket for key r .A.

– That is, hashing can be used to exclude the other rows s0 of U which are not
joined with r.
– This lets us split T and U into smaller bucket files, which can be joined
recursively.

• The mergejoin algorithm is in turn based on the insight that if we first sort each
input Table on its join Predicate (that is, sort Table T on T .A and Table U
on U .B ) then

– the rows s of U to be joined with a row r of T appear as one consecutive


segment of the sorted U – because they all have s .B = r .A.
– the segment for the next row r0 of T is not far – because also T is sorted into
the same order as U .

• These hash- and mergejoins are examples of operations M which must materialize
their input (by hashing or sorting).

284
Figure 105: Adding projections to Figure 91(c). (Sciore, 2008)

Heuristic 9 (waste no material). If the implementation chosen for a node M materializes


its input N , then transform N first by (36).

• Its intuition is that then M no longer has to store those Attributes of N which are
no longer needed.

• Figure 105 shows these projections added to the materialized arguments of the
topmost join.

SimpleDB source file simpledb/opt/HeuristicQueryPlanner.java

• Here is the SimpleDB optimizing Planner for its simple queries.

• It implements the heuristic design of phase ¶ explained here.

• Its Step 2 implements Heuristic 4.

• Its Step 3 implements Heuristic 6.

• Whenever many different next Tables could be added into the current left-deep join
tree, this Planner makes a greedy choice:

Choose the Table which produces the next tree T whose R(T ) is smallest.

The TablePlanner produces these alternative trees T .

• This heuristic Planner uses cost information in this way to choose among the pos-
sibilities permitted by its rules.

285
package s i m p l e d b . o p t ;

import simpledb . tx . Transaction ;


import simpledb . query . ∗ ;
import simpledb . opt . TablePlanner ;
import s i m p l e d b . p a r s e . QueryData ;
import simpledb . p l a n n e r . QueryPlanner ;
import java . u t i l . ∗ ;

/∗ ∗
∗ A q u e r y p l a n n e r t h a t o p t i m i z e s u s i n g a h e u r i s t i c −b a s e d a l g o r i t h m .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s H e u r i s t i c Q u e r y P l a n n e r implements Q u e r y P l a n n e r {
p r i v a t e C o l l e c t i o n <T a b l e P l a n n e r > t a b l e p l a n n e r s = new A r r a y L i s t <T a b l e P l a n n e r >() ;

/∗ ∗
∗ C r e a t e s an o p t i m i z e d l e f t −d e e p q u e r y p l a n u s i n g t h e f o l l o w i n g
∗ heuristics .
∗ H1 . C h o o s e t h e s m a l l e s t t a b l e ( c o n s i d e r i n g s e l e c t i o n p r e d i c a t e s )
∗ to be f i r s t in t h e j o i n order .
∗ H2 . Add t h e t a b l e t o t h e j o i n o r d e r w h i c h
∗ r e s u l t s in the s m a l l e s t output .
∗/
public Plan c r e a t e P l a n ( QueryData data , T r a n s a c t i o n t x ) {

// S t e p 1 : Create a TablePlanner o b j e c t f o r each mentioned t a b l e


f o r ( S t r i n g tblname : d a t a . t a b l e s ( ) ) {
T a b l e P l a n n e r t p = new T a b l e P l a n n e r ( tblname , d a t a . p r e d ( ) , t x ) ;
t a b l e p l a n n e r s . add ( t p ) ;
}

// S t e p 2 : C h o o s e t h e l o w e s t −s i z e p l a n t o begin the join order


Plan c u r r e n t p l a n = g e t L o w e s t S e l e c t P l a n ( ) ;

// S t e p 3 : R e p e a t e d l y add a p l a n t o t h e j o i n o r d e r
while ( ! t a b l e p l a n n e r s . isEmpty ( ) ) {
Plan p = g e t L o w e s t J o i n P l a n ( c u r r e n t p l a n ) ;
i f ( p != n u l l )
currentplan = p ;
else // no a p p l i c a b l e j o i n
cu rrent pla n = getLowestProductPlan ( c urre ntpl an ) ;
}

// S t e p 4 . P r o j e c t on t h e f i e l d names and r e t u r n
return new P r o j e c t P l a n ( c u r r e n t p l a n , d a t a . f i e l d s ( ) ) ;
}

p r i v a t e Plan g e t L o w e s t S e l e c t P l a n ( ) {
TablePlanner besttp = null ;
Plan b e s t p l a n = n u l l ;
for ( TablePlanner tp : t a b l e p l a n n e r s ) {
Plan p l a n = t p . m a k e S e l e c t P l a n ( ) ;
i f ( b e s t p l a n == n u l l | | p l a n . r e c o r d s O u t p u t ( ) < b e s t p l a n . r e c o r d s O u t p u t ( ) ) {
b e s t t p = tp ;
bestplan = plan ;
}
}
t a b l e p l a n n e r s . remove ( b e s t t p ) ;
return b e s t p l a n ;
}

p r i v a t e Plan g e t L o w e s t J o i n P l a n ( Plan c u r r e n t ) {
TablePlanner besttp = null ;
Plan b e s t p l a n = n u l l ;
for ( TablePlanner tp : t a b l e p l a n n e r s ) {
Plan p l a n = t p . makeJoinPlan ( c u r r e n t ) ;
i f ( p l a n != n u l l && ( b e s t p l a n == n u l l | | p l a n . r e c o r d s O u t p u t ( ) < b e s t p l a n . r e c o r d s O u t p u t ( ) ) ) {
b e s t t p = tp ;
bestplan = plan ;
}
}
i f ( b e s t p l a n != n u l l )
t a b l e p l a n n e r s . remove ( b e s t t p ) ;
return b e s t p l a n ;
}

p r i v a t e Plan g e t L o w e s t P r o d u c t P l a n ( Plan c u r r e n t ) {
TablePlanner besttp = null ;
Plan b e s t p l a n = n u l l ;
for ( TablePlanner tp : t a b l e p l a n n e r s ) {
Plan p l a n = t p . makeProductPlan ( c u r r e n t ) ;
i f ( b e s t p l a n == n u l l | | p l a n . r e c o r d s O u t p u t ( ) < b e s t p l a n . r e c o r d s O u t p u t ( ) ) {
b e s t t p = tp ;
bestplan = plan ;
}
}
t a b l e p l a n n e r s . remove ( b e s t t p ) ;
return b e s t p l a n ;
}
}

SimpleDB source file simpledb/opt/TablePlanner.java


• Here is the implementation of phase · which uses heuristics to determine how this
Table can be joined into the current left-deep join tree.

286
package s i m p l e d b . o p t ;

import simpledb . tx . Transaction ;


import s i m p l e d b . r e c o r d . Schema ;
import simpledb . query . ∗ ;
import simpledb . index . query . ∗ ;
import s i m p l e d b . metadata . I n d e x I n f o ;
import simpledb . m u l t i b u f f e r . MultiBufferProductPlan ;
import s i m p l e d b . s e r v e r . SimpleDB ;
import j a v a . u t i l . Map ;

/∗ ∗
∗ This c l a s s c o n t a i n s methods f o r p l a n n i n g a single table .
∗ @ a u t h o r Edward S c i o r e
∗/
class TablePlanner {
p r i v a t e T a b l e P l a n myplan ;
p r i v a t e P r e d i c a t e mypred ;
p r i v a t e Schema myschema ;
p r i v a t e Map<S t r i n g , I n d e x I n f o > i n d e x e s ;
private T r a n s a c t i o n tx ;

/∗ ∗
∗ C r e a t e s a new t a b l e p l a n n e r .
∗ The s p e c i f i e d p r e d i c a t e a p p l i e s t o t h e e n t i r e q u e r y .
∗ The t a b l e p l a n n e r i s r e s p o n s i b l e f o r d e t e r m i n i n g
∗ which p o r t i o n of the p r e d i c a t e i s u s e f u l to the t a b l e ,
∗ and when i n d e x e s a r e u s e f u l .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param mypred t h e q u e r y p r e d i c a t e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public T a b l e P l a n n e r ( S t r i n g tblname , P r e d i c a t e mypred , T r a n s a c t i o n t x ) {
t h i s . mypred = mypred ;
this . tx = tx ;
myplan = new T a b l e P l a n ( tblname , t x ) ;
myschema = myplan . schema ( ) ;
indexes = SimpleDB . mdMgr ( ) . g e t I n d e x I n f o ( tblname , t x ) ;
}

/∗ ∗
∗ Constructs a s e l e c t plan for the t a b l e .
∗ The p l a n w i l l u s e an i n d e x s e l e c t , i f p o s s i b l e .
∗ @return a s e l e c t plan f o r t h e t a b l e .
∗/
public Plan m a k e S e l e c t P l a n ( ) {
Plan p = m a k e I n d e x S e l e c t ( ) ;
i f ( p == n u l l )
p = myplan ;
return a d d S e l e c t P r e d ( p ) ;
}

/∗ ∗
∗ Constructs a join plan of the s p e c i f i e d plan
∗ and t h e t a b l e . The p l a n w i l l u s e an i n d e x j o i n , i f p o s s i b l e .
∗ ( Which means t h a t i f an i n d e x s e l e c t i s a l s o p o s s i b l e ,
∗ the indexjoin operator takes precedence . )
∗ The method r e t u r n s n u l l i f no j o i n i s p o s s i b l e .
∗ @param c u r r e n t t h e s p e c i f i e d p l a n
∗ @ r e t u r n a j o i n p l a n o f t h e p l a n and t h i s t a b l e
∗/
public Plan makeJoinPlan ( Plan c u r r e n t ) {
Schema c u r r s c h = c u r r e n t . schema ( ) ;
P r e d i c a t e j o i n p r e d = mypred . j o i n P r e d ( myschema , c u r r s c h ) ;
i f ( j o i n p r e d == n u l l )
return n u l l ;
Plan p = m a k e I n d e x J o i n ( c u r r e n t , c u r r s c h ) ;
i f ( p == n u l l )
p = makeProductJoin ( c u r r e n t , c u r r s c h ) ;
return p ;
}

/∗ ∗
∗ C o n s t r u c t s a p r o d u c t p l a n o f t h e s p e c i f i e d p l a n and
∗ this table .
∗ @param c u r r e n t t h e s p e c i f i e d p l a n
∗ @ r e t u r n a p r o d u c t p l a n o f t h e s p e c i f i e d p l a n and t h i s table
∗/
public Plan makeProductPlan ( Plan c u r r e n t ) {
Plan p = a d d S e l e c t P r e d ( myplan ) ;
return new M u l t i B u f f e r P r o d u c t P l a n ( c u r r e n t , p , t x ) ;
}

p r i v a t e Plan m a k e I n d e x S e l e c t ( ) {
for ( S t r i n g fldname : i n d e x e s . keySet ( ) ) {
C o n s t a n t v a l = mypred . e q u a t e s W i t h C o n s t a n t ( f l d n a m e ) ;
i f ( v a l != n u l l ) {
I n d e x I n f o i i = i n d e x e s . get ( fldname ) ;
return new I n d e x S e l e c t P l a n ( myplan , i i , v a l , t x ) ;
}
}
return n u l l ;
}

p r i v a t e Plan m a k e I n d e x J o i n ( Plan c u r r e n t , Schema c u r r s c h ) {


for ( S t r i n g fldname : i n d e x e s . keySet ( ) ) {
S t r i n g o u t e r f i e l d = mypred . e q u a t e s W i t h F i e l d ( f l d n a m e ) ;
i f ( o u t e r f i e l d != n u l l && c u r r s c h . h a s F i e l d ( o u t e r f i e l d ) ) {
I n d e x I n f o i i = i n d e x e s . get ( fldname ) ;
Plan p = new I n d e x J o i n P l a n ( c u r r e n t , myplan , i i , o u t e r f i e l d , tx ) ;
p = addSelectPred (p) ;
return a d d J o i n P r e d ( p , c u r r s c h ) ;
}

287
}
return n u l l ;
}

p r i v a t e Plan m a k e P r o d u c t J o i n ( Plan c u r r e n t , Schema c u r r s c h ) {


Plan p = makeProductPlan ( c u r r e n t ) ;
return a d d J o i n P r e d ( p , c u r r s c h ) ;
}

p r i v a t e Plan a d d S e l e c t P r e d ( Plan p ) {
P r e d i c a t e s e l e c t p r e d = mypred . s e l e c t P r e d ( myschema ) ;
i f ( s e l e c t p r e d != n u l l )
return new S e l e c t P l a n ( p , s e l e c t p r e d ) ;
else
return p ;
}

p r i v a t e Plan a d d J o i n P r e d ( Plan p , Schema c u r r s c h ) {


P r e d i c a t e j o i n p r e d = mypred . j o i n P r e d ( c u r r s c h , myschema ) ;
i f ( j o i n p r e d != n u l l )
return new S e l e c t P l a n ( p , j o i n p r e d ) ;
else
return p ;
}
}

6.2 On Cost-Based Optimization


(Sciore, 2008, Chapters 24.4.6 and 24.6)
• Consider finally as an example of cost-based optimization how it could construct a
left-deep join order.
• The idea is that the lowest-cost left-deep join tree of the 4 Tables T1 , T2 , T3 , T4 is
one of these:
– the lowest-cost left-deep join tree for T2 , T3 , T4 , joined with T1
– the lowest-cost left-deep join tree for T1 , T3 , T4 , joined with T2
– the lowest-cost left-deep join tree for T1 , T2 , T4 , joined with T3
– the lowest-cost left-deep join tree for T1 , T2 , T3 , joined with T4 .
• This same idea can be used also for finding these 3-Table subtrees.
– It leads in turn into considering all such 2-Table subtrees.
– Moreover, the same 2-Table subtree will be used in many 4-Table trees.
For instance, T3 , T4 appears in the 1st and 2nd 4-Table trees.
• Dynamic programming is a general algorithm design technique for situations like
these:
– The problem is to optimize some global goodness measure – here the lowest-
cost join order.
– Its optimized solution builds on optimal solutions of smaller but otherwise
identical subproblems – here the lowest-cost join orders with all the other
tables except 1.
– Small enough subproblems can be solved directly – here the lowest-cost join
orders with just 1 Table.
• Here this technique leads into calculating the array lowest[S] whose indexes S are
sets of Tables.
lowest[S].order = the lowest-cost join order for these Tables S
lowest[S].cost = that cost
lowest[S].size = the number of Record s in its output.

288
• The best solution will appear finally into lowest[all Tables].

• Its initialization considers each stored Table T to join:

lowest[{T }].order = T alone


lowest[{T }].size = R(select(T, ψT ))
= lowest[{T }].cost

in preparation for Eq. (30).

• For larger |S| > 1, lowest[S] is calculated as follows:

– Consider each T ∈ S in turn.


– The necessary information about the best join order for the other Tables of S
except this T has already been gathered into the 3 fields of lowest[S \ {T }].
– Based on this information, this T suggests these candidates for the 3 fields:

order = lowest[S \ {T }].order followed by T


cost = lowest[S \ {T }].size + lowest[{T }].size via Eq. (30)
size = R(any tree corresponding to this order ).

– Choose these candidates for a T whose cost was lowest.

• Figures 106 and 107 show how this lowest array is calculated for Figures 101–103.

• Note how it remembers the best solutions to smaller subproblems in order to solve
larger subproblems.

• The recommendation for the join order turns out to be (d).

289
Figure 106: An example lowest array. (Sciore, 2008)

290
Figure 107: Figure 106 continued. (Sciore, 2008)

291

You might also like