Handouts PDF
Handouts PDF
Tietokannanhallintajärjestelmät
Database Management Systems
Matti Nykänen
School of Computing, University of Eastern Finland
e-mail: [email protected]
Academic year 2011-12, IV quarter
Contents
1 Introduction 1
i
4.7.2 Update Scans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.7.3 Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
4.7.4 Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
4.8 Parsing SQL Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
4.9 Query Execution Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.10 The Remote Database Server . . . . . . . . . . . . . . . . . . . . . . . . . 212
5 Indexing 225
5.1 Extendable Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.2 B+ -trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
5.3 Using an Index in a Relational Algebra Operation . . . . . . . . . . . . . . 257
5.4 Updating Indexed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
References
Thomas M. Connolly and Carolyn E. Begg. Database Systems: A Practical Approach to
Design, Implementation, and Management. Addison Wesley, fifth edition, 2010.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Intro-
duction to Algorithms. The MIT Press, third edition, 2009.
Ramez Elmasri and Shamkant B. Navathe. Database Systems: Models, Languages, Design,
and Application Programming. Pearson, sixth edition, 2011.
John R. Levine, Tony Mason, and Doug Brown. Lex & Yacc. O’Reilly, second edition,
1992.
Simon Peyton Jones. Beautiful concurrency. In Andy Oram and Greg Wilson, editors,
Beautiful Code, chapter 24, pages 385–406. O’Reilly, 2007.
Peter Sestoft. Java Precisely. The MIT Press, second edition, 2005.
ii
1 Introduction
• The main questions of this course are:
What features must a Relational Data Base Management System (RDBMS)
have?
How can these features be implemented in an RDBMS?
• This course discusses general design principles, not the vendor-specific design issues
in MySQL, Oracle, Microsoft Access,. . . RDBMS.
1
Figure 1: The Class Diagram for the University Database. (Sciore, 2008)
Then the rest of this course discusses how the RDBMS can do these things.
2.1 Tables
(Sciore, 2008, Chapters 2.1–2.2)
• The central feature of the relational data model is to organize data into tables.
• Moreover, the result of a query in the relational data model is always another table,
built from the stored tables.
• Each table has its own collection of columns called its attributes.
2
Figure 2: The Schema for the University Database. (Sciore, 2008)
• The collection of all the schemas of all the tables in a database is also called the
schema of this database.
(Some texts use the correct but old-fashioned plural “schemata”.)
• Figure 2 shows the schema for our example database, where each table scheme is in
the form
Requirement 1 (data and metadata). The RDBMS must be able to maintain both the
data itself and its metadata.
Rows
• A table contains zero or more rows.
• Each row r of table T representing some specific individual x then tells what is
stored in the database about x with respect to the attributes a1 , a2 , a3 , . . . , an of T ,
so that the value r.ai on column ai of row r tells what x “is like” with respect to ai .
3
• Intuitvely, such a row r says that “there is some individual x of kind T whose a1
is t .a1 and its a2 is t .a2 its a3 is t .a3 and. . . and its an is t .an ”.
• Figure 3 shows our example tables with some example rows. For instance, the first
row of STUDENT says that “there is some student whose student ID number
(“opiskelijanumero” in Finnish) is 1, whose name is Joe, whose graduation year
is 2004, and whose major subject is computer science”.
• The rows within a table are unordered. When we said that “the first row of STU-
DENT” is Joe’s, we meant that the STUDENT rows were shown in a particular
order, here by student ID.
Requirement 2 (no order). The RDBMS must be able to sort the rows before showing
them to the user. The user can determine the order in which (s)he wants to see them.
However, row ordering must not affect anything else than output.
Null Values
• However, some attribute values may come later. For instance:
What is the value of the graduation year attribute during his/her studies?
• The relational model provides special NULL values for such purposes.
must be “No!” because we don’t know the actual graduation year yet. In
addition, if s is another STUDENT row, then also the three questions
all get the same answer “No!” too, whether s .GradYear is a known year or
NULL.
– However, this holds even when row r is s! The relational model has the concept
of NULL values in general, but not “the NULL value(s) specifically for row r ”.
4
Figure 3: Some Contents of the University Database. (Sciore, 2008)
5
• Since NULL value behaviour is so different, some database theorists want to get rid
of them altogether. However, they are sometimes the best practical way to represent
that the information must exist, but is (yet) unknown.
• In contrast, attribute values which might or might not exist should be represented
in some other way.
– Suppose we added student mobile phone numbers into our university database.
If we added another attribute TelNo to the STUDENT table, and allowed
NULL values in it, then we would be implicitly claiming that “every student
does have a phone, but some students have kept their numbers secret”.
– A better design choice would be to add instead a new table with schema
MOBILE (SId , TelNo). (1)
∗ Then a student (represented by the ID) without a phone would have no
rows in this table. . .
∗ . . . whereas a student with many phones would have several.
∗ Moreover, since the university cannot use the information that a student
has phone but its number is secret, the TelNo attribute can be declared
to be non-NULL.
That is, this new table represents the known mobile numbers.
Requirement 3 (NULL value constraints). The RDBMS must permit the table definition
to declare whether a particular attribute can contain NULL values or not. It must enforce
such a constraint by rejecting the insertion of a new row which would have a NULL value
for an attribute which has been declared non-NULL.
• The RDBMS maintains these declarations in its metadata alongside the table defi-
nition.
• However, we shall largely bypass NULL values and their problems in this course.
2.2 Keys
(Sciore, 2008, Chapters 2.3–2.5)
• Intuitively, some attributes of a table identify or “name” uniquely the individual x
described by a row r whereas its other attributes describe the other qualities of
this x.
• The database table T with attributes a1 , a2 , a3 , . . . , am , b1 , b2 , b3 , . . . , bn satisfies the
functional dependency (FD)
a1 , a2 , a3 , . . . , am → bj (2)
if for all possible rows r and s that might be in T we have the following:
if r .a1 = s .a1 and r .a2 = s .a2 and r .a3 = s .a3 and. . . and r .am = s .am then also
r .bj = s .bj .
That is, the values for the attributes a1 , a2 , a3 , . . . , am on the left-hand side (LHS) of
the FD determine what the value for the attribute bj on its right-hand side (RHS)
must be.
6
• Note that this FD concerns the intended meaning of table T in the database schema,
not only the rows which T happens to contain just now.
TelNo → SId
because
since the mobile phone company will not give two different students r and s the same
mobile number (if we assume that two students do not share a common mobile).
• Trivially
a1 , a2 , a3 , . . . , am → ai
for every ai on its LHS.
• Transitively, if
a1 , a2 , a3 , . . . , am → b1
a1 , a2 , a3 , . . . , am → b2
a1 , a2 , a3 , . . . , am → b3
..
.
a1 , a2 , a3 , . . . , am → bn and
b1 , b2 , b3 , . . . , bn → c then also
a1 , a2 , a3 , . . . , am → c.
We can introduce vector notation in FDs to shorten such indexed sequences into
• If two rows r and s share the same values for all the LHS attributes a1 , a2 , a3 , . . . , am ,
then the database cannot tell them apart:
– They share the same values also for all the other attributes bj as well, by
property ¬.
– Their order in T does not matter, by requirement 2.
– table T should really have just one copy of this row, not two; and
7
– each stored table should have candidate keys, to eliminate such duplicate rows.
• Once the database designer has determined the candidate keys for a new table T ,
(s)he chooses one of them as its primary key.
• None of the attributes ai of this chosen primary key is allowed to contain NULLs
by requirement 3 because they would make it impossible to check whether two rows
are two copies of the same row or not.
• What if T does not have any such “natural” candidate keys to choose?
– One solution is to say that T is “all key” and take all its attributes as the key.
– Another is to add an artificial “identifier” field to be the key.
∗ This is how the STUDENT, SECTION and ENROLL of our university
database got their Id fields.
∗ Its DEPT and COURSE have these Id fields as well, even though they
are not necessary: the department name and course title could have been
chosen as keys instead.
∗ However, UEF must have course ids, because we have both English and
Finnish titles for the same course.
Requirement 4 (key constraints). The RDBMS must permit table definition to also state
which attributes shall be its primary key. It must enforce such a constraint as follows:
¬ It must not permit any of these primary key attributes to have NULL values, via
requirement 3.
It must reject adding another row with identical values for all these key attributes
as an already stored row.
• The RDBMS maintains this primary key information in its metadata alongside the
table definition.
• The RDBMS can also generate unique values for artificial identifiers.
It can for instance maintain counters in its metadata.
• The chosen primary key attributes are often shown underlined in the schema of the
table.
• Only an actually stored database table has a primary key, but a table which the
RDBMS computes as an answer to show to the user does not.
• For instance, if we ask for just the students’ names in the university database, then
the answer will have duplicates, since several students can have the same name.
Hence this answer table cannot have a key – it is not even “all key”.
8
Figure 4: Foreign Keys for the University Database. (Sciore, 2008)
Foreign Keys
– its value r .a for a row t in table T names the row s in table U which corre-
sponds to this row r . . .
– . . . so that r .a = s .b where attribute b is the primary key chosen for table U .
• Foreign keys are the central tool to “glue together” the two individuals x and y
represented by the two rows r and s in the two relational tables T and U .
• For instance, attribute MajorId of table STUDENT is a foreign key of table DEPT.
Requirement 5 (referential integrity). The RDBMS must permit defining the foreign key
attribute(s) from one table into another. Moreover, it must enforce that if an attribute a
of a table T is defined to be a foreign key of table U , and its value r .a in a row r in T
is not NULL, then table U must contain a row whose primary key value equals this r .a.
• In other words, if a row r of table T claims that there is some corresponding row s
in table U , then this row s must indeed exist in table U .
9
• The RDBMS must react somehow, if the user attempts to delete from table U the
row s referenced by some rows r via the foreign key in table T , since it would violate
requirement 5.
IGNORE means that this attempt to delete row s from U will be rejected, because
something must be done to the rows r in T first.
The other reactions automate some common ways to “do something” to these
rows t first.
CASCADE means the following:
¬ this row s is deleted from its table U ;
every row r will be deleted from table T ; and
® the RDBMS reacts to each of these deletions in step as defined.
This continues until requirement 5 is restored.
SET NULL first sets the foreign key attributes in table T into NULL. That is,
each modified row r will now say that “there is no corresponding row s in
table U ”.
Of course, all these attributes must permit NULL values, via requirement 3.
SET DEFAULT constant first sets the foreign key attributes in table T into the
given constant instead of NULL. It must be the key for some row s 0 still in
table U . That is, each modified row r will now refer to row s 0 instead of s.
• The RDBMS can maintain these foreign key definitions and their ON DELETE. . .
definitions (if any) alongside the definition of the referring table T in the metadata.
• This approach has developed normal forms (NFs) to guide the design of database
tables.
• We shall now review FD-based NFs, which already prevent the most common update
anomalies.
• Database theory literature has many more NFs based on generalizations of FDs,
which prevent other less often encountered anomalies.
10
The Oath of the Relational Database Designer
the whole key ensures the Second Normal Form (2NF), and
nothing but the key ensures the Third Normal Form (3NF).
1NF
• Table T is in 1NF if its chosen primary key does indeed satisfy property ¬ of
candidate keys.
since each person does indeed determine some set of corresponding phone numbers.
– However, our basic relational data model does not permit this:
Each attribute permits only a single indivisible value, and not a compound
value with inner structure.
– They would be permitted so-called non-first normal form (NFNF) data models,
which extend the basic model.
• A possibility within the basic model might be to fix some upper limit p on phone
numbers/person, and use the schema
where each attribute PhoneNo i would be permitted to have NULLs to mean “this
person does not have an ith phone number”. However:
11
• Another possibility might be to give up property ¬ and use the schema
CONTACTS (Person,Address,PhoneNo)
with a duplicate row for each phone number for a given person. However:
– This table design would implicitly allow the same person to have many different
addresses.
– That is, it would not enforce even the FD
Person → Address
in the original situation.
– This is an example of an update anomaly:
The RDBMS would not be able to reject an update which would violate the
intended meaning.
• In practice, this solution needs also a fast way to find the phone numbers of a given
person. The primary key index of the PHONES table does not help, so we must
create a new clustering index too.
Requirement 6 (extra indexes). The RDBMS must permit defining new indexes, and it
must maintain these defined indexes automatically as the database contents are modified.
Unique indexes associate a single value to each key.
2NF
• 2NF considers tables whose chosen primary key consists of two or more attributes,
and requires that all the other attributes must depend on all of them.
• The schema
T (~a, ~b, c, d)
~
and its FD
~b → c (4)
show how 2NF is violated: attribute c depends only on the part ~b of the whole key
~a, ~b.
T (~a, ~b, d)
~ and U (~b, c)
12
where FD (4) has moved into the new table U . These two tables are connected by
stating that the attributes ~b of the old table T are a foreign key referencing the new
table U .
so that “this employee has worked this number of hours on that project which is
one of the projects of that department”.
• A corresponding update anomaly is that the WORKED table permits the same
project to belong to many different departments, despite FD (6).
or “this employee has worked this number of hours on that project” and “it is one
of the projects of that department” where
3NF
• The schema
T (~a, b, c)
~a → b
b→c (8)
show how 3NF is violated: attribute c does depend on the key ~a as it should, but
only via the non-key intermediate attribute b.
13
where FD (8) has moved into the new table U . These two tables are connected by
stating that the attribute b of the old table T is a foreign key referencing the new
table U .
WORKS (Person,Department,Address)
so that “this person works in that department which is located at that address”.
Person → Department
.
Department → Address (9)
• A corresponding update anomaly is that the WORKS table permits the same de-
partment to be located at many different locations, despite FD (9).
or “this person works in that department” and “this department is located at that
address” where
• From the database design perspective, integrity constraints describe what it means
for the database to reflect the reality (whatever that means. . . ) of its intended
application area.
14
Figure 5: Checking assertions in a table definition. (Sciore, 2008)
Assertions
• Assertions are conditions which the database state must satisfy.
• If the result of a change operation would violate any constraint, then the RDBMS
rejects the operation.
• In SQL, such an assertion can be expressed with
check c o n d i t i o n
which tests the given truth-valued condition.
• Such a check can appear in an SQL table definition, where it states a condition
which the attribute values for each row must satisfy, as in Figure 5.
15
Figure 6: One example of a named SQL assertion. (Sciore, 2008)
Figure 8: An example of a named SQL assertion over two tables. (Sciore, 2008)
16
Triggers
• Sometimes we do not want to reject a change operation, as assertions do, but con-
tinue with other operations until the database is again in the kind of state we want
it to be.
• They tell how to continue until the referential integrity of the database are restored
again.
Event- because a trigger waits for a certain modification operation like insert,
delete or update to happen
Condition- because a trigger has a condition which the RDBMS tests when its
event happens, and this condition determines whether this trigger will fire or
not
Action- because if the trigger fires, then the RDBMS performs these other opera-
tions
Event is a delete operation to the table which is referenced by some foreign key
from another table.
Condition is to test whether this would delete rows which are referenced by rows
of this other table.
Action is the given option ON DELETE IGNORE, CASCADE, SET NULL
or SET constant.
• In Figure 9, the university wants to permit several persons in its staff to modify
course grades, but also wants to maintain a GRADE LOG table who changed what and
when for auditing purposes.
• In Figure 10 enforces the American university policy that when a new student is
inserted into the database, his/her forthcoming expected graduation year is no
more than 4 years from now.
• However, note that the expected graduation year for an existing student can still
be updated to violate this policy, because the trigger in Figure 10 applies only to
insertion events.
17
Figure 9: An example of an SQL trigger. (Sciore, 2008)
18
Figure 11: The Schemata.
• Although normalization makes sense from the DBA’s viewpoint, because it avoids
update anomalies, the resulting new table structure might make less sense than the
original from the user’s viewpoint.
• For instance, the user may well prefer the original WORKED table (5) over its
normalized form (7) with the separate PROJECTS table, because (s)he may wish
to be reminded of the department when reading the hours listing.
Conceptual schema consists of the normalized tables derived from the class diagrams
describing the application area for which this database has been designed.
Physical schema implements the conceptual schema with concrete database table and
index files.
External schemas implement the user’s various views to the stored data on top of the
conceptual schema.
• A DBMS supports logical data independence if its users can be given their
own external schemas, so that they do not need to know the conceptual schema.
• Data independence is desirable, because it shields the upper levels from changes in
the lower levels.
19
Requirement 7 (views). To support logical data independence, the RDBMS must support
defining views: virtual tables on top of actual tables.
• A view can be either
purely virtual so that it exists only as a query Q which accesses the actual tables
in the desired way, or
materialized so that the RDBMS maintains its current contents also in a separate
actual table V .
– This information is redundant, since the contents of this table V could be
created by the defining query Q from the database instead.
– In the “good” ol’ times, already normalized tables were later denormalized
by hand to provide such redundancy.
– Views are a better alternative, since the RDBMS can manage them auto-
matically.
– But there is not (yet) any standard vendor-independent way to define a
materialized view. . .
• The user of an external schema should see its views just like ordinary tables. How-
ever, there is a difference: It might not be clear how a view can be updated – that
is, how the RDBMS should handle insertions, deletions and updates to its rows,
because these rows might not ”really” exist.
• The intuition is that only those views are updatable, whose defining query Q is
so “simple” that the affected rows of its underlying actual tables can be deter-
mined (Connolly and Begg, 2010, Chapter 4.4.3) (Elmasri and Navathe, 2011, Chap-
ter 5.3.3):
– If Q uses grouping or aggregation operations (explained next), then it is not
updatable – since one row in the view is a combination of several rows of the
underlying table(s).
– If Q uses more than one table, then the view is in general not updatable –
since one row in the view is a combination of several rows, each from a different
underlying table.
– If Q contains nested queries, then the view is not updatable – since the update
might have to affect these nested queries too.
– If Q does not mention all the non-null attributes without default values of its
only table, then it is not updatable – since the update would not specify the
required values for these missing attributes.
Otherwise the view can be updatable.
• Another alternative which is becoming common in RDBMSs is to use stored proce-
dures instead of view updates.
– A stored procedure is a combination of programming language and query lan-
guage constructs – an “RDBMS subroutine”.
– It is stored in the view metadata.
– The user can invoke such a procedure, which the DBA has programmed to “do
the right thing” when the user wants to update his/her view.
They offer more flexibility than plain view updates, where the RDBMS has to guess
what would be “the right thing” to do when handling a view update.
20
Grouping Data
• The WORKED table example (5) suggests also another viewpoint to data:
The user may also wish to list the total number of hours spent per project.
¬ Sort the rows of the WORKED table according to its WORKED .Project at-
tribute, via the sorting requirement 2.
For each distinct WORKED .Project attribute value p, add together all the
t .Hours values for all the rows t with t .Project = p. All these rows t are now
adjacent to each other, by step ¬.
® Report to the user each value p and the corresponding sum computed in step .
Requirement 8 (grouping). The RDBMS must be able to group together related rows
and summarize each group into a single representative accumulated value.
2.5 Transactions
• The data grouping scenario in the end of section 2.4 shows that an RDBMS must
control concurrent access to its contents:
– One employee x has asked the RDBMS to give the listing of total hours per
each project. . .
– . . . while other employees y, z, u, . . . insert their own hours into the database at
the same time.
– Which of these new hours will be included in the listing?
• Note that even when no concurrency is permitted, the RDBMS must somehow be
able to enforce it as well.
• The RDBMS must also be able to recover properly after a crash. Consider the
following scenario:
– Suppose that a user deletes some row, which is referenced by a foreign key, and
this starts many other CASCADEd deletions in other parts of the database.
– Then the computer running the RDBMS crashes in the middle of these CAS-
CADEd deletions.
– When the computer and RDBMS are restarted after the crash, the RDBMS
must first somehow undo all those CASCADEd deletions which it managed
to perform before the crash.
(Or carry out the rest of them too, but this would be even harder.)
– Otherwise some of the CASCADEd deletions would be done while others
would be left undone – and so the referential integrity of the database might
be in danger!
21
• Hence RDBMS implementation includes aspects of fault-tolerant computing.
• These two scenarios show that the RDBMS must maintain its consistent state in
successive “snapshots”:
¬ The first grouping scenario showed that a query must be evaluated in some
static “snapshot” of the database, and updating it cannot be permitted at the
same time.
The second crash scenario showed that an update must take the database all
the way from one snapshot into the next, even though this may mean many
lengthy individual operations.
• One concept subsumes both of these concurrency and recovery requirements for an
RDBMS:
Atomicity
commit which means that it has managed to execute all its operations successfully,
or a
abort (also called rollback) which means to undo all the operations which it did
manage to execute successfully, so that afterwards everything looks like as if
the transaction had never started at all.
• Hence the abort operation is a very convenient abstraction for cleaning everything
up after an error occurred in the middle of a transaction – a very common pro-
gramming pattern in fault-tolerant computing.
22
Correctness
• A transaction must be correct, in the sense that the state of the database state after
it has committed must again be consistent, as defined by its integrity constraints
in section 2.3,. . .
– For instance, a deletion and all its cascaded operations in our second crash
scenario are executed in the same transaction t in step ¶.
– The referential integrity requirement 5 is temporarily broken during t. . .
– . . . but is restored after committing or aborting t.
• In this way, the RDBMS uses transactions internally for its own operations like
these CASCADEd deletions.
• The RDBMS must also permit external application programs which use the database
to specify their own transactions:
– The canonical example is: “Transfer Xe from bank account Y into Z if Y has
enough money.”
– In pseudocode:
SELECT Balance
xfer(X,Y,Z): 1 FROM Bank
WHERE Account = Y
2 if Balance ≥ X
UPDATE Bank
3 SET Balance = Balance − X
WHERE Account = Y;
UPDATE Bank
4 SET Balance = Balance + X
WHERE Account = Z
5 else what?
• The RDBMS can run each of the three SQL statements on lines 1, 3 and 4 in its
own internal transaction – but it would not be enough!
23
• Lines 3–4 must be executed in the same transaction too:
• Consider finally line 5. How do we want to report the error that “account Y has
less than Xe”?
– A good choice would be to abort the transaction (even though it has changed
nothing in the Bank) because then an abort means “the money was not trans-
ferred for some reason”.
– Otherwise the transaction could commit in two ways:
either with “the money was transferred”
or with “there was not enough money to transfer”
and the caller of xfer would then have to find out which of these two possi-
bilities actually happened.
– When this xfer code is used as a small part of a large program which imple-
ments the “business logic” of the organization, this choice to abort becomes
more and more attractive to the programmer.
• Hence the RDBMS must permit external application programs to begin, commit
and abort their own transactions, which may consist of several database and non-
database operations.
• This is required, because the database might have also more complex integrity con-
straints like “money should not just disappear” which cannot be stated with just
the RDBMS assertions and triggers.
Isolation
• Transactions must be isolated from each other, in the sense that a transaction
must not notice any of the other concurrently running transactions – instead, each
transaction must “see” the database as if it were the only transaction using it.
• Hence isolation is the other part of the concurrency requirement (besides correct-
ness).
– Employee x will get a listing of all exactly those hours of the other employ-
ees y, z, u, . . . whose insertion transactions ty , tz , tu , . . . have already been com-
mitted before x starts the listing transaction tx .
– If such an insertion transaction is running at the same time as the listing trans-
action, they are isolated from each other. So tx does not see those transactions
of ty , tz , tu , . . . which are still running – because they might abort at the end,
and must therefore not be listed!
24
Figure 12: Transaction isolation levels. (Sciore, 2008)
• Isolation is the one ACID property which the user can relax, if (s)he. . .
In other words, the user can play “fast and loose” by altering the transaction isola-
tion level of his/her query, and accept the risks involved.
• These 4 levels are shown in Figure 12. Its middle column discusses a possible
implementation, and we shall return to that column later.
• Every transaction should run at this level by default, and in most RDBMSs
they do.
• In our first grouping scenario ¬, the listing would contain effects of only those
of transactions ty , tz , tu , . . . which committed before transaction tx started.
25
• In our first grouping scenario ¬, the listing might also contain some rows added
by those transactions ty , tz , tu , . . . which committed during transaction tx . . .
• . . . but user x would not know which, because that depends on the concurrent
execution order of these transactions tx , ty , tz , tu , . . ..
• This level is useful for transactions which modify an already existing row in the
database, because phantoms do not affect that.
• This is why some RDBMSs (noatably Oracle and Sybase) use it as the default
isolation level instead of fully Serializable.
• The new risk (in addition to phantoms and nonrepeatable reads) is dirty reads:
A transaction can read data as soon as another transaction writes it – even
when this other writing transaction later aborts, and its writings should not
have happened at all.
• This is also very fast, because this transaction does not have to stop and wait
for any other transactions.
• In our first grouping scenario ¬, the listing would contain whatever was in the
WORKED table when transaction tx happened to read it.
• However, this level would be OK for read-only transactions whose results do
not have to be exactly accurate.
• For instance, user x can run the listing transaction tx in this level, if (s)he
just wants to compute quickly some rough statistics about approximately how
many Hours have people WORKED on each project.
Durability
• Durability means that when a transaction commits, then the changes it has made
to the data are now stored permanently, so that even a computer crash does not
wipe them out.
• Hence durability is the other part of the recovery requirement (besides atomicity).
26
Java programming language RDBMS
Java source code in a .java file SQL statement from the user (which
source might also be an application program)
– declarative approach: what the result
must be
which gets compiled into
corresponding Java object code in a a corresponding Relational Algebra
intermediate
.class file by the Java compiler expression by the SQL parser of the
RDBMS and optimized by its query
optimizer – procedural approach: how
the result can be formed
which gets executed by
the Java virtual machine (JVM) internal algorithms chosen by the
runtime
• The THA course has already presented the Relational Algebra from its own view-
point.
• Here we present in from the RDBMS viewpoint, as the intermediate language be-
tween
as shown in Table 1.
• Hence we present here a variant of the Relational Algebra which may be closer to
the internals of the RDBMS than the one presented in THA.
• We also assume that the idea of expressions as trees is familiar from the course
“Basic Models of Computation” (”Laskennan perusmallit” or LAP in Finnish).
• Recall that the result of a Relational Algebra operation is another table, and that
this result has its own schema.
27
• In mathematical presentations of Relational Algebra, these tables are considered to
be sets of rows. Here we consider them to be bags or multi-sets of rows instead,
because the results computed by RDBMSs have in general duplicate rows, ulnless
they are explicitly suppressed.
28
Select (Sciore, 2008, Chapter 4.2.1)
• The select operator takes 2 arguments:
• Its result consists of those rows of its table argument for which the predicate is
true.
• Hence its result has the same schema as the table argument.
Q3 = select(select(STUDENT
,GradYear=2004) ¬
,MajorId=10 or MajorId=20)
¬ first selects those rows of the STUDENT table where the GradYear attribute
equals 2004 – as the inner operation – and
then selects from them those rows where the MajorId attribute equals either
10 or 20 – as the outer operation.
In this way, it selects the students who graduated in 2004 from either computer
science or mathematics.
starting at the leaf nodes representing the actually stored tables – here STUDENT
– and
moving up towards the root, and doing the Relational Algebra operator at each
internal node.
29
Figure 13: The Relational Algebra expression tree for Q3. (Sciore, 2008)
• Its result has the same rows as its table argument, but its schema is restricted to
consist only of these particular attributes – that is, we forget that the table argument
has any other attributes than these.
Q6 = project(select(STUDENT
,MajorId=10)
,{SName})
• Its result will in general have duplicate rows – one for each computer science student
with that particular name.
• This operation is often written as πattributes (table) in the database literature – ’π’
being the Greek ’p’.
30
Figure 14: The tree for Q6. (Sciore, 2008)
• The result is sorted according to this order. That is, the result is now an ordered
bag. It has the same schema as the table argument.
• Because this order does not matter, sort is usually the last (topmost, root) operator
in the expression, and it is used only for displaying the result to the user.
Q8 = sort(STUDENT
,[GradYear,Sname])
31
Rename (Sciore, 2008, Chapter 4.2.4)
• Its result is the same table argument, except that the attribute argument is now
called by this new name in its schema.
• Relational Algebra contains also operators with two table arguments, as we shall
soon see.
• We must sometimes rename their attributes apart from each other first, to make
clear which of these two table arguments contains a particular attribute.
Q11 = extend(STUDENT
,GradYear-1863
,GradClass)
32
• This operation handles Requirement 8.
¬ The rows in the table argument are partitioned into groups, so that two rows t
and u are in the same group exactly when t .a = u .a for every attribute a
mentioned in the attribute argument.
Each such group g generates one tuple tg into the result. The value tg .a will
be this common attribute value of g for every attribute a mentioned in the
attribute argument.
® This tuple tg will also be “extended” with the values for each of the expressions
mentioned in that argument.
– Here these values are now computed by considering all the rows in g to-
gether.
– Hence they summarize the whole group g.
– In contrast, extend computed its new values individually row by row.
Q12 = groupby(STUDENT
,{MajorId} ¬
,{Min(GradYear),Max(GradYear)})
Q13 = groupby(STUDENT
,{MajorId,GradYear}
,{Count(SId)})
specifies two grouping attributes MajorId and GradYear, and so its result in Fig-
ure 16 tabulates how many graduates each major subject has had each year.
• If the attribute argument is empty, then the whole table argument forms a single
group, which gets summarized into a single row:
33
Figure 15: The output for Q12. (Sciore, 2008)
34
Q14 = groupby(STUDENT
,{}
,{Min(GradYear)})
• If the expression argument is empty, then groupby groups the rows of the table
argument and removes duplicates:
Q15 = groupby(STUDENT
,{MajorId}
,{})
• The functions in the expression argument come in two flavours. For instance:
Q16 Counts how many students there are with known major subjects – aggregation
ignores NULL values, because it is not clear which group they should belong
to.
Q17 counts instead how many distinct major subjects the students have – each
major subject is now counted only once, whereas Q16 added 1 to the count for
each student.
Q16 = groupby(STUDENT
,{}
,{Count(MajorId)})
Q17 = groupby(STUDENT
,{}
,{CountDistinct(MajorId)})
• It takes 2 arguments:
• The result of
product(T ,U )
35
Figure 17: The result of Q22 = product(STUDENT,DEPT). (Sciore, 2008)
• The schema for its results consists of the schemas for its two table arguments to-
gether – because they are assumed to be renamed apart from each other.
36
Figure 18: The expression tree for Q23. (Sciore, 2008)
in the database literature – because if the tables T and U are sets, then the result
is their Cartesian product.
fundamental because with it we can combine tables in every way we may want
to, but on the other hand
impractical because
– it is very slow to compute, because its result is so big, and
– (almost) always we want to combine tables with much more precision than
“all rows r from table T to all rows s from table U ”.
Join
select(product(T
,U )
,b1 = c1 and b2 = c2 and...and bn = cn ).
• In our university example, we may want to combine students and their majors in
this way:
Q23 = select(product(STUDENT
,DEPT)
,MajorId=DId)
Then its results contains also the attribute DName which gives the name of the major
– the MajorId had the same information only as an artificial ID.
37
• These are examples of join operations. They have 3 arguments:
They are so common and useful that they warrant their own shorthand notation:
join(T ,U ,φ) ≡
select(product(T
,U )
,φ)
b1 = c1 and b2 = c2 and...and bn = cn
here, the join is called an equi join. We focus mainly on them here.
• When an equijoin is used to traverse the foreign key from table T into table U , as
in here, it is called a relationship join.
• As an example of joining multiple tables together, let us find out the grades Joe
received in 2004:
Q25 = select(STUDENT
,Sname=’joe’)
Q26 = join(Q25
,ENROLL
,SId=StudentId)
Q27 = select(SECTION
,YearOffered=2004)
Q28 = join(Q26
,Q27
,SectionId=SectId)
Q29 = project(Q28
,{Grade})
Q26 finds the courses to which Joe has ENROLLed. This needed his student ID via
Q25.
Q28 finds his ENROLLments during 2004. This needed the SECTIONs offered then via
Q27.
38
Figure 19: The expression tree for Q25–Q29. (Sciore, 2008)
– It consists of those rows r of the first table T for which there exists some
matching row s in the second table U .
– That is, so that r and s together satisfy the join predicate φ.
– But none of the attributes of this matching row s are included in the result.
except that now rows r of table T are chosen into the result based on the other
table U
whereas selection chose rows r based on the attribute values in each row r itself.
project(join(T
,U
,φ)
,the attributes of T ).
Q38 = select(SECTION
,Prof=’einstein’)
Q39 = semijoin(ENROLL
,Q38
39
Figure 20: The expression tree for Q38–Q40. (Sciore, 2008)
,SectionId=SectId)
Q40 = semijoin(STUDENT
,Q39
,SId=StudentId)
Q39 chooses those ENROLLments whose section IDs are found in the SECTIONs taught
by him as Q38.
Q40 chooses those STUDENTs whose student IDs are found in Q39.
Antijoin
40
• We need this antijoin for queries whose form is “there does not exist any x such
that. . . ”.
– For instance, a SECTION of a course was easy if no ENROLLed student got the
failing grade ‘F’.
– In other words: if there does not exist any ENROLLed student who got an F in
this SECTION.
– In our Relational Algebra this is
Q42 = select(ENROLL
,Grade=’F’)
Q43 = antijoin(SECTION
,Q42
,SectionId=SectId)
or “keep only those SECTIONs which do not appear in the table Q42 of ENROLLments
which got an ‘F’”.
• We need antijoin also for queries whose form is “something holds for every x”.
or “keep only the professors of those SECTIONs whose professor has never taught
an easy SECTION (where the previous query Q43 retrieved the easy sections)”.
– Figure 21 shows its expression tree.
• Note: These double negations can be tricky to read and write! It helps to know
something about logic.
41
Figure 21: The tree for stern professors. (Sciore, 2008)
Q52 = rename(project(STUDENT
,{Sname})
,SName
,Person)
Q53 = rename(project(SECTION
,{Prof})
,Prof
,Person)
Q54 = union(Q52
,Q53)
42
Figure 22: The result of Q55. (Sciore, 2008)
combines both STUDENTs (in Q52) and professors (in Q53) together as Persons, be-
cause here a person is either a student or a professor.
Outer Join
• The union operator is most commonly used as part of the outer join operator.
• This outerjoin operator has the same 3 arguments as the join operator.
• Its result consists of
– the result of the corresponding join operation, together with (here is the
union)
– all the rows from the two argument tables which did not match the join pred-
icate. . .
– . . . with their missing attribute values filled with NULLs (which of course must
be permitted by requirement 3).
That is, an outerjoin is a join which does include NULLs because their unknown
actual values might have matched the join predicate.
• For instance, we may want to see all the current ENROLLments together with all the
STUDENTs who have not ENROLLed into anything yet:
Q55 = outerjoin(STUDENT,ENROLL,SId=StudentId)
• From this we can count the number of ENROLLments for each STUDENT:
Q58 = groupby(Q55
,{SId}
,{Count(EId)})
43
– Now a STUDENT with no ENROLLments yet is alone is his/her own group. . .
– . . . and since the Count aggregation function ignores the NULL EId value in
his/her own group, its value will be 0 as it should.
– If we had used just ENROLL instead of Q55 in Q58, then we would have missed
these STUDENTs with 0 ENROLLments.
Full outerjoins as described here, whose result consists of all rows from both table
arguments, with NULLs for those attributes for which no matching row existed
in the other table argument.
Left outer joins, whose result consists of all rows from the first table argument,
with NULLs for those attributes for which no matching row existed in the
second table argument.
– This Q55 is such a leftouterjoin, because. . .
– . . . it follows the foreign key from STUDENT into ENROLL. . .
– . . . and so each NULL is for a STUDENT without any ENROLLments, and they
are all at the “right end” of the result in Figure 22. . .
– . . . whereas there are no ENROLLments without STUDENTs, which would
cause NULLs at the “left end” of the result.
Right outer joins, symmetrically.
Data Definition Language (DDL) for defining the elements of the current data-
base schema.
Data Manipulation Language (DML) for populating the tables of the defined
schema with rows.
Query Language (QL) for retrieving the information stored in these database
table rows in various ways.
• The CREATE command adds into the database schema new elements, like
tables Figure 5
integrity constraints like assertions in Figures 6–8 and triggers in Figures 9–10
views whose creation consists essentially of giving the defining query Q, and
indexes on a table and its attributes (in parentheses, separated by commas) like
in Figure 23.
44
Figure 23: Index creation commands. (Sciore, 2008)
• The SQL DDL user can ALTER these CREATEd tables and VIEWs (by ADDing
and DROPping COLUMNs and integrity constraint ASSERTIONSs) later, and
DROPping them altogether when they are no longer needed.
• The SQL DDL user can also CREATE and DROP whole SCHEMAs, because
the same RDBMS offers different schemas for different users.
• Let us review the main (but not nearly all!) query features of SQL, and relate
them to our Relational Algebra presented in section 2.6, because here our aim is to
understand how an SQL query gets executed by the RDBMS.
• Its optional DISTINCT qualifier removes duplicate rows from the result – using
the appropriate groupby operator.
RangeVar .AttrName
where
RangeVar is the range variable for some table T declared in the FROM part to
be explained next.
AttrName is the name of some attribute in this table T .
Or it can be ‘*’ instead. This shorthand expands into all the attributes of
table T .
45
Such a FullName stands for the attribute value r .AttrName for the current row r
of table T .
• Besides these names, the attributes can also contain
Expression AS NewAttrName
forms. These denote in turn extending the result with this new named attribute,
whose value for each row t is obtained by evaluating this Expression.
• A common use for this form is
OldAttrName AS NewAttrName
which essentially renames an old attribute.
The FROM Part (Connolly and Begg, 2010, Chapter 6.3.7) (Sciore, 2008, Chap-
ter 4.3.4)
• The tables in the FROM part are a comma-separated list of
TableName RangeVar
forms. Such a form declares that this RangeVar stands for the current row r of
TableName.
• If none of the other TableNames in this FROM part have any attribute names
in common with this one, then this RangeVar (and ‘.’) can be omitted from
FullNames, because then their AttrNames are enough to determine that they mean
this table.
• This TableName can also be another nested SELECT. . . FROM. . . WHERE. . .
query (in parentheses). Then its RangeVar ranges over the result rows of this nested
query.
• These nested queries permit one possible implementation for the view from sec-
tion 2.4:
If the TableName is a view, then put its defining query (Q) in its place.
• The corresponding Relational Algebra expression is the product of all TableNames
and nested queries in this FROM part.
• It is also possible to write different kinds of joins in this FROM part with the
syntax
first table [FULL or LEFT or RIGHT or NATURAL or CROSS or. . . ] JOIN
second table ON predicate
so Q55 could be written in SQL in for instance like
SELECT ∗
FROM STUDENT s
LEFT JOIN
ENROLL e
ON s . SId = e . S t u d e n t I d
whose result would then use a row of NULLs for those STUDENT rows s which do
not possess any matching ENROLLment rows e.
46
The WHERE Part (Sciore, 2008, Chapters 4.3.5 and 4.3.8)
• The optional WHERE part corresponds to the selection operation on this pred-
icate from the big product of the FROM part.
• A particularly common special case is when the predicate is a conjunction (that is,
all ands but no ors) of Terms with the form
• An example of such a query is “the grades Joe received during his graduation year”:
SELECT e.Grade
FROM STUDENT s,ENROLL e,SECTION k
WHERE s.SId=e.StudentId AND e.SectionId=k.SectId
AND k.YearOffered=s.GradYear AND s.SName=’Joe’
project(select(product(product(STUDENT
,ENROLL)
,SECTION)
,s.SId=e.StudentId
AND e.SectionId=k.SectId
AND k.YearOffered=s.GradYear
AND s.SName=’Joe’)
,{e.Grade})
but the RDBMS query optimizer can improve it further into Figure 24.
which is true if the current value of FullName is in the result of this nested Query.
That is,
47
Figure 24: The Relational Algebra tree for Joe’s final year grades. (Sciore, 2008)
48
The ORDER BY Part (Sciore, 2008, Chapter 4.3.10)
• The optional ORDER BY part specifies a sorting operation as the very last step
of the whole query.
whose
• SQL can also insert many new rows by replacing the VALUES part with a database
Query.
49
whose
predicate chooses the rows to delete, based on their attribute values, as in a Query.
UPDATE TableName
SET AssignmentList
WHERE predicate
whose
forms. Such a form means that r .AttrName is updated into the value of its
Expression.
architecture.
• This Client-Server architecture is also used on a single computer, so that the clients
are other processes running in the same computer as the RDBMS process.
front end of a database application program (which handles the user interface and
the part of the “business logic” of the organization which cannot be represented
with database integrity constraints) in the client from its
50
back end in the server which provides the common database part for all such ap-
plications.
– The database is divided among more than one servers, which serve the clients
together.
– They are very important, especially on the web.
– However, this course concentrates only on the “classical” one-server RDBMSs.
• Here are the general steps for getting the SimpleDB RDBMS up and running on
your computer.
• How each step is carried out in a particular OS is left as an exercise to the reader. . .
Move the unzipped simpledb subdirectory into the serverdirectory where you want
the server-side software to be.
¯ Ensure that the current working directory ‘.’ is in CLASSPATH too (it may already
be).
• The SimpleDB server-side software should now be installed. The server process can
be started as follows:
° Start the
rmiregistry
• This program is part of Java SDK, which you should already have.
• It is the Remote Method Invocation (RMI) registry – the “phone directory”
for Java methods which can be called from other processes, even across the
network.
• The SimpleDB server registers its public methods there, so that its client pro-
cesses can invoke them to ask the server to perform database operations.
51
java simpledb.server.Startup databasename
command.
where the server first recovers databasename into a consistent state, because it
may have ended abnormally.
(For instance, its previous server process may have been killed.)
• Otherwise databasename will be created as a new empty database. If the server
starts OK, then you will see the message
creating new database
new transaction: 1
transaction 1 committed
database server ready
² You can try it out for instance with the example client programs in the unzipped
studentClient/simpledb/ subdirecory:
52
– The SELECT part of a query has just an attribute name list – no ‘*’, AS
nor DISTINCT.
– Its FROM part is just a table name list – no RangeVar iables, JOINs nor
nested queries (but views are supported).
Hence attribute names must determine tables.
– Its WHERE part is just a conjunction of equality comparisons ‘=’ of attribute
names and constants – no other comparisons nor expressions.
– The only 2 supported attribute types are
INT for Java 32-bit integers, and
VARCHAR(N ) for ASCII strings of at most N characters
without NULLs.
– There is no UNION, GROUP nor ORDER BY.
– There are no keys or integrity constraints.
– An INSERT takes only VALUES – not queries.
– An UPDATE has only one assignment – not many.
– An INDEX can have only one attribute – not many. Moreover, index support
must be enabled separately
– Entities CREATEd in the current schema cannot be DROPped.
• There are now ODBC binding libraries for many programming languages. They
permit application programs written in that language to communicate with any
ODBC-compliant database server.
• The Java binding is called JDBC – which does not mean “Java DBC” according to
Sun’s legal position. . .
• The SimpleDB supports enough of the JDBC specification to allow writing simple
clients – but not nearly all the features of the whole specification.
53
Figure 25: A small SQL language dialect. (Sciore, 2008)
54
Figure 26: The basic JDBC Application Programming Interface. (Sciore, 2008)
55
¬ The client opens a connection to the server.
where
theRightDriver () is supplied by the RDBMS JDBC binding, and imported
into the client code.
For SimpleDB, it is simpledb.remote.SimpleDriver.
system is the RDBMS used.
For SimpleDB, it is simpledb.
server is the machine running the rmiregistry and the RDBMS processes to
which this client wants to connect.
If this server is in the same machine as this client, then this is localhost.
/path leads to the databasename to use within the server .
For SimpleDB it is not needed, becaise it stores its databasename subdi-
rectories directly in its users’ home direcories.
properties is an RDBMS-specific string giving extra options for the connection.
For instance, if the RDBMS has mandatory access control, then this string
can contain the required username and password.
SimpleDB does not support any properties so it is the null pointer.
• The vendor-independent parts of JDBC are imported from java.sql.*.
• The method calls of this created connection
¶ happen remotely via the rmiregistry process running on the server . . .
· which in turn forwards them to the RDBMS process.
• Unfortunately this old way to form the connection is not very portable, because
the client contains theRightDriver which is vendor-dependent.
• Java supports also new ways, where the server can send theRightDriver to its
clients based on the system in the url (Sciore, 2008, Chapter 8.2.1).
+ Now the client is vendor-independent, but. . .
− the server-side setup gets more complicated, and so we continue using the
old way here instead.
The client sends an SQL statement to the server.
where
statement is an SQL SELECT. . . FROM. . . WHERE. . . statement as text.
rs gives the results of the query as a result set to be processed in the next
phase ®.
56
• Other SQL statements can be issued with
int howMany = stmt.executeUpdate(qry);
whose return value tells howMany records were affected instead of a result set.
• The RDBS server
¶ first compiles this statement into Relational Algebra and optimizes it into
a form. . .
· which it then executes.
• A statement can also be prepared beforehand:
– The compilation step ¶ happens only once.
– The same compiled statement can be executed in step · many times with
different parameter values each time.
This is useful, because we shall see during this course that step ¶ is not trivial.
• These parameter positions are marked with question marks ‘?’ within the
statement to prepare, while the value for the nth ‘?’ can be set with the
method
setType(int n,Type value)
• The result set of a query consists of the corresponding rows. One of them is
the current row – a reading position within the result set.
– Initially this current row is just before the first row of the result set – so
it is not valid yet.
– Method next moves this current row to the next row of the result set. It
returns false if it moved past the last row of the result set – so it is no
longer valid.
– If the current row is valid, then the value for its named attribute can be
extracted with the method
Type getType(String name)
57
• Besides these basic “read forward” result sets, JDBC also supports
scrollable result sets, whose current row can move also backwards, and
updatable result sets, which permit updating the attribute values of the
current row
(Sciore, 2008, Chapter 8.2.5) which are especially useful in clients with graph-
ical user interfaces (GUIs).
• Such a result set is an example of a lazy data structure:
– it does not exist as a whole, but. . .
– its elements are constructed one by one, as the client asks for the next
one.
• Once the
while(rs.next())
loop processing the result set rs finishes, the client should call
rs.close()
as soon as possible, because the RDBMS maintains each open result set, and
they reserve its limited resources.
• The symbol ‘&’ denotes a long source code line which had to be divided into many
lines on the pages.
import j a v a . s q l . ∗ ;
import s i m p l e d b . r e m o t e . S i m p l e D r i v e r ;
public c l a s s F i n d M a j o r s {
public s t a t i c void main ( S t r i n g [ ] a r g s ) {
S t r i n g major = a r g s [ 0 ] ;
System . o u t . p r i n t l n ( ” Here a r e t h e ” + major + ” m a j o r s ” ) ;
System . o u t . p r i n t l n ( ”Name\ tGradYear ” ) ;
C o n n e c t i o n conn = n u l l ;
try {
// S t e p 1 : c o n n e c t t o d a t a b a s e s e r v e r
D r i v e r d = new S i m p l e D r i v e r ( ) ;
conn = d . c o n n e c t ( ” j d b c : s i m p l e d b : / / l o c a l h o s t ” , n u l l ) ;
// S t e p 2 : e x e c u t e t h e q u e r y
S t a t e m e n t stmt = conn . c r e a t e S t a t e m e n t ( ) ;
S t r i n g q r y = ” s e l e c t sname , g r a d y e a r ”
+ ” from s t u d e n t , d e p t ”
+ ” where d i d = m a j o r i d ”
+ ” and dname = ’ ” + major + ” ’ ” ;
R e s u l t S e t r s = stmt . e x e c u t e Q u e r y ( q r y ) ;
// S t e p 3 : l o o p t h r o u g h t h e r e s u l t s e t
while ( r s . n e x t ( ) ) {
S t r i n g sname = r s . g e t S t r i n g ( ” sname ” ) ;
int gradyear = r s . g e t I n t ( ” gradyear ” ) ;
System . o u t . p r i n t l n ( sname + ” \ t ” + g r a d y e a r ) ;
}
58
Figure 27: Preparing an SQL statement and using it. (Sciore, 2008)
rs . close () ;
}
catch ( E x c e p t i o n e ) {
e . printStackTrace () ;
}
finally {
// S t e p 4 : c l o s e t h e c o n n e c t i o n
try {
i f ( conn != n u l l )
conn . c l o s e ( ) ;
}
catch ( SQLException e ) {
e . printStackTrace () ;
}
}
}
}
The client may choose to retry its operation later, especially if the reason for its
failure was ¹.
59
with AutoCommit still true after setting it to false via the API in
Figure 28
The RDBMS executes each SQL The RDBMS continues the same
statement as its own transaction. transaction when the clients sends its
next SQL statement into the connection.
The RDBMS commits (or aborts) them The client must commit or abort this
internally and automatically – this is transaction by hand at the end.
what “autocommit” means.
Table 2: With and without autocommit mode.
• This takes place in the finally part, so it is executed whether the try part executed
correctly or caused an exception to catch.
• This finally part closes the connection if phase ¬ managed to open it. It may
raise an exception too, and is therefore in its own try block.
• An RDBMS operates in its default transaction isolation level, unless the client sets
this level explicitly for its connection. For instance,
conn . s e t T r a n s a c t i o n I s o l a t i o n ( Connection . &
TRANSACTION SERIALIZABLE)
60
What should the client do then? Neither committing nor aborting its transaction
is possible!
• Then the database may have become corrupted because it may not be possible to
recover it to the last consistent state before this transaction started. Hence the
client should somehow alert the DBA about this danger if possible.
61
Figure 29: JPA annotations combining the STUDENT table and class. (Sciore, 2008)
(Continues in Figure 30.)
62
Figure 30: Rest of Figure 29. (Sciore, 2008)
63
• Moreover, there are other programming philosophies than object-orientation, such
as functional and logic programming.
– They are based on the concept of “value” instead of “(object) identity” and so
the relational model is more natural for them.
– However, despite their long history they are still niche programming languages.
• Although SimpleDB is a restricted RDBMS written and made available for teaching
purposes, it does contain the most important components of a full RDBMS. These
components are shown in Figure 31.
• We can trace the execution of an SQL query in the SimpleDB RDBMS server process
down these components:
¶ The Remote manager handles the communication with the client. The server
process allocates a separate thread for each connection via the RMI meacha-
nism.
· When a clients sends an SQL statement to its open connection, this Remote
manager passes it to the Planner component.
– This component plans how the statement will be executed.
– This plan is a Relational Algebra expression which it sends to the Query
component.
– It invokes the Parser component, which turns the statement into a syntax
tree containing the tables, attributes, constants,. . . mentioned in it.
– This Parser component in turn invokes the Metadata manager, which
keeps track of information about the tables, attributes, indexes,. . . CRE-
ATEd in the database to check that the things mentioned in the syntax
tree do exist and have the right type.
¸ The Query component turns the plan it received from the Planner component
into a scan and executes it.
– It forms this scan by choosing an implementation for each operation in the
expression. For instance, if the expression contains a sort operation, then
this Query component chooses a particular sorting algorithm to use.
– The RDBMS can choose from several algorithms for the same operation,
because different algorithms suit different situations, improving perfor-
mance.
– This component uses the Metadata manager too, because its information
helps in making these choices.
64
Figure 31: The Components of an RDBMS Engine. (Sciore, 2008, page 310)
65
– This scan is executed using the same “current row” approach as the client
uses for processing the result in its phase ® in section 3.2.
¹ Each of these rows processed by the Query component is stored on disk as a
record handled by the Record manager.
– These records are stored in disk blocks held in files managed by the File
manager.
– The Buffer manager is in turn responsible for those disk blocks which have
been read into RAM for accessing the records in them.
º Each (scan for a) statement is executed as (if in autocommit mode) or within
(otherwise) a Transaction. They are managed by a manager responsible for
concurrency control and
recovery using a designated Log file managed by its own component.
• This lowest level of an RDBMS is the component which handles interaction with
the underlying disk drive(s).
raw disk(s) so that the database resides on dedicated drives (or partitions) with
nothing else.
+ This is as fast as possible, but. . .
− such disks needs dedicated special support from the DBA.
This is used only for very high performance requirements.
OS file(s) so that the database is in normal files in normal file systems.
+ They need only the same support as file systems in general, but. . .
− the OS layer overhead impairs performance.
This is currently the most common choice.
single file architecture, where the whole database is stored in a single (possibly
very) big file, like for instance the .mdb files of Microsoft Access.
66
multifile architecture, where each database is in a separate subdirectory containing
separate files for its tables, indexes,. . . like for instance Oracle and SimpleDB
do.
– It consults the OS only for opening and closing its files, and extending the with
more blocks, but. . .
– manages these blocks, their buffering, and their allocation by itself.
The reason is not only better performance but even more importantly ensuring
durability:
The RDBMS must know precisely which of its data is
• In order to guarantee durability, the RDBMS needs some memory whose contents
do not disappear when the computer crashes.
• A disk drive consists of sectors which the OS divides further into blocks.
• Disk striping builds such a big disk out of many smaller disks. For performance
reasons, it spreads the sectors of the big disk evenly across the sectors of the smaller
disks, as in Figure 32.
– A RAID unit adds extra error-correcting information into a striped disk unit.
– If one of the smaller disks breaks, the RAID unit can inform the DBA about
which of them broke.
– The DBA can then change the broken disk and reconstruct its contents from
the other disks and this extra information.
67
Figure 32: Two-disk striping. (Sciore, 2008)
– The only problem is if another disk breaks during this reconstruction. . . but
this is unlikely.
– Moreover, adding more error-correcting information makes it possible to recon-
struct more than one disk at a time.
• There are now 7 levels of RAID, depending on what extra information the unit holds
and where.
• The simplest is RAID-0, which is plain striping without any extra error-correcting
information. Therefore it does not offer any protection against failures.
• The next level is RAID-1, where the extra error-correcting information is a mirror
of the data disk into another identical disk, as in Figure 33.
• The DBA can reconstruct the contents of the data disk simply by copying this
mirror disk into the replacement disk.
68
Figure 33: Mirroring. (Sciore, 2008)
– This is more compact than mirroring, because there is only one extra block
per N data blocks, whereas mirroring had one extra block per data block.
– In fact, mirroring could be viewed as parity with N = 1.
• However, the dedicated extra (N + 1)st parity disk becomes a bottleneck for the
whole RAID unit, because whenever a data disk sector changes, the corresponding
section of the parity disk must be updated too.
• RAID-5 solves this bottleneck by distributing these parity sectors evenly among the
data sectors.
– Every (N + 1)st sector of a small disk is a parity sector, its other sectors are
data.
– A parity sector s on a small disk d contains the parity of the corresponding
sectors s of the other small disks
1, 2, 3, . . . , d − 1, d + 1, d + 2, d + 3, . . . , N + 1
than d itself.
– Then the extra work of updating parity sectors is divided evenly among all the
other disks, and so no one disk is a bottleneck any longer.
– The DBA can still reconstruct the contents of any one broken disk from the
other still functioning N disks.
– RAID-2 used bit instead of sector striping and an error-correcting code instead
of parity, but it was hard to implement and performed poorly, and so is no
longer used.
69
Figure 34: Parity. (Sciore, 2008)
– RAID-3 is like RAID-4 but with the less efficient byte instead of sector striping.
– RAID-6 is like RAID-5 but with two kinds of parity information, so it tolerates
two disk failures at the same time.
70
• This is one reason why the RDBMS executes queries concurrently:
If one query running in one thread must stop and wait for disk I/O, other queries
running in other threads which already have the data they need in RAM may
continue.
• Each disk drive / file system / OS has its own block size constant, so that the block
k = 0, 1, 2, . . . of a file consists of the bytes at
within that file, and reading/writing the value of any byte within that area copies
the whole block between disk and RAM.
• One way how the RDBMS can meet requirement 10 is to ensure that if a block must
be read from the disk, then the information in it is used as well as possible.
• This constant is usually between 512 bytes and 16 kilobytes, 4 kilobytes is a typical
value.
• On the one hand, the application programmer does not have to be aware of this
buffering because the OS handles it.
But (s)he may want to be, for performance reasons.
• On the other hand, the RDBMS wants to be aware of it, and bypasses this OS
buffering altogether with its own Buffer manager, for both performance and dura-
bility.
• That Buffer manager will use the services offered by this File manager for the actual
disk I/O operations.
• The OS converts it internally into a physical block number, which identifies a par-
ticular block on a particular sector of the disk drive.
package s i m p l e d b . f i l e ;
/∗ ∗
∗ A reference to a disk block .
∗ A B l o c k o b j e c t c o n s i s t s o f a f i l e n a m e and a b l o c k number .
∗ I t does not hold the co nte nt s of the b l o c k ;
∗ i n s t e a d , t h a t i s t h e j o b o f a { @ l i n k Page } o b j e c t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B l o c k {
private S t r i n g f i l e n a m e ;
p r i v a t e i n t blknum ;
/∗ ∗
∗ Constructs a block reference
∗ f o r t h e s p e c i f i e d f i l e n a m e and b l o c k number .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param b l k n u m t h e b l o c k number
∗/
71
public B l o c k ( S t r i n g f i l e n a m e , i n t blknum ) {
this . filename = filename ;
t h i s . blknum = blknum ;
}
/∗ ∗
∗ R e t u r n s t h e name o f t h e file where the block lives .
∗ @return t h e f i l e n a m e
∗/
public S t r i n g f i l e N a m e ( ) {
return f i l e n a m e ;
}
/∗ ∗
∗ Returns the l o c a t i o n of the block within the file .
∗ @ r e t u r n t h e b l o c k number
∗/
public i n t number ( ) {
return blknum ;
}
public boolean e q u a l s ( O b j e c t o b j ) {
Block blk = ( Block ) obj ;
return f i l e n a m e . e q u a l s ( b l k . f i l e n a m e ) && blknum == b l k . blknum ;
}
public S t r i n g t o S t r i n g ( ) {
return ” [ f i l e ” + f i l e n a m e + ” , b l o c k ” + blknum + ” ] ” ;
}
public i n t hashCode ( ) {
return t o S t r i n g ( ) . hashCode ( ) ;
}
}
• This library class provides also a reading/writing position within the chunk.
– This means that Java uses one of its OS I/O buffers as the chunk.
– This is a good idea in an RDBMS (but not in most other programming situa-
tions!) because it will manage its own Buffer s.
– In this way, the RDBMS can “recycle” the same memory which the OS would
have used for the same purpose.
• All these methods (like many others) are synchronized (Sestoft, 2005, Chap-
ter 16.2):
– That is, only one thread can execute the methods of a Page object at the same
time.
– Because the RDBMS process handles each connection with a client in its own
thread, this ensures that two clients cannot manipulate the same Page at the
same time – one must wait until the other is finished instead.
– This is important for the get. . . and set. . . methods, which
¬ first move the position where they want it to be, and
then read or write the data starting at that position.
package s i m p l e d b . f i l e ;
import s i m p l e d b . s e r v e r . SimpleDB ;
import j a v a . n i o . B y t e B u f f e r ;
import j a v a . n i o . c h a r s e t . C h a r s e t ;
/∗ ∗
∗ The c o n t e n t s o f a d i s k b l o c k i n memory .
∗ A p a g e i s t r e a t e d a s an a r r a y o f BLOCK SIZE b y t e s .
72
∗ There a r e methods t o g e t / s e t v a l u e s i n t o t h i s array ,
∗ and t o r e a d / w r i t e t h e c o n t e n t s o f t h i s a r r a y t o a d i s k b l o c k .
∗
∗ For an e x a m p l e o f how t o u s e Page and
∗ { @link Block } o b j e c t s ,
∗ c o n s i d e r the f o l l o w i n g code fragment .
∗ The f i r s t p o r t i o n i n c r e m e n t s t h e i n t e g e r a t o f f s e t 792 o f b l o c k 6 o f file junk .
∗ The s e c o n d p o r t i o n s t o r e s t h e s t r i n g ” h e l l o ” a t o f f s e t 20 o f a p a g e ,
∗ and t h e n a p p e n d s i t t o a new b l o c k o f t h e f i l e .
∗ I t then reads t h a t b l o c k i n t o another page
∗ and e x t r a c t s t h e v a l u e ” h e l l o ” i n t o v a r i a b l e s .
∗ <p r e >
∗ Page p1 = new Page ( ) ;
∗ B l o c k b l k = new B l o c k ( ” j u n k ” , 6 ) ;
∗ p1 . r e a d ( b l k ) ;
∗ i n t n = p1 . g e t I n t ( 7 9 2 ) ;
∗ p1 . s e t I n t ( 7 9 2 , n+1) ;
∗ p1 . w r i t e ( b l k ) ;
∗
∗ Page p2 = new Page ( ) ;
∗ p2 . s e t S t r i n g ( 2 0 , ” h e l l o ” ) ;
∗ b l k = p2 . a p p e n d ( ” j u n k ” ) ;
∗ Page p3 = new Page ( ) ;
∗ p3 . r e a d ( b l k ) ;
∗ S t r i n g s = p3 . g e t S t r i n g ( 2 0 ) ;
∗ </p r e >
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s Page {
/∗ ∗
∗ The number o f b y t e s i n a b l o c k .
∗ This v a l u e i s s e t u n r e a s o n a b l y low , so t h a t i t i s e a s i e r
∗ t o c r e a t e and t e s t d a t a b a s e s h a v i n g a l o t o f b l o c k s .
∗ A more r e a l i s t i c v a l u e w o u l d b e 4K .
∗/
public s t a t i c f i n a l i n t BLOCK SIZE = 4 0 0 ;
/∗ ∗
∗ The s i z e o f an i n t e g e r i n b y t e s .
∗ This v a l u e i s almost c e r t a i n l y 4 , but i t i s
∗ a good i d e a t o encode t h i s v a l u e as a c o n s t a n t .
∗/
public s t a t i c f i n a l i n t INT SIZE = I n t e g e r . SIZE / Byte . SIZE ;
/∗ ∗
∗ The maximum s i z e , i n b y t e s , o f a s t r i n g o f l e n g t h n .
∗ A s t r i n g i s r e p r e s e n t e d as th e encoding o f i t s c h a r a c t e r s ,
∗ p r e c e d e d b y an i n t e g e r d e n o t i n g t h e number o f b y t e s i n t h i s e n c o d i n g .
∗ I f t h e JVM u s e s t h e US−ASCII e n c o d i n g , t h e n e a c h c h a r
∗ i s s t o r e d i n one b y t e , s o a s t r i n g o f n c h a r a c t e r s
∗ h a s a s i z e o f 4+n b y t e s .
∗ @param n t h e s i z e o f t h e s t r i n g
∗ @ r e t u r n t h e maximum number o f b y t e s r e q u i r e d t o s t o r e a s t r i n g o f s i z e n
∗/
public s t a t i c f i n a l i n t STR SIZE ( i n t n ) {
f l o a t b y t e s P e r C h a r = C h a r s e t . d e f a u l t C h a r s e t ( ) . newEncoder ( ) . maxBytesPerChar ( ) ;
return INT SIZE + ( n ∗ ( i n t ) b y t e s P e r C h a r ) ;
}
p r i v a t e B y t e B u f f e r c o n t e n t s = B y t e B u f f e r . a l l o c a t e D i r e c t ( BLOCK SIZE ) ;
p r i v a t e F i l e M g r f i l e m g r = SimpleDB . f i l e M g r ( ) ;
/∗ ∗
∗ C r e a t e s a new p a g e . A l t h o u g h t h e c o n s t r u c t o r t a k e s no a r g u m e n t s ,
∗ i t d e p e n d s on a { @ l i n k F i l e M g r } o b j e c t t h a t i t g e t s f r o m t h e
∗ method { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#f i l e M g r ( ) } .
∗ That o b j e c t i s c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l e i t h e r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB# i n i t ( S t r i n g ) } o r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e M g r ( S t r i n g ) } o r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e A n d L o g M g r ( S t r i n g ) } o r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e L o g A n d B u f f e r M g r ( S t r i n g ) }
∗ is called first .
∗/
public Page ( ) {}
/∗ ∗
∗ Populates the page with the c o n t e n t s of the specified disk block .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗/
public synchronized void r e a d ( B l o c k b l k ) {
f i l e m g r . read ( blk , c o n t e n t s ) ;
}
/∗ ∗
∗ Writes the c o n t e n t s of the page to the s p e c i f i e d disk block .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗/
public synchronized void w r i t e ( B l o c k b l k ) {
f i l e m g r . w r i t e ( blk , c o n t e n t s ) ;
}
/∗ ∗
∗ Appends t h e c o n t e n t s o f t h e p a g e t o t h e s p e c i f i e d f i l e .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @ r e t u r n t h e r e f e r e n c e t o t h e n e w l y −c r e a t e d d i s k b l o c k
∗/
public synchronized B l o c k append ( S t r i n g f i l e n a m e ) {
return f i l e m g r . append ( f i l e n a m e , c o n t e n t s ) ;
}
/∗ ∗
∗ Returns the integer value at a specified offset of the page .
73
∗ I f an i n t e g e r was n o t s t o r e d a t t h a t l o c a t i o n ,
∗ t h e b e h a v i o r o f t h e method i s u n p r e d i c t a b l e .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @return t h e i n t e g e r v a l u e at t h a t o f f s e t
∗/
public synchronized i n t g e t I n t ( i n t o f f s e t ) {
contents . position ( o f f s e t ) ;
return c o n t e n t s . g e t I n t ( ) ;
}
/∗ ∗
∗ W r i t e s an i n t e g e r t o t h e s p e c i f i e d o f f s e t on t h e p a g e .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @param v a l t h e i n t e g e r t o b e w r i t t e n t o t h e p a g e
∗/
public synchronized void s e t I n t ( i n t o f f s e t , i n t v a l ) {
contents . position ( o f f s e t ) ;
contents . putInt ( val ) ;
}
/∗ ∗
∗ Returns the s t r i n g v a l u e at the s p e c i f i e d o f f s e t of the page .
∗ I f a s t r i n g was n o t s t o r e d a t t h a t l o c a t i o n ,
∗ t h e b e h a v i o r o f t h e method i s u n p r e d i c t a b l e .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @return t h e s t r i n g v a l u e at t h a t o f f s e t
∗/
public synchronized S t r i n g g e t S t r i n g ( i n t o f f s e t ) {
contents . position ( o f f s e t ) ;
int len = contents . g e t I n t ( ) ;
byte [ ] b y t e v a l = new byte [ l e n ] ;
contents . get ( byteval ) ;
return new S t r i n g ( b y t e v a l ) ;
}
/∗ ∗
∗ W r i t e s a s t r i n g t o t h e s p e c i f i e d o f f s e t on t h e p a g e .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @param v a l t h e s t r i n g t o b e w r i t t e n t o t h e p a g e
∗/
public synchronized void s e t S t r i n g ( i n t o f f s e t , S t r i n g v a l ) {
contents . position ( o f f s e t ) ;
byte [ ] b y t e v a l = v a l . g e t B y t e s ( ) ;
contents . putInt ( byteval . length ) ;
c o n t e n t s . put ( b y t e v a l ) ;
}
}
• The SimpleDB process has just one global File Manager object. It handles all disk
I/O operations
read the contents of Block from disk into a ByteBuffer – for instance, into a Page
object
write a ByteBuffer into an already existing disk Block
append a new Block into the end of a file
get size of a file as the number of disk block in it
• It also opens all requested files and keeps them in openFiles to avoid reopening
them.
read and
write and
synchronous so that when write is executed without errors, then the operation
has really modified this block of this file on disk – this is where the RDBMS
takes over Buffer ing from the OS.
mode.
74
package s i m p l e d b . f i l e ;
/∗ ∗
∗ The SimpleDB f i l e manager .
∗ The d a t a b a s e s y s t e m s t o r e s i t s d a t a a s f i l e s w i t h i n a s p e c i f i e d d i r e c t o r y .
∗ The f i l e manager p r o v i d e s m e t h o d s f o r r e a d i n g t h e c o n t e n t s o f
∗ a f i l e b l o c k t o a Java b y t e b u f f e r ,
∗ writing the contents of a byte b u f f e r to a f i l e block ,
∗ and a p p e n d i n g t h e c o n t e n t s o f a b y t e b u f f e r t o t h e end o f a f i l e .
∗ T h e s e m e t h o d s a r e c a l l e d e x c l u s i v e l y b y t h e c l a s s { @ l i n k s i m p l e d b . f i l e . Page Page } ,
∗ and a r e t h u s p a c k a g e −p r i v a t e .
∗ The c l a s s a l s o c o n t a i n s t w o p u b l i c m e t h o d s :
∗ Method { @ l i n k #i s N e w ( ) i s N e w } i s c a l l e d d u r i n g s y s t e m i n i t i a l i z a t i o n b y { @ l i n k s i m p l e d b . s e r v e r . &
SimpleDB# i n i t } .
∗ Method { @ l i n k #s i z e ( S t r i n g ) s i z e } i s c a l l e d b y t h e l o g manager and t r a n s a c t i o n manager t o
∗ d e t e r m i n e t h e end o f t h e f i l e .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s F i l e M g r {
private F i l e d b D i r e c t o r y ;
p r i v a t e boolean isNew ;
p r i v a t e Map<S t r i n g , F i l e C h a n n e l > o p e n F i l e s = new HashMap<S t r i n g , F i l e C h a n n e l >() ;
/∗ ∗
∗ C r e a t e s a f i l e manager f o r t h e s p e c i f i e d d a t a b a s e .
∗ The d a t a b a s e w i l l b e s t o r e d i n a f o l d e r o f t h a t name
∗ i n t h e u s e r ’ s home d i r e c t o r y .
∗ I f the f o l d e r does not e x i s t , then a f o l d e r co nta ini ng
∗ an empty d a t a b a s e i s c r e a t e d a u t o m a t i c a l l y .
∗ F i l e s f o r a l l t e m p o r a r y t a b l e s ( i . e . t a b l e s b e g i n n i n g w i t h ” temp ” ) a r e deleted .
∗ @param dbname t h e name o f t h e d i r e c t o r y t h a t h o l d s t h e d a t a b a s e
∗/
public F i l e M g r ( S t r i n g dbname ) {
S t r i n g homedir = System . g e t P r o p e r t y ( ” u s e r . home” ) ;
d b D i r e c t o r y = new F i l e ( homedir , dbname ) ;
isNew = ! d b D i r e c t o r y . e x i s t s ( ) ;
// c r e a t e t h e d i r e c t o r y i f t h e d a t a b a s e i s new
i f ( isNew && ! d b D i r e c t o r y . mkdir ( ) )
throw new R u n t i m e E x c e p t i o n ( ” c a n n o t c r e a t e ” + dbname ) ;
// r e m o v e any l e f t o v e r t e m p o r a r y t a b l e s
for ( String filename : dbDirectory . l i s t ( ) )
i f ( f i l e n a m e . s t a r t s W i t h ( ” temp ” ) )
new F i l e ( d b D i r e c t o r y , f i l e n a m e ) . d e l e t e ( ) ;
}
/∗ ∗
∗ Reads t h e c o n t e n t s o f a d i s k b l o c k i n t o a b y t e b u f f e r .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @param b b the bytebuffer
∗/
synchronized void r e a d ( B l o c k b l k , B y t e B u f f e r bb ) {
try {
bb . c l e a r ( ) ;
FileChannel f c = g e t F i l e ( blk . fileName () ) ;
f c . r e a d ( bb , b l k . number ( ) ∗ BLOCK SIZE ) ;
}
catch ( I O E x c e p t i o n e ) {
throw new R u n t i m e E x c e p t i o n ( ” c a n n o t r e a d b l o c k ” + b l k ) ;
}
}
/∗ ∗
∗ Writes the c ont en ts of a b y t e b u f f e r i n t o a d i s k b l o c k .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @param b b the bytebuffer
∗/
synchronized void w r i t e ( B l o c k b l k , B y t e B u f f e r bb ) {
try {
bb . r e w i n d ( ) ;
FileChannel f c = g e t F i l e ( blk . fileName () ) ;
f c . w r i t e ( bb , b l k . number ( ) ∗ BLOCK SIZE ) ;
}
catch ( I O E x c e p t i o n e ) {
throw new R u n t i m e E x c e p t i o n ( ” c a n n o t w r i t e b l o c k ” + b l k ) ;
}
}
/∗ ∗
∗ Appends t h e c o n t e n t s o f a b y t e b u f f e r t o t h e end
∗ of the s p e c i f i e d f i l e .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param b b the bytebuffer
∗ @ r e t u r n a r e f e r e n c e t o t h e n e w l y −c r e a t e d b l o c k .
∗/
synchronized B l o c k append ( S t r i n g f i l e n a m e , B y t e B u f f e r bb ) {
i n t newblknum = s i z e ( f i l e n a m e ) ;
B l o c k b l k = new B l o c k ( f i l e n a m e , newblknum ) ;
w r i t e ( b l k , bb ) ;
return b l k ;
}
/∗ ∗
∗ R e t u r n s t h e number o f b l o c k s i n the s p e c i f i e d file .
∗ @param f i l e n a m e t h e name o f t h e file
∗ @ r e t u r n t h e number o f b l o c k s i n the f i l e
75
∗/
public synchronized i n t s i z e ( S t r i n g f i l e n a m e ) {
try {
FileChannel f c = g e t F i l e ( filename ) ;
return ( i n t ) ( f c . s i z e ( ) / BLOCK SIZE ) ;
}
catch ( I O E x c e p t i o n e ) {
throw new R u n t i m e E x c e p t i o n ( ” c a n n o t a c c e s s ” + f i l e n a m e ) ;
}
}
/∗ ∗
∗ Returns a boolean i n d i c a t i n g whether the file manager
∗ had t o c r e a t e a new d a t a b a s e directory .
∗ @return t r u e i f t h e d a t a b a s e i s new
∗/
public boolean isNew ( ) {
return isNew ;
}
/∗ ∗
∗ Returns the f i l e channel f o r the s p e c i f i e d filename .
∗ The f i l e c h a n n e l i s s t o r e d i n a map k e y e d on t h e f i l e n a m e .
∗ I f t h e f i l e i s n o t open , t h e n i t i s o p e n e d and t h e f i l e c h a n n e l
∗ i s a d d e d t o t h e map .
∗ @param f i l e n a m e t h e s p e c i f i e d f i l e n a m e
∗ @return t h e f i l e c h a n n e l a s s o c i a t e d w i t h t h e open f i l e .
∗ @throws IOException
∗/
p r i v a t e F i l e C h a n n e l g e t F i l e ( S t r i n g f i l e n a m e ) throws I O E x c e p t i o n {
FileChannel f c = openFiles . get ( filename ) ;
i f ( f c == n u l l ) {
F i l e dbTable = new F i l e ( d b D i r e c t o r y , f i l e n a m e ) ;
R a n d o m A c c e s s F i l e f = new R a n d o m A c c e s s F i l e ( dbTable , ” rws ” ) ;
f c = f . getChannel ( ) ;
o p e n F i l e s . put ( f i l e n a m e , f c ) ;
}
return f c ;
}
}
Data files (and their supporting files like indexes, metadata,. . . ) – the RDBMS
has only partial control over their access patterns, because they depend on the
users’ queries too
Log file – which the RDBMS controls fully. It is. . .
– an extremely important special file, because it is the central concept to
implement database recovery after a crash!
– a “diary” (or “ship’s log” or “journal”) of all the operations which the
RDBMS has performed recently.
• You have (most likely. . . ) already encountered these log files implicitly in your daily
work:
– For instance, when Microsoft Word crashes, and is restarted, then it may ask
“Do you want to recover your file?”
– It can do this, because it has kept a log of all operations since the last “Save”
operation, and so it can redo them.
• Because this Log file is so important, and the RDBMS processes it differently from
its other files, it has its own manager.
76
Figure 35: The SimpleDB log management algorithm. (Sciore, 2008)
– These log records are written at the end of the log in the order in which the
RDBMS executes their operations – that is, “forward in time”.
– However, recovery needs to read the Log file not only forward but also backwards
in time – also from the most recently written log record at the end towards the
older log records at the beginning.
– Hence the Log file is a linked list of log records, where each record contains
also a backwards pointer to the previous log record.
• The RDBMS allocates a specific Page which represents the last block of the Log file
(step 1 in Figure 35).
– All the previous blocks of the Log file have aready been written onto the disk.
– This last block may or may not have been written onto the disk yet.
append a new log record at the end of the Log file (step 2 in Figure 35) and give it
an LSN
flush a given LSN (step 3 in Figure 35) – that is, make sure that it is really written
onto the disk, and not just on the last log Page in RAM
which write the last log Page onto the disk if necessary.
• Since only the last log Page is still in RAM, flushing an LSN implies flushing all
the log records before it as well.
• The algorithm in Figure 35 is optimal in the sense that it writes the last log Page
onto the disk only when
77
• However, the algorithm in Figure 35 may write the same last log Page many times:
• The algorithm in Figure 35 can be further improved to write each last log Page
(almost) just once with some concurrent programming:
– When a thread flushes an LSN in the last log Page, then it goes to sleep
waiting for some other thread to write the Page. It is namely enough to have
the LSN on disk when this flushing thread continues.
– When another thread tries to append a new log record into the Log file but
finds its last log Page full, it writes the Page onto the disk and wakes up all
the other threads which have gone to sleep waiting it to be written.
– However, there is a problem:
∗ What if all threads go to sleep waiting for some other thread to write the
last log Page onto the disk?
∗ A general solution to such problems is to have a separate thread which
the RDBMS executes only if it has nothing else to do. This thread then
performs such “housekeeping” tasks as saving the log Page if no other
thread has done it.
– Each SimpleDB Log record is small enough to fit into a Page, so that a record
does not have to be split over a Page boundary.
– The 4 bytes after a log record give the end of the previous record on this Page
(and hence where the 4 bytes after it are), except that. . .
– the first 4 bytes on a Page give instead where the last 4 bytes on this Page
are, because appending a new log record needs to know this.
– Moving backwards across a Page boundary is in turn reading the previous disk
block of the Log file.
• The running RDBMS has always exactly one active Log file which grows with
appending new log records. This Manager handles its growth.
• SimpleDB implements the LSN of a log record by simply its disk block number in
the active Log file.
• SimpleDB hides all the details of reading the Log file backwards behind a log record
iterator:
78
Figure 36: SimpleDB last log file page and records. (Sciore, 2008)
79
BasicLogRecord defines (only) the core functionality of a log record.
– This core functionality consists of methods for reading the next field of a
given type of the current log record.
– It does not know what kinds of log records the RDBMS has.
– Instead, the recovery part of the Transaction manager will define the var-
ious log records it will need. It will use this core functionality for imple-
menting them.
package s i m p l e d b . l o g ;
import s i m p l e d b . s e r v e r . SimpleDB ;
import simpledb . f i l e . ∗ ;
import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import java . u t i l . ∗ ;
/∗ ∗
∗ The l o w− l e v e l l o g manager .
∗ T h i s l o g manager i s r e s p o n s i b l e f o r w r i t i n g l o g r e c o r d s
∗ into a log f i l e .
∗ A l o g r e c o r d can b e any s e q u e n c e o f i n t e g e r and s t r i n g v a l u e s .
∗ The l o g manager d o e s n o t u n d e r s t a n d t h e meaning o f t h e s e
∗ v a l u e s , w h i c h a r e w r i t t e n and r e a d b y t h e
∗ { @ l i n k s i m p l e d b . t x . r e c o v e r y . R e c o v e r y M g r r e c o v e r y manager } .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s LogMgr implements I t e r a b l e <B a s i c L o g R e c o r d> {
/∗ ∗
∗ The l o c a t i o n w h e r e t h e p o i n t e r t o t h e l a s t i n t e g e r i n t h e p a g e i s .
∗ A v a l u e o f 0 means t h a t t h e p o i n t e r i s t h e f i r s t v a l u e i n t h e p a g e .
∗/
public s t a t i c f i n a l i n t LAST POS = 0 ;
private String l o g f i l e ;
private Page mypage = new Page ( ) ;
private Block c u r r e n t b l k ;
private int currentpos ;
/∗ ∗
∗ C r e a t e s t h e manager f o r t h e s p e c i f i e d l o g f i l e .
∗ I f the l o g f i l e does not yet e x i s t , i t i s c re at ed
∗ w i t h an empty f i r s t b l o c k .
∗ T h i s c o n s t r u c t o r d e p e n d s on a { @ l i n k F i l e M g r } o b j e c t
∗ t h a t i t g e t s f r o m t h e method
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#f i l e M g r ( ) } .
∗ That o b j e c t i s c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e M g r ( S t r i n g ) }
∗ is called first .
∗ @param l o g f i l e t h e name o f t h e l o g f i l e
∗/
public LogMgr ( S t r i n g l o g f i l e ) {
this . l o g f i l e = l o g f i l e ;
i n t l o g s i z e = SimpleDB . f i l e M g r ( ) . s i z e ( l o g f i l e ) ;
i f ( l o g s i z e == 0 )
appendNewBlock ( ) ;
else {
c u r r e n t b l k = new B l o c k ( l o g f i l e , l o g s i z e −1) ;
mypage . r e a d ( c u r r e n t b l k ) ;
c u r r e n t p o s = g e t L a s t R e c o r d P o s i t i o n ( ) + INT SIZE ;
}
}
/∗ ∗
∗ Ensures t h a t the l o g r e c o r d s corresponding to the
∗ s p e c i f i e d LSN h a s b e e n w r i t t e n t o d i s k .
∗ A l l e a r l i e r l o g r e c o r d s w i l l a l s o be w r i t t e n to d i s k .
∗ @param l s n t h e LSN o f a l o g r e c o r d
∗/
public void f l u s h ( i n t l s n ) {
i f ( l s n >= c u r r e n t L S N ( ) )
flush () ;
}
/∗ ∗
∗ R e t u r n s an i t e r a t o r f o r t h e l o g r e c o r d s ,
∗ w h i c h w i l l b e r e t u r n e d i n r e v e r s e o r d e r s t a r t i n g w i t h t h e most recent .
∗ @see j a v a . l a n g . I t e r a b l e # i t e r a t o r ( )
∗/
public synchronized I t e r a t o r <B a s i c L o g R e c o r d> i t e r a t o r ( ) {
flush () ;
return new L o g I t e r a t o r ( c u r r e n t b l k ) ;
}
/∗ ∗
∗ Appends a l o g r e c o r d t o t h e f i l e .
∗ The r e c o r d c o n t a i n s an a r b i t r a r y a r r a y o f s t r i n g s and i n t e g e r s .
∗ The method a l s o w r i t e s an i n t e g e r t o t h e end o f e a c h l o g r e c o r d whose v a l u e
∗ is the o f f s e t of the corresponding integer for the previous log record .
∗ These i n t e g e r s a l l o w l o g r e c o r d s t o be read i n r e v e r s e o r d e r .
∗ @param r e c t h e l i s t o f v a l u e s
∗ @ r e t u r n t h e LSN o f t h e f i n a l v a l u e
∗/
public synchronized i n t append ( O b j e c t [ ] r e c ) {
i n t r e c s i z e = INT SIZE ; // 4 b y t e s f o r t h e i n t e g e r t h a t p o i n t s to the previous log record
80
for ( Object obj : r e c )
r e c s i z e += s i z e ( o b j ) ;
i f ( c u r r e n t p o s + r e c s i z e >= BLOCK SIZE ) { // t h e l o g record doesn ’ t fit ,
flush () ; // s o move t o t h e n e x t b l o c k .
appendNewBlock ( ) ;
}
for ( Object obj : r e c )
appendVal ( o b j ) ;
finalizeRecord () ;
return c u r r e n t L S N ( ) ;
}
/∗ ∗
∗ Adds t h e s p e c i f i e d v a l u e t o t h e p a g e a t t h e p o s i t i o n d e n o t e d b y
∗ currentpos . Then i n c r e m e n t s c u r r e n t p o s b y t h e s i z e o f t h e v a l u e .
∗ @param v a l t h e i n t e g e r o r s t r i n g t o b e a d d e d t o t h e p a g e
∗/
p r i v a t e void appendVal ( O b j e c t v a l ) {
i f ( v al instanceof S t r i n g )
mypage . s e t S t r i n g ( c u r r e n t p o s , ( S t r i n g ) v a l ) ;
else
mypage . s e t I n t ( c u r r e n t p o s , ( I n t e g e r ) v a l ) ;
c u r r e n t p o s += s i z e ( v a l ) ;
}
/∗ ∗
∗ Calculates the size of the s p e c i f i e d integer or string .
∗ @param v a l t h e v a l u e
∗ @return t h e s i z e o f t h e value , in b y t e s
∗/
private int s i z e ( Object v a l ) {
i f ( v al instanceof S t r i n g ) {
String sval = ( String ) val ;
return STR SIZE ( s v a l . l e n g t h ( ) ) ;
}
else
return INT SIZE ;
}
/∗ ∗
∗ R e t u r n s t h e LSN o f t h e most r e c e n t l o g r e c o r d .
∗ As i m p l e m e n t e d , t h e LSN i s t h e b l o c k number w h e r e t h e record is stored .
∗ Thus e v e r y l o g r e c o r d i n a b l o c k h a s t h e same LSN .
∗ @ r e t u r n t h e LSN o f t h e most r e c e n t l o g r e c o r d
∗/
private int currentLSN ( ) {
return c u r r e n t b l k . number ( ) ;
}
/∗ ∗
∗ Writes the c u r r e n t page to the log file .
∗/
p r i v a t e void f l u s h ( ) {
mypage . w r i t e ( c u r r e n t b l k ) ;
}
/∗ ∗
∗ C l e a r t h e c u r r e n t p a g e , and a p p e n d i t to the log file .
∗/
p r i v a t e void appendNewBlock ( ) {
setLastRecordPosition (0) ;
c u r r e n t p o s = INT SIZE ;
c u r r e n t b l k = mypage . append ( l o g f i l e ) ;
}
/∗ ∗
∗ S e t s up a c i r c u l a r c h a i n o f p o i n t e r s t o t h e r e c o r d s i n t h e p a g e .
∗ T h e r e i s an i n t e g e r a d d e d t o t h e end o f e a c h l o g r e c o r d
∗ whose v a l u e i s t h e o f f s e t o f t h e p r e v i o u s l o g r e c o r d .
∗ The f i r s t f o u r b y t e s o f t h e p a g e c o n t a i n an i n t e g e r w h o s e v a l u e
∗ i s the o f f s e t of the i n t e g e r f o r the l a s t l o g record in the page .
∗/
p r i v a t e void f i n a l i z e R e c o r d ( ) {
mypage . s e t I n t ( c u r r e n t p o s , g e t L a s t R e c o r d P o s i t i o n ( ) ) ;
setLastRecordPosition ( currentpos ) ;
c u r r e n t p o s += INT SIZE ;
}
private int g e t L a s t R e c o r d P o s i t i o n ( ) {
return mypage . g e t I n t ( LAST POS ) ;
}
p r i v a t e void s e t L a s t R e c o r d P o s i t i o n ( i n t p o s ) {
mypage . s e t I n t ( LAST POS , p o s ) ;
}
}
/∗ ∗
∗ A c l a s s t h a t p r o v i d e s t h e a b i l i t y t o move t h r o u g h the
∗ records of the log f i l e in reverse order .
∗
∗ @ a u t h o r Edward S c i o r e
81
∗/
c l a s s L o g I t e r a t o r implements I t e r a t o r <B a s i c L o g R e c o r d> {
private Block blk ;
p r i v a t e Page pg = new Page ( ) ;
private int c u r r e n t r e c ;
/∗ ∗
∗ C r e a t e s an i t e r a t o r f o r t h e r e c o r d s i n t h e log file ,
∗ positioned a f t e r the l a s t log record .
∗ This c o n s t r u c t o r i s c a l l e d e x c l u s i v e l y by
∗ { @ l i n k LogMgr# i t e r a t o r ( ) } .
∗/
L o g I t e r a t o r ( Block blk ) {
this . blk = blk ;
pg . r e a d ( b l k ) ;
c u r r e n t r e c = pg . g e t I n t ( LogMgr . LAST POS ) ;
}
/∗ ∗
∗ Determines i f the current l o g record
∗ i s the e a r l i e s t record in the log f i l e .
∗ @ r e t u r n t r u e i f t h e r e i s an e a r l i e r r e c o r d
∗/
public boolean hasNext ( ) {
return c u r r e n t r e c >0 | | b l k . number ( ) >0;
}
/∗ ∗
∗ Moves t o t h e n e x t l o g r e c o r d i n r e v e r s e o r d e r .
∗ I f the current log record i s the e a r l i e s t in i t s block ,
∗ t h e n t h e method moves t o t h e n e x t o l d e s t b l o c k ,
∗ and r e t u r n s t h e l o g r e c o r d f r o m t h e r e .
∗ @return t h e next e a r l i e s t l o g record
∗/
public B a s i c L o g R e c o r d n e x t ( ) {
i f ( c u r r e n t r e c == 0 )
moveToNextBlock ( ) ;
c u r r e n t r e c = pg . g e t I n t ( c u r r e n t r e c ) ;
return new B a s i c L o g R e c o r d ( pg , c u r r e n t r e c+INT SIZE ) ;
}
/∗ ∗
∗ Moves t o t h e n e x t l o g b l o c k i n r e v e r s e o r d e r ,
∗ and p o s i t i o n s i t a f t e r t h e l a s t r e c o r d i n t h a t b l o c k .
∗/
p r i v a t e void moveToNextBlock ( ) {
b l k = new B l o c k ( b l k . f i l e N a m e ( ) , b l k . number ( ) −1) ;
pg . r e a d ( b l k ) ;
c u r r e n t r e c = pg . g e t I n t ( LogMgr . LAST POS ) ;
}
}
import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import s i m p l e d b . f i l e . Page ;
/∗ ∗
∗ A c l a s s t h a t p r o v i d e s the a b i l i t y to read the v a l u e s of
∗ a log record .
∗ The c l a s s h a s no i d e a w h a t v a l u e s a r e t h e r e .
∗ I n s t e a d , t h e m e t h o d s { @ l i n k #n e x t I n t ( ) n e x t I n t }
∗ and { @ l i n k #n e x t S t r i n g ( ) n e x t S t r i n g } r e a d t h e v a l u e s
∗ sequentially .
∗ Thus t h e c l i e n t i s r e s p o n s i b l e f o r k n o w i n g how many v a l u e s
∗ a r e i n t h e l o g r e c o r d , and w h a t t h e i r t y p e s a r e .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B a s i c L o g R e c o r d {
p r i v a t e Page pg ;
private int pos ;
/∗ ∗
∗ A log record located at the s p e c i f i e d position of the specified page .
∗ This c o n s t r u c t o r i s c a l l e d e x c l u s i v e l y by
∗ { @ l i n k L o g I t e r a t o r#n e x t ( ) } .
∗ @param p g t h e p a g e c o n t a i n i n g t h e l o g r e c o r d
∗ @param p o s t h e p o s i t i o n o f t h e l o g r e c o r d
∗/
public B a s i c L o g R e c o r d ( Page pg , i n t p o s ) {
t h i s . pg = pg ;
this . pos = pos ;
}
/∗ ∗
∗ Returns the next v a l u e of the current log record ,
∗ a s s u m i n g i t i s an i n t e g e r .
∗ @return t h e next v a l u e o f t h e current log record
∗/
public i n t n e x t I n t ( ) {
i n t r e s u l t = pg . g e t I n t ( p o s ) ;
p o s += INT SIZE ;
return r e s u l t ;
}
82
/∗ ∗
∗ Returns the next v a l u e of the current log record ,
∗ assuming i t i s a s t r i n g .
∗ @return t h e next v a l u e o f t h e c u r r e n t log record
∗/
public S t r i n g n e x t S t r i n g ( ) {
S t r i n g r e s u l t = pg . g e t S t r i n g ( p o s ) ;
p o s += STR SIZE ( r e s u l t . l e n g t h ( ) ) ;
return r e s u l t ;
}
}
• The Buffer Manager is the component responsible for the Pages that hold user data
– that is, for the disk Block s that hold user data which have been read into RAM
to be processed.
• A Buffer is a combination of a
such that this Page holds the current contents of this Block .
• This Manager allocates and manages a large fixed pool of these Buffer s. Initially
they have only their Pages but not yet any Block s.
• This pool reserves much of the RAM of the server computer running the RDBMS
process. This RAM is well spent, because it is the central tool for improving disk
I/O in the RDBMS.
• This Buffer Manager allows the same Buffer to be pinned and accessed by many
clients at the same time.
– It just counts how many pins each Buffer has now – that is, how many clients
are accessing it now.
– If none, then this Buffer is said to be unpinned. This Manager recycles un-
pinned Buffer s.
• The Concurrency part of the Transaction Manager will be responsible for coordi-
nating their concurrent accesses.
¬ If the pool already contains a Buffer b for the requested disk Block d , then the
requesting client t can just add another pin into b. This ensures that a disk
Block has at most one Buffer .
If the RDBMS server process has been started only recently, then some Buffer s
may still have no disk Block s yet. This case is almost as easy:
¶ Take some such Buffer b and
83
Figure 37: Pinning and unpinning. (Sciore, 2008)
· read the contents of Block d from the disk into b.Page and let client t pin
this Buffer b.
® If some Buffer s in the pool are currently unpinned, then this Manager may
have to write before it can read:
¶ Select some such unpinned Buffer b.
· If the contents of b .Page are now different from b .Block – b is dirty – then
write these current contents from b .Page back into b .Block before this
Manager recycles b for d .
¸ Continue as in step · of the preceding case .
¯ If all the Buffer s in the pool are currently pinned, then this thread t must
sleep waiting for a Buffer to become unpinned before it can continue as in the
previous case ®.
• Selecting an unpinned Buffer from the pool in step ¶ of case ® is similar to what
the OS does with physical vs. virtual memory.
Naı̈ve:
– Since no thread is using an unpinned Buffer , it does not really matter
which one we select. . . does it?
– It does, because we prefer the fast case ¬ without disk I/O to slow case ®
with disk I/O.
– But then the RDBMS must guess which Buffer s in its pool are pinned to
disk Block s which client threads might need in the future.
– This is not a good selection strategy. However, SimpleDB uses it, because
it is very simple to implement.
FIFO or First In First Out:
– One such guess is that the disk Block s which were read into the pool long
ago are no longer needed.
84
– That is, the Buffer s in the pool form a queue.
– The first unpinned Buffer from the front of this queue is selected. . .
– and added to the back of this queue when it is pinned to another disk
Block .
– This is a reasonable idea.
– However, it does not take into account that some disk Block s (like meta-
data) may be needed very often, and if such a Buffer happens to be un-
pinned even for a brief moment, then it will get selected.
LRU or Least Recently Used:
– Another guess which solves this problem with FIFO is to remember in each
Buffer the time when it became unpinned, and. . .
– select the Buffer with the earliest time.
– Here the reasoning is that if a Buffer has not been used for a long time,
then its disk Block will not be used soon in the future either.
Clock:
– Another idea is to use the unpinned Buffer s of the pool as evenly as pos-
sible.
– Suppose that the pool is an array bufferpool[0 . . . PoolSize −1] of Buffer s.
– This strategy remember the latest index from where it found the previous
unpinned Buffer .
– When another unpinned Buffer is needed, this index moves forward in the
array with
latest = (latest + 1) mod PoolSize (10)
until one is found.
– The name comes from considering an analog clock whose
face is the bufferpool array
hours are 0, 1, 2, . . . , PoolSize − 1 instead of 1, 2, 3, . . . , 12
hand is the latest index.
– This strategy has some flavour of
FIFO since Equation (10) uses the bufferpool array as if it implemented
a queue
LRU since it skips over pinned Buffer s and reconsiders them only when
the latest index has gone a full circle around the whole bufferpool.
• A client t can modify the Page of a Buffer object by calling its set method. This
method requires the following 2 additional parameters, which the Buffer remembers:
85
Figure 38: Buffer pool example. (Sciore, 2008)
The Transaction which t is now running. This Buffer Manager namely offers a
method to flush all the Buffer s modified by a given Transaction t into the
disk.
The LSN of the last modification to this Buffer by Transaction t. To get this
LSN, Transaction t must have Logged its intention to modify this Buffer before
actually modifying it.
The ıRecovery part of the Transaction Manager will use these remembered param-
eters.
¶ The RDBMS does step first. This overwrites the original contents of the
disk Block .
· Then it tries to do step ¬ but fails. Now the Log file does not have the original
contents of the disk Block either – and so recovery becomes impossible!
Hence the order in requirement 11 is the correct choice, because it overwrites the
disk Block only after its original contents have been successfully flushed into the
Log.
86
package s i m p l e d b . b u f f e r ;
import s i m p l e d b . s e r v e r . SimpleDB ;
import s i m p l e d b . f i l e . ∗ ;
/∗ ∗
∗ An i n d i v i d u a l b u f f e r .
∗ A b u f f e r w r a p s a p a g e and s t o r e s i n f o r m a t i o n a b o u t i t s s t a t u s ,
∗ such as t h e d i s k b l o c k a s s o c i a t e d w i t h t h e page ,
∗ t h e number o f t i m e s t h e b l o c k h a s b e e n p i n n e d ,
∗ whether the c o n t e n t s o f the page have been modified ,
∗ and i f so , t h e i d o f t h e m o d i f y i n g t r a n s a c t i o n and
∗ t h e LSN o f t h e c o r r e s p o n d i n g l o g r e c o r d .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B u f f e r {
p r i v a t e Page c o n t e n t s = new Page ( ) ;
private Block blk = null ;
private int p i n s = 0 ;
p r i v a t e i n t m o d i f i e d B y = −1; // n e g a t i v e means n o t m o d i f i e d
p r i v a t e i n t logSequenceNumber = −1; // n e g a t i v e means no c o r r e s p o n d i n g log record
/∗ ∗
∗ C r e a t e s a new b u f f e r , w r a p p i n g a new
∗ { @ l i n k s i m p l e d b . f i l e . Page p a g e } .
∗ This c o n s t r u c t o r i s c a l l e d e x c l u s i v e l y by t h e
∗ c l a s s { @link BasicBufferMgr }.
∗ I t d e p e n d s on the
∗ { @ l i n k s i m p l e d b . l o g . LogMgr LogMgr } o b j e c t
∗ t h a t i t g e t s from t h e c l a s s
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB } .
∗ That o b j e c t i s c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e A n d L o g M g r ( S t r i n g ) } o r
∗ is called first .
∗/
public B u f f e r ( ) {}
/∗ ∗
∗ Returns the i n t e g e r v a l u e at the specified offset of the
∗ b u f f e r ’ s page .
∗ I f an i n t e g e r was n o t s t o r e d at that location ,
∗ t h e b e h a v i o r o f t h e method i s unpredictable .
∗ @param o f f s e t t h e b y t e o f f s e t of the page
∗ @return t h e i n t e g e r v a l u e at that offset
∗/
public i n t g e t I n t ( i n t o f f s e t ) {
return c o n t e n t s . g e t I n t ( o f f s e t );
}
/∗ ∗
∗ Returns the s t r i n g v a l u e at the s p e c i f i e d o f f s e t of the
∗ b u f f e r ’ s page .
∗ I f a s t r i n g was n o t s t o r e d a t t h a t l o c a t i o n ,
∗ t h e b e h a v i o r o f t h e method i s u n p r e d i c t a b l e .
∗ @param o f f s e t t h e b y t e o f f s e t o f t h e p a g e
∗ @return t h e s t r i n g v a l u e at t h a t o f f s e t
∗/
public S t r i n g g e t S t r i n g ( i n t o f f s e t ) {
return c o n t e n t s . g e t S t r i n g ( o f f s e t ) ;
}
/∗ ∗
∗ W r i t e s an i n t e g e r t o t h e s p e c i f i e d o f f s e t o f t h e
∗ b u f f e r ’ s page .
∗ T h i s method a s s u m e s t h a t t h e t r a n s a c t i o n h a s a l r e a d y
∗ w r i t t e n an a p p r o p r i a t e l o g r e c o r d .
∗ The b u f f e r s a v e s t h e i d o f t h e t r a n s a c t i o n
∗ and t h e LSN o f t h e l o g r e c o r d .
∗ A negative lsn value indicates that a log record
∗ was n o t n e c e s s a r y .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @param v a l t h e new i n t e g e r v a l u e t o b e w r i t t e n
∗ @param txnum t h e i d o f t h e t r a n s a c t i o n p e r f o r m i n g t h e m o d i f i c a t i o n
∗ @param l s n t h e LSN o f t h e c o r r e s p o n d i n g l o g r e c o r d
∗/
public void s e t I n t ( i n t o f f s e t , i n t v a l , i n t txnum , i n t l s n ) {
m o d i f i e d B y = txnum ;
i f ( l s n >= 0 )
logSequenceNumber = l s n ;
contents . setInt ( offset , val ) ;
}
/∗ ∗
∗ Writes a s t r i n g to the s p e c i f i e d o f f s e t of the
∗ b u f f e r ’ s page .
∗ T h i s method a s s u m e s t h a t t h e t r a n s a c t i o n h a s a l r e a d y
∗ w r i t t e n an a p p r o p r i a t e l o g r e c o r d .
∗ A negative lsn value indicates that a log record
∗ was n o t n e c e s s a r y .
∗ The b u f f e r s a v e s t h e i d o f t h e t r a n s a c t i o n
∗ and t h e LSN o f t h e l o g r e c o r d .
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e p a g e
∗ @param v a l t h e new s t r i n g v a l u e t o b e w r i t t e n
∗ @param txnum t h e i d o f t h e t r a n s a c t i o n p e r f o r m i n g t h e modification
∗ @param l s n t h e LSN o f t h e c o r r e s p o n d i n g l o g r e c o r d
∗/
public void s e t S t r i n g ( i n t o f f s e t , S t r i n g v a l , i n t txnum , int lsn ) {
m o d i f i e d B y = txnum ;
i f ( l s n >= 0 )
logSequenceNumber = l s n ;
contents . setString ( offset , val ) ;
}
87
/∗ ∗
∗ Returns a r e f e r e n c e to the d i s k b l o c k
∗ t h a t the b u f f e r i s pinned to .
∗ @return a r e f e r e n c e to a d i s k b l o c k
∗/
public B l o c k b l o c k ( ) {
return b l k ;
}
/∗ ∗
∗ Writes the page to i t s d i s k b l o c k i f the
∗ page i s d i r t y .
∗ The method e n s u r e s t h a t t h e c o r r e s p o n d i n g l o g
∗ record has been w r i t t e n to d i s k p r i o r to w r i t i n g
∗ the page to d i s k .
∗/
void f l u s h ( ) {
i f ( m o d i f i e d B y >= 0 ) {
SimpleDB . logMgr ( ) . f l u s h ( logSequenceNumber ) ;
contents . write ( blk ) ;
m o d i f i e d B y = −1;
}
}
/∗ ∗
∗ Increases the b u f f e r ’ s pin count .
∗/
void p i n ( ) {
p i n s ++;
}
/∗ ∗
∗ Decreases the b u f f e r ’ s pin count .
∗/
void u n p i n ( ) {
p i n s −−;
}
/∗ ∗
∗ Returns t r u e i f the b u f f e r i s c u r r e n t l y pinned
∗ ( t h a t is , i f i t has a nonzero pin count ) .
∗ @return t r u e i f t h e b u f f e r i s pinned
∗/
boolean i s P i n n e d ( ) {
return p i n s > 0 ;
}
/∗ ∗
∗ Returns t r u e i f the b u f f e r i s d i r t y
∗ due t o a m o d i f i c a t i o n by t h e s p e c i f i e d t r a n s a c t i o n .
∗ @param txnum t h e i d o f t h e t r a n s a c t i o n
∗ @return t r u e i f t h e t r a n s a c t i o n mo dif ied t h e b u f f e r
∗/
boolean i s M o d i f i e d B y ( i n t txnum ) {
return txnum == m o d i f i e d B y ;
}
/∗ ∗
∗ Reads t h e c o n t e n t s o f t h e s p e c i f i e d b l o c k i n t o
∗ the b u f f e r ’ s page .
∗ I f t h e b u f f e r was d i r t y , t h e n t h e c o n t e n t s
∗ of the p r e v i o u s page are f i r s t w r i t t e n to d i s k .
∗ @param b a r e f e r e n c e t o t h e d a t a b l o c k
∗/
void a s s i g n T o B l o c k ( B l o c k b ) {
flush () ;
blk = b ;
contents . read ( blk ) ;
pins = 0;
}
/∗ ∗
∗ I n i t i a l i z e s the b u f f e r ’ s page according to the s p e c i f i e d formatter ,
∗ and a p p e n d s t h e p a g e t o t h e s p e c i f i e d f i l e .
∗ I f t h e b u f f e r was d i r t y , t h e n t h e c o n t e n t s
∗ of the p r e v i o u s page are f i r s t w r i t t e n to d i s k .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param f m t r a p a g e f o r m a t t e r , u s e d t o i n i t i a l i z e t h e p a g e
∗/
void assignToNew ( S t r i n g f i l e n a m e , P a g e F o r m a t t e r f m t r ) {
flush () ;
fmtr . format ( contents ) ;
b l k = c o n t e n t s . append ( f i l e n a m e ) ;
pins = 0;
}
}
88
• Such a formatter is a function which initializes the Page in RAM appropriately.
• Each kind of a disk block will define its own kind of formatter.
• For instance, the Record Manager will define a formatter which initializes the Page
to consist of empty unused Record s.
• Client threads will then access this formatted Page, and eventually the Buffer Man-
ager will write in into the disk, creating the new Block .
package s i m p l e d b . b u f f e r ;
import s i m p l e d b . f i l e . Page ;
/∗ ∗
∗ An i n t e r f a c e u s e d t o i n i t i a l i z e a new b l o c k on d i s k .
∗ T h e r e w i l l b e an i m p l e m e n t i n g c l a s s f o r e a c h ” t y p e ” o f
∗ disk block .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e P a g e F o r m a t t e r {
/∗ ∗
∗ I n i t i a l i z e s a page , whose c o n t e n t s w i l l be
∗ w r i t t e n t o a new d i s k b l o c k .
∗ T h i s method i s c a l l e d o n l y d u r i n g t h e method
∗ { @ l i n k B u f f e r#a s s i g n T o N e w } .
∗ @param p a b u f f e r p a g e
∗/
public void f o r m a t ( Page p ) ;
}
• That is, it handles all the cases where the requesting client t can get a Buffer without
having to sleep first.
package s i m p l e d b . b u f f e r ;
import s i m p l e d b . f i l e . ∗ ;
/∗ ∗
∗ Manages t h e p i n n i n g and u n p i n n i n g of buffers to blocks .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
class BasicBufferMgr {
private B u f f e r [ ] b u f f e r p o o l ;
private int numAvailable ;
/∗ ∗
∗ C r e a t e s a b u f f e r manager h a v i n g t h e s p e c i f i e d number
∗ of buffer s l o t s .
∗ T h i s c o n s t r u c t o r d e p e n d s on b o t h t h e { @ l i n k F i l e M g r } and
∗ { @ l i n k s i m p l e d b . l o g . LogMgr LogMgr } o b j e c t s
∗ t h a t i t g e t s from t h e c l a s s
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB } .
∗ Those o b j e c t s a r e c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e A n d L o g M g r ( S t r i n g ) } o r
∗ is called first .
∗ @param n u m b u f f s t h e number o f b u f f e r s l o t s t o a l l o c a t e
∗/
B a s i c B u f f e r M g r ( i n t n um b u ff s ) {
b u f f e r p o o l = new B u f f e r [ n u mb u f fs ] ;
n u m A v a i l a b l e = n um b uf f s ;
f o r ( i n t i =0; i <n um b u ff s ; i ++)
b u f f e r p o o l [ i ] = new B u f f e r ( ) ;
}
/∗ ∗
∗ F l u s h e s t h e d i r t y b u f f e r s m o d i f i e d by t h e specified transaction .
∗ @param txnum t h e t r a n s a c t i o n ’ s i d number
∗/
synchronized void f l u s h A l l ( i n t txnum ) {
for ( Buffer buff : b u f f e r p o o l )
i f ( b u f f . i s M o d i f i e d B y ( txnum ) )
buff . flush () ;
}
/∗ ∗
∗ Pins a b u f f e r to t h e s p e c i f i e d b l o c k .
∗ I f there i s already a b u f f e r assigned to that block
∗ then t h a t b u f f e r i s used ;
∗ o t h e r w i s e , an u n p i n n e d b u f f e r f r o m t h e p o o l i s c h o s e n .
∗ R e t u r n s a n u l l v a l u e i f t h e r e a r e no a v a i l a b l e b u f f e r s .
89
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @return t h e pinned b u f f e r
∗/
synchronized B u f f e r p i n ( B l o c k b l k ) {
Buffer buff = f i n d E x i s t i n g B u f f e r ( blk ) ;
i f ( b u f f == n u l l ) {
buff = chooseUnpinnedBuffer ( ) ;
i f ( b u f f == n u l l )
return n u l l ;
buff . assignToBlock ( blk ) ;
}
i f ( ! buff . isPinned () )
n um Av a il ab l e −−;
b u f f . pin ( ) ;
return b u f f ;
}
/∗ ∗
∗ A l l o c a t e s a new b l o c k i n t h e s p e c i f i e d f i l e , and
∗ pins a b u f f e r to i t .
∗ Returns n u l l ( without a l l o c a t i n g the b l o c k ) i f
∗ t h e r e a r e no a v a i l a b l e b u f f e r s .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param f m t r a p a g e f o r m a t t e r o b j e c t , u s e d t o f o r m a t t h e new b l o c k
∗ @return t h e pinned b u f f e r
∗/
synchronized B u f f e r pinNew ( S t r i n g f i l e n a m e , P a g e F o r m a t t e r f m t r ) {
Buffer buff = chooseUnpinnedBuffer ( ) ;
i f ( b u f f == n u l l )
return n u l l ;
b u f f . assignToNew ( f i l e n a m e , f m t r ) ;
n um Av a il a bl e −−;
b u f f . pin ( ) ;
return b u f f ;
}
/∗ ∗
∗ Unpins t h e s p e c i f i e d b u f f e r .
∗ @param b u f f t h e b u f f e r t o b e u n p i n n e d
∗/
synchronized void u n p i n ( B u f f e r b u f f ) {
b u f f . unpin ( ) ;
i f ( ! buff . isPinned () )
n u m A v a i l a b l e ++;
}
/∗ ∗
∗ R e t u r n s t h e number o f available ( i . e . unpinned ) buffers .
∗ @ r e t u r n t h e number o f available buffers
∗/
int a v a i l a b l e ( ) {
return n u m A v a i l a b l e ;
}
private B u f f e r f i n d E x i s t i n g B u f f e r ( Block b lk ) {
for ( Buffer buff : b u f f e r p o o l ) {
Block b = b u f f . block ( ) ;
i f ( b != n u l l && b . e q u a l s ( b l k ) )
return b u f f ;
}
return n u l l ;
}
private B u f f e r chooseUnpinnedBuffer ( ) {
for ( Buffer buff : b u f f e r p o o l )
i f ( ! buff . isPinned () )
return b u f f ;
return n u l l ;
}
}
• This full Buffer Manager adds the remaining case ¯ of the Buffer granting algorithm
into the basic Buffer Manager.
• That is, it handles the remaining case where the requesting client t must first go to
sleep waiting for a Buffer to become unpinned.
• SimpleDB implements this sleeping with the Java lock of the unique bufferMgr
object (Sestoft, 2005, Chapter 16.4) as follows:
90
¸ When another thread takes the last pin from a Buffer , it calls bufferMgr.notifyAll ,
which wakes up every thread which is bufferMgr.waiting for this to happen,
and. . .
¹ all these threads compete for this one unpinned Buffer . One of them wins, and
the others must bufferMgr.wait again.
– A waiting thread can experience livelock where it cannot get on with its work,
because it always loses in the competitions of step ¹.
– A fair implementation would grant the buffer requests in FIFO order instead.
• However, this Buffer Manager can also cause a deadlock – and that is a problem!
– If a client thread has been waiting for a Buffer for 10 seconds, then it is
assumed to be in a deadlock.
– Then SimpleDB raises a BufferAbortException in this client thread in the
RDBMS server process, which. . .
– aborts the client thread’s current transaction, and this in turn unpins all its
Buffer s, and. . .
– gets passed to the client process too.
– This is an example of where the RDBMS reports an “error” to the client process
because it is running low on resources, as in failure reason ¹ of section 3.3.
¶ Since client A has already got one of the Buffer s, client B must not get the
other.
· Instead, client A can get both Buffer s, and execute.
¸ At the end client A unpins both its Buffer s, and then client B can execute.
91
package s i m p l e d b . b u f f e r ;
import s i m p l e d b . f i l e . ∗ ;
/∗ ∗
∗ The p u b l i c l y −a c c e s s i b l e b u f f e r manager .
∗ A b u f f e r manager w r a p s a b a s i c b u f f e r manager , and
∗ p r o v i d e s t h e same m e t h o d s . The d i f f e r e n c e i s t h a t
∗ t h e m e t h o d s { @ l i n k #p i n ( B l o c k ) p i n } and
∗ { @ l i n k #pinNew ( S t r i n g , P a g e F o r m a t t e r ) pinNew }
∗ w i l l never return n u l l .
∗ I f no b u f f e r s a r e c u r r e n t l y a v a i l a b l e , t h e n t h e
∗ c a l l i n g t h r e a d w i l l b e p l a c e d on a w a i t i n g l i s t .
∗ The w a i t i n g t h r e a d s a r e r e m o v e d f r o m t h e l i s t when
∗ a b u f f e r becomes a v a i l a b l e .
∗ I f a t h r e a d h a s b e e n w a i t i n g f o r a b u f f e r f o r an
∗ e x c e s s i v e amount o f t i m e ( c u r r e n t l y , 10 s e c o n d s )
∗ then a { @link B u f f e r A b o r t E x c e p t i o n } i s thrown .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B u f f e r M g r {
p r i v a t e s t a t i c f i n a l long MAX TIME = 1 0 0 0 0 ; // 10 s e c o n d s
private BasicBufferMgr bufferMgr ;
/∗ ∗
∗ C r e a t e s a new b u f f e r manager h a v i n g t h e s p e c i f i e d
∗ number o f b u f f e r s .
∗ T h i s c o n s t r u c t o r d e p e n d s on b o t h t h e { @ l i n k F i l e M g r } and
∗ { @ l i n k s i m p l e d b . l o g . LogMgr LogMgr } o b j e c t s
∗ t h a t i t g e t s from t h e c l a s s
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB } .
∗ Those o b j e c t s a r e c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e A n d L o g M g r ( S t r i n g ) } o r
∗ is called first .
∗ @param n u m b u f f e r s t h e number o f b u f f e r s l o t s t o a l l o c a t e
∗/
public B u f f e r M g r ( i n t n u m b u f f e r s ) {
b u f f e r M g r = new B a s i c B u f f e r M g r ( n u m b u f f e r s ) ;
}
/∗ ∗
∗ Pins a b u f f e r to t h e s p e c i f i e d block , p o t e n t i a l l y
∗ w a i t i n g u n t i l a b u f f e r becomes a v a i l a b l e .
∗ I f no b u f f e r b e c o m e s a v a i l a b l e w i t h i n a f i x e d
∗ time period , then a { @link B u f f e r A b o r t E x c e p t i o n } i s thrown .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @return t h e b u f f e r pinned to t h a t b l o c k
∗/
public synchronized B u f f e r p i n ( B l o c k b l k ) {
try {
long timestamp = System . c u r r e n t T i m e M i l l i s ( ) ;
Buffer b u f f = bufferMgr . pin ( blk ) ;
while ( b u f f == n u l l && ! w a i t i n g T o o L o n g ( timestamp ) ) {
w a i t (MAX TIME) ;
b u f f = bufferMgr . pin ( blk ) ;
}
i f ( b u f f == n u l l )
throw new B u f f e r A b o r t E x c e p t i o n ( ) ;
return b u f f ;
}
catch ( I n t e r r u p t e d E x c e p t i o n e ) {
throw new B u f f e r A b o r t E x c e p t i o n ( ) ;
}
}
/∗ ∗
∗ P i n s a b u f f e r t o a new b l o c k i n t h e s p e c i f i e d f i l e ,
∗ p o t e n t i a l l y w a i t i n g u n t i l a b u f f e r becomes a v a i l a b l e .
∗ I f no b u f f e r b e c o m e s a v a i l a b l e w i t h i n a f i x e d
∗ time period , then a { @link B u f f e r A b o r t E x c e p t i o n } i s thrown .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param f m t r t h e f o r m a t t e r u s e d t o i n i t i a l i z e t h e p a g e
∗ @return t h e b u f f e r pinned to t h a t b l o c k
∗/
public synchronized B u f f e r pinNew ( S t r i n g f i l e n a m e , P a g e F o r m a t t e r f m t r ) {
try {
long timestamp = System . c u r r e n t T i m e M i l l i s ( ) ;
B u f f e r b u f f = b u f f e r M g r . pinNew ( f i l e n a m e , f m t r ) ;
while ( b u f f == n u l l && ! w a i t i n g T o o L o n g ( timestamp ) ) {
w a i t (MAX TIME) ;
b u f f = b u f f e r M g r . pinNew ( f i l e n a m e , f m t r ) ;
}
i f ( b u f f == n u l l )
throw new B u f f e r A b o r t E x c e p t i o n ( ) ;
return b u f f ;
}
catch ( I n t e r r u p t e d E x c e p t i o n e ) {
throw new B u f f e r A b o r t E x c e p t i o n ( ) ;
}
}
/∗ ∗
∗ Unpins t h e s p e c i f i e d b u f f e r .
∗ I f t h e b u f f e r ’ s pin count becomes 0 ,
∗ t h e n t h e t h r e a d s on t h e w a i t l i s t a r e n o t i f i e d .
∗ @param b u f f t h e b u f f e r t o b e u n p i n n e d
∗/
public synchronized void u n p i n ( B u f f e r b u f f ) {
bufferMgr . unpin ( b u f f ) ;
i f ( ! buff . isPinned () )
notifyAll () ;
}
92
/∗ ∗
∗ F l u s h e s t h e d i r t y b u f f e r s m o d i f i e d by t h e specified transaction .
∗ @param txnum t h e t r a n s a c t i o n ’ s i d number
∗/
public void f l u s h A l l ( i n t txnum ) {
b u f f e r M g r . f l u s h A l l ( txnum ) ;
}
/∗ ∗
∗ R e t u r n s t h e number o f a v a i l a b l e ( i e unpinned ) buffers .
∗ @ r e t u r n t h e number o f a v a i l a b l e buffers
∗/
public i n t a v a i l a b l e ( ) {
return b u f f e r M g r . a v a i l a b l e ( ) ;
}
p r i v a t e boolean w a i t i n g T o o L o n g ( long s t a r t t i m e ) {
return System . c u r r e n t T i m e M i l l i s ( ) − s t a r t t i m e > MAX TIME ;
}
}
/∗ ∗
∗ A runtime e x c e p t i o n i n d i c a t i n g t h a t the t r a n s a c t i o n
∗ needs to a b o r t because a b u f f e r r e q u e s t could not be s a t i s f i e d .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
public c l a s s B u f f e r A b o r t E x c e p t i o n extends R u n t i m e E x c e p t i o n {}
• We have defined what Transactions are and their 4 ACID properties in section 2.5.
• Let us now consider how SimpleDB implements them, and some alternatives.
Recovery of the database after its server process is restarted after a shutdown.
Concurrency Management for Buffer s and other resources which several client threads
and Transactions want to use at the same time.
• Recovery takes place when the RDBMS server process is restarted after a shutdown
(for whatever reason).
• Recovery uses the information in the Log file which was written before the shutdown.
undoes the modifications made by Transactions that got started but never committed
before the shutdown, and
93
no-undo with-undo
The database contains exactly the The database does contains all the
no-redo modifications by the committed modifications by the committed
Transactions. Transactions (and maybe more).
− All the same problems as below and − When a Transaction commits, all its
to the right. . . Buffer s must be flushed, as in
Figure 40 – a lot of disk I/O in
one burst! In all, this could mean
up to 10 × normal I/O!
Table 3: 4 kinds of recovery algorithms. (Weikum and Vossen, 2001, Chapter 12.5)
redoes the modifications made by Transactions that committed before the shut-
down, but whose Buffer s might not have been flushed yet.
• A Transaction with a Start but neither a Commit nor a Rollback log record must
have been running while the shutdown happened.
• Committing a Transaction with redo can simplify step 1 of Figure 40 into just
unpinning these Buffer s:
– The Buffer manager will write them to their disk Block s later, when it recycles
them.
– If the RDBMS shuts down before the it has written them, then their modifi-
cations can be redone from the Log during recovery.
• Figure 41 reconstructs the original disk Block contents into RAM Buffer s.
94
Figure 39: The general recovery algorithm. (Sciore, 2008)
95
Figure 42: Aborting without undo. (Sciore, 2008)
– One way to ensure that these original Buffer s are written to disk is to flush
them before adding the abort Log record.
This is similar to committing in the “with-undo-no-redo” approach in Table 3.
– Another way is to add Log records also in its step 2a.
This way is compatible with more design choices that the first.
• SimpleDB has chosen attribute values as its Logging and recovery granularity.
– Other, coarser choiced could have been to Log changes to whole Block s or even
files instead.
– Then the Log would contain fewer records, but each record would be larger.
96
SimpleDB source file simpledb/tx/recovery/LogRecord.java
• Here is the definition of the LogRecord interface.
• Each of the 5 files after it implements one particular kind of a log record mentioned
before.
package s i m p l e d b . t x . r e c o v e r y ;
import s i m p l e d b . l o g . LogMgr ;
import s i m p l e d b . s e r v e r . SimpleDB ;
/∗ ∗
∗ The i n t e r f a c e i m p l e m e n t e d b y e a c h t y p e o f l o g record .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e LogRecord {
/∗ ∗
∗ The s i x d i f f e r e n t t y p e s o f l o g r e c o r d
∗/
s t a t i c f i n a l i n t CHECKPOINT = 0 , START = 1 ,
COMMIT = 2 , ROLLBACK = 3 ,
SETINT = 4 , SETSTRING = 5 ;
/∗ ∗
∗ W r i t e s t h e r e c o r d t o t h e l o g and r e t u r n s i t s LSN .
∗ @ r e t u r n t h e LSN o f t h e r e c o r d i n t h e l o g
∗/
int writeToLog ( ) ;
/∗ ∗
∗ Returns the log record ’ s type .
∗ @return the log record ’ s type
∗/
i n t op ( ) ;
/∗ ∗
∗ Returns the t r a n s a c t i o n id s t o r e d with
∗ the log record .
∗ @return t h e l o g record ’ s t r a n s a c t i o n i d
∗/
i n t txNumber ( ) ;
/∗ ∗
∗ Undoes t h e o p e r a t i o n e n c o d e d b y t h i s l o g r e c o r d .
∗ The o n l y l o g r e c o r d t y p e s f o r w h i c h t h i s method
∗ d o e s a n y t h i n g i n t e r e s t i n g a r e SETINT and SETSTRING .
∗ @param txnum t h e i d o f t h e t r a n s a c t i o n t h a t i s p e r f o r m i n g t h e undo .
∗/
void undo ( i n t txnum ) ;
}
import s i m p l e d b . l o g . B a s i c L o g R e c o r d ;
c l a s s S t a r t R e c o r d implements LogRecord {
p r i v a t e i n t txnum ;
/∗ ∗
∗ C r e a t e s a new s t a r t l o g r e c o r d f o r t h e s p e c i f i e d transaction .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗/
public S t a r t R e c o r d ( i n t txnum ) {
t h i s . txnum = txnum ;
}
/∗ ∗
∗ C r e a t e s a l o g r e c o r d b y r e a d i n g one o t h e r value from t h e log .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public S t a r t R e c o r d ( B a s i c L o g R e c o r d r e c ) {
txnum = r e c . n e x t I n t ( ) ;
}
/∗ ∗
∗ Writes a s t a r t record to the l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e START o p e r a t o r ,
∗ f o l l o w e d by t h e t r a n s a c t i o n i d .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {START, txnum } ;
return logMgr . append ( r e c ) ;
}
public i n t op ( ) {
return START ;
}
public i n t txNumber ( ) {
97
return txnum ;
}
/∗ ∗
∗ Does n o t h i n g , b e c a u s e a s t a r t record
∗ c o n t a i n s no undo i n f o r m a t i o n .
∗/
public void undo ( i n t txnum ) {}
public S t r i n g t o S t r i n g ( ) {
return ”<START ” + txnum + ”>” ;
}
}
• The next file is its definition for the other type τ = Int which SimpleDB supports.
package s i m p l e d b . t x . r e c o v e r y ;
c l a s s S e t S t r i n g R e c o r d implements LogRecord {
p r i v a t e i n t txnum , o f f s e t ;
private S t r i n g v a l ;
private Block blk ;
/∗ ∗
∗ C r e a t e s a new s e t s t r i n g l o g r e c o r d .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗ @param b l k t h e b l o c k c o n t a i n i n g t h e v a l u e
∗ @param o f f s e t t h e o f f s e t o f t h e v a l u e i n t h e b l o c k
∗ @param v a l t h e new v a l u e
∗/
public S e t S t r i n g R e c o r d ( i n t txnum , B l o c k b l k , i n t o f f s e t , String val ) {
t h i s . txnum = txnum ;
this . blk = blk ;
this . o f f s e t = o f f s e t ;
this . val = val ;
}
/∗ ∗
∗ C r e a t e s a l o g r e c o r d by r e a d i n g f i v e o t h e r values from t h e log .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public S e t S t r i n g R e c o r d ( B a s i c L o g R e c o r d r e c ) {
txnum = r e c . n e x t I n t ( ) ;
String filename = rec . nextString () ;
i n t blknum = r e c . n e x t I n t ( ) ;
b l k = new B l o c k ( f i l e n a m e , blknum ) ;
o f f s e t = rec . nextInt () ;
val = rec . nextString () ;
}
/∗ ∗
∗ Writes a s e t S t r i n g record to the l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e SETSTRING o p e r a t o r ,
∗ f o l l o w e d b y t h e t r a n s a c t i o n i d , t h e f i l e n a m e , number ,
∗ and o f f s e t o f t h e m o d i f i e d b l o c k , and t h e p r e v i o u s
∗ s t r i n g value at that o f f s e t .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {SETSTRING , txnum , b l k . f i l e N a m e ( ) ,
b l k . number ( ) , o f f s e t , v a l } ;
return logMgr . append ( r e c ) ;
}
public i n t op ( ) {
return SETSTRING ;
}
public i n t txNumber ( ) {
return txnum ;
}
public S t r i n g t o S t r i n g ( ) {
return ”<SETSTRING ” + txnum + ” ” + b l k + ” ” + o f f s e t + ” ” + v a l + ”>” ;
}
/∗ ∗
∗ Replaces the s p e c i f i e d data value with the value saved in the log record .
∗ The method p i n s a b u f f e r t o t h e s p e c i f i e d b l o c k ,
∗ c a l l s s e t S t r i n g to r e s t o r e the saved value
∗ ( u s i n g a dummy LSN ) , and u n p i n s t h e b u f f e r .
∗ @see s i m p l e d b . t x . r e c o v e r y . L o g R e c o r d#undo ( i n t )
∗/
98
public void undo ( i n t txnum ) {
B u f f e r M g r b u f f M g r = SimpleDB . b u f f e r M g r ( ) ;
B u f f e r b u f f = buffMgr . pin ( blk ) ;
b u f f . s e t S t r i n g ( o f f s e t , v a l , txnum , −1) ;
buffMgr . unpin ( b u f f ) ;
}
}
c l a s s S e t I n t R e c o r d implements LogRecord {
p r i v a t e i n t txnum , o f f s e t , v a l ;
private Block blk ;
/∗ ∗
∗ C r e a t e s a new s e t i n t l o g r e c o r d .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗ @param b l k t h e b l o c k c o n t a i n i n g t h e v a l u e
∗ @param o f f s e t t h e o f f s e t o f t h e v a l u e i n t h e b l o c k
∗ @param v a l t h e new v a l u e
∗/
public S e t I n t R e c o r d ( i n t txnum , B l o c k b l k , i n t o f f s e t , int val ) {
t h i s . txnum = txnum ;
this . blk = blk ;
this . o f f s e t = o f f s e t ;
this . val = val ;
}
/∗ ∗
∗ C r e a t e s a l o g r e c o r d by r e a d i n g f i v e o t h e r values from t h e log .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public S e t I n t R e c o r d ( B a s i c L o g R e c o r d r e c ) {
txnum = r e c . n e x t I n t ( ) ;
String filename = rec . nextString () ;
i n t blknum = r e c . n e x t I n t ( ) ;
b l k = new B l o c k ( f i l e n a m e , blknum ) ;
o f f s e t = rec . nextInt () ;
val = rec . nextInt () ;
}
/∗ ∗
∗ Writes a s e t I n t record to the l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e SETINT o p e r a t o r ,
∗ f o l l o w e d b y t h e t r a n s a c t i o n i d , t h e f i l e n a m e , number ,
∗ and o f f s e t o f t h e m o d i f i e d b l o c k , and t h e p r e v i o u s
∗ integer value at that o f f s e t .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {SETINT , txnum , b l k . f i l e N a m e ( ) ,
b l k . number ( ) , o f f s e t , v a l } ;
return logMgr . append ( r e c ) ;
}
public i n t op ( ) {
return SETINT ;
}
public i n t txNumber ( ) {
return txnum ;
}
public S t r i n g t o S t r i n g ( ) {
return ”<SETINT ” + txnum + ” ” + b l k + ” ” + o f f s e t + ” ” + v a l + ”>” ;
}
/∗ ∗
∗ Replaces the s p e c i f i e d data value with the value saved in the log record .
∗ The method p i n s a b u f f e r t o t h e s p e c i f i e d b l o c k ,
∗ c a l l s s e t I n t to r e s t o r e the saved value
∗ ( u s i n g a dummy LSN ) , and u n p i n s t h e b u f f e r .
∗ @see s i m p l e d b . t x . r e c o v e r y . L o g R e c o r d#undo ( i n t )
∗/
public void undo ( i n t txnum ) {
B u f f e r M g r b u f f M g r = SimpleDB . b u f f e r M g r ( ) ;
B u f f e r b u f f = buffMgr . pin ( blk ) ;
b u f f . s e t I n t ( o f f s e t , v a l , txnum , −1) ;
buffMgr . unpin ( b u f f ) ;
}
}
import s i m p l e d b . l o g . B a s i c L o g R e c o r d ;
/∗ ∗
99
∗ The COMMIT l o g r e c o r d
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s CommitRecord implements LogRecord {
p r i v a t e i n t txnum ;
/∗ ∗
∗ C r e a t e s a new commit l o g r e c o r d f o r t h e s p e c i f i e d transaction .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗/
public CommitRecord ( i n t txnum ) {
t h i s . txnum = txnum ;
}
/∗ ∗
∗ C r e a t e s a l o g r e c o r d b y r e a d i n g one o t h e r value from t h e log .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public CommitRecord ( B a s i c L o g R e c o r d r e c ) {
txnum = r e c . n e x t I n t ( ) ;
}
/∗ ∗
∗ W r i t e s a commit r e c o r d t o t h e l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e COMMIT o p e r a t o r ,
∗ f o l l o w e d by t h e t r a n s a c t i o n i d .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {COMMIT, txnum } ;
return logMgr . append ( r e c ) ;
}
public i n t op ( ) {
return COMMIT;
}
public i n t txNumber ( ) {
return txnum ;
}
/∗ ∗
∗ Does n o t h i n g , b e c a u s e a commit r e c o r d
∗ c o n t a i n s no undo i n f o r m a t i o n .
∗/
public void undo ( i n t txnum ) {}
public S t r i n g t o S t r i n g ( ) {
return ”<COMMIT ” + txnum + ”>” ;
}
}
import s i m p l e d b . l o g . B a s i c L o g R e c o r d ;
/∗ ∗
∗ The ROLLBACK l o g r e c o r d .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s R o l l b a c k R e c o r d implements LogRecord {
p r i v a t e i n t txnum ;
/∗ ∗
∗ C r e a t e s a new r o l l b a c k l o g r e c o r d f o r t h e s p e c i f i e d transaction .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗/
public R o l l b a c k R e c o r d ( i n t txnum ) {
t h i s . txnum = txnum ;
}
/∗ ∗
∗ C r e a t e s a l o g r e c o r d b y r e a d i n g one o t h e r value from t h e log .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public R o l l b a c k R e c o r d ( B a s i c L o g R e c o r d r e c ) {
txnum = r e c . n e x t I n t ( ) ;
}
/∗ ∗
∗ Writes a r o l l b a c k record to the l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e ROLLBACK o p e r a t o r ,
∗ f o l l o w e d by t h e t r a n s a c t i o n i d .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {ROLLBACK, txnum } ;
return logMgr . append ( r e c ) ;
}
public i n t op ( ) {
return ROLLBACK;
}
public i n t txNumber ( ) {
return txnum ;
}
100
/∗ ∗
∗ Does n o t h i n g , b e c a u s e a r o l l b a c k record
∗ c o n t a i n s no undo i n f o r m a t i o n .
∗/
public void undo ( i n t txnum ) {}
public S t r i n g t o S t r i n g ( ) {
return ”<ROLLBACK ” + txnum + ”>” ;
}
}
101
Checkpoints (Sciore, 2008, Chapters 14.3.6–14.3.7) (Weikum and Vossen, 2001, Chap-
ter 13.3.3)
• The RDBMS can mark a checkpoint into its Log file at a moment where all its
Buffer s and Transactions are in some suitable known “quiet” state.
• Another but different meaning for the same word is a database state which its user
can save and later revert back into. We do not consider them here.
• The DBA can set how frequently the RDBMS takes these system checkpoints. Typ-
ical values are between 1 and 5 minutes.
• Heavyweight checkpointing flushes all modified Buffer s to get them into a quiet
state. It can be further divided into
Log Truncation
• These system checkpoints allow the undo stage 1 or Figure 39 to stop reading the
Log file backwards sooner:
When the Recovery Manager encounters the first (that is, the most recent)
• This is how the algorithm in Figure 39 behaves on the Log file in Figure 45:
102
Figure 43: Quiescent checkpointing. (Sciore, 2008)
103
Going backwards in its undo phase 1, it does the same for the preceding record
SETINT, 2. . . too.
® It passes over the START, 3 record.
¯ The COMMIT, 0 record adds 0 to its committed list.
° This causes the SETSTRING, 0 record to be ignored.
± When it encounters the nonquiescent checkpoint NQCKPT, 0, 2, it
– ignores 0 because it is now in the committed list, and
– knows that it can stop its phase 1 as soon as it encounters the START, 2
record.
² It undoes the SETSTRING, 2 record, because 2 is in neither in the committed
nor the (initally empty) aborted list.
³ The COMMIT, 1 record adds 1 to the committed list.
´ Now it encounters the START, 2 record it has been looking for, and moves from
its undo phase 1 into its redo phase 2.
µ This redo phase moves forward in the log from this START, 2 record, and
redoes all SET. . . ,0 and SET. . . ,1 records, because they form its committed
list.
• In this way, the last checkpoint in the Log file determines its still relevant tail – all
earlier records can be ignored.
• Deleting these no longer relevant records is called truncating the old Log file.
104
package s i m p l e d b . t x . r e c o v e r y ;
import s i m p l e d b . l o g . B a s i c L o g R e c o r d ;
/∗ ∗
∗ The CHECKPOINT l o g r e c o r d .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s C h e c k p o i n t R e c o r d implements LogRecord {
/∗ ∗
∗ Creates a quiescent checkpoint record .
∗/
public C h e c k p o i n t R e c o r d ( ) {}
/∗ ∗
∗ C r e a t e s a l o g r e c o r d b y r e a d i n g no o t h e r v a l u e s
∗ from t h e b a s i c l o g r e c o r d .
∗ @param r e c t h e b a s i c l o g r e c o r d
∗/
public C h e c k p o i n t R e c o r d ( B a s i c L o g R e c o r d r e c ) {}
/∗ ∗
∗ Writes a c he ckp oi nt record to the l o g .
∗ T h i s l o g r e c o r d c o n t a i n s t h e CHECKPOINT o p e r a t o r ,
∗ and n o t h i n g e l s e .
∗ @ r e t u r n t h e LSN o f t h e l a s t l o g v a l u e
∗/
public i n t w r i t e T o L o g ( ) {
O b j e c t [ ] r e c = new O b j e c t [ ] {CHECKPOINT} ;
return logMgr . append ( r e c ) ;
}
public i n t op ( ) {
return CHECKPOINT;
}
/∗ ∗
∗ C h e c k p o i n t r e c o r d s h a v e no a s s o c i a t e d t r a n s a c t i o n ,
∗ and s o t h e method r e t u r n s a ”dummy ” , n e g a t i v e t x i d .
∗/
public i n t txNumber ( ) {
return −1; // dummy v a l u e
}
/∗ ∗
∗ Does n o t h i n g , b e c a u s e a c h e c k p o i n t record
∗ c o n t a i n s no undo i n f o r m a t i o n .
∗/
public void undo ( i n t txnum ) {}
public S t r i n g t o S t r i n g ( ) {
return ”<CHECKPOINT>” ;
}
}
• The Log Manager defined the basic LogIterator with just moving backwards in
the Log file.
package s i m p l e d b . t x . r e c o v e r y ;
import s t a t i c s i m p l e d b . t x . r e c o v e r y . LogRecord . ∗ ;
import java . u t i l . I t e r a t o r ;
import simpledb . l o g . BasicLogRecord ;
import s i m p l e d b . s e r v e r . SimpleDB ;
/∗ ∗
∗ A c l a s s t h a t p r o v i d e s the a b i l i t y to read records
∗ from t h e l o g i n r e v e r s e o r d e r .
∗ Unlike the similar c l a s s
∗ { @link simpledb . log . LogIterator LogIterator } ,
∗ t h i s c l a s s u n d e r s t a n d s t h e meaning o f t h e l o g r e c o r d s .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s L o g R e c o r d I t e r a t o r implements I t e r a t o r <LogRecord> {
p r i v a t e I t e r a t o r <B a s i c L o g R e c o r d> i t e r = SimpleDB . logMgr ( ) . i t e r a t o r ( ) ;
/∗ ∗
∗ C o n s t r u c t s a l o g r e c o r d from t h e v a l u e s i n t h e
∗ current basic log record .
∗ The method f i r s t r e a d s an i n t e g e r , w h i c h d e n o t e s
∗ the type of the log record . B a s e d on t h a t t y p e ,
∗ t h e method c a l l s t h e a p p r o p r i a t e L o g R e c o r d c o n s t r u c t o r
∗ to read the remaining v a l u e s .
∗ @ r e t u r n t h e n e x t l o g r e c o r d , o r n u l l i f no more r e c o r d s
105
∗/
public LogRecord n e x t ( ) {
BasicLogRecord r e c = i t e r . next ( ) ;
i n t op = r e c . n e x t I n t ( ) ;
switch ( op ) {
case CHECKPOINT:
return new C h e c k p o i n t R e c o r d ( r e c ) ;
case START :
return new S t a r t R e c o r d ( r e c ) ;
case COMMIT:
return new CommitRecord ( r e c ) ;
case ROLLBACK:
return new R o l l b a c k R e c o r d ( r e c ) ;
case SETINT :
return new S e t I n t R e c o r d ( r e c ) ;
case SETSTRING :
return new S e t S t r i n g R e c o r d ( r e c ) ;
default :
return n u l l ;
}
}
import s t a t i c s i m p l e d b . t x . r e c o v e r y . LogRecord . ∗ ;
import simpledb . f i l e . Block ;
import simpledb . b u f f e r . Buffer ;
import s i m p l e d b . s e r v e r . SimpleDB ;
import java . u t i l . ∗ ;
/∗ ∗
∗ The r e c o v e r y manager . Each t r a n s a c t i o n has its own r e c o v e r y manager .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s RecoveryMgr {
p r i v a t e i n t txnum ;
/∗ ∗
∗ C r e a t e s a r e c o v e r y manager f o r t h e s p e c i f i e d t r a n s a c t i o n .
∗ @param txnum t h e ID o f t h e s p e c i f i e d t r a n s a c t i o n
∗/
public RecoveryMgr ( i n t txnum ) {
t h i s . txnum = txnum ;
new S t a r t R e c o r d ( txnum ) . w r i t e T o L o g ( ) ;
}
/∗ ∗
∗ W r i t e s a commit r e c o r d t o t h e l o g , and f l u s h e s i t to disk .
∗/
public void commit ( ) {
SimpleDB . b u f f e r M g r ( ) . f l u s h A l l ( txnum ) ;
i n t l s n = new CommitRecord ( txnum ) . w r i t e T o L o g ( ) ;
SimpleDB . logMgr ( ) . f l u s h ( l s n ) ;
}
/∗ ∗
∗ W r i t e s a r o l l b a c k r e c o r d t o t h e l o g , and f l u s h e s i t to disk .
∗/
public void r o l l b a c k ( ) {
doRollback ( ) ;
SimpleDB . b u f f e r M g r ( ) . f l u s h A l l ( txnum ) ;
i n t l s n = new R o l l b a c k R e c o r d ( txnum ) . w r i t e T o L o g ( ) ;
SimpleDB . logMgr ( ) . f l u s h ( l s n ) ;
}
/∗ ∗
∗ R e c o v e r s u n c o m p l e t e d t r a n s a c t i o n s from t h e l o g ,
∗ then w r i t e s a quiescent checkpoint record to the log and flushes it .
∗/
public void r e c o v e r ( ) {
doRecover ( ) ;
SimpleDB . b u f f e r M g r ( ) . f l u s h A l l ( txnum ) ;
i n t l s n = new C h e c k p o i n t R e c o r d ( ) . w r i t e T o L o g ( ) ;
SimpleDB . logMgr ( ) . f l u s h ( l s n ) ;
/∗ ∗
∗ W r i t e s a s e t i n t r e c o r d t o t h e l o g , and r e t u r n s i t s l s n .
∗ Updates to temporary f i l e s are not l o g g e d ; instead , a
∗ ”dummy” n e g a t i v e l s n i s r e t u r n e d .
∗ @param b u f f t h e b u f f e r c o n t a i n i n g t h e p a g e
∗ @param o f f s e t t h e o f f s e t o f t h e v a l u e i n t h e p a g e
∗ @param n e w v a l t h e v a l u e t o b e w r i t t e n
∗/
106
public i n t s e t I n t ( B u f f e r b u f f , i n t o f f s e t , i n t newva l ) {
int o l d v a l = b u f f . g e t I n t ( o f f s e t ) ;
Block blk = b u f f . block ( ) ;
i f ( isTempBlock ( b l k ) )
return −1;
else
return new S e t I n t R e c o r d ( txnum , b l k , o f f s e t , o l d v a l ) . w r i t e T o L o g ( ) ;
}
/∗ ∗
∗ W r i t e s a s e t s t r i n g r e c o r d t o t h e l o g , and r e t u r n s its lsn .
∗ Updates to temporary f i l e s are not l o g g e d ; instead , a
∗ ”dummy” n e g a t i v e l s n i s r e t u r n e d .
∗ @param b u f f t h e b u f f e r c o n t a i n i n g t h e p a g e
∗ @param o f f s e t t h e o f f s e t o f t h e v a l u e i n t h e p a g e
∗ @param n e w v a l t h e v a l u e t o b e w r i t t e n
∗/
public i n t s e t S t r i n g ( B u f f e r b u f f , i n t o f f s e t , S t r i n g newv al ) {
String oldval = buff . getString ( o f f s e t ) ;
Block blk = b u f f . block ( ) ;
i f ( isTempBlock ( b l k ) )
return −1;
else
return new S e t S t r i n g R e c o r d ( txnum , b l k , o f f s e t , o l d v a l ) . writeToLog ( ) ;
}
/∗ ∗
∗ R o l l s back the t r a n s a c t i o n .
∗ The method i t e r a t e s t h r o u g h t h e l o g r e c o r d s ,
∗ c a l l i n g undo ( ) f o r e a c h l o g r e c o r d i t f i n d s
∗ for the transaction ,
∗ u n t i l i t f i n d s t h e t r a n s a c t i o n ’ s START r e c o r d .
∗/
p r i v a t e void d o R o l l b a c k ( ) {
I t e r a t o r <LogRecord> i t e r = new L o g R e c o r d I t e r a t o r ( ) ;
while ( i t e r . hasNext ( ) ) {
LogRecord r e c = i t e r . n e x t ( ) ;
i f ( r e c . txNumber ( ) == txnum ) {
i f ( r e c . op ( ) == START)
return ;
r e c . undo ( txnum ) ;
}
}
}
/∗ ∗
∗ Does a c o m p l e t e d a t a b a s e r e c o v e r y .
∗ The method i t e r a t e s t h r o u g h t h e l o g r e c o r d s .
∗ Whenever i t f i n d s a l o g r e c o r d f o r an u n f i n i s h e d
∗ t r a n s a c t i o n , i t c a l l s undo ( ) on t h a t r e c o r d .
∗ The method s t o p s when i t e n c o u n t e r s a CHECKPOINT r e c o r d
∗ o r t h e end o f t h e l o g .
∗/
p r i v a t e void d o R e c o v e r ( ) {
C o l l e c t i o n <I n t e g e r > f i n i s h e d T x s = new A r r a y L i s t <I n t e g e r >() ;
I t e r a t o r <LogRecord> i t e r = new L o g R e c o r d I t e r a t o r ( ) ;
while ( i t e r . hasNext ( ) ) {
LogRecord r e c = i t e r . n e x t ( ) ;
i f ( r e c . op ( ) == CHECKPOINT)
return ;
i f ( r e c . op ( ) == COMMIT | | r e c . op ( ) == ROLLBACK)
f i n i s h e d T x s . add ( r e c . txNumber ( ) ) ;
e l s e i f ( ! f i n i s h e d T x s . c o n t a i n s ( r e c . txNumber ( ) ) )
r e c . undo ( txnum ) ;
}
}
/∗ ∗
∗ D e t e r m i n e s w h e t h e r a b l o c k comes f r o m a t e m p o r a r y file or not .
∗/
p r i v a t e boolean isTempBlock ( B l o c k b l k ) {
return b l k . f i l e N a m e ( ) . s t a r t s W i t h ( ” temp ” ) ;
}
}
• The task of this Manager is to coordinate all the concurrently running Transaction
threads.
107
– The Page get and set methods were synchronized to ensure that each call
is finished before the next starts.
– Here we coordinate which Transactions are permitted to make these calls, and
for which Pages.
The idea is that Ht traces all the relevant I/O operations performed by Transaction t
in the order in which they happen.
• For instance
H1 = R1 (p)W1 (q)
says that transaction 1
We are not interested in what it reads and writes, just in the order in which these
operations happen.
• In this way, history Ht simplifies what one Transaction t does, restricted only to
what is interesting for Concurrency Management.
– This S is a big string which consists of the characters in these smaller strings
shuffled in some way, but keeping the characters of each Hi in their original
order.
– In other words, if we delete from S all characters for the other Transactions
j 6= i, then we get Hi .
108
• A schedule S is serial if every history Hi appears as a consecutive substring. For
instance the serial schedules for the H1 and H2 in Eq. (11) are
where
• That is, in a serial schedule S the RDBMS executes each transaction i entirely from
its beginning to its end before begining another.
• In other words, a serial schedule S represents the case with no concurrency among
its Transactions.
• Hence serial schedules S are obviously correct. What non-serial schedules S 0 are
also correct? These S 0 are namely the concurrent executions of the RDBMS which
the Concurrency Manager can permit.
Conflict Serializability
• The most common concept of this “equivalence” between schedules is conflict equiv-
alence.
• Let us denote that schedules Γ and ∆ are conflict equivalent by Γ ∼ ∆, and define
this relation ‘∼’ with suitable rules.
• One rule is
ΓRt (p)Ru (q)∆ ∼ ΓRu (q)Rt (p)∆ if t 6= u (12)
or “the order in which two adjacent reads by two different Transactions happen
does not matter” because each Transaction t or u reads the same contents from the
past Γ in both sides.
• However, rule (12) does not hold for just one Transaction t = u, because that would
change the history Ht of this Transaction t.
• Another rule is
or “the order of adjacent reads and writes does not matter, if they use different
Buffer s”.
• However, rule (13) certainly does not hold if they use the same Buffer p = q:
Left side says that Transaction t reads the previous contents of this Buffer p before
Transaction u overwrites them.
109
Right side says that Transaction u overwrites the contents of this Buffer p and
Transaction t reads these new contents.
• We say that this situation is a read-write conflict between these two Transactions t
and u since these two sides disagree on what contents of this Buffer p Transaction t
saw.
• A third rule is
or “the order of two adjacent writes does not matter, if they use different Buffer s”.
• Again, rule (14) does not hold if they use the same Buffer p = q:
Left side says that Transaction u writes the contents for the future ∆.
Right side says that Transaction t writes the contents for the future ∆.
• We can extend this ‘∼’ into an equivalence relation with the familiar (?) rules
Γ∼Γ (reflexivity)
Γ ∼ ∆ if and only if ∆ ∼ Γ (symmetry)
and
• Here this notion is “these schedules Γ and ∆ perform their I/O operations in the
same order when that matters”.
110
• Proving correctness is especially important for algorithms whose testing is difficult
– and testing a Concurrency Manager is difficult!
• In rules (12)–(14)
because although they turn the adjacent pair around, this does not turn any edges
around, since this pair did not produce any edges.
S 0 = H1 H2 H3 . . . Hn
• On the other hand, if G is any acyclic graph, then we can build a serial schedule S 0
which has it as its constraint graph.
111
for which we have algorithms from the Data Structures II course (”Tietorakenteet II”
in Finnish).
Two-Phase Locking (Weikum and Vossen, 2001, Chapters 4.3.1–4.3.4) (Sciore, 2008,
Chapters 8.2.2–8.2.3 and 14)
• One common way to implement this summary (16) is by attaching Lock s on disk
Block s.
• A Transaction t attaches a
shared lock (slock) if t only wants to read (but not write) Block b
exclusive lock (xlock) if t wants to (read and) write Block b.
These 2 basic kinds of locks are enough for correct Concurrency Management.
112
• An RDBMS can also have more kinds of locks to make its Concurrency Management
more flexible, but we concenterate only to these basic 2.
In other words,
many Transactions can read the same Block b at the same time, but if
one Transaction wants to write b, then it must be the only Transaction using b
at that time.
• The constraint graph becomes the waits-for graph telling which Transactions are
now waiting for which other Transactions to unlock the Lock s for the Block they
need.
For instance, if a Transaction t holds an xlock on a Buffer b, another Transaction u
which needs b must wait, because its Ru (b) or Wu (b) operation conflicts with the Wt (b)
operation for which Transaction t attached its xlock on b.
• It turns out that the Concurrency Manager must coordinate also which way one
Transaction uses its own Lock s.
Requirement 12 (two-phase locking (2PL)). After a Transaction has performed its first
unlock operation, it cannot perform any more locking operations.
¬ It sets the right Lock for each Block it needs, and processes their contents.
It starts unlocking them only when it is certain that it will not need any more
Block s to process.
– These are situations where aborting one Transaction causes aborting others
too:
113
Figure 46: Locking example. (Sciore, 2008)
114
Figure 47: Locking and unlocking rules. (Sciore, 2008)
Strict 2PL (S2PL) where a Transaction keeps all its xlocks until it terminates.
– In the scenario above, t would not unlock z, and so avoids this write-read
conflict with u.
– S2PL avoids also write-write conflicts between running Transactions.
Strong 2PL (SS2PL) where a Transaction keeps all its Lock s until it terminates.
– SS2PL avoids all conflicts between running Transactions – including also
read-write.
– SS2PL is also commit order preserving (unlike S2PL):
The order in which Transactions commit is also their serial schedule.
– SimpleDB uses SS2PL.
– SS2PL is given as Figure 47.
Isolation Levels and Locking (Sciore, 2008, Chapters 8.2.2–8.2.3 and 14.4.7)
• The Lock Usage column of Figure 12 explains the connections between transaction
isolation levels and a Lock ing implementation.
• These levels relax the rules how Transactions can use slocks for reading from
Block s – when they compute results for queries.
• In constrast, the RDBMS must not relax the rules how Transactions can use xlocks
when they write into Block s – otherwise they might corrupt the database!
• Phantoms were new rows which appeared into the database during the current
Transaction t:
115
– They can appear, because t cannot slock a new Block n before another
Transaction u has appended it into the database – and by then u may have
added new phantom rows into n.
• The RDBMS implementation can avoid these phantoms by saying that the end-of-
file (eof) marker is another “Block ” which must be Lock ed too. Then
• “Releasing slocks early” means unlocking old slocks before locking new slocks
– violating 2PL.
• But if a Transaction does not use slocks at all, then there are no guarantees on
what it sees – hence read uncommitted.
Deadlock Handling
• When we encountered the same problem in the Buffer Manager, the SimpleDB
solution was brutally straightforward:
– If a Transaction had been waiting for any Buffer to become unpinned for
10 seconds, it was assumed to be deadlocked and was aborted.
– This solution was appropriate there, because the DBA can make these aborts
less frequent simply by adding more RAM to the Buffer pool of the RDBMS
server process.
• In contrast, here Transactions are waiting for Lock s on specific disk Block s.
– This waiting depends on the queries and the data – it cannot be alleviated
by tuning the system parameters, because the bottleneck is the actual Block s
themselves.
– Hence this Concurrency Manager should spend more effort in choosing which
Transactions it aborts than the Buffer Manager did.
116
– This effort can be based on the waits-for graph.
– However, maintaining this waits-for graph is somewhat costly in terms of both
RAM and time.
– Therefore simpler ways which do not need this graph are often preferred.
– For instance, we may base them on Transaction timestamps which indicate
when they started instead of the waits-for graph:
If we always prefer the older Transaction, then this graph would clearly be
acyclic.
Wait-Die: If this Transaction u requests a Lock which conflicts with another Lock
already held by another Transaction t, then. . .
1 if u started before t
2 u waits for t to release its Lock
3 else abort u.
That is, u either waits or dies by suicide.
Wound-Wait: We can also do the opposite to Wait-Die with. . .
1 if u started before t
2 abort t so that u may get its Lock
3 else u waits for t to release its Lock .
That is, u either murders t or waits.
Both avoid aborting the Transaction which has been running longer, because that
would mean losing all the work which it has done so far.
• On the other hand, if the waits-for graph is used instead, then the chosen Transactions
will have to be murdered.
• Lock waiting should also be fair, so that no Transaction will wait for a Block
indefinitely, because it is always given to other waiting Transactions instead.
• Despite its shortcomings, SimpleDB uses the same 10-second waiting approach with
a single Java lock to break deadlocks in both Buffer s and Lock ing. It is shown in
Figure 48.
package s i m p l e d b . t x . c o n c u r r e n c y ;
import s i m p l e d b . f i l e . B l o c k ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ The l o c k t a b l e , w h i c h p r o v i d e s m e t h o d s t o l o c k and u n l o c k b l o c k s .
∗ I f a t r a n s a c t i o n r e q u e s t s a l o c k t h a t c a u s e s a c o n f l i c t w i t h an
∗ e x i s t i n g l o c k , t h e n t h a t t r a n s a c t i o n i s p l a c e d on a w a i t l i s t .
117
Figure 48: The time-limit strategy. (Sciore, 2008)
∗ T h e r e i s o n l y one w a i t l i s t f o r a l l b l o c k s .
∗ When t h e l a s t l o c k on a b l o c k i s u n l o c k e d , t h e n a l l t r a n s a c t i o n s
∗ a r e r e m o v e d f r o m t h e w a i t l i s t and r e s c h e d u l e d .
∗ I f one o f t h o s e t r a n s a c t i o n s d i s c o v e r s t h a t t h e l o c k i t i s w a i t i n g for
∗ i s s t i l l l o c k e d , i t w i l l p l a c e i t s e l f b a c k on t h e w a i t l i s t .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s LockTable {
p r i v a t e s t a t i c f i n a l long MAX TIME = 1 0 0 0 0 ; // 10 s e c o n d s
/∗ ∗
∗ G r a n t s an SLock on t h e s p e c i f i e d b l o c k .
∗ I f an XLock e x i s t s when t h e method i s c a l l e d ,
∗ t h e n t h e c a l l i n g t h r e a d w i l l b e p l a c e d on a w a i t l i s t
∗ u n t i l the lock is released .
∗ I f t h e t h r e a d r e m a i n s on t h e w a i t l i s t f o r a c e r t a i n
∗ amount o f t i m e ( c u r r e n t l y 10 s e c o n d s ) ,
∗ t h e n an e x c e p t i o n i s t h r o w n .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
public synchronized void s L o c k ( B l o c k b l k ) {
try {
long timestamp = System . c u r r e n t T i m e M i l l i s ( ) ;
while ( h a s X l o c k ( b l k ) && ! w a i t i n g T o o L o n g ( timestamp ) )
w a i t (MAX TIME) ;
i f ( hasXlock ( blk ) )
throw new L o c k A b o r t E x c e p t i o n ( ) ;
int val = getLockVal ( blk ) ; // w i l l n o t b e n e g a t i v e
l o c k s . put ( b l k , v a l +1) ;
}
catch ( I n t e r r u p t e d E x c e p t i o n e ) {
throw new L o c k A b o r t E x c e p t i o n ( ) ;
}
}
/∗ ∗
∗ G r a n t s an XLock on t h e s p e c i f i e d b l o c k .
∗ I f a l o c k o f any t y p e e x i s t s when t h e method i s c a l l e d ,
∗ t h e n t h e c a l l i n g t h r e a d w i l l b e p l a c e d on a w a i t l i s t
∗ u n t i l the l o c k s are r e l e a s e d .
∗ I f t h e t h r e a d r e m a i n s on t h e w a i t l i s t f o r a c e r t a i n
∗ amount o f t i m e ( c u r r e n t l y 10 s e c o n d s ) ,
∗ t h e n an e x c e p t i o n i s t h r o w n .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
synchronized void xLock ( B l o c k b l k ) {
try {
long timestamp = System . c u r r e n t T i m e M i l l i s ( ) ;
while ( h a s O t h e r S L o c k s ( b l k ) && ! w a i t i n g T o o L o n g ( timestamp ) )
w a i t (MAX TIME) ;
i f ( hasOtherSLocks ( blk ) )
throw new L o c k A b o r t E x c e p t i o n ( ) ;
l o c k s . put ( b l k , −1) ;
}
catch ( I n t e r r u p t e d E x c e p t i o n e ) {
throw new L o c k A b o r t E x c e p t i o n ( ) ;
}
}
/∗ ∗
∗ R e l e a s e s a l o c k on t h e s p e c i f i e d b l o c k .
∗ I f t h i s l o c k i s t h e l a s t l o c k on t h a t b l o c k ,
∗ then the waiting t r a n s a c t i o n s are n o t i f i e d .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
synchronized void u n l o c k ( B l o c k b l k ) {
int val = getLockVal ( blk ) ;
i f ( val > 1)
l o c k s . put ( b l k , v a l −1) ;
else {
l o c k s . remove ( b l k ) ;
notifyAll () ;
}
}
118
p r i v a t e boolean h a s X l o c k ( B l o c k b l k ) {
return g e t L o c k V a l ( b l k ) < 0 ;
}
p r i v a t e boolean h a s O t h e r S L o c k s ( B l o c k b l k ) {
return g e t L o c k V a l ( b l k ) > 1 ;
}
p r i v a t e boolean w a i t i n g T o o L o n g ( long s t a r t t i m e ) {
return System . c u r r e n t T i m e M i l l i s ( ) − s t a r t t i m e > MAX TIME ;
}
/∗ ∗
∗ A runtime e x c e p t i o n i n d i c a t i n g t h a t the t r a n s a c t i o n
∗ needs to a b o r t because a l o c k could not be o b t a i n e d .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
public c l a s s L o c k A b o r t E x c e p t i o n extends R u n t i m e E x c e p t i o n {
public L o c k A b o r t E x c e p t i o n ( ) {
}
}
• The implementation for the Concurrency Manager maintains two kinds of informa-
tion:
package s i m p l e d b . t x ;
/∗ ∗
∗ P r o v i d e s t r a n s a c t i o n management f o r c l i e n t s ,
∗ ensuring t h a t a l l t r a n s a c t i o n s are s e r i a l i z a b l e , recoverable ,
∗ and i n g e n e r a l s a t i s f y t h e ACID p r o p e r t i e s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s T r a n s a c t i o n {
p r i v a t e s t a t i c i n t nextTxNum = 0 ;
p r i v a t e s t a t i c f i n a l i n t END OF FILE = −1;
p r i v a t e RecoveryMgr recoveryMgr ;
p r i v a t e ConcurrencyMgr concurMgr ;
p r i v a t e i n t txnum ;
p r i v a t e B u f f e r L i s t m y B u f f e r s = new B u f f e r L i s t ( ) ;
/∗ ∗
∗ C r e a t e s a new t r a n s a c t i o n and i t s a s s o c i a t e d
∗ r e c o v e r y and c o n c u r r e n c y m a n a g e r s .
∗ T h i s c o n s t r u c t o r d e p e n d s on t h e f i l e , l o g , and b u f f e r
∗ managers t h a t i t g e t s from t h e c l a s s
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB } .
∗ Those o b j e c t s a r e c r e a t e d d u r i n g s y s t e m i n i t i a l i z a t i o n .
∗ Thus t h i s c o n s t r u c t o r c a n n o t b e c a l l e d u n t i l e i t h e r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB# i n i t ( S t r i n g ) } o r
∗ { @ l i n k s i m p l e d b . s e r v e r . SimpleDB#i n i t F i l e L o g A n d B u f f e r M g r ( S t r i n g ) } o r
∗ is called first .
∗/
public T r a n s a c t i o n ( ) {
txnum = nextTxNumber ( ) ;
r e c o v e r y M g r = new RecoveryMgr ( txnum ) ;
concurMgr = new ConcurrencyMgr ( ) ;
119
}
/∗ ∗
∗ Commits t h e c u r r e n t t r a n s a c t i o n .
∗ F l u s h e s a l l m o d i f i e d b u f f e r s ( and t h e i r l o g r e c o r d s ) ,
∗ w r i t e s and f l u s h e s a commit r e c o r d t o t h e l o g ,
∗ r e l e a s e s a l l l o c k s , and u n p i n s any p i n n e d b u f f e r s .
∗/
public void commit ( ) {
r e c o v e r y M g r . commit ( ) ;
concurMgr . r e l e a s e ( ) ;
myBuffers . unpinAll ( ) ;
System . o u t . p r i n t l n ( ” t r a n s a c t i o n ” + txnum + ” committed ” ) ;
}
/∗ ∗
∗ R o l l s back the current t r a n s a c t i o n .
∗ Undoes any m o d i f i e d v a l u e s ,
∗ flushes those buffers ,
∗ w r i t e s and f l u s h e s a r o l l b a c k r e c o r d t o t h e l o g ,
∗ r e l e a s e s a l l l o c k s , and u n p i n s any p i n n e d b u f f e r s .
∗/
public void r o l l b a c k ( ) {
recoveryMgr . r o l l b a c k ( ) ;
concurMgr . r e l e a s e ( ) ;
myBuffers . unpinAll ( ) ;
System . o u t . p r i n t l n ( ” t r a n s a c t i o n ” + txnum + ” r o l l e d back ” ) ;
}
/∗ ∗
∗ Flushes a l l modified b u f f e r s .
∗ Then g o e s t h r o u g h t h e l o g , r o l l i n g b a c k a l l
∗ uncommitted t r a n s a c t i o n s . Finally ,
∗ writes a quiescent checkpoint record to the log .
∗ T h i s method i s c a l l e d o n l y d u r i n g s y s t e m s t a r t u p ,
∗ before user transactions begin .
∗/
public void r e c o v e r ( ) {
SimpleDB . b u f f e r M g r ( ) . f l u s h A l l ( txnum ) ;
recoveryMgr . r e c o v e r ( ) ;
}
/∗ ∗
∗ Pins t h e s p e c i f i e d b l o c k .
∗ The t r a n s a c t i o n manages t h e b u f f e r for the client .
∗ @param b l k a r e f e r e n c e t o t h e d i s k block
∗/
public void p i n ( B l o c k b l k ) {
myBuffers . pin ( blk ) ;
}
/∗ ∗
∗ Unpins t h e s p e c i f i e d b l o c k .
∗ The t r a n s a c t i o n l o o k s up t h e b u f f e r p i n n e d to this block ,
∗ and u n p i n s i t .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
public void u n p i n ( B l o c k b l k ) {
myBuffers . unpin ( b lk ) ;
}
/∗ ∗
∗ Returns the i n t e g e r v a l u e s t o r e d at the
∗ s p e c i f i e d o f f s e t of the s p e c i f i e d block .
∗ The method f i r s t o b t a i n s an SLock on t h e b l o c k ,
∗ then i t c a l l s the b u f f e r to r e t r i e v e the value .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e b l o c k
∗ @return t h e i n t e g e r s t o r e d at t h a t o f f s e t
∗/
public i n t g e t I n t ( B l o c k b l k , i n t o f f s e t ) {
concurMgr . s L o c k ( b l k ) ;
B u f f e r b u f f = myBuffers . g e t B u f f e r ( blk ) ;
return b u f f . g e t I n t ( o f f s e t ) ;
}
/∗ ∗
∗ Returns the s t r i n g v a l u e s t o r e d at the
∗ s p e c i f i e d o f f s e t of the s p e c i f i e d block .
∗ The method f i r s t o b t a i n s an SLock on t h e b l o c k ,
∗ then i t c a l l s the b u f f e r to r e t r i e v e the value .
∗ @param b l k a r e f e r e n c e t o a d i s k b l o c k
∗ @param o f f s e t t h e b y t e o f f s e t w i t h i n t h e b l o c k
∗ @return t h e s t r i n g s t o r e d at t h a t o f f s e t
∗/
public S t r i n g g e t S t r i n g ( B l o c k b l k , i n t o f f s e t ) {
concurMgr . s L o c k ( b l k ) ;
B u f f e r b u f f = myBuffers . g e t B u f f e r ( blk ) ;
return b u f f . g e t S t r i n g ( o f f s e t ) ;
}
/∗ ∗
∗ S t o r e s an i n t e g e r a t t h e s p e c i f i e d o f f s e t
∗ of the s p e c i f i e d block .
∗ The method f i r s t o b t a i n s an XLock on t h e b l o c k .
∗ I t then reads the current value at that o f f s e t ,
∗ p u t s i t i n t o an u p d a t e l o g r e c o r d , and
∗ writes that record to the log .
∗ Finally , i t c a l l s the b u f f e r to store the value ,
∗ p a s s i n g i n t h e LSN o f t h e l o g r e c o r d and t h e t r a n s a c t i o n ’ s id .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗ @param o f f s e t a b y t e o f f s e t w i t h i n t h a t b l o c k
∗ @param v a l t h e v a l u e t o b e s t o r e d
120
∗/
public void s e t I n t ( B l o c k b l k , i n t o f f s e t , i n t val ) {
concurMgr . xLock ( b l k ) ;
B u f f e r b u f f = myBuffers . g e t B u f f e r ( blk ) ;
int l s n = recoveryMgr . s e t I n t ( buff , o f f s e t , val ) ;
b u f f . s e t I n t ( o f f s e t , v a l , txnum , l s n ) ;
}
/∗ ∗
∗ Stores a s t r i n g at the s p e c i f i e d o f f s e t
∗ of the s p e c i f i e d block .
∗ The method f i r s t o b t a i n s an XLock on t h e b l o c k .
∗ I t then reads the current value at that o f f s e t ,
∗ p u t s i t i n t o an u p d a t e l o g r e c o r d , and
∗ writes that record to the log .
∗ Finally , i t c a l l s the b u f f e r to store the value ,
∗ p a s s i n g i n t h e LSN o f t h e l o g r e c o r d and t h e t r a n s a c t i o n ’ s id .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗ @param o f f s e t a b y t e o f f s e t w i t h i n t h a t b l o c k
∗ @param v a l t h e v a l u e t o b e s t o r e d
∗/
public void s e t S t r i n g ( B l o c k b l k , i n t o f f s e t , S t r i n g v a l ) {
concurMgr . xLock ( b l k ) ;
B u f f e r b u f f = myBuffers . g e t B u f f e r ( blk ) ;
int l s n = recoveryMgr . s e t S t r i n g ( buff , o f f s e t , val ) ;
b u f f . s e t S t r i n g ( o f f s e t , v a l , txnum , l s n ) ;
}
/∗ ∗
∗ R e t u r n s t h e number o f b l o c k s i n t h e s p e c i f i e d f i l e .
∗ T h i s method f i r s t o b t a i n s an SLock on t h e
∗ ” end o f t h e f i l e ” , b e f o r e a s k i n g t h e f i l e manager
∗ to return the f i l e s i z e .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @ r e t u r n t h e number o f b l o c k s i n t h e f i l e
∗/
public i n t s i z e ( S t r i n g f i l e n a m e ) {
B l o c k dummyblk = new B l o c k ( f i l e n a m e , END OF FILE ) ;
concurMgr . s L o c k ( dummyblk ) ;
return SimpleDB . f i l e M g r ( ) . s i z e ( f i l e n a m e ) ;
}
/∗ ∗
∗ Appends a new b l o c k t o t h e end o f t h e s p e c i f i e d f i l e
∗ and r e t u r n s a r e f e r e n c e t o i t .
∗ T h i s method f i r s t o b t a i n s an XLock on t h e
∗ ” end o f t h e f i l e ” , b e f o r e p e r f o r m i n g t h e a p p e n d .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param f m t r t h e f o r m a t t e r u s e d t o i n i t i a l i z e t h e new p a g e
∗ @ r e t u r n a r e f e r e n c e t o t h e n e w l y −c r e a t e d d i s k b l o c k
∗/
public B l o c k append ( S t r i n g f i l e n a m e , P a g e F o r m a t t e r f m t r ) {
B l o c k dummyblk = new B l o c k ( f i l e n a m e , END OF FILE ) ;
concurMgr . xLock ( dummyblk ) ;
B l o c k b l k = m y B u f f e r s . pinNew ( f i l e n a m e , f m t r ) ;
unpin ( b l k ) ;
return b l k ;
}
p r i v a t e s t a t i c synchronized i n t nextTxNumber ( ) {
nextTxNum++;
System . o u t . p r i n t l n ( ”new t r a n s a c t i o n : ” + nextTxNum ) ;
return nextTxNum ;
}
}
import s i m p l e d b . f i l e . B l o c k ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ The c o n c u r r e n c y manager f o r t h e t r a n s a c t i o n .
∗ Each t r a n s a c t i o n h a s i t s own c o n c u r r e n c y manager .
∗ The c o n c u r r e n c y manager k e e p s t r a c k o f w h i c h l o c k s the
∗ t r a n s a c t i o n c u r r e n t l y h a s , and i n t e r a c t s w i t h t h e
∗ g l o b a l l o c k t a b l e as needed .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s ConcurrencyMgr {
/∗ ∗
∗ The g l o b a l l o c k t a b l e . This v a r i a b l e i s s t a t i c because a l l t r a n s a c t i o n s
∗ s h a r e t h e same t a b l e .
∗/
p r i v a t e s t a t i c LockTable l o c k t b l = new LockTable ( ) ;
p r i v a t e Map<Block , S t r i n g > l o c k s = new HashMap<Block , S t r i n g >() ;
/∗ ∗
∗ O b t a i n s an SLock on t h e b l o c k , i f n e c e s s a r y .
∗ The method w i l l a s k t h e l o c k t a b l e f o r an SLock
∗ i f t h e t r a n s a c t i o n c u r r e n t l y h a s no l o c k s on t h a t block .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
public void s L o c k ( B l o c k b l k ) {
i f ( l o c k s . g e t ( b l k ) == n u l l ) {
l o c k t b l . sLock ( blk ) ;
l o c k s . put ( b l k , ”S” ) ;
121
}
}
/∗ ∗
∗ O b t a i n s an XLock on t h e b l o c k , i f n e c e s s a r y .
∗ I f t h e t r a n s a c t i o n d o e s n o t h a v e an XLock on t h a t block ,
∗ t h e n t h e method f i r s t g e t s an SLock on t h a t b l o c k
∗ ( i f n e c e s s a r y ) , and t h e n u p g r a d e s i t t o an XLock .
∗ @param b l k a r e f r e n c e t o t h e d i s k b l o c k
∗/
public void xLock ( B l o c k b l k ) {
i f ( ! hasXLock ( b l k ) ) {
sLock ( blk ) ;
l o c k t b l . xLock ( b l k ) ;
l o c k s . put ( b l k , ”X” ) ;
}
}
/∗ ∗
∗ R e l e a s e s a l l l o c k s by a s k i n g t h e lock table to
∗ u n l o c k e a c h one .
∗/
public void r e l e a s e ( ) {
for ( Block blk : l o c k s . keySet ( ) )
l o c k t b l . unlock ( blk ) ;
locks . clear () ;
}
p r i v a t e boolean hasXLock ( B l o c k b l k ) {
String locktype = locks . get ( blk ) ;
return l o c k t y p e != n u l l && l o c k t y p e . e q u a l s ( ”X” ) ;
}
}
/∗ ∗
∗ Manages t h e t r a n s a c t i o n ’ s c u r r e n t l y −p i n n e d b u f f e r s .
∗ @ a u t h o r Edward S c i o r e
∗/
class BufferList {
p r i v a t e Map<Block , B u f f e r > b u f f e r s = new HashMap<Block , B u f f e r >() ;
p r i v a t e L i s t <Block> p i n s = new A r r a y L i s t <Block >() ;
p r i v a t e B u f f e r M g r b u f f e r M g r = SimpleDB . b u f f e r M g r ( ) ;
/∗ ∗
∗ Returns the b u f f e r pinned to the s p e c i f i e d block .
∗ The method r e t u r n s n u l l i f t h e t r a n s a c t i o n has not
∗ pinned the b l o c k .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗ @return t h e b u f f e r pinned to t h a t b l o c k
∗/
B u f f e r g e t B u f f e r ( Block blk ) {
return b u f f e r s . g e t ( b l k ) ;
}
/∗ ∗
∗ P i n s t h e b l o c k and k e e p s t r a c k o f t h e b u f f e r internally .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗/
void p i n ( B l o c k b l k ) {
Buffer b u f f = bufferMgr . pin ( blk ) ;
b u f f e r s . put ( b l k , b u f f ) ;
p i n s . add ( b l k ) ;
}
/∗ ∗
∗ Appends a new b l o c k t o t h e s p e c i f i e d f i l e
∗ and p i n s i t .
∗ @param f i l e n a m e t h e name o f t h e f i l e
∗ @param f m t r t h e f o r m a t t e r u s e d t o i n i t i a l i z e t h e new p a g e
∗ @ r e t u r n a r e f e r e n c e t o t h e n e w l y −c r e a t e d b l o c k
∗/
B l o c k pinNew ( S t r i n g f i l e n a m e , P a g e F o r m a t t e r f m t r ) {
B u f f e r b u f f = b u f f e r M g r . pinNew ( f i l e n a m e , f m t r ) ;
Block blk = b u f f . block ( ) ;
b u f f e r s . put ( b l k , b u f f ) ;
p i n s . add ( b l k ) ;
return b l k ;
}
/∗ ∗
∗ Unpins t h e s p e c i f i e d b l o c k .
∗ @param b l k a r e f e r e n c e t o t h e d i s k block
∗/
void u n p i n ( B l o c k b l k ) {
Buffer buff = b u f f e r s . get ( blk ) ;
bufferMgr . unpin ( b u f f ) ;
p i n s . remove ( b l k ) ;
i f ( ! pins . contains ( blk ) )
b u f f e r s . remove ( b l k ) ;
}
122
/∗ ∗
∗ U n p i n s any b u f f e r s s t i l l p i n n e d b y this transaction .
∗/
void u n p i n A l l ( ) {
for ( Block blk : p i n s ) {
Buffer buff = b u f f e r s . get ( blk ) ;
bufferMgr . unpin ( b u f f ) ;
}
buffers . clear () ;
pins . clear () ;
}
}
Multiversion Locking (Sciore, 2008, Chapter 14.4.6) (Weikum and Vossen, 2001,
Chapter 5)
• Many Transactions are read-only.
– This is not the same thing as relaxing the Transaction isolation level and
accepting the risk of wrong answers.
– Instead, this alternative protocol is still correct but faster than general Lock ing,
because it knows that the Transaction is read-only.
• One such specialized protocol for read-only Transactions is Multiversion Lock ing.
¬ Each read-write Transaction follows the normal Lock ing protocol, to ensure
that each version of a Block has only one writer and timestamp.
When a read-write Transaction t writes a Block b and commits, this creates
a new version of b with the current timestamp.
® The RDBMS maintains many versions of the same Block b with different times-
tamps:
¶ one version timestamped with the commit time of t,
· another with the commit time of t0 ,
¸ a third with the commit time of t00 ,. . .
for its already committed writers t, t0 , t00 , . . .
¯ When a read-only Transaction u requests a Block b the RDBMS gives it the
version with the largest timestamp < the starting time of u – the version of b
which was newest when u started.
° This read-only Transaction u can then read its own version of Block b without
Lock ing:
That version was made by the most recently committed writer of b before u
started, so it no longer has any writer still running.
123
Figure 49: Multiversioning example. (Sciore, 2008)
124
Figure 49 gives an example.
• Physically the RDBMS does not have to maintain each version in ® explicitly as
separate disk Block s.
It can do so, but then it must also garbage collect the Block s for those versions
which are no longer needed.
• Instead, the RDBMS can reconstruct the correct version of the Block b requested
by the read-only Transaction u in ¯.
– Recall that recovery reconstructs all the Buffer s written by those Transactions
which committed before the shutdown.
– Here we reconstruct the Buffer f for this particular Block b written by those
Transactions which committed before u started.
¶ Allocate a new RAM Buffer f which u pins and initialize it to have the current
contents of the requested Block b.
This f does not have to be pinned to any Block because it will not be saved
to disk.
· Execute this variant of Figure 39:
– Construct the list of Transactions which either have committed after the
beginning of u or are still executing, and
– undo into f what they have done to b.
¸ Now f is the version of Block b which u requested, so u can read f without
locking in °, and unpin f afterwards.
– This needs no Lock s at all, because the RDBMS can compare timestamps
instead.
– Lock ing overhead disappears – but version management overhead appears.
– This approach is called multiversion timestamp ordering (MVTO). We do not
discuss it further here.
• DBMSs are the most common but not the only example of such programming.
125
at the same time.
• In this STM approach, when an OS thread wants to access this shared RAM, it
• The STM design philosophy encourages programming the threads to use many brief
Transactions which focus on just modifying the shared RAM.
• These Transactions satisfy the Atomicity and Isolation properties, but not Durabil-
ity – they coordinate using shared RAM, not disk database.
– However, there are things which the thread should not do inside a Transaction.
– Especially it should not perform I/O actions – how would you “undo” them if
the Transaction aborts later?
– An external library cannot enforce this – its documentation can only ask pro-
grammers to follow these rules. . .
126
• STM offers an elegant solution to composing concurrent programs:
– If 2 individually correct Lock -based programs P and Q are composed sequen-
tially into P ; Q it may no longer be correct – because P leaves its Lock s in a
state which Q cannot handle.
– But if P and Q are STM code, then P ; Q is executed as one Transaction and
this problem does not arise.
• The GHC STM offers the following additional Transaction programming primitives
(Peyton Jones, 2007):
– A third way to end a Transaction:
∗ The retry function aborts the current Transaction and begins it again
later – because it might commit then.
∗ In most cases, the programmer wants his/her Transaction to commit
eventually, and this is easily expressed with retrying.
– A choice to control this retrying: P orElse Q
¬ first tries executing the STM code P , but if it would end in retrying
then executes the STM code Q instead.
The programmer can then define more elaborate Transaction control strategies
on top of orElse.
• The GHC STM implementation uses the following optimistic concurrency control
strategy:
¬ When a Transaction begins, it creates its own private initially empty Log.
When the Transaction wants to read a variable x, it
¬ first checks if it already has x in its own private Log
otherwise reads x from the shared RAM into its own private Log for later
use.
® When the Transaction wants to write a variable y, it writes it into its own
private Log.
¯ When the Transaction wants to commit, it compares its own private Log
against the shared RAM.
– If the shared RAM still has the same old Logged values, this Transaction
writes its new Logged values into the shared RAM and commits.
– Otherwise some other Transaction has committed and modified the orig-
inal values of the Logged variables in shared RAM while this Transaction
was running, so it must retry instead, because it has been using their
outdated values from its own private Log.
– This retrying just discards this private Log and goes back to ¬.
• This optimistic strategy is lightweight if retrying is. In
STM it is, because it uses RAM since it does not have to satisfy the Transaction
Durability property
DBMS it is not, because it must use the disk to satisfy it. However, it can still be
efficient if Log comparisons usually succeed. Oracle uses a variant of optimism
by default.
127
• Just for fun, here is the GHC STM solution the Santa Claus Problem from concur-
rent programming literature:
It is not part of this course! If you get interested, Peyton Jones (2007) derives this
solution step by step.
import C o n t r o l . C o n c u r r e n t .STM
import C o n t r o l . C o n c u r r e n t
import System .Random
; r e i n g p <− newGroup 9
; sequence [ r e i n d e e r r e i n g p n | n <− [ 1 . . 9 ] ]
; f o r e v e r ( santa e l f g p rein gp ) }
where
elf gp i d = f o r k I O ( f o r e v e r ( do { e l f 1 gp i d ; randomDelay } ) )
r e i n d e e r gp i d = f o r k I O ( f o r e v e r ( do { r e i n d e e r 1 gp i d ; randomDelay } ) )
−−−−−−−−−−−−−−−
data Group = MkGroup Int ( TVar ( Int , Gate , Gate ) )
−−−−−−−−−−−−−−−
data Gate = MkGate Int ( TVar Int )
p a s s G a t e : : Gate −> IO ( )
p a s s G a t e ( MkGate n t v )
128
The relational data model has. . . Its RDBMS implementation is. . .
a stored Table, which is a collection of. . . a File of disk Block s, which is a sequence
of. . .
Row s Record s
which have one or more. . .
Attributes. . . Field s. . .
each of which has some Value of a known Type.
o p e r a t e G a t e : : Gate −> IO ( )
o p e r a t e G a t e ( MkGate n tv )
= do { a t o m i c a l l y ( writeTVar tv n )
; atomically ( do { n l e f t <− readTVar t v
; c h e c k ( n l e f t == 0 ) } ) }
−−−−−−−−−−−−−−−−
f o r e v e r : : IO ( ) −> IO ( )
−− R e p e a t e d l y p e r f o r m t h e a c t i o n
f o r e v e r a c t = do { a c t ; f o r e v e r a c t }
randomDelay : : IO ( )
−− D e l a y f o r a random t i m e b e t w e e n 1 and 1 0 0 0 , 0 0 0 m i c r o s e c o n d s
randomDelay = do { waitTime <− getStdRandom (randomR ( 1 , 1 0 0 0 0 0 0 ) )
; t h r e a d D e l a y waitTime }
• Now we consider the RDBMS components which interpret this raw data as an
implementation of the relational data model.
• The Record Manager is the first such component. It builds a stored Table on top of
disk Block s as in Table 4.
• That is, RAM Pages (in Buffer s) have methods for getting and setting arbitrary
Values at arbitrary offsets within them.
129
– get and set the Values for its Field s
without having to know and calculate their actual offsets – this Record Manager
takes care of that for them.
• This Record Manager also determines the Record IDentifier (RID) for each Record .
• The first tradeoff in the design of a Record Manager is the File structure. A File
is. . .
homogeneous if all its Record s belong to the same Table – and so they all have
the same Field s too.
+ Simpler Record Manager design – each Block of the File can be treated as
an array of structurally identical Record s.
− The database must be divided into many OS Files (as in Oracle, Sim-
pleDB,. . . ).
nonhomogeneous if its Record s can belong to different Tables and can therefore
have different Field s too.
+ The database can be in one OS File (like a MS Access .mdb file).
− The Record Manager must keep structurally different kinds of Record s to-
gether in the same File.
• This organization is
fast when the data is accessed in the same way as it was clustered but
slower when the data is accessed in other ways.
• Figure 50 shows the DEPT and STUDENT Tables clustered together in one file
so that the student Record s with the same major are clustered together after their
common department Record .
• Then it is fast to retrieve and list students according to their major, but slower if
they are accessed in some other way.
130
Figure 50: Nonhomogeneous blocks. (Sciore, 2008)
• Another tradeoff is whether a Record can span over a Page boundary or not.
• If it can, then. . .
+ the Pages can be filled to maximum without having to waste the last part of a
Page which is too small for another Record .
+ Record length has no upper limit
− processing a Record which spans a Page boundary is more difficult, because it
must consider both Pages.
• The SimpleDB Log uses unspanned Record s, because it flushes the last Page when
the next Record would not fit into it any more, and starts another Page.
• Figure 51 shows 2 1000-byte Block s with 4 300-byte Record s. The wasted 100-byte
part of the unspanned choice (b) is shaded.
• Figure 52 shows 2 ways to represent spanned Record s with an integer in the begin-
ning of each Block telling how many bytes. . .
131
Figure 52: 2 ways to span a block boundary. (Sciore, 2008)
• Problem · can be solved with an ID table as in part (c) of Figure 54. Then RID
hp, qi means “the record whose starting point within Block p is in its ID-TABLE[q]”.
• Character and binary large object blocks can be stored separately from their records,
as in part (b) of Figure 55.
132
Figure 53: Growing a variable-length field into an overflow block. (Sciore, 2008)
133
Figure 55: Different ways to store variable-length strings. (Sciore, 2008)
• SimpleDB uses
• It stores each Java int as 4 bytes – including the length n of a varchar field.
1 if this Record is already used for storing a row of the Table stored in this File,
and
0 if it is still unused.
(In fact, SimpleDB uses a 4-byte int as the Flag, but let us assume just 1 byte for
these examples.)
• Then RID hp, qi means the Record stored in Slot q of Record Page p.
• Figure 57 shows the corresponding table information which describes the Field struc-
ture of the Record inside a Slot.
134
Figure 56: A block of student records. (Sciore, 2008)
10 characters for the Attribute in the Schema by the Table definition in Figure 5,
but
14 bytes for the Field in the table information in Figure 57 – because the Field
begins with the 4-byte int giving the actual length of the current Value.
• For instance, accessing the MajorId field of the Record with RID hp, qi means
¬ retrieving Block p of the Student File into the RAM Page of a Buffer
135
moving into the beginning Slot q within this Buffer – that is, into position
Slot length
z }| {
q · (Flag byte + Record length) = q · (1 + 26)
= q · 27
• The Record Manager handles this translation of an RID and an Attribute name into
a position within a Block .
• The API for the Schema and TableInfo objects used in this translation is in Fig-
ure 58.
import s t a t i c j a v a . s q l . Types . ∗ ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ The r e c o r d schema o f a t a b l e .
∗ A schema c o n t a i n s t h e name and t y p e o f
∗ each f i e l d o f t h e t a b l e , as w e l l as t h e l e n g t h
∗ of each varchar f i e l d .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
public c l a s s Schema {
p r i v a t e Map<S t r i n g , F i e l d I n f o > i n f o = new HashMap<S t r i n g , F i e l d I n f o >() ;
/∗ ∗
∗ C r e a t e s an empty schema .
∗ F i e l d i n f o r m a t i o n can b e a d d e d t o a schema
∗ v i a t h e f i v e addXXX m e t h o d s .
∗/
public Schema ( ) {}
/∗ ∗
∗ Adds a f i e l d t o t h e schema h a v i n g a s p e c i f i e d
∗ name , t y p e , and l e n g t h .
∗ I f the f i e l d type i s ” i n t e g e r ” , then the l e n g t h
∗ value is irrelevant .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param t y p e t h e t y p e o f t h e f i e l d , a c c o r d i n g t o the constants in simpledb . s q l . types
∗ @param l e n g t h t h e c o n c e p t u a l l e n g t h o f a s t r i n g field .
∗/
public void a d d F i e l d ( S t r i n g fldname , i n t type , i n t length ) {
i n f o . put ( fldname , new F i e l d I n f o ( type , l e n g t h ) ) ;
}
/∗ ∗
∗ Adds an i n t e g e r f i e l d t o t h e schema .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗/
public void a d d I n t F i e l d ( S t r i n g f l d n a m e ) {
a d d F i e l d ( fldname , INTEGER, 0 ) ;
}
/∗ ∗
∗ Adds a s t r i n g f i e l d t o t h e schema .
∗ The l e n g t h i s t h e c o n c e p t u a l l e n g t h o f t h e f i e l d .
∗ For e x a m p l e , i f t h e f i e l d i s d e f i n e d a s v a r c h a r ( 8 ) ,
∗ then i t s l e n g t h i s 8.
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param l e n g t h t h e number o f c h a r s i n t h e v a r c h a r d e f i n i t i o n
∗/
public void a d d S t r i n g F i e l d ( S t r i n g fldname , i n t l e n g t h ) {
a d d F i e l d ( fldname , VARCHAR, l e n g t h ) ;
}
/∗ ∗
∗ Adds a f i e l d t o t h e schema h a v i n g t h e same
∗ t y p e and l e n g t h as t he c o r r e s p o n d i n g f i e l d
136
Figure 58: The two kinds of objects in the Record Manager. (Sciore, 2008)
137
∗ i n a n o t h e r schema .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param s c h t h e o t h e r schema
∗/
public void add ( S t r i n g fldname , Schema s c h ) {
int type = sch . type ( fldname ) ;
int l e n g t h = sch . l e n g t h ( fldname ) ;
a d d F i e l d ( fldname , type , l e n g t h ) ;
}
/∗ ∗
∗ Adds a l l o f t h e f i e l d s i n t h e s p e c i f i e d schema
∗ t o t h e c u r r e n t schema .
∗ @param s c h t h e o t h e r schema
∗/
public void a d d A l l ( Schema s c h ) {
i n f o . putAll ( sch . i n f o ) ;
}
/∗ ∗
∗ R e t u r n s a c o l l e c t i o n c o n t a i n i n g t h e name o f
∗ e a c h f i e l d i n t h e schema .
∗ @ r e t u r n t h e c o l l e c t i o n o f t h e schema ’ s f i e l d names
∗/
public C o l l e c t i o n <S t r i n g > f i e l d s ( ) {
return i n f o . k e y S e t ( ) ;
}
/∗ ∗
∗ Returns t r ue i f the s p e c i f i e d f i e l d
∗ i s i n t h e schema
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @ r e t u r n t r u e i f t h e f i e l d i s i n t h e schema
∗/
public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return f i e l d s ( ) . c o n t a i n s ( f l d n a m e ) ;
}
/∗ ∗
∗ Returns the type of the s p e c i f i e d f i e l d , using the
∗ c o n s t a n t s i n { @ l i n k j a v a . s q l . Types } .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e i n t e g e r t y p e o f t h e f i e l d
∗/
public i n t t y p e ( S t r i n g f l d n a m e ) {
return i n f o . g e t ( f l d n a m e ) . t y p e ;
}
/∗ ∗
∗ Returns the co n c ep t ua l l e n g t h of the s p e c i f i e d field .
∗ I f the f i e l d i s not a s t r i n g f i e l d , then
∗ the return value is undefined .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e c o n c e p t u a l l e n g t h o f t h e f i e l d
∗/
public i n t l e n g t h ( S t r i n g f l d n a m e ) {
return i n f o . g e t ( f l d n a m e ) . l e n g t h ;
}
class FieldInfo {
i n t type , l e n g t h ;
public F i e l d I n f o ( i n t type , int length ) {
this . type = type ;
this . length = length ;
}
}
}
/∗ ∗
∗ The m e t a d a t a a b o u t a t a b l e and i t s r e c o r d s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s T a b l e I n f o {
p r i v a t e Schema schema ;
p r i v a t e Map<S t r i n g , I n t e g e r > o f f s e t s ;
private int r e c o r d l e n ;
p r i v a t e S t r i n g tblname ;
/∗ ∗
∗ C r e a t e s a T a b l e I n f o o b j e c t , g i v e n a t a b l e name
∗ and schema . The c o n s t r u c t o r c a l c u l a t e s t h e
∗ p h y s i c a l o f f s e t of each f i e l d .
∗ T h i s c o n s t r u c t o r i s u s e d when a t a b l e i s c r e a t e d .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param schema t h e schema o f t h e t a b l e ’ s r e c o r d s
∗/
public T a b l e I n f o ( S t r i n g tblname , Schema schema ) {
t h i s . schema = schema ;
t h i s . tblname = tblname ;
offsets = new HashMap<S t r i n g , I n t e g e r >() ;
int pos = 0 ;
f o r ( S t r i n g f l d n a m e : schema . f i e l d s ( ) ) {
138
o f f s e t s . put ( fldname , p o s ) ;
p o s += l e n g t h I n B y t e s ( f l d n a m e ) ;
}
r e c o r d l e n = pos ;
}
/∗ ∗
∗ C r e a t e s a T a b l e I n f o o b j e c t from t h e
∗ s p e c i f i e d metadata .
∗ T h i s c o n s t r u c t o r i s u s e d when t h e m e t a d a t a
∗ i s r e t r i e v e d from t h e c a t a l o g .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param schema t h e schema o f t h e t a b l e ’ s r e c o r d s
∗ @param o f f s e t s t h e a l r e a d y −c a l c u l a t e d o f f s e t s o f t h e f i e l d s w i t h i n a r e c o r d
∗ @param r e c o r d l e n t h e a l r e a d y −c a l c u l a t e d l e n g t h o f e a c h r e c o r d
∗/
public T a b l e I n f o ( S t r i n g tblname , Schema schema , Map<S t r i n g , I n t e g e r > o f f s e t s , i n t recordlen ) {
t h i s . tblname = tblname ;
t h i s . schema = schema ;
this . o f f s e t s = offsets ;
this . recordlen = recordlen ;
}
/∗ ∗
∗ Returns the filename a s s i g n e d to t h i s t a b l e .
∗ C u r r e n t l y , t h e f i l e n a m e i s t h e t a b l e name
∗ f o l l o w e d by ” . t b l ” .
∗ @ r e t u r n t h e name o f t h e f i l e a s s i g n e d t o t h e table
∗/
public S t r i n g f i l e N a m e ( ) {
return tblname + ” . t b l ” ;
}
/∗ ∗
∗ R e t u r n s t h e schema o f t h e t a b l e ’ s r e c o r d s
∗ @ r e t u r n t h e t a b l e ’ s r e c o r d schema
∗/
public Schema schema ( ) {
return schema ;
}
/∗ ∗
∗ Returns the o f f s e t of a s p e c i f i e d f i e l d w i t h i n a record
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e o f f s e t o f t h a t f i e l d w i t h i n a record
∗/
public i n t o f f s e t ( S t r i n g f l d n a m e ) {
return o f f s e t s . g e t ( f l d n a m e ) ;
}
/∗ ∗
∗ Returns the l e n g t h of a record , in b y t e s .
∗ @return t h e l e n g t h in b y t e s o f a record
∗/
public i n t r e c o r d L e n g t h ( ) {
return r e c o r d l e n ;
}
/∗ ∗
∗ An i d e n t i f i e r f o r a r e c o r d w i t h i n a f i l e .
∗ A RID c o n s i s t s o f t h e b l o c k number i n t h e file ,
∗ and t h e ID o f t h e r e c o r d i n t h a t b l o c k .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s RID {
p r i v a t e i n t blknum ;
private int i d ;
/∗ ∗
∗ C r e a t e s a RID f o r t h e r e c o r d h a v i n g t h e
∗ s p e c i f i e d ID i n t h e s p e c i f i e d b l o c k .
∗ @param b l k n u m t h e b l o c k number w h e r e t h e record lives
∗ @param i d t h e r e c o r d ’ s ID
∗/
public RID ( i n t blknum , i n t i d ) {
t h i s . blknum = blknum ;
this . id = id ;
}
/∗ ∗
∗ R e t u r n s t h e b l o c k number associated with t h i s RID .
∗ @ r e t u r n t h e b l o c k number
∗/
public i n t blockNumber ( ) {
return blknum ;
}
139
/∗ ∗
∗ R e t u r n s t h e ID a s s o c i a t e d with t h i s RID .
∗ @ r e t u r n t h e ID
∗/
public i n t i d ( ) {
return i d ;
}
public boolean e q u a l s ( O b j e c t o b j ) {
RID r = ( RID ) o b j ;
return blknum == r . blknum && i d==r . i d ;
}
public S t r i n g t o S t r i n g ( ) {
return ” [ ” + blknum + ” , ” + i d + ” ] ” ;
}
}
• Its get and set methods take an Attribute name as an argument, and translate it
into the correct position within this current Slot.
• It also provides the next method, which moves this current Slot into the next Slot
in use within this Block , if any.
package s i m p l e d b . r e c o r d ;
import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import s i m p l e d b . f i l e . B l o c k ;
import s i m p l e d b . t x . T r a n s a c t i o n ;
/∗ ∗
∗ Manages t h e p l a c e m e n t and a c c e s s o f r e c o r d s i n a b l o c k .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s RecordPage {
public s t a t i c f i n a l i n t EMPTY = 0 , INUSE = 1 ;
/∗ ∗ C r e a t e s t h e r e c o r d manager f o r t h e s p e c i f i e d b l o c k .
∗ The c u r r e n t r e c o r d i s s e t t o b e p r i o r t o t h e f i r s t one .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗ @param t i t h e t a b l e ’ s m e t a d a t a
∗ @param t x t h e t r a n s a c t i o n p e r f o r m i n g t h e o p e r a t i o n s
∗/
public RecordPage ( B l o c k b l k , T a b l e I n f o t i , T r a n s a c t i o n t x ) {
this . blk = blk ;
this . t i = t i ;
this . tx = tx ;
s l o t s i z e = t i . r e c o r d L e n g t h ( ) + INT SIZE ;
tx . pin ( blk ) ;
}
/∗ ∗
∗ C l o s e s t h e manager , by unpinning the block .
∗/
public void c l o s e ( ) {
i f ( b l k != n u l l ) {
tx . unpin ( blk ) ;
blk = null ;
}
}
/∗ ∗
∗ Moves t o t h e n e x t r e c o r d i n t h e b l o c k .
∗ @ r e t u r n f a l s e i f t h e r e i s no n e x t r e c o r d .
∗/
public boolean n e x t ( ) {
return s e a r c h F o r ( INUSE ) ;
}
/∗ ∗
∗ Returns the i n t e g e r v a l u e s t o r e d f o r the
∗ s p e c i f i e d f i e l d of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d .
∗ @return t h e i n t e g e r s t o r e d in t h a t f i e l d
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
int p o s i t i o n = f i e l d p o s ( fldname ) ;
return t x . g e t I n t ( b l k , p o s i t i o n ) ;
}
140
/∗ ∗
∗ Returns the s t r i n g v a l u e s t o r e d f o r the
∗ s p e c i f i e d f i e l d of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d .
∗ @return t h e s t r i n g s t o r e d in t h a t f i e l d
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
int p o s i t i o n = f i e l d p o s ( fldname ) ;
return t x . g e t S t r i n g ( b l k , p o s i t i o n ) ;
}
/∗ ∗
∗ S t o r e s an i n t e g e r a t t h e s p e c i f i e d f i e l d
∗ of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e i n t e g e r v a l u e s t o r e d i n t h a t field
∗/
public void s e t I n t ( S t r i n g fldname , i n t v a l ) {
int p o s i t i o n = f i e l d p o s ( fldname ) ;
tx . s e t I n t ( blk , p o s i t i o n , v a l ) ;
}
/∗ ∗
∗ Stores a s t r i n g at the s p e c i f i e d f i e l d
∗ of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e s t r i n g v a l u e s t o r e d i n t h a t f i e l d
∗/
public void s e t S t r i n g ( S t r i n g fldname , S t r i n g v a l ) {
int p o s i t i o n = f i e l d p o s ( fldname ) ;
tx . s e t S t r i n g ( blk , p o s i t i o n , v a l ) ;
}
/∗ ∗
∗ Deletes the current record .
∗ D e l e t i o n i s performed by j u s t marking t h e r e c o r d
∗ as ” d e l e t e d ” ; t he c u r r e n t record does not change .
∗ To g e t t o t h e n e x t r e c o r d , c a l l n e x t ( ) .
∗/
public void d e l e t e ( ) {
int p o s i t i o n = currentpos ( ) ;
t x . s e t I n t ( b l k , p o s i t i o n , EMPTY) ;
}
/∗ ∗
∗ I n s e r t s a new , b l a n k r e c o r d s o m e w h e r e i n t h e p a g e .
∗ R e t u r n f a l s e i f t h e r e w e r e no a v a i l a b l e s l o t s .
∗ @ r e t u r n f a l s e i f t h e i n s e r t i o n was n o t p o s s i b l e
∗/
public boolean i n s e r t ( ) {
c u r r e n t s l o t = −1;
boolean f o u n d = s e a r c h F o r (EMPTY) ;
i f ( found ) {
int p o s i t i o n = currentpos ( ) ;
t x . s e t I n t ( b l k , p o s i t i o n , INUSE ) ;
}
return f o u n d ;
}
/∗ ∗
∗ S e t s t he c u r r e n t record to be th e record having the
∗ s p e c i f i e d ID .
∗ @param i d t h e ID o f t h e r e c o r d w i t h i n t h e p a g e .
∗/
public void moveToId ( i n t i d ) {
currentslot = id ;
}
/∗ ∗
∗ R e t u r n s t h e ID o f t h e c u r r e n t record .
∗ @ r e t u r n t h e ID o f t h e c u r r e n t record
∗/
public i n t c u r r e n t I d ( ) {
return c u r r e n t s l o t ;
}
private int c u r r e n t p o s ( ) {
return c u r r e n t s l o t ∗ s l o t s i z e ;
}
p r i v a t e boolean i s V a l i d S l o t ( ) {
return c u r r e n t p o s ( ) + s l o t s i z e <= BLOCK SIZE ;
}
p r i v a t e boolean s e a r c h F o r ( i n t f l a g ) {
c u r r e n t s l o t ++;
while ( i s V a l i d S l o t ( ) ) {
int p o s i t i o n = currentpos ( ) ;
i f ( t x . g e t I n t ( b l k , p o s i t i o n ) == f l a g )
return true ;
c u r r e n t s l o t ++;
}
return f a l s e ;
}
}
141
Figure 59: The record file operations. (Sciore, 2008)
/∗ ∗
∗ An o b j e c t t h a t can f o r m a t a p a g e t o l o o k l i k e a b l o c k of
∗ empty r e c o r d s .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s R e c o r d F o r m a t t e r implements P a g e F o r m a t t e r {
private T a b l e I n f o t i ;
/∗ ∗
∗ C r e a t e s a f o r m a t t e r f o r a new p a g e o f a table .
∗ @param t i t h e t a b l e ’ s m e t a d a t a
∗/
public R e c o r d F o r m a t t e r ( T a b l e I n f o t i ) {
this . t i = t i ;
}
/∗ ∗
∗ F o r m a t s t h e p a g e b y a l l o c a t i n g a s many r e c o r d s l o t s
∗ as p o s s i b l e , g i v e n t he record l e n g t h .
∗ Each r e c o r d s l o t i s a s s i g n e d a f l a g o f EMPTY.
∗ Each i n t e g e r f i e l d i s g i v e n a v a l u e o f 0 , and
∗ each s t r i n g f i e l d i s g i v e n a v a l u e of ””.
∗ @see s i m p l e d b . b u f f e r . P a g e F o r m a t t e r#f o r m a t ( s i m p l e d b . f i l e . Page )
∗/
public void f o r m a t ( Page page ) {
i n t r e c s i z e = t i . r e c o r d L e n g t h ( ) + INT SIZE ;
f o r ( i n t p o s =0; p o s+r e c s i z e <=BLOCK SIZE ; p o s += r e c s i z e ) {
page . s e t I n t ( pos , EMPTY) ;
m a k e D e f a u l t R e c o r d ( page , p o s ) ;
}
}
142
SimpleDB source file simpledb/record/RecordFile.java
• Here is the implementation of a whole File of RecordPages – that is, of a stored
Table.
– It maintains the notion of the current Record – which it builds on top of the
current Slot within the current RecordPage.
– This current Record can be positioned “just before the first” actual Record in
the File.
– It can be moved to the next Record (if any) – which it does by moving
¬ to the next Slot in use within the current RecordPage, and
to the next RecordPage there are no more Slots in use within the current
RecordPage.
– It permits getting and setting the Attribute Values for this current Record .
• It also provides random access to these Record using RIDs as their addresses:
• It can also delete the current Record – by setting the Flag of its Slot to 0.
• It can also insert a new Record somewhere in the File. Its contents can then be
set. SimpleDB
This linear scan for an unused Slot is not very efficient. Instead an RDBMS
RecordFile can maintain for instance a list of still unused Slots linked by RIDs.
package s i m p l e d b . r e c o r d ;
import s i m p l e d b . f i l e . B l o c k ;
import s i m p l e d b . t x . T r a n s a c t i o n ;
/∗ ∗
∗ Manages a f i l e o f r e c o r d s .
∗ There a r e methods f o r i t e r a t i n g through the records
∗ and a c c e s s i n g t h e i r c o n t e n t s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s R e c o r d F i l e {
private T a b l e I n f o t i ;
private T r a n s a c t i o n tx ;
private S t r i n g f i l e n a m e ;
p r i v a t e RecordPage r p ;
private int currentblknum ;
/∗ ∗
∗ C o n s t r u c t s an o b j e c t t o manage a f i l e o f r e c o r d s .
∗ I f the f i l e does not e x i s t , i t i s c re at ed .
∗ @param t i t h e t a b l e m e t a d a t a
∗ @param t x t h e t r a n s a c t i o n
∗/
143
public R e c o r d F i l e ( T a b l e I n f o t i , Transaction tx ) {
this . t i = t i ;
this . tx = tx ;
filename = t i . fileName () ;
i f ( t x . s i z e ( f i l e n a m e ) == 0 )
app endBlo ck ( ) ;
moveTo ( 0 ) ;
}
/∗ ∗
∗ Closes the record f i l e .
∗/
public void c l o s e ( ) {
rp . c l o s e ( ) ;
}
/∗ ∗
∗ P o s i t i o n s th e c u r r e n t record so t h a t a call t o method n e x t
∗ w i l l w i n d up a t t h e f i r s t r e c o r d .
∗/
public void b e f o r e F i r s t ( ) {
moveTo ( 0 ) ;
}
/∗ ∗
∗ Moves t o t h e n e x t r e c o r d . R e t u r n s false if there
∗ i s no n e x t r e c o r d .
∗ @ r e t u r n f a l s e i f t h e r e i s no n e x t record .
∗/
public boolean n e x t ( ) {
while ( true ) {
i f ( rp . next ( ) )
return true ;
i f ( atLastBlock () )
return f a l s e ;
moveTo ( c u r r e n t b l k n u m + 1 ) ;
}
}
/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e i n t e g e r v a l u e at t h a t f i e l d
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
return r p . g e t I n t ( f l d n a m e ) ;
}
/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e s t r i n g v a l u e at t h a t f i e l d
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
return r p . g e t S t r i n g ( f l d n a m e ) ;
}
/∗ ∗
∗ Sets the value of the s p e c i f i e d f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e new v a l u e f o r t h e f i e l d
∗/
public void s e t I n t ( S t r i n g fldname , i n t v a l ) {
r p . s e t I n t ( fldname , v a l ) ;
}
/∗ ∗
∗ Sets the value of the s p e c i f i e d f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e new v a l u e f o r t h e f i e l d
∗/
public void s e t S t r i n g ( S t r i n g fldname , S t r i n g val ) {
r p . s e t S t r i n g ( fldname , v a l ) ;
}
/∗ ∗
∗ Deletes the current record .
∗ The c l i e n t must c a l l n e x t ( ) t o move t o
∗ the next record .
∗ C a l l s t o m e t h o d s on a d e l e t e d r e c o r d
∗ have u n s p e c i f i e d b e h a v i o r .
∗/
public void d e l e t e ( ) {
rp . d e l e t e ( ) ;
}
/∗ ∗
∗ I n s e r t s a new , b l a n k r e c o r d s o m e w h e r e i n t h e f i l e
∗ beginning at the current record .
∗ I f t h e new r e c o r d d o e s n o t f i t i n t o an e x i s t i n g b l o c k ,
∗ t h e n a new b l o c k i s a p p e n d e d t o t h e f i l e .
∗/
public void i n s e r t ( ) {
while ( ! r p . i n s e r t ( ) ) {
i f ( atLastBlock () )
appe ndBlo ck ( ) ;
moveTo ( c u r r e n t b l k n u m + 1 ) ;
}
}
144
/∗ ∗
∗ P o s i t i o n s th e c u r r e n t record as indicated by the
∗ s p e c i f i e d RID .
∗ @param r i d a r e c o r d i d e n t i f i e r
∗/
public void moveToRid ( RID r i d ) {
moveTo ( r i d . blockNumber ( ) ) ;
r p . moveToId ( r i d . i d ( ) ) ;
}
/∗ ∗
∗ R e t u r n s t h e RID o f t h e c u r r e n t r e c o r d .
∗ @return a record i d e n t i f i e r
∗/
public RID c u r r e n t R i d ( ) {
int i d = rp . c u r r e n t I d ( ) ;
return new RID ( c u r r e n t b l k n u m , i d ) ;
}
p r i v a t e void moveTo ( i n t b ) {
i f ( r p != n u l l )
rp . c l o s e ( ) ;
currentblknum = b ;
B l o c k b l k = new B l o c k ( f i l e n a m e , c u r r e n t b l k n u m ) ;
r p = new RecordPage ( b l k , t i , t x ) ;
}
p r i v a t e boolean a t L a s t B l o c k ( ) {
return c u r r e n t b l k n u m == t x . s i z e ( f i l e n a m e ) − 1 ;
}
• The Schema and table information of the Record Manager in section 4.5 is one
example of metadata:
data telling how to interpret the other data stored in the database.
• The SQL standard specifies > 50 different views an RDBMS must offer to its meta-
data.
In this way it avoids specifying how an RDBMS actually stores its metadata.
tblcat(TblName:varchar(16),RecLength:int)
fldcat(TblName:varchar(16),FldName:varchar(16)
,Type:int,Length:int,Offset:int)
viewcat(ViewName:varchar(16),ViewDef:varchar(100))
idxcat(tablename:varchar(16),fieldname:varchar(16)
,indexname:varchar(16))
• They can be queried with SELECT. . . FROM. . . WHERE. . . just like other
Tables.
• These metadata tables are often called the catalog of the RDBMS.
• The Table catalog tblcat has the name of each CREATEd Table as its key and
the length of its Record s as its other attribute.
• The Field catalog fldcat tells which Field s such a Table has, as well as the
145
Type of its Values, where
4 denotes an int, and
10 denotes varchar
Length of these Values – for varchars
Offset inside the Record
• Together they form the SimpleDB metadata for each CREATEd Table.
• Hence this implementation provides also getting the table information for a given
Table.
import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . r e c o r d . ∗ ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ The t a b l e manager .
∗ There a r e methods t o c r e a t e a t a b l e , s a v e t h e metadata
∗ i n t h e c a t a l o g , and o b t a i n t h e m e t a d a t a o f a
∗ p r e v i o u s l y −c r e a t e d t a b l e .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
public c l a s s TableMgr {
/∗ ∗
∗ The maximum number o f c h a r a c t e r s i n any
∗ tablename or f i e l d n a m e .
∗ Currently , t h i s value i s 16.
∗/
public s t a t i c f i n a l i n t MAX NAME = 1 6 ;
/∗ ∗
∗ C r e a t e s a new c a t a l o g manager f o r t h e d a t a b a s e s y s t e m .
∗ I f t h e d a t a b a s e i s new , t h e n t h e t w o c a t a l o g t a b l e s
∗ are created .
∗ @param i s N e w h a s t h e v a l u e t r u e i f t h e d a t a b a s e i s new
∗ @param t x t h e s t a r t u p t r a n s a c t i o n
∗/
public TableMgr ( boolean isNew , T r a n s a c t i o n t x ) {
Schema t c a t S c h e m a = new Schema ( ) ;
t c a t S c h e m a . a d d S t r i n g F i e l d ( ” tblname ” , MAX NAME) ;
tcatSchema . a d d I n t F i e l d ( ” r e c l e n g t h ” ) ;
t c a t I n f o = new T a b l e I n f o ( ” t b l c a t ” , t c a t S c h e m a ) ;
if ( isNew ) {
c r e a t e T a b l e ( ” t b l c a t ” , tcatSchema , tx ) ;
c r e a t e T a b l e ( ” f l d c a t ” , fcatSchema , tx ) ;
}
}
/∗ ∗
∗ C r e a t e s a new t a b l e h a v i n g t h e s p e c i f i e d name and schema .
∗ @param t b l n a m e t h e name o f t h e new t a b l e
∗ @param s c h t h e t a b l e ’ s schema
∗ @param t x t h e t r a n s a c t i o n c r e a t i n g t h e t a b l e
∗/
public void c r e a t e T a b l e ( S t r i n g tblname , Schema sch , T r a n s a c t i o n t x ) {
146
Figure 60: Metadata for the University Database. (Sciore, 2008)
147
T a b l e I n f o t i = new T a b l e I n f o ( tblname , s c h ) ;
// i n s e r t one r e c o r d i n t o t b l c a t
R e c o r d F i l e t c a t f i l e = new R e c o r d F i l e ( t c a t I n f o , t x ) ;
t c a t f i l e . insert () ;
t c a t f i l e . s e t S t r i n g ( ” tblname ” , tblname ) ;
t c a t f i l e . s e t I n t ( ” reclength ” , t i . recordLength () ) ;
t c a t f i l e . close () ;
// i n s e r t a r e c o r d i n t o f l d c a t f o r e a c h f i e l d
R e c o r d F i l e f c a t f i l e = new R e c o r d F i l e ( f c a t I n f o , t x ) ;
for ( S t r i n g fldname : sch . f i e l d s ( ) ) {
f c a t f i l e . insert () ;
f c a t f i l e . s e t S t r i n g ( ” tblname ” , tblname ) ;
f c a t f i l e . s e t S t r i n g ( ” fldname ” , fldname ) ;
f c a t f i l e . setInt ( ” type ” , sch . type ( fldname ) ) ;
f c a t f i l e . setInt ( ” l e n g t h ” , sch . l e n g t h ( fldname ) ) ;
f c a t f i l e . setInt ( ” o f f s e t ” , t i . o f f s e t ( fldname ) ) ;
}
f c a t f i l e . close () ;
}
/∗ ∗
∗ R e t r i e v e s the metadata f o r the s p e c i f i e d t a b l e
∗ out of the c a t a l o g .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param t x t h e t r a n s a c t i o n
∗ @return t h e t a b l e ’ s s t o r e d metadata
∗/
public T a b l e I n f o g e t T a b l e I n f o ( S t r i n g tblname , T r a n s a c t i o n t x ) {
R e c o r d F i l e t c a t f i l e = new R e c o r d F i l e ( t c a t I n f o , t x ) ;
i n t r e c l e n = −1;
while ( t c a t f i l e . n e x t ( ) )
i f ( t c a t f i l e . g e t S t r i n g ( ” tblname ” ) . e q u a l s ( tblname ) ) {
reclen = t c a t f i l e . getInt (” reclength ”) ;
break ;
}
t c a t f i l e . close () ;
R e c o r d F i l e f c a t f i l e = new R e c o r d F i l e ( f c a t I n f o , t x ) ;
Schema s c h = new Schema ( ) ;
Map<S t r i n g , I n t e g e r > o f f s e t s = new HashMap<S t r i n g , I n t e g e r >() ;
while ( f c a t f i l e . n e x t ( ) )
i f ( f c a t f i l e . g e t S t r i n g ( ” tblname ” ) . e q u a l s ( tblname ) ) {
S t r i n g fldname = f c a t f i l e . g e t S t r i n g ( ” fldname ” ) ;
int f l d t y p e = f c a t f i l e . g e t I n t ( ” type ” ) ;
int f l d l e n = f c a t f i l e . getInt (” length ”) ;
int o f f s e t = f c a t f i l e . getInt (” offset ”) ;
o f f s e t s . put ( fldname , o f f s e t ) ;
s c h . a d d F i e l d ( fldname , f l d t y p e , f l d l e n ) ;
}
f c a t f i l e . close () ;
return new T a b l e I n f o ( tblname , sch , o f f s e t s , r e c l e n ) ;
}
}
• The view catalog viewcat tells the definition of each named view.
• Its constructor CREATEs internally this viewcat table and its Field s into the
Table metadata, if it is constructing a new database from scratch.
package s i m p l e d b . metadata ;
import s i m p l e d b . r e c o r d . ∗ ;
import s i m p l e d b . t x . T r a n s a c t i o n ;
c l a s s ViewMgr {
private s t a t i c f i n a l i n t MAX VIEWDEF = 1 0 0 ;
TableMgr t b l M g r ;
148
t b l M g r . c r e a t e T a b l e ( ” v i e w c a t ” , sch , tx ) ;
}
}
public S t r i n g g e t V i e w D e f ( S t r i n g vname , T r a n s a c t i o n t x ) {
S t r i n g r e s u l t = null ;
T a b l e I n f o t i = tblMgr . g e t T a b l e I n f o ( ” v i e w c a t ” , tx ) ;
R e c o r d F i l e r f = new R e c o r d F i l e ( t i , t x ) ;
while ( r f . n e x t ( ) )
i f ( r f . g e t S t r i n g ( ” viewname ” ) . e q u a l s ( vname ) ) {
r e s u l t = r f . getString ( ” viewdef ” ) ;
break ;
}
rf . close () ;
return r e s u l t ;
}
}
– It tells the names of the indexes which have been CREATEd for a given
named Table.
– Each SimpleDB index can be built on just one Field of a Table, and that
restriction simplifies this index metadata.
– In general, an RDBMS index can be built on many fields of the same Table.
• Its constructor CREATEs internally this idxcat table and its Field s into the Table
metadata, if it is constructing a new database from scratch.
package s i m p l e d b . metadata ;
/∗ ∗
∗ The i n d e x manager .
∗ The i n d e x manager h a s similar functionalty to the table manager .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s IndexMgr {
private T a b l e I n f o t i ;
/∗ ∗
∗ C r e a t e s t h e i n d e x manager .
∗ This c o n s t r u c t o r i s c a l l e d during system s t a r t u p .
∗ I f t h e d a t a b a s e i s new , t h e n t h e <i >i d x c a t </i > t a b l e i s c r e a t e d .
∗ @param i s n e w i n d i c a t e s w h e t h e r t h i s i s a new d a t a b a s e
∗ @param t x t h e s y s t e m s t a r t u p t r a n s a c t i o n
∗/
public IndexMgr ( boolean i s n e w , TableMgr tblmgr , T r a n s a c t i o n t x ) {
i f ( isnew ) {
Schema s c h = new Schema ( ) ;
s c h . a d d S t r i n g F i e l d ( ” indexname ” , MAX NAME) ;
s c h . a d d S t r i n g F i e l d ( ” t a b l e n a m e ” , MAX NAME) ;
s c h . a d d S t r i n g F i e l d ( ” f i e l d n a m e ” , MAX NAME) ;
t b l m g r . c r e a t e T a b l e ( ” i d x c a t ” , sch , t x ) ;
}
t i = tblmgr . g e t T a b l e I n f o ( ” i d x c a t ” , tx ) ;
}
/∗ ∗
∗ C r e a t e s an i n d e x o f t h e s p e c i f i e d t y p e f o r t h e s p e c i f i e d f i e l d .
∗ A u n i q u e ID i s a s s i g n e d t o t h i s i n d e x , and i t s i n f o r m a t i o n
∗ i s stored in the i d x c a t t a b l e .
∗ @param i d x n a m e t h e name o f t h e i n d e x
∗ @param t b l n a m e t h e name o f t h e i n d e x e d t a b l e
∗ @param f l d n a m e t h e name o f t h e i n d e x e d f i e l d
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public void c r e a t e I n d e x ( S t r i n g idxname , S t r i n g tblname , S t r i n g fldname , Transaction tx ) {
R e c o r d F i l e r f = new R e c o r d F i l e ( t i , t x ) ;
rf . insert () ;
r f . s e t S t r i n g ( ” indexname ” , idxname ) ;
r f . s e t S t r i n g ( ” t a b l e n a m e ” , tblname ) ;
r f . s e t S t r i n g ( ” f i e l d n a m e ” , fldname ) ;
149
Figure 61: The information on each index. (Sciore, 2008)
rf . close () ;
}
/∗ ∗
∗ R e t u r n s a map c o n t a i n i n g t h e i n d e x i n f o f o r a l l i n d e x e s
∗ on t h e s p e c i f i e d t a b l e .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n a map o f I n d e x I n f o o b j e c t s , k e y e d b y t h e i r f i e l d names
∗/
public Map<S t r i n g , I n d e x I n f o > g e t I n d e x I n f o ( S t r i n g tblname , T r a n s a c t i o n t x ) {
Map<S t r i n g , I n d e x I n f o > r e s u l t = new HashMap<S t r i n g , I n d e x I n f o >() ;
R e c o r d F i l e r f = new R e c o r d F i l e ( t i , t x ) ;
while ( r f . n e x t ( ) )
i f ( r f . g e t S t r i n g ( ” t a b l e n a m e ” ) . e q u a l s ( tblname ) ) {
S t r i n g idxname = r f . g e t S t r i n g ( ” indexname ” ) ;
S t r i n g fldname = r f . g e t S t r i n g ( ” f i e l d n a m e ” ) ;
I n d e x I n f o i i = new I n d e x I n f o ( idxname , tblname , fldname , t x ) ;
r e s u l t . put ( fldname , i i ) ;
}
rf . close () ;
return r e s u l t ;
}
}
– An index must be opened before it can be used to search for the RIDs having
the given Value in the indexed Field .
– The blocksAccessed estimates how many Block s would be accessed during
one such search, so that the RDBMS can decide which is faster in a given
situation:
∗ Reading the Records sequentially from the File vs.
∗ searching the File using this index – which may read the same Block many
times.
150
/∗ ∗
∗ The i n f o r m a t i o n a b o u t an i n d e x .
∗ This i n f o r m a t i o n i s used by t h e query planner in order to
∗ estimate the costs of using the index ,
∗ and t o o b t a i n t h e schema o f t h e index records .
∗ I t s methods are e s s e n t i a l l y t h e same a s t h o s e o f P l a n .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n d e x I n f o {
p r i v a t e S t r i n g idxname , f l d n a m e ;
private T r a n s a c t i o n tx ;
private T a b l e I n f o t i ;
private S t a t I n f o s i ;
/∗ ∗
∗ C r e a t e s an I n d e x I n f o o b j e c t f o r t h e s p e c i f i e d i n d e x .
∗ @param i d x n a m e t h e name o f t h e i n d e x
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param f l d n a m e t h e name o f t h e i n d e x e d f i e l d
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public I n d e x I n f o ( S t r i n g idxname , S t r i n g tblname , S t r i n g fldname ,
Transaction tx ) {
t h i s . idxname = idxname ;
this . fldname = fldname ;
this . tx = tx ;
t i = SimpleDB . mdMgr ( ) . g e t T a b l e I n f o ( tblname , t x ) ;
s i = SimpleDB . mdMgr ( ) . g e t S t a t I n f o ( tblname , t i , t x ) ;
}
/∗ ∗
∗ Opens t h e i n d e x d e s c r i b e d b y t h i s o b j e c t .
∗ @return t h e Index o b j e c t a s s o c i a t e d with t h i s information
∗/
public I n d e x open ( ) {
Schema s c h = schema ( ) ;
// C r e a t e new H a s h I n d e x f o r h a s h i n d e x i n g
return new HashIndex ( idxname , sch , t x ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s r e q u i r e d t o
∗ f i n d a l l index records having a p a r t i c u l a r search key .
∗ The method u s e s t h e t a b l e ’ s m e t a d a t a t o e s t i m a t e t h e
∗ s i z e o f t h e i n d e x f i l e and t h e number o f i n d e x r e c o r d s
∗ per b l o c k .
∗ I t then passes t h i s information to the t r a v e r s a l C o s t
∗ method o f t h e a p p r o p r i a t e i n d e x t y p e ,
∗ which p r o v i d e s the e s t i m a t e .
∗ @ r e t u r n t h e number o f b l o c k a c c e s s e s r e q u i r e d t o t r a v e r s e the index
∗/
public i n t b l o c k s A c c e s s e d ( ) {
T a b l e I n f o i d x t i = new T a b l e I n f o ( ” ” , schema ( ) ) ;
i n t rpb = BLOCK SIZE / i d x t i . r e c o r d L e n g t h ( ) ;
i n t numblocks = s i . r e c o r d s O u t p u t ( ) / rpb ;
// C a l l H a s h I n d e x . s e a r c h C o s t f o r h a s h i n d e x i n g
return HashIndex . s e a r c h C o s t ( numblocks , rpb ) ;
}
/∗ ∗
∗ R e t u r n s t h e e s t i m a t e d number o f r e c o r d s h a v i n g a
∗ search key . T h i s v a l u e i s t h e same a s d o i n g a s e l e c t
∗ q u e r y ; t h a t i s , i t i s t h e number o f r e c o r d s i n t h e t a b l e
∗ d i v i d e d b y t h e number o f d i s t i n c t v a l u e s o f t h e i n d e x e d f i e l d .
∗ @ r e t u r n t h e e s t i m a t e d number o f r e c o r d s h a v i n g a s e a r c h k e y
∗/
public i n t r e c o r d s O u t p u t ( ) {
return s i . r e c o r d s O u t p u t ( ) / s i . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}
/∗ ∗
∗ Returns the d i s t i n c t v a l u e s f o r a s p e c i f i e d f i e l d
∗ in t h e u n d e r l y i n g t a b l e , or 1 f o r t h e i n d e xe d f i e l d .
∗ @param fname t h e s p e c i f i e d f i e l d
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g fname ) {
i f ( f l d n a m e . e q u a l s ( fname ) )
return 1 ;
else
return Math . min ( s i . d i s t i n c t V a l u e s ( f l d n a m e ) , r e c o r d s O u t p u t ( ) ) ;
}
/∗ ∗
∗ R e t u r n s t h e schema o f t h e i n d e x r e c o r d s .
∗ The schema c o n s i s t s o f t h e dataRID ( w h i c h i s
∗ r e p r e s e n t e d a s t w o i n t e g e r s , t h e b l o c k number and t h e
∗ r e c o r d ID ) and t h e d a t a v a l ( w h i c h i s t h e i n d e x e d f i e l d ) .
∗ Schema i n f o r m a t i o n a b o u t t h e i n d e x e d f i e l d i s o b t a i n e d
∗ v i a the t a b l e ’ s metadata .
∗ @ r e t u r n t h e schema o f t h e i n d e x r e c o r d s
∗/
p r i v a t e Schema schema ( ) {
Schema s c h = new Schema ( ) ;
sch . addIntField ( ” block ” ) ;
sch . addIntField ( ” id ” ) ;
i f ( t i . schema ( ) . t y p e ( f l d n a m e ) == INTEGER)
sch . addIntField ( ” dataval ” ) ;
else {
i n t f l d l e n = t i . schema ( ) . l e n g t h ( f l d n a m e ) ;
sch . addStringField ( ” dataval ” , f l d l e n ) ;
}
151
Figure 62: Example Statistics for the University Database. (Sciore, 2008)
return s c h ;
}
}
Table Statistics
• The blocksAccessed method in Figure 61 is an example of statistics which the
RDBMS uses to decide an efficient way to execute the given SQL query.
• Consider the following simple statistics:
B(T ): the number of Block s in the File storing this Table T – estimating the I/O
needed to list its contents
R(T ): the number of Record s in this Table T – estimating the size of this listing
V(T ,F ): the number of distinct Values in this Field F of this Table T – estimating
the size of select(T ,F = . . .), or how selective F is.
• A commercial RDBMS may use much more elaborate statistics than these.
• Figure 62 shows them for a university with about 900 students and 500 sections per
year, for the last 50 years.
152
and
write them when the database contents change – with xlocks, which reduces
concurrency
read them when planning how to execute a given SQL query – without slocks,
because the results do not have to be exact.
RAM because these Tables are small, but then they must be
recalculated whenever the RDBMS process is started, and
maintained while it is running.
import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . r e c o r d . ∗ ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ The s t a t i s t i c s manager , w h i c h i s r e s p o n s i b l e f o r
∗ ke epin g s t a t i s t i c a l information about each t a b l e .
∗ The manager d o e s n o t s t o r e t h i s i n f o r m a t i o n i n t h e d a t a b a s e .
∗ I n s t e a d , i t c a l c u l a t e s t h i s i n f o r m a t i o n on s y s t e m s t a r t u p ,
∗ and p e r i o d i c a l l y r e f r e s h e s i t .
∗ @ a u t h o r Edward S c i o r e
∗/
c l a s s StatMgr {
p r i v a t e TableMgr t b l M g r ;
p r i v a t e Map<S t r i n g , S t a t I n f o > t a b l e s t a t s ;
private int numcalls ;
/∗ ∗
∗ C r e a t e s t h e s t a t i s t i c s manager .
∗ The i n i t i a l s t a t i s t i c s a r e c a l c u l a t e d b y
∗ traversing the entire database .
∗ @param t x t h e s t a r t u p t r a n s a c t i o n
∗/
public StatMgr ( TableMgr tblMgr , T r a n s a c t i o n t x ) {
t h i s . tblMgr = tblMgr ;
r e f r e s h S t a t i s t i c s ( tx ) ;
}
/∗ ∗
∗ Returns the s t a t i s t i c a l information about the s p e c i f i e d t a b l e .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param t i t h e t a b l e ’ s m e t a d a t a
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @return t h e s t a t i s t i c a l i n f o r m a t i o n about t h e t a b l e
∗/
public synchronized S t a t I n f o g e t S t a t I n f o ( S t r i n g tblname , T a b l e I n f o ti , Transaction tx ) {
n u m c a l l s ++;
i f ( numcalls > 100)
r e f r e s h S t a t i s t i c s ( tx ) ;
S t a t I n f o s i = t a b l e s t a t s . g e t ( tblname ) ;
i f ( s i == n u l l ) {
s i = c a l c T a b l e S t a t s ( t i , tx ) ;
t a b l e s t a t s . put ( tblname , s i ) ;
}
return s i ;
}
p r i v a t e synchronized void r e f r e s h S t a t i s t i c s ( T r a n s a c t i o n t x ) {
t a b l e s t a t s = new HashMap<S t r i n g , S t a t I n f o >() ;
numcalls = 0 ;
T a b l e I n f o tcatmd = t b l M g r . g e t T a b l e I n f o ( ” t b l c a t ” , t x ) ;
R e c o r d F i l e t c a t f i l e = new R e c o r d F i l e ( tcatmd , t x ) ;
while ( t c a t f i l e . n e x t ( ) ) {
S t r i n g tblname = t c a t f i l e . g e t S t r i n g ( ” tblname ” ) ;
T a b l e I n f o md = t b l M g r . g e t T a b l e I n f o ( tblname , t x ) ;
S t a t I n f o s i = c a l c T a b l e S t a t s (md, t x ) ;
t a b l e s t a t s . put ( tblname , s i ) ;
}
t c a t f i l e . close () ;
}
p r i v a t e synchronized S t a t I n f o c a l c T a b l e S t a t s ( T a b l e I n f o ti , Transaction tx ) {
i n t numRecs = 0 ;
R e c o r d F i l e r f = new R e c o r d F i l e ( t i , t x ) ;
i n t numblocks = 0 ;
while ( r f . n e x t ( ) ) {
numRecs++;
153
numblocks = r f . c u r r e n t R i d ( ) . blockNumber ( ) + 1 ;
}
rf . close () ;
return new S t a t I n f o ( numblocks , numRecs ) ;
}
}
• SimpleDB does not actually compute the true V(T ,F ) values – it just makes a wild
guess. . .
package s i m p l e d b . metadata ;
/∗ ∗
∗ Holds t h r e e p i e c e s of s t a t i s t i c a l information about a table :
∗ t h e number o f b l o c k s , t h e number o f r e c o r d s ,
∗ and t h e number o f d i s t i n c t v a l u e s f o r e a c h f i e l d .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S t a t I n f o {
p r i v a t e i n t numBlocks ;
p r i v a t e i n t numRecs ;
/∗ ∗
∗ Creates a StatInfo o b j e c t .
∗ N o t e t h a t t h e number o f d i s t i n c t v a l u e s i s n o t
∗ passed into the constructor .
∗ The o b j e c t f a k e s t h i s v a l u e .
∗ @param n u m b l o c k s t h e number o f b l o c k s i n t h e t a b l e
∗ @param n u m r e c s t h e number o f r e c o r d s i n t h e t a b l e
∗/
public S t a t I n f o ( i n t numblocks , i n t numrecs ) {
t h i s . numBlocks = numblocks ;
t h i s . numRecs = numrecs ;
}
/∗ ∗
∗ R e t u r n s t h e e s t i m a t e d number o f blocks in the table .
∗ @ r e t u r n t h e e s t i m a t e d number o f blocks in the table
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return numBlocks ;
}
/∗ ∗
∗ R e t u r n s t h e e s t i m a t e d number o f records in the table .
∗ @ r e t u r n t h e e s t i m a t e d number o f records in the table
∗/
public i n t r e c o r d s O u t p u t ( ) {
return numRecs ;
}
/∗ ∗
∗ R e t u r n s t h e e s t i m a t e d number o f d i s t i n c t v a l u e s
∗ for the s p e c i f i e d f i e l d .
∗ In a c t u a l i t y , t h i s e s t i m a t e i s a complete g u e s s .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @ r e t u r n a g u e s s a s t o t h e number o f d i s t i n c t f i e l d values
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
return 1 + ( numRecs / 3 ) ;
}
}
– Table,
– View ,
– Index and
– Statistics.
package s i m p l e d b . metadata ;
import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . r e c o r d . ∗ ;
import j a v a . u t i l . Map ;
154
public c l a s s MetadataMgr {
p r i v a t e s t a t i c TableMgr tblmgr ;
p r i v a t e s t a t i c ViewMgr viewmgr ;
p r i v a t e s t a t i c StatMgr statmgr ;
p r i v a t e s t a t i c IndexMgr idxmgr ;
public S t r i n g g e t V i e w D e f ( S t r i n g viewname , T r a n s a c t i o n t x ) {
return viewmgr . g e t V i e w D e f ( viewname , t x ) ;
}
155
Figure 63: Scan nodes. (Sciore, 2008)
• Now we build on them the next level of query processing by combining these Tables
with the Relational Algebra operations in section 2.6 which compute answers by
traversing these Files.
leaf node is a File of Record s implementing one relational Table – which can be
processed as a result set
internal node is an implementation of a relational Algebra operation which takes
result set(s) as input and produces another result set as output – which can be
an input into another internal node.
That is, a Scan is a tree of Table and operation Scans by Figure 63.
• Figure 64 shows the interface for these Scan nodes. It is similar to Record Files,
except that it. . .
• Both examples
¬ first construct the Scan (b) according to the Relational Algebra expression (a)
156
Figure 64: Scans. (Sciore, 2008)
then call its next method while it still has another current row to print.
– When next tells that it has another current row (by returning true), the
printing loop can get its Attribute Values. . .
– . . . but this actually gets the corresponding Field Values from the current
Record (s) of their File(s).
requests go down in the Scan until its leaf Record Files, and their results
return values come back up in the Scan.
• Note also that the same Table can have many current Record s at the same time.
• Let us next consider how each kind of query Scan implements its beforeFirst
and next methods.
157
Figure 65: One-table scan. (Sciore, 2008)
158
Figure 66: Two-table scan. (Sciore, 2008)
159
package s i m p l e d b . q u e r y ;
/∗ ∗
∗ The i n t e r f a c e w i l l b e i m p l e m e n t e d b y e a c h query scan .
∗ T h e r e i s a Scan c l a s s f o r e a c h r e l a t i o n a l
∗ algebra operator .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e Scan {
/∗ ∗
∗ Positions the scan before its first record .
∗/
public void beforeFirst () ;
/∗ ∗
∗ Moves t h e s c a n t o t h e n e x t r e c o r d .
∗ @ r e t u r n f a l s e i f t h e r e i s no n e x t r e c o r d
∗/
public boolean next ( ) ;
/∗ ∗
∗ Closes the s c a n and its subscans , if any .
∗/
public void close () ;
/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d f i e l d in the current record .
∗ The v a l u e i s e x p r e s s e d a s a C o n s t a n t .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e v a l u e o f t h a t f i e l d , e x p r e s s e d as a Constant .
∗/
public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) ;
/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d i n t e g e r f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e f i e l d ’ s i n t e g e r v a l u e in t h e c u r r e n t record
∗/
public i n t g e t I n t ( S t r i n g fldname ) ;
/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d s t r i n g f i e l d
∗ in the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t h e f i e l d ’ s s t r i n g v a l u e in t h e c u r r e n t record
∗/
public S t r i n g g e t S t r i n g ( S t r i n g fldname ) ;
/∗ ∗
∗ Returns t r u e i f the scan has the s p e c i f i e d field .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return t r u e i f t h e scan has t h a t f i e l d
∗/
public boolean h a s F i e l d ( S t r i n g fldname ) ;
}
• A Table Scan just redirects its beforeFirst and next methods into the same meth-
ods for its inderlying Record File rf .
package s i m p l e d b . q u e r y ;
/∗ ∗
∗ The Scan c l a s s c o r r e s p o n d i n g t o a t a b l e .
∗ A t a b l e scan i s j u s t a wrapper f o r a RecordFile object ;
∗ most m e t h o d s j u s t d e l e g a t e t o t h e c o r r e s p o n d i n g
∗ R e c o r d F i l e methods .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
public c l a s s T a b l e S c a n implements UpdateScan {
private R e c o r d F i l e r f ;
p r i v a t e Schema s c h ;
/∗ ∗
∗ C r e a t e s a new t a b l e s c a n ,
∗ and o p e n s i t s c o r r e s p o n d i n g r e c o r d f i l e .
∗ @param t i t h e t a b l e ’ s m e t a d a t a
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public T a b l e S c a n ( T a b l e I n f o t i , T r a n s a c t i o n t x ) {
rf = new R e c o r d F i l e ( t i , t x ) ;
s c h = t i . schema ( ) ;
}
// Scan m e t h o d s
public void b e f o r e F i r s t ( ) {
rf . beforeFirst () ;
160
}
public boolean n e x t ( ) {
return r f . n e x t ( ) ;
}
public void c l o s e ( ) {
rf . close () ;
}
/∗ ∗
∗ Returns t h e v a l u e o f t he s p e c i f i e d f i e l d , as a Constant .
∗ The schema i s e x a m i n e d t o d e t e r m i n e t h e f i e l d ’ s t y p e .
∗ I f INTEGER , t h e n t h e r e c o r d f i l e ’ s g e t I n t method i s c a l l e d ;
∗ o t h e r w i s e , t h e g e t S t r i n g method i s c a l l e d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) {
i f ( s c h . t y p e ( f l d n a m e ) == INTEGER)
return new I n t C o n s t a n t ( r f . g e t I n t ( f l d n a m e ) ) ;
else
return new S t r i n g C o n s t a n t ( r f . g e t S t r i n g ( f l d n a m e ) ) ;
}
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
return r f . g e t I n t ( f l d n a m e ) ;
}
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
return r f . g e t S t r i n g ( f l d n a m e ) ;
}
public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return s c h . h a s F i e l d ( f l d n a m e ) ;
}
// U p d a t e S c a n m e t h o d s
/∗ ∗
∗ S e t s t he v a l u e o f th e s p e c i f i e d f i e l d , as a Constant .
∗ The schema i s e x a m i n e d t o d e t e r m i n e t h e f i e l d ’ s t y p e .
∗ I f INTEGER , t h e n t h e r e c o r d f i l e ’ s s e t I n t method i s c a l l e d ;
∗ o t h e r w i s e , t h e s e t S t r i n g method i s c a l l e d .
∗ @see s i m p l e d b . q u e r y . U p d a t e S c a n#s e t V a l ( j a v a . l a n g . S t r i n g , s i m p l e d b . q u e r y . C o n s t a n t )
∗/
public void s e t V a l ( S t r i n g fldname , C o n s t a n t v a l ) {
i f ( s c h . t y p e ( f l d n a m e ) == INTEGER)
r f . s e t I n t ( fldname , ( I n t e g e r ) v a l . a s J a v a V a l ( ) ) ;
else
r f . s e t S t r i n g ( fldname , ( S t r i n g ) v a l . a s J a v a V a l ( ) ) ;
}
public void d e l e t e ( ) {
rf . delete () ;
}
public void i n s e r t ( ) {
rf . insert () ;
}
public RID g e t R i d ( ) {
return r f . c u r r e n t R i d ( ) ;
}
161
package s i m p l e d b . q u e r y ;
import s i m p l e d b . r e c o r d . ∗ ;
/∗ ∗
∗ The s c a n c l a s s c o r r e s p o n d i n g t o t h e <i >s e l e c t </i > r e l a t i o n a l
∗ algebra operator .
∗ A l l methods e x c e p t n e x t d e l e g a t e t h e i r work t o t h e
∗ u n d e r l y i n g scan .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S e l e c t S c a n implements UpdateScan {
p r i v a t e Scan s ;
private P r e d i c a t e pred ;
/∗ ∗
∗ Creates a s e l e c t scan having the s p e c i f i e d underlying
∗ s c a n and p r e d i c a t e .
∗ @param s t h e s c a n o f t h e u n d e r l y i n g q u e r y
∗ @param p r e d t h e s e l e c t i o n p r e d i c a t e
∗/
public S e l e c t S c a n ( Scan s , P r e d i c a t e p r e d ) {
this . s = s ;
this . pred = pred ;
}
// Scan m e t h o d s
public void b e f o r e F i r s t ( ) {
s . beforeFirst () ;
}
/∗ ∗
∗ Move t o t h e n e x t r e c o r d s a t i s f y i n g t h e p r e d i c a t e .
∗ The method r e p e a t e d l y c a l l s n e x t on t h e u n d e r l y i n g s c a n
∗ u n t i l a s u i t a b l e r e c o r d i s found , or t h e u n d e r l y i n g scan
∗ c o n t a i n s no more r e c o r d s .
∗ @see s i m p l e d b . q u e r y . Scan#n e x t ( )
∗/
public boolean n e x t ( ) {
while ( s . n e x t ( ) )
i f ( pred . i s S a t i s f i e d ( s ) )
return true ;
return f a l s e ;
}
public void c l o s e ( ) {
s . close () ;
}
public C o n s t a n t g e t V a l ( S t r i n g fldname ) {
return s . g e t V a l ( f l d n a m e ) ;
}
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
return s . g e t I n t ( f l d n a m e ) ;
}
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
return s . g e t S t r i n g ( f l d n a m e ) ;
}
// U p d a t e S c a n m e t h o d s
public void d e l e t e ( ) {
UpdateScan u s = ( UpdateScan ) s ;
us . d e l e t e ( ) ;
}
public void i n s e r t ( ) {
UpdateScan u s = ( UpdateScan ) s ;
us . i n s e r t ( ) ;
}
public RID g e t R i d ( ) {
UpdateScan u s = ( UpdateScan ) s ;
return u s . g e t R i d ( ) ;
}
162
SimpleDB source file simpledb/query/ProjectScan.java
• It does not actually compute anything about the rows of its subScan s, but just
package s i m p l e d b . q u e r y ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ The s c a n c l a s s c o r r e s p o n d i n g t o t h e <i >p r o j e c t </i > r e l a t i o n a l
∗ algebra operator .
∗ A l l methods e x c e p t h a s F i e l d d e l e g a t e t h e i r work t o t h e
∗ u n d e r l y i n g scan .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s P r o j e c t S c a n implements Scan {
p r i v a t e Scan s ;
p r i v a t e C o l l e c t i o n <S t r i n g > f i e l d l i s t ;
/∗ ∗
∗ Creates a p r o j e c t scan having the s p e c i f i e d
∗ u n d e r l y i n g s c a n and f i e l d l i s t .
∗ @param s t h e u n d e r l y i n g s c a n
∗ @param f i e l d l i s t t h e l i s t o f f i e l d names
∗/
public P r o j e c t S c a n ( Scan s , C o l l e c t i o n <S t r i n g > f i e l d l i s t ) {
this . s = s ;
this . f i e l d l i s t = f i e l d l i s t ;
}
public void b e f o r e F i r s t ( ) {
s . beforeFirst () ;
}
public boolean n e x t ( ) {
return s . n e x t ( ) ;
}
public void c l o s e ( ) {
s . close () ;
}
public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) {
i f ( h a s F i e l d ( fldname ) )
return s . g e t V a l ( f l d n a m e ) ;
else
throw new R u n t i m e E x c e p t i o n ( ” f i e l d ” + f l d n a m e + ” n o t f o u n d . ” ) ;
}
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
i f ( h a s F i e l d ( fldname ) )
return s . g e t I n t ( f l d n a m e ) ;
else
throw new R u n t i m e E x c e p t i o n ( ” f i e l d ” + fldname + ” not found . ” ) ;
}
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
i f ( h a s F i e l d ( fldname ) )
return s . g e t S t r i n g ( f l d n a m e ) ;
else
throw new R u n t i m e E x c e p t i o n ( ” f i e l d ” + f l d n a m e + ” n o t f o u n d . ” ) ;
}
/∗ ∗
∗ Returns t r ue i f the s p e c i f i e d f i e l d
∗ i s in the p r o j e c t i o n l i s t .
∗ @see s i m p l e d b . q u e r y . Scan#h a s F i e l d ( j a v a . l a n g . S t r i n g )
∗/
public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return f i e l d l i s t . c o n t a i n s ( f l d n a m e ) ;
}
}
• It the RDBMS constructed its result eagerly then it would use the following 2 nested
for loops:
163
1 for each row r1 of the subScan s1
2 for each row r2 of the subScan s2
3 output the row r with the same Field s and Values
as r1 and r2 .
• However, since the RDBMS constructs its result lazily in a pipelined fashion one r
at a time, it unrolls these loops into next steps.
The competent programmer is fully aware of the strictly limited size of his
own skull; therefore he approaches the programming task in full humility,
and among other things he avoids clever tricks like the plague. —
E.W. Dijkstra
1 s1 .beforeFirst( );
2 v1 = s1 .next( );
3 s2 .beforeFirst( ).
1 v2 = s2 .next( );
2 if not v2
3 v1 = s1 .next( );
4 s2 .beforeFirst( );
5 v2 = s2 .next( );
6 return v1 and v2 .
– What if the very first call of s1 .next( ) already returns false in the beforeFirst
method?
– That is, what if s1 is empty?
– Adding this v1 variable handles that.
• The get and set methods redirect their calls to the correct subScan s1 or s2 de-
pending on which of them contains this f ieldname.
164
package s i m p l e d b . q u e r y ;
/∗ ∗
∗ The s c a n c l a s s c o r r e s p o n d i n g t o t h e <i >p r o d u c t </i > r e l a t i o n a l
∗ algebra operator .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s P r o d u c t S c a n implements Scan {
p r i v a t e Scan s1 , s 2 ;
/∗ ∗
∗ C r e a t e s a p r o d u c t scan h a v i n g t h e two underlying scans .
∗ @param s 1 t h e LHS s c a n
∗ @param s 2 t h e RHS s c a n
∗/
public P r o d u c t S c a n ( Scan s1 , Scan s 2 ) {
this . s1 = s1 ;
this . s2 = s2 ;
s1 . next ( ) ;
}
/∗ ∗
∗ P o s i t i o n s the scan b e f o r e i t s f i r s t record .
∗ I n o t h e r words , t h e LHS s c a n i s p o s i t i o n e d a t
∗ i t s f i r s t r e c o r d , and t h e RHS s c a n
∗ is positioned before i t s f i r s t record .
∗ @see s i m p l e d b . q u e r y . Scan#b e f o r e F i r s t ( )
∗/
public void b e f o r e F i r s t ( ) {
s1 . b e f o r e F i r s t ( ) ;
s1 . next ( ) ;
s2 . b e f o r e F i r s t ( ) ;
}
/∗ ∗
∗ Moves t h e s c a n t o t h e n e x t r e c o r d .
∗ The method moves t o t h e n e x t RHS r e c o r d , i f p o s s i b l e .
∗ O t h e r w i s e , i t moves t o t h e n e x t LHS r e c o r d and t h e
∗ f i r s t RHS r e c o r d .
∗ I f t h e r e a r e no more LHS r e c o r d s , t h e method r e t u r n s f a l s e .
∗ @see s i m p l e d b . q u e r y . Scan#n e x t ( )
∗/
public boolean n e x t ( ) {
i f ( s2 . next ( ) )
return true ;
else {
s2 . b e f o r e F i r s t ( ) ;
return s 2 . n e x t ( ) && s 1 . n e x t ( ) ;
}
}
/∗ ∗
∗ Closes both underlying scans .
∗ @see s i m p l e d b . q u e r y . Scan#c l o s e ( )
∗/
public void c l o s e ( ) {
s1 . c l o s e ( ) ;
s2 . c l o s e ( ) ;
}
/∗ ∗
∗ Returns the v a l u e of the s p e c i f i e d f i e l d .
∗ The v a l u e i s o b t a i n e d f r o m w h i c h e v e r s c a n
∗ contains the f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) {
i f ( s1 . h a s F i e l d ( fldname ) )
return s 1 . g e t V a l ( f l d n a m e ) ;
else
return s 2 . g e t V a l ( f l d n a m e ) ;
}
/∗ ∗
∗ Returns the i n t e g e r v a l u e of the s p e c i f i e d f i e l d .
∗ The v a l u e i s o b t a i n e d f r o m w h i c h e v e r s c a n
∗ contains the f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t I n t ( j a v a . l a n g . S t r i n g )
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
i f ( s1 . h a s F i e l d ( fldname ) )
return s 1 . g e t I n t ( f l d n a m e ) ;
else
return s 2 . g e t I n t ( f l d n a m e ) ;
}
/∗ ∗
∗ Returns the s t r i n g v a l u e of the s p e c i f i e d f i e l d .
∗ The v a l u e i s o b t a i n e d f r o m w h i c h e v e r s c a n
∗ contains the f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t S t r i n g ( j a v a . l a n g . S t r i n g )
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
i f ( s1 . h a s F i e l d ( fldname ) )
return s 1 . g e t S t r i n g ( f l d n a m e ) ;
else
return s 2 . g e t S t r i n g ( f l d n a m e ) ;
}
/∗ ∗
∗ Returns t r ue i f the s p e c i f i e d f i e l d i s in
∗ e i t h e r of the underlying scans .
∗ @see s i m p l e d b . q u e r y . Scan#h a s F i e l d ( j a v a . l a n g . S t r i n g )
165
∗/
public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return s 1 . h a s F i e l d ( f l d n a m e ) | | s 2 . h a s F i e l d ( f l d n a m e ) ;
}
}
Extending Scans
beforeFirst method could simply call the s .beforeFirst method of its subScan s.
next method could simply call the s .next method of its subScan s.
hasField method could
1 if fldname = AttrName
2 return the current value of this Expr ession
3 else return s .getVal (fldname)
• The current value of this Expr ession on line 2 means its value on the current row of
the subScan s:
Whenever this Expr ession mentions some fieldname, its value is retrieved with
s .get(fieldname), like in Selection Scans.
Sorting Scans
• SimpleDB does not contain a Scan for the sort(s,AttrList) operation of Relational
Algebra.
• This pipelined query execution would not be very good for sorting its output:
– The first next call must produce the smallest row in the output of its subScan s
(wrt. the lexicographic order on AttrList). . .
– . . . but how can it know which is its smallest row without examining all its
rows?
– There are special cases where it can be known, but in general it cannot.
• One solution would be to rescan the whole output of s whenever the next row is
requested, to find out the next larger row than the most recently found row.
• Another better solution is to trade space for time and materialize the whole output
of s once and for all:
166
s .materialize( ): 1 temp = CREATE a new initially empty database Table
with the same Schema as s;
2 s .beforeFirst( );
3 while s .next( )
4 insert a copy of the current row of s into temp;
5 s .close( );
6 return temp.
– does not need Lock ing, because other Transactions do not know that it exists.
– needs only its currently last Block in one RAM Buffer , because its earlier
Block s can be stored in the corresponding temporary disk File.
beforeFirst:
Antijoin Scans
167
1 s1 .beforeFirst( ).
1 repeat
2 v1 = s1 .next( );
3 match = false;
4 s2 .beforeFirst( );
5 repeat
6 v2 = v1 and s2 .next( );
7 match = v2 and pred
8 until (not v2 ) or match
9 until (not v1 ) or (not match);
10 return v1 .
• Algorithm and program design principle 2: When you design a loop, de-
scribe its
invariant: the defining property of the loop, which holds whenever the loop test is
checked
bound: how the execution of the loop body progresses towards its termination
so than after the loop its invariant and current status of its loop test together give
what we wanted to achieve with it.
invariant is
– v2 = “Are the current rows of s1 (as told by v1 by line 6) and s2 valid?”
– match = “Do these valid rows satisfy this pred icate?” (where this matching
pred icate is evaluated on line 7 by getting the appropriate Values from the
current rows of s1 or s2 ).
– none of the previous rows of s2 have matched.
bound is that the current row of s2 advances towards its end, where it is no longer
valid.
168
before the current row now
are valid (as told by v1 ) and have a matching row in s2 – for this we use the
result (17) of the inner loop as a lemma.
bound that the current row of s1 advances towards its end
• The previously discussed Query Scans provided methods for getting the named
Attribute Values of the current row.
• A Query is updatable only if this concept of “the RID of my current row” makes
sense – if the RDBMS knows the exact Record s to modify.
• In SimpleDB, a
Table Scan is always updatable (because it always has a current RID) and
Selection Scan select(s,p) is updatable, if its subScan s is too (because then its
next RID is the next RID of s satisfying p, if any)
169
Figure 67: SQL update command and scan. (Sciore, 2008)
170
– Each row r in the output of product(s1 ,s2 ) combines two rows: r1 from its
subScan s1 and r2 from s2 .
– Even if these r1 and r2 had RIDs, what would be the RID of r?
– If we wanted to update some Attribute Value r1 .a to have a value which de-
pends on another Attribute Value r2 .b (which is why we would like to UP-
DATE this product at all) what would this mean?
Which Value(s) of b would we use for this r1 .a?
import s i m p l e d b . r e c o r d . RID ;
/∗ ∗
∗ The i n t e r f a c e i m p l e m e n t e d b y a l l u p d a t e a b l e s c a n s .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e UpdateScan extends Scan {
/∗ ∗
∗ Modifies the f i e l d value of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e new v a l u e , e x p r e s s e d a s a C o n s t a n t
∗/
public void s e t V a l ( S t r i n g fldname , C o n s t a n t v a l ) ;
/∗ ∗
∗ Modifies the f i e l d value of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e new i n t e g e r v a l u e
∗/
public void s e t I n t ( S t r i n g fldname , i n t v a l ) ;
/∗ ∗
∗ Modifies the f i e l d value of the current record .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @param v a l t h e new s t r i n g v a l u e
∗/
public void s e t S t r i n g ( S t r i n g fldname , S t r i n g v a l ) ;
/∗ ∗
∗ I n s e r t s a new r e c o r d s o m e w h e r e i n the scan .
∗/
public void i n s e r t ( ) ;
/∗ ∗
∗ Deletes the current r e c o r d from t h e scan .
∗/
public void d e l e t e ( ) ;
/∗ ∗
∗ R e t u r n s t h e RID o f the current record .
∗ @ r e t u r n t h e RID o f the current record
∗/
public RID getRid ( ) ;
/∗ ∗
∗ P o s i t i o n s th e scan so t h a t t h e c u r r e n t record has
∗ t h e s p e c i f i e d RID .
∗ @param r i d t h e RID o f t h e d e s i r e d r e c o r d
∗/
public void moveToRid ( RID r i d ) ;
}
4.7.3 Plans
• Each Scan tells one way how a particular Query can be executed.
• A Plan is otherwise similar to a Scan, but it tells instead roughly how much it would
cost to execute this Scan.
• Keeping Plans and Scans separate in this way makes it easier for the RDBMS to
offer many alternative implementations for the same Relational Algebra operation.
• The Planner component of the RDBMS builds many different Plans for the user’s
Query Q and compares their costs.
• Once this component finds a cheap Plan P for Q, the RDBMS opens this P into
the corresponding Scan S and executes S.
171
Figure 68: Some cost formulas. (Sciore, 2008)
• The “currency” of these cost estimations is essentially the amount of disk I/O in
the Scan – because that is the central measure of RDBMS performance.
– If s Scans a stored database Table T , then its B(T ), R(T ) and V(T ,F ) val-
ues are the statistical metadata which SimpleDB has collected about T in
section 4.6.
– If s is another kind of Scan like product(s1 ,s2 ) then we can compute its
B(s), R(s) and V(s,F ) values from the B(s1 ), R(s1 ), V(s1 ,F ), B(s2 ), R(s2 )
and V(s2 ,F ) obtained by recursion from its two subScans s1 and s2 with the
cost equations for the product Relational Algebra operation.
• Figure 68 gives these cost equations for the 3 main Relational Algebra operations.
172
V is the method distinctValues.
package s i m p l e d b . q u e r y ;
import s i m p l e d b . r e c o r d . Schema ;
/∗ ∗
∗ The i n t e r f a c e i m p l e m e n t e d b y e a c h q u e r y p l a n .
∗ There i s a Plan c l a s s f o r each r e l a t i o n a l a l g e b r a operator .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
public i n t e r f a c e Plan {
/∗ ∗
∗ Opens a s c a n c o r r e s p o n d i n g t o t h i s plan .
∗ The s c a n w i l l b e p o s i t i o n e d b e f o r e its first record .
∗ @return a scan
∗/
public Scan open ( ) ;
/∗ ∗
∗ R e t u r n s an e s t i m a t e o f t h e number o f b l o c k a c c e s s e s
∗ t h a t w i l l o c c u r when t h e s c a n i s r e a d t o c o m p l e t i o n .
∗ @ r e t u r n t h e e s t i m a t e d number o f b l o c k a c c e s s e s
∗/
public i n t blocksAccessed () ;
/∗ ∗
∗ R e t u r n s an e s t i m a t e o f t h e number o f r e c o r d s
∗ in the query ’ s output t a b l e .
∗ @ r e t u r n t h e e s t i m a t e d number o f o u t p u t r e c o r d s
∗/
public i n t recordsOutput ( ) ;
/∗ ∗
∗ R e t u r n s an e s t i m a t e o f t h e number o f d i s t i n c t v a l u e s
∗ for the s p e c i f i e d f i e l d in the query ’ s output t a b l e .
∗ @param f l d n a m e t h e name o f a f i e l d
∗ @ r e t u r n t h e e s t i m a t e d number o f d i s t i n c t f i e l d v a l u e s in the output
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g fldname ) ;
/∗ ∗
∗ R e t u r n s t h e schema o f t h e q u e r y .
∗ @ r e t u r n t h e q u e r y ’ s schema
∗/
public Schema schema ( ) ;
}
/∗ ∗ The P l a n c l a s s c o r r e s p o n d i n g t o a t a b l e .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s T a b l e P l a n implements Plan {
private T r a n s a c t i o n tx ;
private T a b l e I n f o t i ;
private S t a t I n f o s i ;
/∗ ∗
∗ C r e a t e s a l e a f node i n t h e q u e r y t r e e c o r r e s p o n d i n g
∗ to the s p e c i f i e d t a b l e .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public T a b l e P l a n ( S t r i n g tblname , T r a n s a c t i o n t x ) {
this . tx = tx ;
t i = SimpleDB . mdMgr ( ) . g e t T a b l e I n f o ( tblname , t x ) ;
s i = SimpleDB . mdMgr ( ) . g e t S t a t I n f o ( tblname , t i , t x ) ;
}
/∗ ∗
∗ Creates a t a b l e scan f o r t h i s query .
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
return new T a b l e S c a n ( t i , t x ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s f o r t h e t a b l e ,
∗ w h i c h i s o b t a i n a b l e f r o m t h e s t a t i s t i c s manager .
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )
173
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return s i . b l o c k s A c c e s s e d ( ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f r e c o r d s i n t h e t a b l e ,
∗ w h i c h i s o b t a i n a b l e f r o m t h e s t a t i s t i c s manager .
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return s i . r e c o r d s O u t p u t ( ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f d i s t i n c t f i e l d v a l u e s i n t h e t a b l e ,
∗ w h i c h i s o b t a i n a b l e f r o m t h e s t a t i s t i c s manager .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
return s i . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}
/∗ ∗
∗ D e t e r m i n e s t h e schema o f t h e t a b l e ,
∗ which i s o b t a i n a b l e from t h e c a t a l o g manager .
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return t i . schema ( ) ;
}
}
• A Selection Plan implements the Plan interface as follows, where the correspond-
ing Scan is s0 = select(s1 ,pred ).
– Computing the answer for s0 requires executing the subScan s1 and selecting
those rows which satisfy this pred icate.
– Hence it involves accessing the same Block s as s1 , and therefore
B(s) = B(s0 )
= B(s1 )
as in Figure 68.
• Consider then the equation for R(s) when the pred icate compares the Value of an
Attribute A with a constant c.
– Assume for simplicity that each of the V(s1 ,A) distinct Values for A occurs
roughly as often.
– We are calculating estimates because exact values would be about as hard as
doing the actual query itself.
– This gives the equation in Figure 68.
• Consider then the equation for R(s) when the pred icate compares the Value of 2
Attributes A and B.
174
– This and the simplicity assumption above lead to the equation in Figure 68.
– If F = A then this selection reduces its Values into just 1, namely this c.
– If F 6= A then we can use the inductive count V(s1 ,F ) directly. . .
– . . . but if the output of s0 has fewer rows than this, then it has only as many
distinct Values left.
– This leads to the equation in Figure 68.
package s i m p l e d b . q u e r y ;
import s i m p l e d b . r e c o r d . Schema ;
/∗ ∗
∗ C r e a t e s a new s e l e c t n o d e i n t h e q u e r y t r e e ,
∗ h a v i n g t h e s p e c i f i e d s u b q u e r y and p r e d i c a t e .
∗ @param p t h e s u b q u e r y
∗ @param p r e d t h e p r e d i c a t e
∗/
public S e l e c t P l a n ( Plan p , P r e d i c a t e p r e d ) {
this . p = p ;
this . pred = pred ;
}
/∗ ∗
∗ Creates a s e l e c t scan f o r t h i s query .
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
Scan s = p . open ( ) ;
return new S e l e c t S c a n ( s , p r e d ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s i n t h e selection ,
∗ w h i c h i s t h e same a s i n t h e u n d e r l y i n g q u e r y .
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return p . b l o c k s A c c e s s e d ( ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f o u t p u t r e c o r d s i n t h e s e l e c t i o n ,
∗ which i s determined by t h e
∗ reduction factor of the predicate .
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return p . r e c o r d s O u t p u t ( ) / p r e d . r e d u c t i o n F a c t o r ( p ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f d i s t i n c t f i e l d v a l u e s
∗ in the p r o j e c t i o n .
∗ I f t h e p r e d i c a t e c o n t a i n s a term e q u a t i n g t h e s p e c i f i e d
∗ f i e l d to a constant , then t h i s v a l u e w i l l be 1 .
∗ O t h e r w i s e , i t w i l l b e t h e number o f t h e d i s t i n c t v a l u e s
∗ in the underlying query
∗ ( b u t n o t more t h a n t h e s i z e o f t h e o u t p u t t a b l e ) .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
i f ( p r e d . e q u a t e s W i t h C o n s t a n t ( f l d n a m e ) != n u l l )
return 1 ;
else {
S t r i n g fldname2 = pred . equatesWithField ( fldname ) ;
i f ( f l d n a m e 2 != n u l l )
return Math . min ( p . d i s t i n c t V a l u e s ( f l d n a m e ) ,
175
p . d i s t i n c t V a l u e s ( fldname2 ) ) ;
else
return Math . min ( p . d i s t i n c t V a l u e s ( f l d n a m e ) ,
recordsOutput ( ) ) ;
}
}
/∗ ∗
∗ R e t u r n s t h e schema o f t h e s e l e c t i o n ,
∗ w h i c h i s t h e same a s i n t h e u n d e r l y i n g query .
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return p . schema ( ) ;
}
}
• A Projection Plan implements the Plan interface by redirecting the cost methods
into the subPlan.
• This is because the projection Relational Algebra operation just modifies the
Schema but the actual rows stay the same.
package s i m p l e d b . q u e r y ;
import s i m p l e d b . r e c o r d . Schema ;
import j a v a . u t i l . C o l l e c t i o n ;
/∗ ∗
∗ C r e a t e s a new p r o j e c t n o d e i n t h e q u e r y t r e e ,
∗ h a v i n g t h e s p e c i f i e d s u b q u e r y and f i e l d l i s t .
∗ @param p t h e s u b q u e r y
∗ @param f i e l d l i s t t h e l i s t o f f i e l d s
∗/
public P r o j e c t P l a n ( Plan p , C o l l e c t i o n <S t r i n g > f i e l d l i s t ) {
this . p = p ;
for ( S t r i n g fldname : f i e l d l i s t )
schema . add ( fldname , p . schema ( ) ) ;
}
/∗ ∗
∗ Creates a p r o j e c t scan f o r t h i s query .
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
Scan s = p . open ( ) ;
return new P r o j e c t S c a n ( s , schema . f i e l d s ( ) ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s i n t h e projection ,
∗ w h i c h i s t h e same a s i n t h e u n d e r l y i n g q u e r y .
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return p . b l o c k s A c c e s s e d ( ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f o u t p u t r e c o r d s i n t h e projection ,
∗ w h i c h i s t h e same a s i n t h e u n d e r l y i n g q u e r y .
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return p . r e c o r d s O u t p u t ( ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f d i s t i n c t f i e l d v a l u e s
∗ in the projection ,
∗ w h i c h i s t h e same a s i n t h e u n d e r l y i n g q u e r y .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
return p . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}
/∗ ∗
∗ R e t u r n s t h e schema o f t h e p r o j e c t i o n ,
∗ which i s t a k e n from t h e f i e l d l i s t .
176
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return schema ;
}
}
• The value V(s, F ) is V(si , F ) for the subPlan si whose Schema has this F .
• The equation
B(s) = B(s1 ) + R(s1 ) · B(s2 ) (18)
| {z } | {z }
outer and inner loops
• Rewriting
where
• If s1 and s2 are Tables, then Eq. (19) says that their product is cheaper if the Table
with larger Record s comes first.
package s i m p l e d b . q u e r y ;
import s i m p l e d b . r e c o r d . Schema ;
/∗ ∗
∗ C r e a t e s a new p r o d u c t n o d e i n t h e q u e r y tree ,
∗ h a v i n g t h e two s p e c i f i e d s u b q u e r i e s .
∗ @param p1 t h e l e f t −hand s u b q u e r y
∗ @param p2 t h e r i g h t −hand s u b q u e r y
∗/
177
public P r o d u c t P l a n ( Plan p1 , Plan p2 ) {
t h i s . p1 = p1 ;
t h i s . p2 = p2 ;
schema . a d d A l l ( p1 . schema ( ) ) ;
schema . a d d A l l ( p2 . schema ( ) ) ;
}
/∗ ∗
∗ Creates a product scan f o r t h i s query .
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
Scan s 1 = p1 . open ( ) ;
Scan s 2 = p2 . open ( ) ;
return new P r o d u c t S c a n ( s1 , s 2 ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s i n t h e p r o d u c t .
∗ The f o r m u l a i s :
∗ <p r e > B( p r o d u c t ( p1 , p2 ) ) = B( p1 ) + R( p1 ) ∗B( p2 ) </p r e >
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return p1 . b l o c k s A c c e s s e d ( ) + ( p1 . r e c o r d s O u t p u t ( ) ∗ p2 . b l o c k s A c c e s s e d ( ) ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f o u t p u t r e c o r d s i n t h e p r o d u c t .
∗ The f o r m u l a i s :
∗ <p r e > R( p r o d u c t ( p1 , p2 ) ) = R( p1 ) ∗R( p2 ) </p r e >
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return p1 . r e c o r d s O u t p u t ( ) ∗ p2 . r e c o r d s O u t p u t ( ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e d i s t i n c t number o f f i e l d v a l u e s i n t h e p r o d u c t .
∗ Since t h e product does not i n c r e a s e or d e c r e a s e f i e l d v a l u e s ,
∗ t h e e s t i m a t e i s t h e same a s i n t h e a p p r o p r i a t e u n d e r l y i n g q u e r y .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
i f ( p1 . schema ( ) . h a s F i e l d ( f l d n a m e ) )
return p1 . d i s t i n c t V a l u e s ( f l d n a m e ) ;
else
return p2 . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}
/∗ ∗
∗ R e t u r n s t h e schema of the product ,
∗ which i s t h e union o f t h e schemas o f the underlying queries .
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return schema ;
}
}
• Figures 69 and 70 give an example on calculating the cost of retrieving the math
majors’ names.
(a) gives the Query tree which determines the Scan and Plan to consider.
(b) gives the SimpleDB client method calls which it would execute.
(c) gives its cost in our University example.
• Eq. (19) says that it would have been better to swap s1 and s3 in s4.
4.7.4 Predicates
• We have skipped until now the SimpleDB implementation of selection pred icates.
• SimpleDB supports only conjunctions (that is, ANDs) of Terms, where each Term
is one of
– AttrName = AttrName or
– AttrName = constant.
• Full SQL offers much more detailed pred icates in its WHERE parts.
178
Figure 69: Cost estimation example. (Sciore, 2008)
179
Figure 70: Figure 69 continued. (Sciore, 2008)
180
• This pred icate handling involves a lot of code which the
Parser component of the RDBMS invokes when it parses the WHERE part of an
SQL statement into the corresponding predicate
Query and Planner components invoke when they process this predicate constructed
by the Parser .
/∗ ∗
∗ The i n t e r f a c e t h a t d e n o t e s v a l u e s s t o r e d i n t h e d a t a b a s e .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e C o n s t a n t extends Comparable<Constant> {
/∗ ∗
∗ Returns the Java object corresponding to this constant .
∗ @return t h e Java value of the constant
∗/
public O b j e c t asJavaVal ( ) ;
}
/∗ ∗
∗ The c l a s s t h a t w r a p s J a v a s t r i n g s a s d a t a b a s e c o n s t a n t s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S t r i n g C o n s t a n t implements C o n s t a n t {
private S t r i n g v a l ;
/∗ ∗
∗ Create a c o n s t a n t by wrapping t h e specified string .
∗ @param s t h e s t r i n g v a l u e
∗/
public S t r i n g C o n s t a n t ( S t r i n g s ) {
val = s ;
}
/∗ ∗
∗ Unwraps t h e s t r i n g and r e t u r n s i t .
∗ @see s i m p l e d b . q u e r y . C o n s t a n t#a s J a v a V a l ( )
∗/
public S t r i n g a s J a v a V a l ( ) {
return v a l ;
}
public boolean e q u a l s ( O b j e c t o b j ) {
StringConstant sc = ( StringConstant ) obj ;
return s c != n u l l && v a l . e q u a l s ( s c . v a l ) ;
}
public i n t compareTo ( C o n s t a n t c ) {
StringConstant sc = ( StringConstant ) c ;
return v a l . compareTo ( s c . v a l ) ;
}
public i n t hashCode ( ) {
return v a l . hashCode ( ) ;
}
public S t r i n g t o S t r i n g ( ) {
return v a l ;
}
}
/∗ ∗
∗ The c l a s s t h a t w r a p s J a v a i n t s a s d a t a b a s e c o n s t a n t s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n t C o n s t a n t implements C o n s t a n t {
private I n t e g e r v a l ;
/∗ ∗
∗ Create a c o n s t a n t by wrapping the specified int .
∗ @param n t h e i n t v a l u e
∗/
public I n t C o n s t a n t ( i n t n ) {
v a l = new I n t e g e r ( n ) ;
}
181
/∗ ∗
∗ Unwraps t h e I n t e g e r and r e t u r n s i t .
∗ @see s i m p l e d b . q u e r y . C o n s t a n t#a s J a v a V a l ( )
∗/
public O b j e c t a s J a v a V a l ( ) {
return v a l ;
}
public boolean e q u a l s ( O b j e c t o b j ) {
IntConstant i c = ( IntConstant ) obj ;
return i c != n u l l && v a l . e q u a l s ( i c . v a l ) ;
}
public i n t compareTo ( C o n s t a n t c ) {
IntConstant i c = ( IntConstant ) c ;
return v a l . compareTo ( i c . v a l ) ;
}
public i n t hashCode ( ) {
return v a l . hashCode ( ) ;
}
public S t r i n g t o S t r i n g ( ) {
return v a l . t o S t r i n g ( ) ;
}
}
import s i m p l e d b . r e c o r d . Schema ;
/∗ ∗
∗ An e x p r e s s i o n c o n s i s t i n g e n t i r e l y o f a s i n g l e c o n s t a n t .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
public c l a s s C o n s t a n t E x p r e s s i o n implements E x p r e s s i o n {
private Constant v a l ;
/∗ ∗
∗ C r e a t e s a new e x p r e s s i o n b y w r a p p i n g a c o n s t a n t .
∗ @param c t h e c o n s t a n t
∗/
public C o n s t a n t E x p r e s s i o n ( C o n s t a n t c ) {
val = c ;
}
/∗ ∗
∗ Returns t r ue .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#i s C o n s t a n t ( )
∗/
public boolean i s C o n s t a n t ( ) {
return true ;
}
/∗ ∗
∗ Returns f a l s e .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#i s F i e l d N a m e ( )
∗/
public boolean i s F i e l d N a m e ( ) {
return f a l s e ;
}
/∗ ∗
∗ Unwraps t h e c o n s t a n t and r e t u r n s i t .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a s C o n s t a n t ( )
∗/
public C o n s t a n t a s C o n s t a n t ( ) {
return v a l ;
}
/∗ ∗
∗ T h i s method s h o u l d n e v e r b e c a l l e d .
∗ Throws a C l a s s C a s t E x c e p t i o n .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a s F i e l d N a m e ( )
∗/
public S t r i n g asFiel dName ( ) {
throw new C l a s s C a s t E x c e p t i o n ( ) ;
}
/∗ ∗
∗ Returns the constant , r e g a r d l e s s of the scan .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#e v a l u a t e ( s i m p l e d b . q u e r y . Scan )
∗/
public C o n s t a n t e v a l u a t e ( Scan s ) {
return v a l ;
}
/∗ ∗
∗ R e t u r n s t r u e , b e c a u s e a c o n s t a n t a p p l i e s t o any schema .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a p p l i e s T o ( s i m p l e d b . r e c o r d . Schema )
∗/
public boolean a p p l i e s T o ( Schema s c h ) {
return true ;
}
public S t r i n g toString () {
182
return v a l . t o S t r i n g ( ) ;
}
}
import s i m p l e d b . r e c o r d . Schema ;
/∗ ∗
∗ The i n t e r f a c e c o r r e s p o n d i n g t o SQL e x p r e s s i o n s .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
public i n t e r f a c e E x p r e s s i o n {
/∗ ∗
∗ Returns t r ue if the expression is a constant .
∗ @return t r u e if the expression is a constant
∗/
public boolean isConstant () ;
/∗ ∗
∗ Returns t r ue if the expression is a f i e l d reference .
∗ @return t r u e if the expression denotes a f i e l d
∗/
public boolean isFieldName ( ) ;
/∗ ∗
∗ Returns the constant corresponding to a constant expression .
∗ Throws an e x c e p t i o n i f t h e e x p r e s s i o n d o e s n o t
∗ denote a constant .
∗ @return t h e e x p r e s s i o n as a c o n s t a n t
∗/
public C o n s t a n t a s C o n s t a n t ( ) ;
/∗ ∗
∗ R e t u r n s t h e f i e l d name c o r r e s p o n d i n g t o a c o n s t a n t expression .
∗ Throws an e x c e p t i o n i f t h e e x p r e s s i o n d o e s n o t
∗ denote a f i e l d .
∗ @ r e t u r n t h e e x p r e s s i o n a s a f i e l d name
∗/
public S t r i n g asF ieldN ame ( ) ;
/∗ ∗
∗ Evaluates the expression with respect to the
∗ c u r r e n t record of the s p e c i f i e d scan .
∗ @param s t h e s c a n
∗ @return t h e v a l u e o f t h e e x p r e s s i o n , as a Constant
∗/
public C o n s t a n t e v a l u a t e ( Scan s ) ;
/∗ ∗
∗ Determines i f a l l o f t h e f i e l d s mentioned in this expression
∗ a r e c o n t a i n e d i n t h e s p e c i f i e d schema .
∗ @param s c h t h e schema
∗ @return t r u e i f a l l f i e l d s in t h e e x p r e s s i o n are in t h e schema
∗/
public boolean a p p l i e s T o ( Schema s c h ) ;
}
import s i m p l e d b . r e c o r d . Schema ;
/∗ ∗
∗ An e x p r e s s i o n c o n s i s t i n g e n t i r e l y o f a s i n g l e f i e l d .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
public c l a s s F i e l d N a m e E x p r e s s i o n implements E x p r e s s i o n {
private S t r i n g fldname ;
/∗ ∗
∗ C r e a t e s a new e x p r e s s i o n b y w r a p p i n g a f i e l d .
∗ @param f l d n a m e t h e name o f t h e w r a p p e d f i e l d
∗/
public F i e l d N a m e E x p r e s s i o n ( S t r i n g f l d n a m e ) {
this . fldname = fldname ;
}
/∗ ∗
∗ Returns f a l s e .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#i s C o n s t a n t ( )
∗/
public boolean i s C o n s t a n t ( ) {
return f a l s e ;
}
/∗ ∗
∗ Returns t r ue .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#i s F i e l d N a m e ( )
∗/
183
public boolean i s F i e l d N a m e ( ) {
return true ;
}
/∗ ∗
∗ T h i s method s h o u l d n e v e r b e c a l l e d .
∗ Throws a C l a s s C a s t E x c e p t i o n .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a s C o n s t a n t ( )
∗/
public C o n s t a n t a s C o n s t a n t ( ) {
throw new C l a s s C a s t E x c e p t i o n ( ) ;
}
/∗ ∗
∗ Unwraps t h e f i e l d name and r e t u r n s i t .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a s F i e l d N a m e ( )
∗/
public S t r i n g asFiel dName ( ) {
return f l d n a m e ;
}
/∗ ∗
∗ E v a l u a t e s t h e f i e l d by g e t t i n g i t s v a l u e in t h e scan .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#e v a l u a t e ( s i m p l e d b . q u e r y . Scan )
∗/
public C o n s t a n t e v a l u a t e ( Scan s ) {
return s . g e t V a l ( f l d n a m e ) ;
}
/∗ ∗
∗ R e t u r n s t r u e i f t h e f i e l d i s i n t h e s p e c i f i e d schema .
∗ @see s i m p l e d b . q u e r y . E x p r e s s i o n#a p p l i e s T o ( s i m p l e d b . r e c o r d . Schema )
∗/
public boolean a p p l i e s T o ( Schema s c h ) {
return s c h . h a s F i e l d ( f l d n a m e ) ;
}
public S t r i n g t o S t r i n g ( ) {
return f l d n a m e ;
}
}
import s i m p l e d b . r e c o r d . Schema ;
/∗ ∗
∗ A term i s a comparison b e t w e e n two expressions .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
public c l a s s Term {
private E x p r e s s i o n l h s , r h s ;
/∗ ∗
∗ C r e a t e s a new t e r m t h a t c o m p a r e s t w o e x p r e s s i o n s
∗ for equality .
∗ @param l h s t h e LHS e x p r e s s i o n
∗ @param r h s t h e RHS e x p r e s s i o n
∗/
public Term ( E x p r e s s i o n l h s , E x p r e s s i o n r h s ) {
this . lhs = lhs ;
this . rhs = rhs ;
}
/∗ ∗
∗ C a l c u l a t e s t h e e x t e n t t o w h i c h s e l e c t i n g on t h e t e r m reduces
∗ t h e number o f r e c o r d s o u t p u t b y a q u e r y .
∗ For e x a m p l e i f t h e r e d u c t i o n f a c t o r i s 2 , t h e n t h e
∗ term c u t s t h e s i z e o f t h e o u t p u t i n h a l f .
∗ @param p t h e q u e r y ’ s p l a n
∗ @return t h e i n t e g e r r e d u c t i o n f a c t o r .
∗/
public i n t r e d u c t i o n F a c t o r ( Plan p ) {
S t r i n g lhsName , rhsName ;
i f ( l h s . i s F i e l d N a m e ( ) && r h s . i s F i e l d N a m e ( ) ) {
lhsName = l h s . asF ieldNa me ( ) ;
rhsName = r h s . asFi eldNam e ( ) ;
return Math . max ( p . d i s t i n c t V a l u e s ( lhsName ) ,
p . d i s t i n c t V a l u e s ( rhsName ) ) ;
}
i f ( l h s . isFieldName ( ) ) {
lhsName = l h s . asF ieldNa me ( ) ;
return p . d i s t i n c t V a l u e s ( lhsName ) ;
}
i f ( rhs . isFieldName ( ) ) {
rhsName = r h s . asFi eldNam e ( ) ;
return p . d i s t i n c t V a l u e s ( rhsName ) ;
}
// o t h e r w i s e , t h e t e r m e q u a t e s c o n s t a n t s
i f ( l h s . asConstant ( ) . equals ( rhs . asConstant ( ) ) )
return 1 ;
else
return I n t e g e r .MAX VALUE;
}
/∗ ∗
∗ Determines if this term is of t h e f o r m ”F=c ”
184
∗ w h e r e F i s t h e s p e c i f i e d f i e l d and c i s some c o n s t a n t .
∗ I f so , t h e method r e t u r n s t h a t c o n s t a n t .
∗ I f n o t , t h e method r e t u r n s n u l l .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return e i t h e r t h e c o n s t a n t or n u l l
∗/
public C o n s t a n t e q u a t e s W i t h C o n s t a n t ( S t r i n g f l d n a m e ) {
i f ( l h s . i s F i e l d N a m e ( ) &&
l h s . asFi eldNam e ( ) . e q u a l s ( f l d n a m e ) &&
rhs . isConstant () )
return r h s . a s C o n s t a n t ( ) ;
e l s e i f ( r h s . i s F i e l d N a m e ( ) &&
r h s . asFiel dName ( ) . e q u a l s ( f l d n a m e ) &&
lhs . isConstant () )
return l h s . a s C o n s t a n t ( ) ;
else
return n u l l ;
}
/∗ ∗
∗ D e t e r m i n e s i f t h i s t e r m i s o f t h e f o r m ”F1=F2”
∗ w h e r e F1 i s t h e s p e c i f i e d f i e l d and F2 i s a n o t h e r f i e l d .
∗ I f so , t h e method r e t u r n s t h e name o f t h a t f i e l d .
∗ I f n o t , t h e method r e t u r n s n u l l .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @ r e t u r n e i t h e r t h e name o f t h e o t h e r f i e l d , o r n u l l
∗/
public S t r i n g e q u a t e s W i t h F i e l d ( S t r i n g f l d n a m e ) {
i f ( l h s . i s F i e l d N a m e ( ) &&
l h s . asFie ldName ( ) . e q u a l s ( f l d n a m e ) &&
rhs . isFieldName ( ) )
return r h s . asFiel dName ( ) ;
e l s e i f ( r h s . i s F i e l d N a m e ( ) &&
r h s . asFiel dName ( ) . e q u a l s ( f l d n a m e ) &&
l h s . isFieldName ( ) )
return l h s . asFi eldNam e ( ) ;
else
return n u l l ;
}
/∗ ∗
∗ Returns t r u e i f b o t h o f t h e term ’ s e x p r e s s i o n s
∗ a p p l y t o t h e s p e c i f i e d schema .
∗ @param s c h t h e schema
∗ @ r e t u r n t r u e i f b o t h e x p r e s s i o n s a p p l y t o t h e schema
∗/
public boolean a p p l i e s T o ( Schema s c h ) {
return l h s . a p p l i e s T o ( s c h ) && r h s . a p p l i e s T o ( s c h ) ;
}
/∗ ∗
∗ Returns t r u e i f b o t h o f t h e term ’ s e x p r e s s i o n s
∗ e v a l u a t e t o t h e same c o n s t a n t ,
∗ with r e s p e c t to the s p e c i f i e d scan .
∗ @param s t h e s c a n
∗ @ r e t u r n t r u e i f b o t h e x p r e s s i o n s h a v e t h e same v a l u e in the scan
∗/
public boolean i s S a t i s f i e d ( Scan s ) {
Constant l h s v a l = l h s . e v a l u a t e ( s ) ;
Constant r h s v a l = rhs . e v a l u a t e ( s ) ;
return r h s v a l . e q u a l s ( l h s v a l ) ;
}
public S t r i n g t o S t r i n g ( ) {
return l h s . t o S t r i n g ( ) + ”=” + r h s . t o S t r i n g ( ) ;
}
}
import s i m p l e d b . r e c o r d . Schema ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ A p r e d i c a t e i s a Boolean combination o f terms .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
public c l a s s P r e d i c a t e {
p r i v a t e L i s t <Term> t e r m s = new A r r a y L i s t <Term>() ;
/∗ ∗
∗ C r e a t e s an empty p r e d i c a t e , corresponding to ” true ”.
∗/
public P r e d i c a t e ( ) {}
/∗ ∗
∗ Creates a predicate containing a single term .
∗ @param t t h e t e r m
∗/
public P r e d i c a t e ( Term t ) {
t e r m s . add ( t ) ;
}
/∗ ∗
∗ M o d i f i e s t h e p r e d i c a t e to be t he c o n j u n c t i o n of
∗ i t s e l f and t h e s p e c i f i e d p r e d i c a t e .
∗ @param p r e d t h e o t h e r p r e d i c a t e
∗/
185
public void c o n j o i n W i t h ( P r e d i c a t e p r e d ) {
terms . addAll ( pred . terms ) ;
}
/∗ ∗
∗ Returns t r ue i f the p r e d i c a t e e v a l u a t e s to t r u e
∗ with r e s p e c t to the s p e c i f i e d scan .
∗ @param s t h e s c a n
∗ @return t r u e i f t h e p r e d i c a t e i s t r u e in t h e scan
∗/
public boolean i s S a t i s f i e d ( Scan s ) {
f o r ( Term t : t e r m s )
if (! t . isSatisfied (s))
return f a l s e ;
return true ;
}
/∗ ∗
∗ C a l c u l a t e s t h e e x t e n t t o w h i c h s e l e c t i n g on t h e p r e d i c a t e
∗ r e d u c e s t h e number o f r e c o r d s o u t p u t b y a q u e r y .
∗ For e x a m p l e i f t h e r e d u c t i o n f a c t o r i s 2 , t h e n t h e
∗ predica te cuts the s i z e of the output in h a l f .
∗ @param p t h e q u e r y ’ s p l a n
∗ @return t h e i n t e g e r r e d u c t i o n f a c t o r .
∗/
public i n t r e d u c t i o n F a c t o r ( Plan p ) {
int f a c t o r = 1 ;
f o r ( Term t : t e r m s )
f a c t o r ∗= t . r e d u c t i o n F a c t o r ( p ) ;
return f a c t o r ;
}
/∗ ∗
∗ Returns the s u b p r e d i c a t e t h a t a p p l i e s to the s p e c i f i e d schema .
∗ @param s c h t h e schema
∗ @ r e t u r n t h e s u b p r e d i c a t e a p p l y i n g t o t h e schema
∗/
public P r e d i c a t e s e l e c t P r e d ( Schema s c h ) {
P r e d i c a t e r e s u l t = new P r e d i c a t e ( ) ;
f o r ( Term t : t e r m s )
i f ( t . appliesTo ( sch ) )
r e s u l t . t e r m s . add ( t ) ;
i f ( r e s u l t . t e r m s . s i z e ( ) == 0 )
return n u l l ;
else
return r e s u l t ;
}
/∗ ∗
∗ Returns the s u b p r e d i c a t e c o n s i s t i n g of terms t h a t apply
∗ t o t h e union o f t h e two s p e c i f i e d schemas ,
∗ b u t n o t t o e i t h e r schema s e p a r a t e l y .
∗ @param s c h 1 t h e f i r s t schema
∗ @param s c h 2 t h e s e c o n d schema
∗ @return t h e s u b p r e d i c a t e whose terms a p p l y t o t h e union of the two schemas but not either &
schema s e p a r a t e l y .
∗/
public P r e d i c a t e j o i n P r e d ( Schema s c h 1 , Schema s c h 2 ) {
P r e d i c a t e r e s u l t = new P r e d i c a t e ( ) ;
Schema newsch = new Schema ( ) ;
newsch . a d d A l l ( s c h 1 ) ;
newsch . a d d A l l ( s c h 2 ) ;
f o r ( Term t : t e r m s )
i f ( ! t . a p p l i e s T o ( s c h 1 ) &&
! t . a p p l i e s T o ( s c h 2 ) &&
t . a p p l i e s T o ( newsch ) )
r e s u l t . t e r m s . add ( t ) ;
i f ( r e s u l t . t e r m s . s i z e ( ) == 0 )
return n u l l ;
else
return r e s u l t ;
}
/∗ ∗
∗ D e t e r m i n e s i f t h e r e i s a t e r m o f t h e f o r m ”F=c ”
∗ w h e r e F i s t h e s p e c i f i e d f i e l d and c i s some c o n s t a n t .
∗ I f so , t h e method r e t u r n s t h a t c o n s t a n t .
∗ I f n o t , t h e method r e t u r n s n u l l .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @return e i t h e r t h e c o n s t a n t or n u l l
∗/
public C o n s t a n t e q u a t e s W i t h C o n s t a n t ( S t r i n g f l d n a m e ) {
f o r ( Term t : t e r m s ) {
Constant c = t . equatesWithConstant ( fldname ) ;
i f ( c != n u l l )
return c ;
}
return n u l l ;
}
/∗ ∗
∗ D e t e r m i n e s i f t h e r e i s a t e r m o f t h e f o r m ”F1=F2”
∗ w h e r e F1 i s t h e s p e c i f i e d f i e l d and F2 i s a n o t h e r field .
∗ I f so , t h e method r e t u r n s t h e name o f t h a t f i e l d .
∗ I f n o t , t h e method r e t u r n s n u l l .
∗ @param f l d n a m e t h e name o f t h e f i e l d
∗ @ r e t u r n t h e name o f t h e o t h e r f i e l d , o r n u l l
∗/
public S t r i n g e q u a t e s W i t h F i e l d ( S t r i n g f l d n a m e ) {
f o r ( Term t : t e r m s ) {
S t r i n g s = t . equatesWithField ( fldname ) ;
i f ( s != n u l l )
return s ;
186
Figure 71: An example of a parse tree. (Sciore, 2008)
}
return n u l l ;
}
public S t r i n g t o S t r i n g ( ) {
I t e r a t o r <Term> i t e r = t e r m s . i t e r a t o r ( ) ;
i f ( ! i t e r . hasNext ( ) )
return ” ” ;
S t r i n g r e s u l t = i t e r . next ( ) . t o S t r i n g ( ) ;
while ( i t e r . hasNext ( ) )
r e s u l t += ” and ” + i t e r . n e x t ( ) . t o S t r i n g ( ) ;
return r e s u l t ;
}
}
• In particular, SQL has been designed so that hand-written recursive descent LL(1)
parsing is enough.
• However, full SQL is so large that using a dedicated parser generator tool like yacc
instead would be a good idea. (Levine et al., 1992, Appendix J)
• The SimpleDB subset of SQL was in Figure 25. Its recursive descent parser is listed
here.
187
SimpleDB source file simpledb/parse/Lexer.java
package s i m p l e d b . p a r s e ;
import j a v a . u t i l . ∗ ;
import j a v a . i o . ∗ ;
/∗ ∗
∗ The l e x i c a l a n a l y z e r .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s L e x e r {
p r i v a t e C o l l e c t i o n <S t r i n g > k ey w o rd s ;
private StreamTokenizer tok ;
/∗ ∗
∗ C r e a t e s a new l e x i c a l a n a l y z e r f o r SQL s t a t e m e n t s .
∗ @param s t h e SQL s t a t e m e n t
∗/
public L e x e r ( S t r i n g s ) {
initKeywords () ;
t o k = new S t r e a m T o k e n i z e r (new S t r i n g R e a d e r ( s ) ) ;
tok . ordinaryChar ( ’ . ’ ) ;
t o k . lowerCaseMode ( true ) ; // i d s and k e y w o r d s a r e c o n v e r t e d
nextToken ( ) ;
}
/∗ ∗
∗ Returns t r ue i f the current token i s
∗ the s p e c i f i e d delimiter character .
∗ @param d a c h a r a c t e r d e n o t i n g t h e d e l i m i t e r
∗ @return t r u e i f t h e d e l i m i t e r i s t h e c u r r e n t token
∗/
public boolean matchDelim ( char d ) {
return d == ( char ) t o k . t t y p e ;
}
/∗ ∗
∗ R e t u r n s t r u e i f t h e c u r r e n t t o k e n i s an i n t e g e r .
∗ @ r e t u r n t r u e i f t h e c u r r e n t t o k e n i s an i n t e g e r
∗/
public boolean m a t c h I n t C o n s t a n t ( ) {
return t o k . t t y p e == S t r e a m T o k e n i z e r .TT NUMBER;
}
/∗ ∗
∗ Returns t r ue i f the current token i s a string .
∗ @return t r u e i f t h e c u r r e n t token i s a string
∗/
public boolean m a t c h S t r i n g C o n s t a n t ( ) {
return ’ \ ’ ’ == ( char ) t o k . t t y p e ;
}
/∗ ∗
∗ Returns t r u e i f t h e c u r r e n t token i s t h e s p e c i f i e d keyword .
∗ @param w t h e k e y w o r d s t r i n g
∗ @return t r u e i f t h a t keyword i s t h e c u r r e n t token
∗/
public boolean matchKeyword ( S t r i n g w) {
return t o k . t t y p e == S t r e a m T o k e n i z e r .TT WORD && t o k . s v a l . e q u a l s (w) ;
}
/∗ ∗
∗ Returns t r ue i f the current token i s a l e g a l i d e n t i f i e r .
∗ @ r e t u r n t r u e i f t h e c u r r e n t t o k e n i s an i d e n t i f i e r
∗/
public boolean matchId ( ) {
return t o k . t t y p e==S t r e a m T o k e n i z e r .TT WORD && ! k e yw o rd s . c o n t a i n s ( t o k . s v a l ) ;
}
// M e t h o d s t o ” e a t ” t h e current token
/∗ ∗
∗ Throws an e x c e p t i o n i f t h e c u r r e n t t o k e n i s not the
∗ specified delimiter .
∗ O t h e r w i s e , moves t o t h e n e x t t o k e n .
∗ @param d a c h a r a c t e r d e n o t i n g t h e d e l i m i t e r
∗/
public void e a t D e l i m ( char d ) {
i f ( ! matchDelim ( d ) )
throw new B a d S y n t a x E x c e p t i o n ( ) ;
nextToken ( ) ;
}
/∗ ∗
∗ Throws an e x c e p t i o n i f t h e c u r r e n t t o k e n i s n o t
∗ an i n t e g e r .
∗ O t h e r w i s e , r e t u r n s t h a t i n t e g e r and moves t o t h e next token .
∗ @return t h e i n t e g e r v a l u e o f t h e c u r r e n t token
∗/
public i n t e a t I n t C o n s t a n t ( ) {
i f ( ! matchIntConstant ( ) )
throw new B a d S y n t a x E x c e p t i o n ( ) ;
int i = ( int ) tok . nval ;
nextToken ( ) ;
return i ;
}
/∗ ∗
∗ Throws an e x c e p t i o n if the current token is not
∗ a string .
188
∗ O t h e r w i s e , r e t u r n s t h a t s t r i n g and moves t o t h e n e x t t o k e n .
∗ @return t h e s t r i n g v a l u e o f t h e c u r r e n t token
∗/
public S t r i n g e a t S t r i n g C o n s t a n t ( ) {
i f ( ! matchStringConstant ( ) )
throw new B a d S y n t a x E x c e p t i o n ( ) ;
S t r i n g s = t o k . s v a l ; // c o n s t a n t s a r e n o t c o n v e r t e d t o l o w e r case
nextToken ( ) ;
return s ;
}
/∗ ∗
∗ Throws an e x c e p t i o n i f t h e c u r r e n t t o k e n is not the
∗ s p e c i f i e d keyword .
∗ O t h e r w i s e , moves t o t h e n e x t t o k e n .
∗ @param w t h e k e y w o r d s t r i n g
∗/
public void eatKeyword ( S t r i n g w) {
i f ( ! matchKeyword (w) )
throw new B a d S y n t a x E x c e p t i o n ( ) ;
nextToken ( ) ;
}
/∗ ∗
∗ Throws an e x c e p t i o n i f t h e c u r r e n t t o k e n i s n o t
∗ an i d e n t i f i e r .
∗ Otherwise , r e t u r n s the i d e n t i f i e r s t r i n g
∗ and moves t o t h e n e x t t o k e n .
∗ @return t h e s t r i n g v a l u e o f t h e c u r r e n t token
∗/
public S t r i n g e a t I d ( ) {
i f ( ! matchId ( ) )
throw new B a d S y n t a x E x c e p t i o n ( ) ;
S t r i n g s = tok . s v a l ;
nextToken ( ) ;
return s ;
}
p r i v a t e void nextToken ( ) {
try {
t o k . nextToken ( ) ;
}
catch ( I O E x c e p t i o n e ) {
throw new B a d S y n t a x E x c e p t i o n ( ) ;
}
}
p r i v a t e void i n i t K e y w o r d s ( ) {
k ey w o rd s = A r r a y s . a s L i s t ( ” s e l e c t ” , ” from ” , ” where ” , ” and ” ,
” i n s e r t ” , ” i n t o ” , ” v a l u e s ” , ” d e l e t e ” , ” update ” , ” s e t ” ,
” c r e a t e ” , ” t a b l e ” , ” i n t ” , ” v a r c h a r ” , ” v i e w ” , ” a s ” , ” i n d e x ” , ” on ” ) ;
}
}
import j a v a . u t i l . ∗ ;
import s i m p l e d b . q u e r y . ∗ ;
import s i m p l e d b . r e c o r d . Schema ;
/∗ ∗
∗ The SimpleDB p a r s e r .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s P a r s e r {
private Lexer l e x ;
public P a r s e r ( S t r i n g s ) {
l e x = new L e x e r ( s ) ;
}
public S t r i n g f i e l d ( ) {
return l e x . e a t I d ( ) ;
}
public C o n s t a n t c o n s t a n t ( ) {
i f ( l e x . matchStringConstant ( ) )
return new S t r i n g C o n s t a n t ( l e x . e a t S t r i n g C o n s t a n t ( ) ) ;
else
return new I n t C o n s t a n t ( l e x . e a t I n t C o n s t a n t ( ) ) ;
}
public E x p r e s s i o n e x p r e s s i o n ( ) {
i f ( l e x . matchId ( ) )
return new F i e l d N a m e E x p r e s s i o n ( f i e l d ( ) ) ;
else
return new C o n s t a n t E x p r e s s i o n ( c o n s t a n t ( ) ) ;
}
189
public P r e d i c a t e p r e d i c a t e ( ) {
P r e d i c a t e p r e d = new P r e d i c a t e ( term ( ) ) ;
i f ( l e x . matchKeyword ( ” and ” ) ) {
l e x . eatKeyword ( ” and ” ) ;
pred . conjoinWith ( p r e d i c a t e ( ) ) ;
}
return p r e d ;
}
public QueryData q u e r y ( ) {
l e x . eatKeyword ( ” s e l e c t ” ) ;
C o l l e c t i o n <S t r i n g > f i e l d s = s e l e c t L i s t ( ) ;
l e x . eatKeyword ( ” from ” ) ;
C o l l e c t i o n <S t r i n g > t a b l e s = t a b l e L i s t ( ) ;
P r e d i c a t e p r e d = new P r e d i c a t e ( ) ;
i f ( l e x . matchKeyword ( ” where ” ) ) {
l e x . eatKeyword ( ” where ” ) ;
pred = p r e d i c a t e ( ) ;
}
return new QueryData ( f i e l d s , t a b l e s , p r e d ) ;
}
p r i v a t e C o l l e c t i o n <S t r i n g > s e l e c t L i s t ( ) {
C o l l e c t i o n <S t r i n g > L = new A r r a y L i s t <S t r i n g >() ;
L . add ( f i e l d ( ) ) ;
i f ( l e x . matchDelim ( ’ , ’ ) ) {
l e x . eatDelim ( ’ , ’ ) ;
L . addAll ( s e l e c t L i s t ( ) ) ;
}
return L ;
}
p r i v a t e C o l l e c t i o n <S t r i n g > t a b l e L i s t ( ) {
C o l l e c t i o n <S t r i n g > L = new A r r a y L i s t <S t r i n g >() ;
L . add ( l e x . e a t I d ( ) ) ;
i f ( l e x . matchDelim ( ’ , ’ ) ) {
l e x . eatDelim ( ’ , ’ ) ;
L . addAll ( t a b l e L i s t ( ) ) ;
}
return L ;
}
public O b j e c t updateCmd ( ) {
i f ( l e x . matchKeyword ( ” i n s e r t ” ) )
return i n s e r t ( ) ;
e l s e i f ( l e x . matchKeyword ( ” d e l e t e ” ) )
return d e l e t e ( ) ;
e l s e i f ( l e x . matchKeyword ( ” u p d a t e ” ) )
return m o d i f y ( ) ;
else
return c r e a t e ( ) ;
}
private Object c r e a t e ( ) {
l e x . eatKeyword ( ” c r e a t e ” ) ;
i f ( l e x . matchKeyword ( ” t a b l e ” ) )
return c r e a t e T a b l e ( ) ;
e l s e i f ( l e x . matchKeyword ( ” v i e w ” ) )
return c r e a t e V i e w ( ) ;
else
return c r e a t e I n d e x ( ) ;
}
public D e l e t e D a t a d e l e t e ( ) {
l e x . eatKeyword ( ” d e l e t e ” ) ;
l e x . eatKeyword ( ” from ” ) ;
S t r i n g tblname = l e x . e a t I d ( ) ;
P r e d i c a t e p r e d = new P r e d i c a t e ( ) ;
i f ( l e x . matchKeyword ( ” where ” ) ) {
l e x . eatKeyword ( ” where ” ) ;
pred = p r e d i c a t e ( ) ;
}
return new D e l e t e D a t a ( tblname , p r e d ) ;
}
public I n s e r t D a t a i n s e r t ( ) {
l e x . eatKeyword ( ” i n s e r t ” ) ;
l e x . eatKeyword ( ” i n t o ” ) ;
S t r i n g tblname = l e x . e a t I d ( ) ;
l e x . eatDelim ( ’ ( ’ ) ;
L i s t <S t r i n g > f l d s = f i e l d L i s t ( ) ;
l e x . eatDelim ( ’ ) ’ ) ;
l e x . eatKeyword ( ” v a l u e s ” ) ;
l e x . eatDelim ( ’ ( ’ ) ;
L i s t <Constant> v a l s = c o n s t L i s t ( ) ;
l e x . eatDelim ( ’ ) ’ ) ;
return new I n s e r t D a t a ( tblname , f l d s , vals ) ;
}
p r i v a t e L i s t <S t r i n g > f i e l d L i s t ( ) {
L i s t <S t r i n g > L = new A r r a y L i s t <S t r i n g >() ;
L . add ( f i e l d ( ) ) ;
i f ( l e x . matchDelim ( ’ , ’ ) ) {
l e x . eatDelim ( ’ , ’ ) ;
L . addAll ( f i e l d L i s t ( ) ) ;
190
}
return L ;
}
p r i v a t e L i s t <Constant> c o n s t L i s t ( ) {
L i s t <Constant> L = new A r r a y L i s t <Constant >() ;
L . add ( c o n s t a n t ( ) ) ;
i f ( l e x . matchDelim ( ’ , ’ ) ) {
l e x . eatDelim ( ’ , ’ ) ;
L . addAll ( c o n s t L i s t ( ) ) ;
}
return L ;
}
public ModifyData m o d i f y ( ) {
l e x . eatKeyword ( ” u p d a t e ” ) ;
S t r i n g tblname = l e x . e a t I d ( ) ;
l e x . eatKeyword ( ” s e t ” ) ;
S t r i n g fldname = f i e l d ( ) ;
l e x . e a t D e l i m ( ’= ’ ) ;
E x p r e s s i o n newv al = e x p r e s s i o n ( ) ;
P r e d i c a t e p r e d = new P r e d i c a t e ( ) ;
i f ( l e x . matchKeyword ( ” where ” ) ) {
l e x . eatKeyword ( ” where ” ) ;
pred = p r e d i c a t e ( ) ;
}
return new ModifyData ( tblname , fldname , newval , pred ) ;
}
public C r e a t e T a b l e D a t a c r e a t e T a b l e ( ) {
l e x . eatKeyword ( ” t a b l e ” ) ;
S t r i n g tblname = l e x . e a t I d ( ) ;
l e x . eatDelim ( ’ ( ’ ) ;
Schema s c h = f i e l d D e f s ( ) ;
l e x . eatDelim ( ’ ) ’ ) ;
return new C r e a t e T a b l e D a t a ( tblname , s c h ) ;
}
p r i v a t e Schema f i e l d D e f s ( ) {
Schema schema = f i e l d D e f ( ) ;
i f ( l e x . matchDelim ( ’ , ’ ) ) {
l e x . eatDelim ( ’ , ’ ) ;
Schema schema2 = f i e l d D e f s ( ) ;
schema . a d d A l l ( schema2 ) ;
}
return schema ;
}
p r i v a t e Schema f i e l d D e f ( ) {
S t r i n g fldname = f i e l d ( ) ;
return f i e l d T y p e ( f l d n a m e ) ;
}
p r i v a t e Schema f i e l d T y p e ( S t r i n g f l d n a m e ) {
Schema schema = new Schema ( ) ;
i f ( l e x . matchKeyword ( ” i n t ” ) ) {
l e x . eatKeyword ( ” i n t ” ) ;
schema . a d d I n t F i e l d ( f l d n a m e ) ;
}
else {
l e x . eatKeyword ( ” v a r c h a r ” ) ;
l e x . eatDelim ( ’ ( ’ ) ;
int strLen = l e x . eatIntConstant ( ) ;
l e x . eatDelim ( ’ ) ’ ) ;
schema . a d d S t r i n g F i e l d ( fldname , s t r L e n ) ;
}
return schema ;
}
public CreateViewData c r e a t e V i e w ( ) {
l e x . eatKeyword ( ” v i e w ” ) ;
S t r i n g viewname = l e x . e a t I d ( ) ;
l e x . eatKeyword ( ” a s ” ) ;
QueryData qd = q u e r y ( ) ;
return new CreateViewData ( viewname , qd ) ;
}
public C r e a t e I n d e x D a t a c r e a t e I n d e x ( ) {
l e x . eatKeyword ( ” i n d e x ” ) ;
S t r i n g idxname = l e x . e a t I d ( ) ;
l e x . eatKeyword ( ” on ” ) ;
S t r i n g tblname = l e x . e a t I d ( ) ;
l e x . eatDelim ( ’ ( ’ ) ;
S t r i n g fldname = f i e l d ( ) ;
l e x . eatDelim ( ’ ) ’ ) ;
return new C r e a t e I n d e x D a t a ( idxname , tblname , fldname ) ;
}
}
191
package s i m p l e d b . p a r s e ;
/∗ ∗
∗ A runtime e x c e p t i o n i n d i c a t i n g t h a t the submitted query
∗ has i n c o r r e c t syntax .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
public c l a s s B a d S y n t a x E x c e p t i o n extends R u n t i m e E x c e p t i o n {
public B a d S y n t a x E x c e p t i o n ( ) {
}
}
import s i m p l e d b . q u e r y . ∗ ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ Data f o r t h e SQL <i >s e l e c t </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s QueryData {
p r i v a t e C o l l e c t i o n <S t r i n g > f i e l d s ;
p r i v a t e C o l l e c t i o n <S t r i n g > t a b l e s ;
private P r e d i c a t e pred ;
/∗ ∗
∗ S a v e s t h e f i e l d and t a b l e l i s t and p r e d i c a t e .
∗/
public QueryData ( C o l l e c t i o n <S t r i n g > f i e l d s , C o l l e c t i o n <S t r i n g > t a b l e s , P r e d i c a t e pred ) {
this . f i e l d s = f i e l d s ;
this . t a b l e s = t a b l e s ;
this . pred = pred ;
}
/∗ ∗
∗ Returns t h e f i e l d s mentioned in t h e select clause .
∗ @ r e t u r n a c o l l e c t i o n o f f i e l d names
∗/
public C o l l e c t i o n <S t r i n g > f i e l d s ( ) {
return f i e l d s ;
}
/∗ ∗
∗ R e t u r n s t h e t a b l e s mentioned i n t h e from clause .
∗ @ r e t u r n a c o l l e c t i o n o f t a b l e names
∗/
public C o l l e c t i o n <S t r i n g > t a b l e s ( ) {
return t a b l e s ;
}
/∗ ∗
∗ Returns the p r e d i c a t e t h a t d e s c r i b e s which
∗ r e c o r d s s h o u l d be in th e output t a b l e .
∗ @return t h e query p r e d i c a t e
∗/
public P r e d i c a t e p r e d ( ) {
return p r e d ;
}
public S t r i n g t o S t r i n g ( ) {
String result = ” select ” ;
for ( S t r i n g fldname : f i e l d s )
r e s u l t += f l d n a m e + ” , ” ;
re su lt = re sul t . substring (0 , res ul t . l e n g t h ( ) −2) ; // r e m o v e f i n a l comma
r e s u l t += ” from ” ;
f o r ( S t r i n g tblname : t a b l e s )
r e s u l t += tblname + ” , ” ;
re su lt = re sul t . substring (0 , res ul t . l e n g t h ( ) −2) ; // r e m o v e f i n a l comma
S t r i n g p r e d s t r i n g = pred . t o S t r i n g ( ) ;
i f ( ! p r e d s t r i n g . equals ( ”” ) )
r e s u l t += ” where ” + p r e d s t r i n g ;
return r e s u l t ;
}
}
import s i m p l e d b . q u e r y . C o n s t a n t ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ Data f o r t h e SQL <i >i n s e r t </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n s e r t D a t a {
p r i v a t e S t r i n g tblname ;
p r i v a t e L i s t <S t r i n g > f l d s ;
p r i v a t e L i s t <Constant> v a l s ;
/∗ ∗
192
∗ S a v e s t h e t a b l e name and t h e f i e l d and v a l u e l i s t s .
∗/
public I n s e r t D a t a ( S t r i n g tblname , L i s t <S t r i n g > f l d s , L i s t <Constant> v a l s ) {
t h i s . tblname = tblname ;
this . f l d s = f l d s ;
this . vals = vals ;
}
/∗ ∗
∗ R e t u r n s t h e name o f t h e a f f e c t e d table .
∗ @ r e t u r n t h e name o f t h e a f f e c t e d table
∗/
public S t r i n g tableName ( ) {
return tblname ;
}
/∗ ∗
∗ Returns a l i s t of f i e l d s f o r which
∗ v a l u e s w i l l b e s p e c i f i e d i n t h e new r e c o r d .
∗ @ r e t u r n a l i s t o f f i e l d names
∗/
public L i s t <S t r i n g > f i e l d s ( ) {
return f l d s ;
}
/∗ ∗
∗ Returns a l i s t of v a l u e s f o r the s p e c i f i e d f i e l d s .
∗ T h e r e i s a one−one c o r r e s p o n d e n c e b e t w e e n t h i s
∗ l i s t o f v a l u e s and t h e l i s t o f f i e l d s .
∗ @return a l i s t o f Constant v a l u e s .
∗/
public L i s t <Constant> v a l s ( ) {
return v a l s ;
}
}
import s i m p l e d b . q u e r y . ∗ ;
/∗ ∗
∗ Data f o r t h e SQL <i >d e l e t e </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s D e l e t e D a t a {
p r i v a t e S t r i n g tblname ;
private P r e d i c a t e pred ;
/∗ ∗
∗ S a v e s t h e t a b l e name and p r e d i c a t e .
∗/
public D e l e t e D a t a ( S t r i n g tblname , P r e d i c a t e p r e d ) {
t h i s . tblname = tblname ;
this . pred = pred ;
}
/∗ ∗
∗ R e t u r n s t h e name o f t h e a f f e c t e d table .
∗ @ r e t u r n t h e name o f t h e a f f e c t e d table
∗/
public S t r i n g tableName ( ) {
return tblname ;
}
/∗ ∗
∗ Returns the p r e d i c a t e t h a t d e s c r i b e s which
∗ r e c o r d s s h o u l d be d e l e t e d .
∗ @return t h e d e l e t i o n p r e d i c a t e
∗/
public P r e d i c a t e p r e d ( ) {
return p r e d ;
}
}
import s i m p l e d b . q u e r y . ∗ ;
/∗ ∗
∗ Data f o r t h e SQL <i >u p d a t e </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s ModifyData {
p r i v a t e S t r i n g tblname ;
private S t r i n g fldname ;
p r i v a t e E x p r e s s i o n new val ;
private P r e d i c a t e pred ;
/∗ ∗
∗ S a v e s t h e t a b l e name , t h e m o d i f i e d f i e l d and i t s new v a l u e , and t h e p r e d i c a t e .
∗/
public ModifyData ( S t r i n g tblname , S t r i n g fldname , E x p r e s s i o n newval , P r e d i c a t e p r e d ) {
193
this . tblname = tblname ;
this . fldname = fldname ;
this . newva l = newval ;
this . pred = pred ;
}
/∗ ∗
∗ R e t u r n s t h e name o f t h e a f f e c t e d table .
∗ @ r e t u r n t h e name o f t h e a f f e c t e d table
∗/
public S t r i n g tableName ( ) {
return tblname ;
}
/∗ ∗
∗ Returns t h e f i e l d whose v a l u e s w i l l be modified
∗ @ r e t u r n t h e name o f t h e t a r g e t field
∗/
public S t r i n g t a r g e t F i e l d ( ) {
return f l d n a m e ;
}
/∗ ∗
∗ R e t u r n s an e x p r e s s i o n .
∗ Evaluating t h i s expression for a record produces
∗ t he v a l u e t h a t w i l l be s t o r e d in t he record ’ s t a r g e t field .
∗ @return t h e t a r g e t e x p r e s s i o n
∗/
public E x p r e s s i o n newValue ( ) {
return newv al ;
}
/∗ ∗
∗ Returns the p r e d i c a t e t h a t d e s c r i b e s which
∗ r e c o r d s s h o u l d be modified .
∗ @return t h e m o d i f i c a t i o n p r e d i c a t e
∗/
public P r e d i c a t e p r e d ( ) {
return p r e d ;
}
}
import s i m p l e d b . r e c o r d . Schema ;
/∗ ∗
∗ Data f o r t h e SQL <i >c r e a t e t a b l e </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s C r e a t e T a b l e D a t a {
p r i v a t e S t r i n g tblname ;
p r i v a t e Schema s c h ;
/∗ ∗
∗ S a v e s t h e t a b l e name and schema .
∗/
public C r e a t e T a b l e D a t a ( S t r i n g tblname , Schema s c h ) {
t h i s . tblname = tblname ;
this . sch = sch ;
}
/∗ ∗
∗ R e t u r n s t h e name o f t h e new t a b l e .
∗ @ r e t u r n t h e name o f t h e new t a b l e
∗/
public S t r i n g tableName ( ) {
return tblname ;
}
/∗ ∗
∗ R e t u r n s t h e schema o f t h e new t a b l e .
∗ @ r e t u r n t h e schema o f t h e new t a b l e
∗/
public Schema newSchema ( ) {
return s c h ;
}
}
/∗ ∗
∗ Data f o r t h e SQL <i >c r e a t e v i e w </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s CreateViewData {
p r i v a t e S t r i n g viewname ;
p r i v a t e QueryData q r y d a t a ;
/∗ ∗
∗ Saves the v i e w name and its definition .
∗/
194
public CreateViewData ( S t r i n g viewname , QueryData q r y d a t a ) {
t h i s . viewname = viewname ;
this . qrydata = qrydata ;
}
/∗ ∗
∗ R e t u r n s t h e name o f t h e new v i e w .
∗ @ r e t u r n t h e name o f t h e new v i e w
∗/
public S t r i n g viewName ( ) {
return viewname ;
}
/∗ ∗
∗ R e t u r n s t h e d e f i n i t i o n o f t h e new v i e w .
∗ @ r e t u r n t h e d e f i n i t i o n o f t h e new v i e w
∗/
public S t r i n g v i e w D e f ( ) {
return q r y d a t a . t o S t r i n g ( ) ;
}
}
/∗ ∗
∗ The p a r s e r f o r t h e <i >c r e a t e i n d e x </i > s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s C r e a t e I n d e x D a t a {
p r i v a t e S t r i n g idxname , tblname , f l d n a m e ;
/∗ ∗
∗ S a v e s t h e t a b l e and f i e l d names o f t h e s p e c i f i e d i n d e x .
∗/
public C r e a t e I n d e x D a t a ( S t r i n g idxname , S t r i n g tblname , S t r i n g fldname ) {
t h i s . idxname = idxname ;
t h i s . tblname = tblname ;
this . fldname = fldname ;
}
/∗ ∗
∗ R e t u r n s t h e name o f t h e i n d e x .
∗ @ r e t u r n t h e name o f t h e i n d e x
∗/
public S t r i n g indexName ( ) {
return idxname ;
}
/∗ ∗
∗ R e t u r n s t h e name o f t h e i n d e x e d table .
∗ @ r e t u r n t h e name o f t h e i n d e x e d table
∗/
public S t r i n g tableName ( ) {
return tblname ;
}
/∗ ∗
∗ R e t u r n s t h e name o f t h e i n d e x e d field .
∗ @ r e t u r n t h e name o f t h e i n d e x e d field
∗/
public S t r i n g f i e l d N a m e ( ) {
return f l d n a m e ;
}
}
195
– The Plans to INSERT, UPDATE or DELETE Record s involve these Query
Plans too.
• The SQL standard specifies the following structure for this initial Plan, from the
leaf nodes towards the root in its Relational Algebra expression trees:
¬ stored Tables
products and joins
® outerjoins
¯ selections, semijoins and antijoins
° extend operations
± projections
² unions
³ sort operation.
Simple Queries
SELECT A1 , A2 , A3 , . . . , Ap
FROM T1 , T2 , T3 , . . . , Tq
WHERE P1
AND P2
AND P3
.
.
.
AND Pr
Ai is an attribute name
Tj is a Table name
Pk is a Term.
• Figure 73 shows an example whose WHERE part has not been split into its Terms
yet.
196
answer
project {A1 , A2 , A3 , . . . , Aq }
SELECT part ±
select P1
select P2
select P3
WHERE part ¯
select Pr
product
product Tp
product T3
T1 T2 FROM part
197
Views and Nested Queries in the FROM Part
• View s can be added into this translation:
– Suppose that some Tj is the name of a View (and not a Table).
– The definition of this Tj is another Query Qj .
– This Qj has its own translation Rj into Relational Algebra.
– This Rj can replace its name Tj within the translation of the whole Query.
• Figure 74 shows an example, where
(a) is the whole Query and the definition of a View named EINSTEIN used in it
(b) is the translation for this EINSTEIN
(c) is the translation of the whole Query with (b) as its subexpression.
• The same idea can be used also for queries nested into the FROM part – as if they
were unnamed queries whose definitions are nested within the Query itself.
rename(. . .(rename(rename(rename(Tj ,
C1 ,vj .C1 ),C2 ,vj .C2 ),C3 ,vj .C3 ),. . .),Cs ,vj .Cs )
instead of Tj in ¬.
• This rename Relational Algebra operation can be implemented at essentially no cost
by just changing the name of the attribute in the Schema used for Tj .
198
Figure 74: View translation example. (Sciore, 2008)
199
select x IN φ semijoin x = y ¯
becomes
the subtree below the subtree below translation of φ
• The translation in Figure 76 was straightforward, because it could assume its nested
subQuery φ to be closed:
– In other words, that φ did not mention any attribute or range variable names
defined outside it – in the FROM part of a Query containing φ.
– Hence this φ
gives values of y to the Query containing it, but
takes nothing from it.
– In other words, that all communication between φ and its enclosing Query was
by φ giving its values for y and the enclosing Query comparing them with its
values for x.
• But if the FROM part of the enclosing Query contains an EXISTS ψ subQuery,
then we can no longer assume its subQuery ψ to be closed, and this makes its
translation somewhat more intricate.
• Call in our University example two SECTIONs of the same COURSE adjacent if
there has been no third SECTION of the same COURSE between them.
SELECT ∗
FROM SECTION s1 ,
SECTION s2
WHERE s1 . CourseID = s2 . CourseID
AND s1 . Y e a r O f f e r e d < s2 . Y e a r O f f e r e d
AND NOT EXISTS (SELECT ∗
FROM SECTION s3
WHERE s3 . CourseID=s1 . CourseID
AND s1 . Y e a r O f f e r e d < s3 . Y e a r O f f e r e d
AND s3 . Y e a r O f f e r e d < s2 . Y e a r O f f e r e d )
• It seems difficult to express this SQL Query in a way which would not use SEC-
TIONS s1 and s2 in the WHERE part of the subQuery for s3.
200
select ψ becomes semijoin public
SELECT *
FROM ...
WHERE private
AND public
private part mentions only attributes and range variables defined inside (the FROM
part of) this ψ
public part mentions also attributes and range variables defined outside this φ in
the enclosing Query – so that ψ would be closed without it.
• This partition of the WHERE part of the subQuery ψ into a private and a public
part permits its translation as in Figure 77.
• The translation in Figure 77 works also for NOT EXISTS. . . with antijoin
instead of semijoin.
• In the “adjacent courses” example, the whole WHERE part of the subQuery is
public.
201
semijoin (. . . AND χ) becomes semijoin (. . . )
the left subtree below the right subtree below the left subtree below semijoin (public of χ)
SELECT *
FROM SECTION s3
WHERE TRUE
– In other words, it reduces to just the SECTION Table with range variable s3.
• What if the public part in Figure 77 has the form “. . . AND χ” for another nested
subQuery χ?
• This χ restricts the output from the private part, so its semi- or antijoin must be
added into the right subtree below, as in Figure 78.
SELECT ...
FROM T x
WHERE EXISTS(SELECT *
FROM α
WHERE EXISTS(SELECT *
FROM β
WHERE γ))
where γ mentions x:
202
– The left subtree of this whole Query has the translation of T.
– The right subtree of this whole Query is the translation of its outer EXISTS. . .
subQuery.
– Its inner EXISTS. . . subQuery mentions x but is within that right subtree
of this whole Query.
SELECT ...
FROM T x
WHERE EXISTS(SELECT *
FROM α,
T y
WHERE y = x
AND EXISTS(SELECT *
FROM β
WHERE γ 0 ))
mentions y instead of x
gets this y from the translation of the outer EXISTS. . . subQuery – which is the
right subtree below in Figure 78.
• In general, we can
call a subQuery almost closed if its public part mentions only those attributes
which are defined in the nearest enclosing FROM part, and
assume that every subQuery is at least almost closed, before we start translating
the whole Query
because a subQuery can always be made almost closed with suitable copying before
translating it – and the RDBMS can perform this copying internally.
• Moreover, such copied definitions are good candidates for materialization, because
they are used in many places of the whole Query.
Disjunctions
• Our translations have assumed only ANDs but not ORs in their WHERE parts.
• ORs can be readily added into the translation of a simple Query shown in Figure 72:
203
– Because a selection operation permits ANDd and ORs in its pred icate,
we could in fact have used just one big selection operation for the whole
WHERE part.
– However, the Query will turn out to be easier to optimize, if we still split its
WHERE part into several selection operations, but now each Pi is an OR
of Terms.
– This splitting is (or should be!) familiar from propositional logic:
It is the Conjunctive Normal Form (CNF) of the logical formula which is the
WHERE part of this simple Query.
• Recall that a formula in propositional logic is in CNF if it has the form (using our
notation and omitting NOTs):
and that ANDs can be lifted above ORs in this way by using the equivalence that
appropriately.
• Our translation for a closed subQuery of the form “x [NOT] IN. . . ” in Figure 76
assumes that it is alone in its selection operation – and hence that ORs have been
eliminated first from the WHERE part containing this subQuery.
– A formula is in Disjunctive Normal Form (DNF) if its ORs are above its ANDs
– the “other way around” than in CNF.
– It too can be reached similarly to Eq. (22):
SELECT α FROM β
WHERE γ
204
SELECT α FROM β
WHERE γ1
OR γ2
• Repeating this conversion of ORs into UNIONs will eventually lead into a Query
in which the WHERE parts containing this subQuery no longer have ORs.
• However, the whole Query can get much larger, because its FROM β part gets
repeated.
into
(EXISTS (SELECT α FROM β
WHERE γ1 ))
OR
(EXISTS (SELECT α FROM β
WHERE γ2 ).
205
by (the logical versions of) de Morgan’s laws:
OR
NOT(Q R) means the same as
AND
AND
(NOT Q) (NOT R) (24)
OR
x NOT IN (Q1
UNION
Q2 )
(x NOT IN (Q1 ))
AND
(x NOT IN (Q2 ))
– is faster to execute
– permits more further optimizations
• These transformations can get rid of the unwanted ORs before the whole Query is
translated into Relational Algebra.
Postprocessing
• If an SQL (sub)Query ends with
.
.
.
GROUP BY grouping HAVING pred
select(groupby(translation of
.
.
.
,grouping
,computing)
,pred )
206
Figure 79: A big SQL query to translate. (Sciore, 2008)
• If it omits its optional HAVING part, then its pred icate can be taken to be true,
and its selection omitted.
• Similarly, the whole Query (but not a subQuery) can end with
.
.
.
ORDER BY attributes
sort(translation of
.
.
.
,attributes).
on top of everything.
207
Figure 80: Translation of Figure 79. (Sciore, 2008)
208
SimpleDB source file simpledb/planner/Planner.java
• It consists of 2 subPlanner s:
QueryPlanner which translates each SQL Query into a Plan as outlined earlier.
UpdatePlanner which executes the SQL INSERT, UPDATE and DELETE Statements.
It handles also the CREATE (and DROP and ALTER, if SimpleDB sup-
ported them) Statements, because they are similar updates of the catalog meta-
data.
package s i m p l e d b . p l a n n e r ;
import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . p a r s e . ∗ ;
import s i m p l e d b . q u e r y . ∗ ;
/∗ ∗
∗ The o b j e c t t h a t e x e c u t e s SQL s t a t e m e n t s .
∗ @author s c i o r e
∗/
public c l a s s P l a n n e r {
private QueryPlanner q p l a n n e r ;
private UpdatePlanner uplanner ;
/∗ ∗
∗ C r e a t e s a p l a n f o r an SQL s e l e c t s t a t e m e n t , u s i n g t h e s u p p l i e d planner .
∗ @param q r y t h e SQL q u e r y s t r i n g
∗ @param t x t h e t r a n s a c t i o n
∗ @return t h e scan c o r r e s p o n d i n g to t h e query plan
∗/
public Plan c r e a t e Q u e r y P l a n ( S t r i n g qry , T r a n s a c t i o n t x ) {
P a r s e r p a r s e r = new P a r s e r ( q r y ) ;
QueryData d a t a = p a r s e r . q u e r y ( ) ;
return q p l a n n e r . c r e a t e P l a n ( data , t x ) ;
}
/∗ ∗
∗ E x e c u t e s an SQL i n s e r t , d e l e t e , m o d i f y , o r
∗ create statement .
∗ The method d i s p a t c h e s t o t h e a p p r o p r i a t e method o f t h e
∗ s u p p l i e d update planner ,
∗ d e p e n d i n g on w h a t t h e p a r s e r r e t u r n s .
∗ @param cmd t h e SQL u p d a t e s t r i n g
∗ @param t x t h e t r a n s a c t i o n
∗ @ r e t u r n an i n t e g e r d e n o t i n g t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e U p d a t e ( S t r i n g cmd , T r a n s a c t i o n t x ) {
P a r s e r p a r s e r = new P a r s e r ( cmd ) ;
O b j e c t o b j = p a r s e r . updateCmd ( ) ;
i f ( obj instanceof InsertData )
return u p l a n n e r . e x e c u t e I n s e r t ( ( I n s e r t D a t a ) o b j , t x ) ;
else i f ( obj instanceof DeleteData )
return u p l a n n e r . e x e c u t e D e l e t e ( ( D e l e t e D a t a ) o b j , t x ) ;
e l s e i f ( o b j i n s t a n c e o f ModifyData )
return u p l a n n e r . e x e c u t e M o d i f y ( ( ModifyData ) o b j , t x ) ;
els e i f ( obj instanceof CreateTableData )
return u p l a n n e r . e x e c u t e C r e a t e T a b l e ( ( C r e a t e T a b l e D a t a ) o b j , t x ) ;
e l s e i f ( o b j i n s t a n c e o f CreateViewData )
return u p l a n n e r . e x e c u t e C r e a t e V i e w ( ( CreateViewData ) o b j , t x ) ;
else i f ( obj instanceof CreateIndexData )
return u p l a n n e r . e x e c u t e C r e a t e I n d e x ( ( C r e a t e I n d e x D a t a ) o b j , t x ) ;
else
return 0 ;
}
}
import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . q u e r y . Plan ;
import s i m p l e d b . p a r s e . QueryData ;
/∗ ∗
∗ The i n t e r f a c e implemented by planners for
209
∗ t h e SQL s e l e c t s t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
public i n t e r f a c e Q u e r y P l a n n e r {
/∗ ∗
∗ Creates a plan for the parsed query .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e q u e r y
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @return a plan f o r t h a t query
∗/
public Plan c r e a t e P l a n ( QueryData data , T r a n s a c t i o n t x ) ;
}
import s i m p l e d b . t x . T r a n s a c t i o n ;
import s i m p l e d b . p a r s e . ∗ ;
/∗ ∗
∗ The i n t e r f a c e i m p l e m e n t e d b y t h e p l a n n e r s
∗ f o r SQL i n s e r t , d e l e t e , and m o d i f y s t a t e m e n t s .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e U p d a t e P l a n n e r {
/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d i n s e r t s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e i n s e r t s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e I n s e r t ( I n s e r t D a t a data , T r a n s a c t i o n t x ) ;
/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d d e l e t e s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e d e l e t e s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e D e l e t e ( D e l e t e D a t a data , T r a n s a c t i o n t x ) ;
/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d m o d i f y s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e m o d i f y s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e M o d i f y ( ModifyData data , T r a n s a c t i o n t x ) ;
/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d c r e a t e t a b l e s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e c r e a t e t a b l e s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e C r e a t e T a b l e ( C r e a t e T a b l e D a t a data , T r a n s a c t i o n t x ) ;
/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d c r e a t e v i e w s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e c r e a t e v i e w s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e C r e a t e V i e w ( CreateViewData data , T r a n s a c t i o n t x ) ;
/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d c r e a t e i n d e x s t a t e m e n t , and
∗ r e t u r n s t h e number o f a f f e c t e d r e c o r d s .
∗ @param d a t a t h e p a r s e d r e p r e s e n t a t i o n o f t h e c r e a t e i n d e x s t a t e m e n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗ @ r e t u r n t h e number o f a f f e c t e d r e c o r d s
∗/
public i n t e x e c u t e C r e a t e I n d e x ( C r e a t e I n d e x D a t a data , T r a n s a c t i o n t x ) ;
}
210
package s i m p l e d b . p l a n n e r ;
/∗ ∗
∗ The s i m p l e s t , most n a i v e q u e r y p l a n n e r p o s s i b l e .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B a s i c Q u e r y P l a n n e r implements Q u e r y P l a n n e r {
/∗ ∗
∗ Creates a query plan as f o l l o w s . It f i r s t takes
∗ t h e p r o d u c t o f a l l t a b l e s and v i e w s ; i t t h e n s e l e c t s on t h e p r e d i c a t e ;
∗ and f i n a l l y i t p r o j e c t s on t h e f i e l d l i s t .
∗/
public Plan c r e a t e P l a n ( QueryData data , T r a n s a c t i o n t x ) {
// S t e p 1 : C r e a t e a p l a n f o r e a c h m e n t i o n e d t a b l e o r v i e w
L i s t <Plan> p l a n s = new A r r a y L i s t <Plan >() ;
f o r ( S t r i n g tblname : d a t a . t a b l e s ( ) ) {
S t r i n g v i e w d e f = SimpleDB . mdMgr ( ) . g e t V i e w D e f ( tblname , t x ) ;
i f ( v i e w d e f != n u l l )
p l a n s . add ( SimpleDB . p l a n n e r ( ) . c r e a t e Q u e r y P l a n ( v i e w d e f , t x ) ) ;
else
p l a n s . add (new T a b l e P l a n ( tblname , t x ) ) ;
}
// S t e p 2 : C r e a t e t h e p r o d u c t o f a l l t a b l e plans
Plan p = p l a n s . remove ( 0 ) ;
f o r ( Plan n e x t p l a n : p l a n s )
p = new P r o d u c t P l a n ( p , n e x t p l a n ) ;
// S t e p 3 : Add a s e l e c t i o n p l a n f o r t h e predicate
p = new S e l e c t P l a n ( p , d a t a . p r e d ( ) ) ;
// S t e p 4 : P r o j e c t on t h e f i e l d names
p = new P r o j e c t P l a n ( p , d a t a . f i e l d s ( ) ) ;
return p ;
}
}
211
package s i m p l e d b . p l a n n e r ;
/∗ ∗
∗ The b a s i c p l a n n e r f o r SQL u p d a t e s t a t e m e n t s .
∗ @author s c i o r e
∗/
public c l a s s B a s i c U p d a t e P l a n n e r implements U p d a t e P l a n n e r {
public i n t e x e c u t e D e l e t e ( D e l e t e D a t a data , T r a n s a c t i o n t x ) {
Plan p = new T a b l e P l a n ( d a t a . tableName ( ) , t x ) ;
p = new S e l e c t P l a n ( p , d a t a . p r e d ( ) ) ;
UpdateScan u s = ( UpdateScan ) p . open ( ) ;
int count = 0 ;
while ( u s . n e x t ( ) ) {
us . d e l e t e ( ) ;
c o u n t ++;
}
us . c l o s e ( ) ;
return c o u n t ;
}
public i n t e x e c u t e I n s e r t ( I n s e r t D a t a data , T r a n s a c t i o n t x ) {
Plan p = new T a b l e P l a n ( d a t a . tableName ( ) , t x ) ;
UpdateScan u s = ( UpdateScan ) p . open ( ) ;
us . i n s e r t ( ) ;
I t e r a t o r <Constant> i t e r = d a t a . v a l s ( ) . i t e r a t o r ( ) ;
for ( S t r i n g fldname : data . f i e l d s ( ) ) {
Constant v a l = i t e r . next ( ) ;
u s . s e t V a l ( fldname , v a l ) ;
}
us . c l o s e ( ) ;
return 1 ;
}
public i n t e x e c u t e C r e a t e T a b l e ( C r e a t e T a b l e D a t a data , T r a n s a c t i o n t x ) {
SimpleDB . mdMgr ( ) . c r e a t e T a b l e ( d a t a . tableName ( ) , d a t a . newSchema ( ) , t x ) ;
return 0 ;
}
– on the server machine side, the initialization of the SimpleDB server process
– communication between client processes and this server process.
Each client Connection runs as its own separate OS thread within this server
process.
– on the client side, a subset of the JDBC standard for this communication.
212
• Its job is to provide remote Connections.
• The programmer just writes these Implementation classes – Java supplies their
stubs.
package s i m p l e d b . r e m o t e ;
/∗ ∗
∗ The RMI s e r v e r −s i d e i m p l e m e n t a t i o n o f R e m o t e D r i v e r .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
public c l a s s R e m o t e D r i v e r I m p l extends U n i c a s t R e m o t e O b j e c t implements R e m ot e D r i v er {
public R e m o t e D r i v e r I m p l ( ) throws RemoteException {
}
/∗ ∗
∗ C r e a t e s a new R e m o t e C o n n e c t i o n I m p l o b j e c t and
∗ returns i t .
∗ @see s i m p l e d b . r e m o t e . R e m o t e D r i v e r#c o n n e c t ( )
∗/
public RemoteConnection c o n n e c t ( ) throws RemoteException {
return new RemoteConnectionImpl ( ) ;
}
}
import j a v a . rmi . ∗ ;
/∗ ∗
∗ The RMI r e m o t e i n t e r f a c e c o r r e s p o n d i n g t o D r i v e r .
∗ The method i s s i m i l a r t o t h a t o f D r i v e r ,
∗ e x c e p t t h a t i t t a k e s no a r g u m e n t s and
∗ throws RemoteExceptions i n s t e a d o f SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e R e m ot e D r i v er extends Remote {
public RemoteConnection c o n n e c t ( ) throws RemoteException ;
}
import j a v a . s q l . ∗ ;
import j a v a . rmi . ∗ ;
import j a v a . u t i l . P r o p e r t i e s ;
/∗ ∗
∗ The SimpleDB d a t a b a s e d r i v e r .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S i m p l e D r i v e r extends D r i v e r A d a p t e r {
/∗ ∗
∗ C o n n e c t s t o t h e SimpleDB s e r v e r on t h e s p e c i f i e d h o s t .
∗ The method r e t r i e v e s t h e R e m o t e D r i v e r s t u b f r o m
∗ t h e RMI r e g i s t r y on t h e s p e c i f i e d h o s t .
∗ I t t h e n c a l l s t h e c o n n e c t method on t h a t s t u b ,
∗ w h i c h i n t u r n c r e a t e s a new c o n n e c t i o n and
∗ r e t u r n s t h e RemoteConnection s t u b f o r i t .
∗ This s t u b i s wrapped i n a SimpleConnection o b j e c t
∗ and i s r e t u r n e d .
∗ <P>
∗ The c u r r e n t i m p l e m e n t a t i o n o f t h i s method i g n o r e s t h e
∗ p r o p e r t i e s argument .
∗ @see j a v a . s q l . D r i v e r#c o n n e c t ( j a v a . l a n g . S t r i n g , P r o p e r t i e s )
∗/
public C o n n e c t i o n c o n n e c t ( S t r i n g u r l , P r o p e r t i e s prop ) throws SQLException {
try {
S t r i n g n e w u r l = u r l . r e p l a c e ( ” j d b c : s i m p l e d b ” , ” rmi ” ) + ” / s i m p l e d b ” ;
R e m o te D r i v e r r d v r = ( R e m o te D r i v e r ) Naming . l o o k u p ( n e w u r l ) ;
RemoteConnection r c o n n = r d v r . c o n n e c t ( ) ;
return new S i m p l e C o n n e c t i o n ( r c o n n ) ;
}
catch ( E x c e p t i o n e ) {
throw new SQLException ( e ) ;
}
}
}
213
SimpleDB source file simpledb/remote/DriverAdapter.java
package s i m p l e d b . r e m o t e ;
import j a v a . s q l . ∗ ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ This c l a s s implements a l l o f t h e methods o f t h e Driver i n t e r f a c e ,
∗ b y t h r o w i n g an e x c e p t i o n f o r e a c h one .
∗ S u b c l a s s e s ( s u c h a s S i m p l e D r i v e r ) can o v e r r i d e t h o s e m e t h o d s t h a t
∗ i t want t o i m p l e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public a b s t r a c t c l a s s D r i v e r A d a p t e r implements D r i v e r {
public boolean acceptsURL ( S t r i n g u r l ) throws SQLException {
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}
public i n t g e t M a j o r V e r s i o n ( ) {
return 0 ;
}
public i n t g e t M i n o r V e r s i o n ( ) {
return 0 ;
}
public boolean j d b c C o m p l i a n t ( ) {
return f a l s e ;
}
}
Driver for RMI so that the clients and the server can establish Connections be-
tween them.
Connection for this client-server communication.
Statement for passing SQL Statements from a client to the server via these Connections.
ResultSet for passing the result rows of an SQL Query from the server back to its
client.
MetaData for passing the metadata for these result rows.
• SimpleDB Remote Connections ensure that each Query gets executed as its own
Transaction.
214
• In this way, SimpleDB supports only the SQL AUTOCOMMIT mode.
package s i m p l e d b . r e m o t e ;
import s i m p l e d b . t x . T r a n s a c t i o n ;
import j a v a . rmi . RemoteException ;
import j a v a . rmi . s e r v e r . U n i c a s t R e m o t e O b j e c t ;
/∗ ∗
∗ The RMI s e r v e r −s i d e i m p l e m e n t a t i o n o f R e m o t e C o n n e c t i o n .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
c l a s s RemoteConnectionImpl extends U n i c a s t R e m o t e O b j e c t implements RemoteConnection {
private T r a n s a c t i o n tx ;
/∗ ∗
∗ Creates a remote co nne ct ion
∗ and b e g i n s a new t r a n s a c t i o n f o r i t .
∗ @throws RemoteException
∗/
RemoteConnectionImpl ( ) throws RemoteException {
t x = new T r a n s a c t i o n ( ) ;
}
/∗ ∗
∗ C r e a t e s a new R e m o t e S t a t e m e n t f o r t h i s c o n n e c t i o n .
∗ @see s i m p l e d b . r e m o t e . R e m o t e C o n n e c t i o n#c r e a t e S t a t e m e n t ( )
∗/
public RemoteStatement c r e a t e S t a t e m e n t ( ) throws RemoteException {
return new RemoteStatementImpl ( t h i s ) ;
}
/∗ ∗
∗ Closes the connection .
∗ The c u r r e n t t r a n s a c t i o n i s c o m m i t t e d .
∗ @see s i m p l e d b . r e m o t e . R e m o t e C o n n e c t i o n#c l o s e ( )
∗/
public void c l o s e ( ) throws RemoteException {
t x . commit ( ) ;
}
/∗ ∗
∗ Returns the t r a n s a c t i o n c u r r e n t l y a s s o c i a t e d with
∗ t h i s connection .
∗ @return t h e t r a n s a c t i o n a s s o c i a t e d with t h i s connection
∗/
Transaction getTransaction () {
return t x ;
}
/∗ ∗
∗ Commits t h e c u r r e n t t r a n s a c t i o n ,
∗ and b e g i n s a new one .
∗/
void commit ( ) {
t x . commit ( ) ;
t x = new T r a n s a c t i o n ( ) ;
}
/∗ ∗
∗ R o l l s back the current transaction ,
∗ and b e g i n s a new one .
∗/
void r o l l b a c k ( ) {
tx . r o l l b a c k ( ) ;
t x = new T r a n s a c t i o n ( ) ;
}
}
import j a v a . rmi . ∗ ;
/∗ ∗
∗ The RMI r e m o t e i n t e r f a c e c o r r e s p o n d i n g t o C o n n e c t i o n .
∗ The m e t h o d s a r e i d e n t i c a l t o t h o s e o f C o n n e c t i o n ,
∗ e x c e p t t h a t t h e y throw RemoteExceptions i n s t e a d o f SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e RemoteConnection extends Remote {
public RemoteStatement c r e a t e S t a t e m e n t ( ) throws RemoteException ;
public void c l o s e ( ) throws RemoteException ;
}
import j a v a . s q l . ∗ ;
215
/∗ ∗
∗ An a d a p t e r c l a s s t h a t w r a p s R e m o t e C o n n e c t i o n .
∗ I t s m e t h o d s do n o t h i n g e x c e p t t r a n s f o r m R e m o t e E x c e p t i o n s
∗ i n t o SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S i m p l e C o n n e c t i o n extends C o n n e c t i o n A d a p t e r {
p r i v a t e RemoteConnection r c o n n ;
public S i m p l e C o n n e c t i o n ( RemoteConnection c ) {
rconn = c ;
}
import j a v a . s q l . ∗ ;
import j a v a . u t i l . ∗ ;
/∗ ∗
∗ This c l a s s implements a l l o f t h e methods o f t h e Connection i n t e r f a c e ,
∗ b y t h r o w i n g an e x c e p t i o n f o r e a c h one .
∗ S u b c l a s s e s ( s u c h a s S i m p l e C o n n e c t i o n ) can o v e r r i d e t h o s e m e t h o d s t h a t
∗ i t want t o i m p l e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
public a b s t r a c t c l a s s C o n n e c t i o n A d a p t e r implements C o n n e c t i o n {
public void c l e a r W a r n i n g s ( ) throws SQLException {
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}
216
}
217
throw new SQLException ( ” o p e r a t i o n n o t i m p l e m e n t e d ” ) ;
}
/∗ ∗
∗ The RMI s e r v e r −s i d e i m p l e m e n t a t i o n o f R e m o t e S t a t e m e n t .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
c l a s s RemoteStatementImpl extends U n i c a s t R e m o t e O b j e c t implements RemoteStatement {
p r i v a t e RemoteConnectionImpl r c o n n ;
/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d SQL q u e r y s t r i n g .
∗ The method c a l l s t h e q u e r y p l a n n e r t o c r e a t e a p l a n
∗ for the query . I t then sends the plan to the
∗ RemoteResultSetImpl constructor for processing .
∗ @see s i m p l e d b . r e m o t e . R e m o t e S t a t e m e n t#e x e c u t e Q u e r y ( j a v a . l a n g . S t r i n g )
∗/
public R e m o t e R e s u l t S e t e x e c u t e Q u e r y ( S t r i n g q r y ) throws RemoteException {
try {
Transaction tx = rconn . g e t T r a n s a c t i o n ( ) ;
Plan p l n = SimpleDB . p l a n n e r ( ) . c r e a t e Q u e r y P l a n ( qry , t x ) ;
return new R e m o t e R e s u l t S e t I m p l ( pln , r c o n n ) ;
}
catch ( R u n t i m e E x c e p t i o n e ) {
rconn . r o l l b a c k ( ) ;
throw e ;
}
}
/∗ ∗
∗ E x e c u t e s t h e s p e c i f i e d SQL u p d a t e command .
∗ The method s e n d s t h e command t o t h e u p d a t e p l a n n e r ,
∗ which e x e c u t e s i t .
∗ @see s i m p l e d b . r e m o t e . R e m o t e S t a t e m e n t#e x e c u t e U p d a t e ( j a v a . l a n g . S t r i n g )
218
∗/
public i n t e x e c u t e U p d a t e ( S t r i n g cmd ) throws RemoteException {
try {
Transaction tx = rconn . g e t T r a n s a c t i o n ( ) ;
i n t r e s u l t = SimpleDB . p l a n n e r ( ) . e x e c u t e U p d a t e ( cmd , t x ) ;
r c o n n . commit ( ) ;
return r e s u l t ;
}
catch ( R u n t i m e E x c e p t i o n e ) {
rconn . r o l l b a c k ( ) ;
throw e ;
}
}
}
import j a v a . rmi . ∗ ;
/∗ ∗
∗ The RMI r e m o t e i n t e r f a c e c o r r e s p o n d i n g t o S t a t e m e n t .
∗ The m e t h o d s a r e i d e n t i c a l t o t h o s e o f S t a t e m e n t ,
∗ e x c e p t t h a t t h e y throw RemoteExceptions i n s t e a d o f SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e RemoteStatement extends Remote {
public R e m o t e R e s u l t S e t e x e c u t e Q u e r y ( S t r i n g q r y ) throws RemoteException ;
public i n t e x e c u t e U p d a t e ( S t r i n g cmd ) throws RemoteException ;
}
import j a v a . s q l . ∗ ;
/∗ ∗
∗ An a d a p t e r c l a s s t h a t w r a p s R e m o t e S t a t e m e n t .
∗ I t s m e t h o d s do n o t h i n g e x c e p t t r a n s f o r m R e m o t e E x c e p t i o n s
∗ i n t o SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S i m p l e S t a t e m e n t extends S t a t e m e n t A d a p t e r {
p r i v a t e RemoteStatement r s t m t ;
public S i m p l e S t a t e m e n t ( RemoteStatement s ) {
rstmt = s ;
}
import s i m p l e d b . r e c o r d . Schema ;
import simpledb . query . ∗ ;
import j a v a . rmi . RemoteException ;
import j a v a . rmi . s e r v e r . U n i c a s t R e m o t e O b j e c t ;
/∗ ∗
∗ The RMI s e r v e r −s i d e i m p l e m e n t a t i o n o f R e m o t e R e s u l t S e t .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
c l a s s R e m o t e R e s u l t S e t I m p l extends U n i c a s t R e m o t e O b j e c t implements R e m o t e R e s u l t S e t {
p r i v a t e Scan s ;
p r i v a t e Schema s c h ;
p r i v a t e RemoteConnectionImpl r c o n n ;
/∗ ∗
∗ Creates a RemoteResultSet object .
219
∗ The s p e c i f i e d p l a n i s o p e n e d , and t h e s c a n i s s a v e d .
∗ @param p l a n t h e q u e r y p l a n
∗ @param r c o n n TODO
∗ @throws RemoteException
∗/
public R e m o t e R e s u l t S e t I m p l ( Plan p l a n , RemoteConnectionImpl r c o n n ) throws RemoteException {
s = p l a n . open ( ) ;
s c h = p l a n . schema ( ) ;
this . rconn = rconn ;
}
/∗ ∗
∗ Moves t o t h e n e x t r e c o r d i n t h e r e s u l t s e t ,
∗ by moving t o t h e n e x t r e c o r d i n t h e s a v e d scan .
∗ @see s i m p l e d b . r e m o t e . R e m o t e R e s u l t S e t#n e x t ( )
∗/
public boolean n e x t ( ) throws RemoteException {
try {
return s . n e x t ( ) ;
}
catch ( R u n t i m e E x c e p t i o n e ) {
rconn . r o l l b a c k ( ) ;
throw e ;
}
}
/∗ ∗
∗ Returns the i n t e g e r v a l u e of the s p e c i f i e d f i e l d ,
∗ b y r e t u r n i n g t h e c o r r e s p o n d i n g v a l u e on t h e s a v e d s c a n .
∗ @see s i m p l e d b . r e m o t e . R e m o t e R e s u l t S e t#g e t I n t ( j a v a . l a n g . S t r i n g )
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) throws RemoteException {
try {
f l d n a m e = f l d n a m e . toLowerCase ( ) ; // t o e n s u r e c a s e − i n s e n s i t i v i t y
return s . g e t I n t ( f l d n a m e ) ;
}
catch ( R u n t i m e E x c e p t i o n e ) {
rconn . r o l l b a c k ( ) ;
throw e ;
}
}
/∗ ∗
∗ Returns the i n t e g e r v a l u e of the s p e c i f i e d f i e l d ,
∗ b y r e t u r n i n g t h e c o r r e s p o n d i n g v a l u e on t h e s a v e d s c a n .
∗ @see s i m p l e d b . r e m o t e . R e m o t e R e s u l t S e t#g e t I n t ( j a v a . l a n g . S t r i n g )
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) throws RemoteException {
try {
f l d n a m e = f l d n a m e . toLowerCase ( ) ; // t o e n s u r e c a s e − i n s e n s i t i v i t y
return s . g e t S t r i n g ( f l d n a m e ) ;
}
catch ( R u n t i m e E x c e p t i o n e ) {
rconn . r o l l b a c k ( ) ;
throw e ;
}
}
/∗ ∗
∗ Returns t h e r e s u l t s e t ’ s metadata ,
∗ b y p a s s i n g i t s schema i n t o t h e RemoteMetaData c o n s t r u c t o r .
∗ @see s i m p l e d b . r e m o t e . R e m o t e R e s u l t S e t#g e t M e t a D a t a ( )
∗/
public RemoteMetaData getMetaData ( ) throws RemoteException {
return new RemoteMetaDataImpl ( s c h ) ;
}
/∗ ∗
∗ C l o s e s t h e r e s u l t s e t by c l o s i n g i t s scan .
∗ @see s i m p l e d b . r e m o t e . R e m o t e R e s u l t S e t#c l o s e ( )
∗/
public void c l o s e ( ) throws RemoteException {
s . close () ;
r c o n n . commit ( ) ;
}
}
import j a v a . rmi . ∗ ;
/∗ ∗
∗ The RMI r e m o t e i n t e r f a c e c o r r e s p o n d i n g t o R e s u l t S e t .
∗ The m e t h o d s a r e i d e n t i c a l t o t h o s e o f R e s u l t S e t ,
∗ e x c e p t t h a t t h e y throw RemoteExceptions i n s t e a d o f SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e R e m o t e R e s u l t S e t extends Remote {
public boolean n e x t ( ) throws RemoteException ;
public i n t g e t I n t ( S t r i n g f l d n a m e ) throws RemoteException ;
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) throws RemoteException ;
public RemoteMetaData getMetaData ( ) throws RemoteException ;
public void c l o s e ( ) throws RemoteException ;
}
220
SimpleDB source file simpledb/remote/SimpleResultSet.java
package s i m p l e d b . r e m o t e ;
import j a v a . s q l . ∗ ;
/∗ ∗
∗ An a d a p t e r c l a s s t h a t w r a p s R e m o t e R e s u l t S e t .
∗ I t s m e t h o d s do n o t h i n g e x c e p t t r a n s f o r m R e m o t e E x c e p t i o n s
∗ i n t o SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s S i m p l e R e s u l t S e t extends R e s u l t S e t A d a p t e r {
private RemoteResultSet r r s ;
public S i m p l e R e s u l t S e t ( R e m o t e R e s u l t S e t s ) {
rrs = s ;
}
• This Remote MetaData assigns a column number for each Attribute of the Result
Set.
• This helps when this set is printed out as rows of fixed-width columns, as in the
SQL interpreter example client.
package s i m p l e d b . r e m o t e ;
import s i m p l e d b . r e c o r d . Schema ;
import s t a t i c j a v a . s q l . Types . INTEGER ;
import j a v a . rmi . RemoteException ;
import j a v a . rmi . s e r v e r . U n i c a s t R e m o t e O b j e c t ;
import java . u t i l . ∗ ;
/∗ ∗
∗ The RMI s e r v e r −s i d e i m p l e m e n t a t i o n o f RemoteMetaData .
∗ @ a u t h o r Edward S c i o r e
∗/
@SuppressWarnings ( ” s e r i a l ” )
public c l a s s RemoteMetaDataImpl extends U n i c a s t R e m o t e O b j e c t implements RemoteMetaData {
p r i v a t e Schema s c h ;
p r i v a t e L i s t <S t r i n g > f i e l d s = new A r r a y L i s t <S t r i n g >() ;
/∗ ∗
∗ Creates a metadata object that wraps the specified schema .
221
∗ The method a l s o c r e a t e s a l i s t t o h o l d t h e schema ’ s
∗ c o l l e c t i o n o f f i e l d names ,
∗ s o t h a t t h e f i e l d s can b e a c c e s s e d b y p o s i t i o n .
∗ @param s c h t h e schema
∗ @throws RemoteException
∗/
public RemoteMetaDataImpl ( Schema s c h ) throws RemoteException {
this . sch = sch ;
f i e l d s . addAll ( sch . f i e l d s ( ) ) ;
}
/∗ ∗
∗ Returns the s i z e of the f i e l d l i s t .
∗ @see s i m p l e d b . r e m o t e . RemoteMetaData#g e t C o l u m n C o u n t ( )
∗/
public i n t getColumnCount ( ) throws RemoteException {
return f i e l d s . s i z e ( ) ;
}
/∗ ∗
∗ R e t u r n s t h e f i e l d name f o r t h e s p e c i f i e d column number .
∗ I n JDBC , column n u m b e r s s t a r t w i t h 1 , s o t h e f i e l d
∗ i s t a k e n f r o m p o s i t i o n ( column −1) i n t h e l i s t .
∗ @see s i m p l e d b . r e m o t e . RemoteMetaData#getColumnName ( i n t )
∗/
public S t r i n g getColumnName ( i n t column ) throws RemoteException {
return f i e l d s . g e t ( column −1) ;
}
/∗ ∗
∗ R e t u r n s t h e t y p e o f t h e s p e c i f i e d column .
∗ The method f i r s t f i n d s t h e name o f t h e f i e l d i n t h a t column ,
∗ and t h e n l o o k s up i t s t y p e i n t h e schema .
∗ @see s i m p l e d b . r e m o t e . RemoteMetaData#g e t C o l u m n T y p e ( i n t )
∗/
public i n t getColumnType ( i n t column ) throws RemoteException {
S t r i n g f l d n a m e = getColumnName ( column ) ;
return s c h . t y p e ( f l d n a m e ) ;
}
/∗ ∗
∗ R e t u r n s t h e number o f c h a r a c t e r s r e q u i r e d t o d i s p l a y t h e
∗ s p e c i f i e d column .
∗ For a s t r i n g −t y p e f i e l d , t h e method s i m p l y l o o k s up t h e
∗ f i e l d ’ s l e n g t h i n t h e schema and r e t u r n s t h a t .
∗ For an i n t −t y p e f i e l d , t h e method n e e d s t o d e c i d e how
∗ l a r g e i n t e g e r s can b e .
∗ Here , t h e method a r b i t r a r i l y c h o o s e s 6 c h a r a c t e r s ,
∗ w h i c h means t h a t i n t e g e r s o v e r 9 9 9 , 9 9 9 w i l l
∗ probably get displayed improperly .
∗ @see s i m p l e d b . r e m o t e . RemoteMetaData#g e t C o l u m n D i s p l a y S i z e ( i n t )
∗/
public i n t g e t C o l u m n D i s p l a y S i z e ( i n t column ) throws RemoteException {
S t r i n g f l d n a m e = getColumnName ( column ) ;
int f l d t y p e = sch . type ( fldname ) ;
int f l d l e n g t h = sch . l e n g t h ( fldname ) ;
i f ( f l d t y p e == INTEGER)
return 6 ; // accommodate 6− d i g i t i n t e g e r s
else
return f l d l e n g t h ;
}
}
import j a v a . rmi . ∗ ;
/∗ ∗
∗ The RMI r e m o t e i n t e r f a c e c o r r e s p o n d i n g t o R e s u l t S e t M e t a D a t a .
∗ The m e t h o d s a r e i d e n t i c a l t o t h o s e o f R e s u l t S e t M e t a D a t a ,
∗ e x c e p t t h a t t h e y throw RemoteExceptions i n s t e a d o f SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public i n t e r f a c e RemoteMetaData extends Remote {
public i n t getColumnCount ( ) throws RemoteException ;
public S t r i n g getColumnName ( i n t column ) throws RemoteException ;
public i n t getColumnType ( i n t column ) throws RemoteException ;
public i n t g e t C o l u m n D i s p l a y S i z e ( i n t column ) throws RemoteException ;
}
import j a v a . s q l . ∗ ;
/∗ ∗
∗ An a d a p t e r c l a s s t h a t w r a p s RemoteMetaData .
∗ I t s m e t h o d s do n o t h i n g e x c e p t t r a n s f o r m R e m o t e E x c e p t i o n s
∗ i n t o SQLExceptions .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s SimpleMetaData extends R e s u l t S e t M e t a D a t a A d a p t e r {
p r i v a t e RemoteMetaData rmd ;
222
public SimpleMetaData ( RemoteMetaData md) {
rmd = md ;
}
public c l a s s S Q L I n t e r p r e t e r {
p r i v a t e s t a t i c C o n n e c t i o n conn = n u l l ;
while ( true ) {
// p r o c e s s one l i n e o f i n p u t
System . o u t . p r i n t ( ” \nSQL> ” ) ;
S t r i n g cmd = b r . r e a d L i n e ( ) . t r i m ( ) ;
System . o u t . p r i n t l n ( ) ;
i f ( cmd . s t a r t s W i t h ( ” e x i t ” ) )
break ;
e l s e i f ( cmd . s t a r t s W i t h ( ” s e l e c t ” ) )
doQuery ( cmd ) ;
else
doUpdate ( cmd ) ;
}
}
catch ( E x c e p t i o n e ) {
e . printStackTrace () ;
}
finally {
try {
i f ( conn != n u l l )
conn . c l o s e ( ) ;
}
catch ( E x c e p t i o n e ) {
e . printStackTrace () ;
}
}
}
// p r i n t h e a d e r
f o r ( i n t i =1; i <=numcols ; i ++) {
i n t w i d t h = md . g e t C o l u m n D i s p l a y S i z e ( i ) ;
t o t a l w i d t h += w i d t h ;
S t r i n g fmt = ”%” + w i d t h + ” s ” ;
System . o u t . f o r m a t ( fmt , md . getColumnName ( i ) ) ;
223
}
System . o u t . p r i n t l n ( ) ;
f o r ( i n t i =0; i <t o t a l w i d t h ; i ++)
System . o u t . p r i n t ( ”−” ) ;
System . o u t . p r i n t l n ( ) ;
// p r i n t r e c o r d s
while ( r s . n e x t ( ) ) {
for ( int i =1; i <=numcols ; i ++) {
S t r i n g f l d n a m e = md . getColumnName ( i ) ;
i n t f l d t y p e = md . getColumnType ( i ) ;
S t r i n g fmt = ”%” + md . g e t C o l u m n D i s p l a y S i z e ( i ) ;
i f ( f l d t y p e == Types . INTEGER)
System . o u t . f o r m a t ( fmt + ”d” , r s . g e t I n t ( f l d n a m e ) ) ;
else
System . o u t . f o r m a t ( fmt + ” s ” , r s . g e t S t r i n g ( f l d n a m e ) ) ;
}
System . o u t . p r i n t l n ( ) ;
}
rs . close () ;
}
catch ( SQLException e ) {
System . o u t . p r i n t l n ( ”SQL E x c e p t i o n : ” + e . g e t M e s s a g e ( ) ) ;
e . printStackTrace () ;
}
}
package s i m p l e d b . s e r v e r ;
import s i m p l e d b . r e m o t e . ∗ ;
import j a v a . rmi . ∗ ;
public c l a s s S t a r t u p {
public s t a t i c void main ( S t r i n g a r g s [ ] ) throws E x c e p t i o n {
// c o n f i g u r e and i n i t i a l i z e the database
SimpleDB . i n i t ( a r g s [ 0 ] ) ;
// p o s t t h e s e r v e r e n t r y i n t h e rmi r e g i s t r y
R e m ot e D r i v er d = new R e m o t e D r i v e r I m p l ( ) ;
Naming . r e b i n d ( ” s i m p l e d b ” , d ) ;
/∗ ∗
∗ The c l a s s t h a t p r o v i d e s s y s t e m −w i d e s t a t i c g l o b a l v a l u e s .
∗ T h e s e v a l u e s must b e i n i t i a l i z e d b y t h e method
∗ { @link #i n i t ( S t r i n g ) i n i t } b e f o r e use .
∗ The m e t h o d s { @ l i n k #i n i t F i l e M g r ( S t r i n g ) i n i t F i l e M g r } ,
224
∗ { @ l i n k #i n i t F i l e A n d L o g M g r ( S t r i n g ) i n i t F i l e A n d L o g M g r } ,
∗ { @ l i n k #i n i t F i l e L o g A n d B u f f e r M g r ( S t r i n g ) i n i t F i l e L o g A n d B u f f e r M g r } ,
∗ and { @ l i n k #i n i t M e t a d a t a M g r ( b o o l e a n , T r a n s a c t i o n ) i n i t M e t a d a t a M g r }
∗ p r o v i d e l i m i t e d i n i t i a l i z a t i o n , and a r e u s e f u l f o r
∗ debugging purposes .
∗
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s SimpleDB {
public s t a t i c i n t BUFFER SIZE = 8 ;
public s t a t i c S t r i n g LOG FILE = ” s i m p l e d b . l o g ” ;
private s t a t i c FileMgr fm ;
private s t a t i c BufferMgr bm ;
private s t a t i c LogMgr logm ;
private s t a t i c MetadataMgr mdm;
/∗ ∗
∗ I n i t i a l i z e s the system .
∗ T h i s method i s c a l l e d d u r i n g s y s t e m s t a r t u p .
∗ @param d i r n a m e t h e name o f t h e d a t a b a s e d i r e c t o r y
∗/
public s t a t i c void i n i t ( S t r i n g dirname ) {
i n i t F i l e L o g A n d B u f f e r M g r ( dirname ) ;
T r a n s a c t i o n t x = new T r a n s a c t i o n ( ) ;
boolean i s n e w = fm . isNew ( ) ;
i f ( isnew )
System . o u t . p r i n t l n ( ” c r e a t i n g new d a t a b a s e ” ) ;
else {
System . o u t . p r i n t l n ( ” r e c o v e r i n g e x i s t i n g d a t a b a s e ” ) ;
tx . r e c o v e r ( ) ;
}
initMetadataMgr ( isnew , tx ) ;
t x . commit ( ) ;
}
// The f o l l o w i n g i n i t i a l i z a t i o n m e t h o d s a r e u s e f u l f o r
// t e s t i n g t h e l o w e r − l e v e l c o m p o n e n t s o f t h e s y s t e m
// w i t h o u t h a v i n g t o i n i t i a l i z e e v e r y t h i n g .
/∗ ∗
∗ I n i t i a l i z e s o n l y t h e f i l e manager .
∗ @param d i r n a m e t h e name o f t h e d a t a b a s e d i r e c t o r y
∗/
public s t a t i c void i n i t F i l e M g r ( S t r i n g dirname ) {
fm = new F i l e M g r ( dirname ) ;
}
/∗ ∗
∗ I n i t i a l i z e s t h e f i l e and l o g m a n a g e r s .
∗ @param d i r n a m e t h e name o f t h e d a t a b a s e d i r e c t o r y
∗/
public s t a t i c void i n i t F i l e A n d L o g M g r ( S t r i n g dirname ) {
i n i t F i l e M g r ( dirname ) ;
logm = new LogMgr ( LOG FILE ) ;
}
/∗ ∗
∗ I n i t i a l i z e s t h e f i l e , l o g , and b u f f e r m a n a g e r s .
∗ @param d i r n a m e t h e name o f t h e d a t a b a s e d i r e c t o r y
∗/
public s t a t i c void i n i t F i l e L o g A n d B u f f e r M g r ( S t r i n g dirname ) {
i n i t F i l e A n d L o g M g r ( dirname ) ;
bm = new B u f f e r M g r ( BUFFER SIZE ) ;
}
/∗ ∗
∗ I n i t i a l i z e s m e t a d a t a manager .
∗ @param i s n e w an i n d i c a t i o n o f w h e t h e r a new
∗ d a t a b a s e needs to be c r e a t e d .
∗ @param t x t h e t r a n s a c t i o n p e r f o r m i n g t h e i n i t i a l i z a t i o n
∗/
public s t a t i c void i n i t M e t a d a t a M g r ( boolean i s n e w , T r a n s a c t i o n t x ) {
mdm = new MetadataMgr ( i s n e w , t x ) ;
}
/∗ ∗
∗ C r e a t e s a p l a n n e r f o r SQL commands .
∗ To c h a n g e how t h e p l a n n e r w o r k s , m o d i f y t h i s method .
∗ @ r e t u r n t h e s y s t e m ’ s p l a n n e r f o r SQL commands
∗/ public s t a t i c P l a n n e r p l a n n e r ( ) {
QueryPlanner q p l a n n e r = new B a s i c Q u e r y P l a n n e r ( ) ;
U p d a t e P l a n n e r u p l a n n e r = new B a s i c U p d a t e P l a n n e r ( ) ;
return new P l a n n e r ( q p l a n n e r , u p l a n n e r ) ;
}
}
5 Indexing
(Sciore, 2008, Chapters 6.3 and 21)
225
• The basic SimpleDB design described in section 4 did not include indexes although
they are a central part of any realistic RDBMS – without them, processing larger
databases would soon become infeasible.
hashing and/or
search trees.
• Their basic ideas are already familiar from the courses “Data Structures I&II” (”Tie-
torakenteet I&II” (TRAI&II) in Finnish).
• However, here we must take into account that the storage medium is not RAM but
a disk – which is much slower and available as big blocks.
• This index is
hk1 , k2 , k3 , . . . , km i ∈ A1 × A2 × A3 × · · · × Am .
• An RDBMS builds a unique index for the chosen primary key Attributes of each
NF1 stored Table.
226
Hashing Balanced search trees.
Those index record types whose key All kinds of keys.
Applies to
each index records stored on disk. bytes of extra storage – for the pointers
which hold the tree together.
Table 5: Hashing vs. search trees.
227
• Table 5 summarizes the differences between typical hashing and search trees.
• If an RDBMS has only one kind of index, then it is usually search tree -based.
• SimpleDB does the opposite: it provides hash indexes by default, but search trees
must be turned on separately.
• This SimpleDB interface for Index es follows the same beforeFirst. . . next. . . access
pattern as its Scans.
• Similarly to selection Scans, here the next method moves to the next index record
with this key value k, or returns false if there are no (more) such index records.
• There is a method for getting the RID of the current index record. . .
• . . . but no method for getting its key, because that would always be k.
package s i m p l e d b . i n d e x ;
import s i m p l e d b . r e c o r d . RID ;
import s i m p l e d b . q u e r y . C o n s t a n t ;
/∗ ∗
∗ This i n t e r f a c e c o n t a i n s methods to t r a v e r s e an i n d e x .
∗ @ a u t h o r Edward S c i o r e
∗
∗/
public i n t e r f a c e I n d e x {
/∗ ∗
∗ Positions the index before the f i r s t record
∗ having the s p e c i f i e d search key .
∗ @param s e a r c h k e y t h e s e a r c h k e y v a l u e .
∗/
public void b e f o r e F i r s t ( Constant s e a r c h k e y ) ;
/∗ ∗
∗ Moves t h e i n d e x t o t h e n e x t r e c o r d h a v i n g t h e
∗ s e a r c h k e y s p e c i f i e d i n t h e b e f o r e F i r s t method .
∗ R e t u r n s f a l s e i f t h e r e a r e no more s u c h i n d e x r e c o r d s .
∗ @ r e t u r n f a l s e i f no o t h e r i n d e x r e c o r d s h a v e t h e s e a r c h key .
∗/
public boolean n e x t ( ) ;
/∗ ∗
∗ R e t u r n s t h e dataRID v a l u e s t o r e d i n t h e c u r r e n t index record .
∗ @ r e t u r n t h e dataRID s t o r e d i n t h e c u r r e n t i n d e x record .
∗/
public RID getDataRid ( ) ;
/∗ ∗
∗ I n s e r t s an i n d e x r e c o r d h a v i n g t h e s p e c i f i e d
∗ d a t a v a l and dataRID v a l u e s .
∗ @param d a t a v a l t h e d a t a v a l i n t h e new i n d e x r e c o r d .
∗ @param d a t a r i d t h e dataRID i n t h e new i n d e x r e c o r d .
∗/
public void i n s e r t ( C o n s t a n t d a t a v a l , RID d a t a r i d ) ;
/∗ ∗
∗ Deletes the index record having the s p e c i f i e d
∗ d a t a v a l and dataRID v a l u e s .
∗ @param d a t a v a l t h e d a t a v a l o f t h e d e l e t e d i n d e x r e c o r d
∗ @param d a t a r i d t h e dataRID o f t h e d e l e t e d i n d e x r e c o r d
∗/
public void d e l e t e ( C o n s t a n t d a t a v a l , RID d a t a r i d ) ;
/∗ ∗
∗ Closes the index .
∗/
public void close () ;
}
228
5.1 Extendable Hashing
(Elmasri and Navathe, 2011, Chapter 16.8.3), (Sciore, 2008, Chapters 21.2–21.3)
• Now we study extendable hashing as an index implementation technique.
• We assume that we have a function hash(k) which maps each key value k into a
“small” (32-bit, say) unsigned integer.
(If the keys k are already such integers, then this function is not needed.)
of disk Block pointers into the other file. They point to the actual buckets.
This hashing is extendable, because it can grow as needed.
Bucket file whose bucket b has
– a directory .bucket[b].localDepth ≤ directory .globalDepth .
– an array directory .bucket[b].slot[0 . . .] of index entries. It is long enough
to fill this disk Block , so its length depends on their size.
– a disk Block pointer directory .bucket[b].overflow to its overflow chain.
Each Block in this chain is also in this bucket file, and is otherwise similar
but does not have the localDepth field.
• Note that many directory entries share the same bucket to save disk space:
– Each bucket b stores those index records r whose hash(r .key) has the same
directory .bucket[b].localDepth lowest bits as b.
– For instance, bucket 0 stores those index records where these 2 lowest bits are
. . . 00, bucket 1 with . . . 1, and bucket 2 with . . . 10.
229
Figure 81: An example of an extendable hash table. (Sciore, 2008)
• Finding the first index entry with the given key value k is:
NoBlock is the number of a block which cannot exist – a “NULL pointer” on disk.
NoRID is the RID of a Record which cannot exist – it marks an unused slot.
230
1 repeat
2 b = the directory .globalDepth lowest bits of hash(x .key);
3 c = bucket file block number directory .bucket[b];
4 if c or its overflow chain has an unused slot
5 store x there
6 elseif c .localDepth < directory .globalDepth and
it is OK to split this bucket c
7 d = the c .localDepth lowest bits of b;
8 c .localDepth = c .localDepth + 1;
9 c0 = a new bucket with the same localDepth as c;
10 for every other bucket number b0 such that directory .bucket[b0 ] = c
11 directory .bucket[b0 ] = c0 ;
12 rehash all the index records in c
13 elseif c .localDepth = directory .globalDepth and
it is OK to double the directory
14 directory .globalDepth = directory .globalDepth + 1
15 double the length of the bucket array;
16 fill its new half with a copy of its old half
17 else add a Block into the overflow chain of c;
18 store x
19 until x has been stored.
• Line 10. . .
– considers those directory .buckets which point to this old bucket c
– redirects the 2nd, 4th, 6th,. . . of them into pointing to this new bucket c0
instead
– can be optimized with suitable bit arithmetic:
b0 = d0 1d for d0 = 0, 1, 2, . . ..
• Line 12 splits the index records r in c so that if hash(r .key) ends in the bit pattern. . .
. . . 1d then r goes into the new bucket (or into its overflow chain, if necessary)
. . . 0d then r stays in c
which may permit shortening the overflow chain of c.
• In theory, it is always OK to split on line 6 and to double on line 13.
• In practice, they can be used to fine-tune this basic algorithm.
• Figure 82 shows how this data structure could have evolved into Figure 81:
¬ In the beginning (not shown), directory .globalDepth is set so that the array
fills the only block of the directory file as well as possible.
This example does not involve doubling the directory.
In (a), the only block of the bucket file is now full.
The ‘L’ marks its localDepth, which starts out as 0.
® In (b), this only bucket 0 has been split into 2 buckets 0 and 1.
Note how every other 0 turns into 1 in the directory.
¯ In (c), bucket 0 splits again into 0 and 2.
Note how every other remaining 0 turns into 2 in the directory.
231
Figure 82: How Figure 81 could have been been built. (Sciore, 2008)
232
However, this does not try to shorten the overflow chain of c if now possible – but
the insertion algorithm extended it only in rare situations, so this might be enough.
• It is not extendable:
Instead, it allocates a fixed number of buckets as its directory, which it does not
double.
• Hence its performance does not scale well when the number of index records to store
grows.
• Index operations take place within a transaction tx so that they can be ABORTed
or recovered if needed.
• Indexes also have a function searchCost which estimates the number of disk Block s
read when this index is used for looking up the RID corresponding to a given key.
• This function is used by the blocksAccessed function of the Index Metadata Man-
ager to calculate the I/O costs of using this index.
package s i m p l e d b . i n d e x . hash ;
/∗ ∗
∗ A s t a t i c hash implementation of the Index i n t e r f a c e .
∗ A f i x e d number o f b u c k e t s i s a l l o c a t e d ( c u r r e n t l y , 1 0 0 ) ,
∗ and e a c h b u c k e t i s i m p l e m e n t e d a s a f i l e o f i n d e x r e c o r d s .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s HashIndex implements I n d e x {
public s t a t i c i n t NUM BUCKETS = 1 0 0 ;
p r i v a t e S t r i n g idxname ;
p r i v a t e Schema s c h ;
private T r a n s a c t i o n tx ;
private Constant s e a r c h k e y = null ;
private TableScan t s = null ;
/∗ ∗
∗ Opens a h a s h i n d e x f o r t h e s p e c i f i e d i n d e x .
∗ @param i d x n a m e t h e name o f t h e i n d e x
∗ @param s c h t h e schema o f t h e i n d e x r e c o r d s
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public HashIndex ( S t r i n g idxname , Schema sch , T r a n s a c t i o n t x ) {
t h i s . idxname = idxname ;
this . sch = sch ;
this . tx = tx ;
}
/∗ ∗
∗ Positions the index before the f i r s t index record
∗ having the s p e c i f i e d search key .
∗ The method h a s h e s t h e s e a r c h k e y t o d e t e r m i n e t h e b u c k e t ,
∗ and t h e n o p e n s a t a b l e s c a n on t h e f i l e
∗ corresponding to the bucket .
∗ The t a b l e s c a n f o r t h e p r e v i o u s b u c k e t ( i f any ) i s c l o s e d .
∗ @see s i m p l e d b . i n d e x . I n d e x#b e f o r e F i r s t ( s i m p l e d b . q u e r y . C o n s t a n t )
∗/
public void b e f o r e F i r s t ( C o n s t a n t s e a r c h k e y ) {
close () ;
this . searchkey = searchkey ;
i n t b u c k e t = s e a r c h k e y . hashCode ( ) % NUM BUCKETS;
S t r i n g tblname = idxname + b u c k e t ;
T a b l e I n f o t i = new T a b l e I n f o ( tblname , s c h ) ;
t s = new T a b l e S c a n ( t i , t x ) ;
}
/∗ ∗
∗ Moves t o t h e n e x t r e c o r d h a v i n g t h e s e a r c h k e y .
∗ The method l o o p s t h r o u g h t h e t a b l e s c a n f o r t h e b u c k e t ,
∗ l o o k i n g f o r a m a t c h i n g r e c o r d , and r e t u r n i n g f a l s e
∗ i f t h e r e a r e no more s u c h r e c o r d s .
∗ @see s i m p l e d b . i n d e x . I n d e x#n e x t ( )
∗/
public boolean n e x t ( ) {
233
while ( t s . n e x t ( ) )
i f ( ts . getVal ( ” dataval ” ) . equals ( searchkey ) )
return true ;
return f a l s e ;
}
/∗ ∗
∗ R e t r i e v e s t h e dataRID f r o m t h e c u r r e n t record
∗ in the t a b l e scan f o r the b u c k e t .
∗ @see s i m p l e d b . i n d e x . I n d e x#g e t D a t a R i d ( )
∗/
public RID g e t D a t a R i d ( ) {
i n t blknum = t s . g e t I n t ( ” b l o c k ” ) ;
int id = t s . g e t I n t ( ” id ” ) ;
return new RID ( blknum , i d ) ;
}
/∗ ∗
∗ I n s e r t s a new r e c o r d i n t o t h e t a b l e s c a n f o r t h e b u c k e t .
∗ @see s i m p l e d b . i n d e x . I n d e x#i n s e r t ( s i m p l e d b . q u e r y . C o n s t a n t , s i m p l e d b . r e c o r d . RID )
∗/
public void i n s e r t ( C o n s t a n t v a l , RID r i d ) {
beforeFirst ( val ) ;
ts . insert () ;
t s . s e t I n t ( ” b l o c k ” , r i d . blockNumber ( ) ) ;
ts . setInt (” id ” , rid . id () ) ;
ts . setVal ( ” dataval ” , val ) ;
}
/∗ ∗
∗ D e l e t e s t h e s p e c i f i e d r e c o r d from t h e t a b l e scan f o r
∗ the bucket . The method s t a r t s a t t h e b e g i n n i n g o f t h e
∗ s c a n , and l o o p s t h r o u g h t h e r e c o r d s u n t i l t h e
∗ s p e c i f i e d record i s found .
∗ @see s i m p l e d b . i n d e x . I n d e x#d e l e t e ( s i m p l e d b . q u e r y . C o n s t a n t , s i m p l e d b . r e c o r d . RID )
∗/
public void d e l e t e ( C o n s t a n t v a l , RID r i d ) {
beforeFirst ( val ) ;
while ( n e x t ( ) )
i f ( getDataRid ( ) . e q u a l s ( r i d ) ) {
ts . delete () ;
return ;
}
}
/∗ ∗
∗ C l o s e s t h e i n d e x by c l o s i n g t h e c u r r e n t table scan .
∗ @see s i m p l e d b . i n d e x . I n d e x#c l o s e ( )
∗/
public void c l o s e ( ) {
i f ( t s != n u l l )
ts . close () ;
}
/∗ ∗
∗ R e t u r n s t h e c o s t o f s e a r c h i n g an i n d e x f i l e h a v i n g t h e
∗ s p e c i f i e d number o f b l o c k s .
∗ The method a s s u m e s t h a t a l l b u c k e t s a r e a b o u t t h e
∗ same s i z e , and s o t h e c o s t i s s i m p l y t h e s i z e o f
∗ the bucket .
∗ @param n u m b l o c k s t h e number o f b l o c k s o f i n d e x r e c o r d s
∗ @param r p b t h e number o f r e c o r d s p e r b l o c k ( n o t u s e d h e r e )
∗ @return t h e c o s t o f t r a v e r s i n g t h e index
∗/
public s t a t i c i n t s e a r c h C o s t ( i n t numblocks , i n t rpb ) {
return numblocks / HashIndex .NUM BUCKETS;
}
}
5.2 B+ -trees
For B-trees, see Cormen et al. (2009, Chapter 18) or Elmasri and Navathe (2011, Chap-
ter 17.3.1). For B+ -trees, see Elmasri and Navathe (2011, Chapter 17.3.2) or Sciore (2008,
Chapter 21.4).
• Table 6 summarizes the differences between RAM-based and disk-based search trees.
• RAM-based width balanced trees have also been developed, such as 2-3 and 2-3-4
trees, where these numbers tell how many subtrees they permit.
However, height-balanced trees are preferred to them in RAM.
• B+ -trees are the most popular tree-based index implementation data structute in
RDBMSs.
• They are often called just B-trees for simplicity, but this is slightly inaccurate:
234
in RAM on disk
Contains 1 key value and 2 pointers to Contains many more than 1 keys and
Node (possibly empty) subtrees. Designed to 1 more subtree pointer than keys.
be small to save RAM. Designed to fill a Block to save I/O.
The node .key redirects each operation Similarly, the node .keys redirect each
keys
to the node .left or node .right subtree, operation to the appropriate subtree.
as appropriate, based on the input key.
By subtree height: For instance, AVL By node width: Each subtree has
Balance
trees require that the heights of the exactly the same height. This is
2 subtrees of a node differ from each achieved by allowing different nodes to
other by at most 1. have very different numbers of actual
subtrees.
Balance ensures that operations take Same logarithmic time, but here the
Speed
logarithmic time wrt. the number of branching factor of the tree > 2, and so
index records stored in the tree. is its base.
Table 6: RAM vs. disk search trees.
Leaf:
– An array node .slot[1. . . ] of index records.
∗ Its length is chosen so that the disk Block is as fully used as possible
– to maximize I/O utilization.
∗ This length depends on how much space must be reserved for each
keys in this Block .
– The counter node .last indicating that only the prefix node .slot[1. . . node .last]
is currently used, while the suffix node .slot[node .last+1. . . ] is still unused.
– A disk Block pointer node .next to the next leaf (if any). In. . .
theory these are not needed
practice they are extremely useful in many situations, and are therefore
included.
Internal:
– An array node .key[1. . . ] of keys.
235
Figure 83: An example B+ -tree. (Sciore, 2008)
236
¯ Move directly to the correct record in the STUDENT Table using this
s .RID found in (a).
(c) shows the abstraction of this directory Block (b) into a B+ -tree internal node
on top of the leaf nodes from (a):
– The RIDs of (b) are now shown as arrows/pointers.
– The 1st key (here ’Amy’) can be omitted, because we know that if n <
the 2nd key (here ’Bob’) then n must be in the 1st subtree.
– This leads to the idea that there is 1 more subtree pointer than keys.
– In addition its leaf nodes from (a) would be linked together into an ordered
chain of next pointers (not shown).
• Eq. (25) takes the following form in the internal nodes of B+ -trees
• By Eq. (26)
node .key[1 . . . node .last] (27)
is ordered into strictly ascending order.
237
Balance Condition and Insertion
• Recall that the height of
– All the subtrees of an internal node have exactly the same height.
– Every non-root node is at least half full:
how many keys would fit into this node
node .last ≥ . (29)
2
– If the root is internal, then its
its .last ≥ 1.
Hence a B+ -tree is balanced by keeping (the disk Block s storing) its non-root nodes
between half and totally full.
• In particular, the initially empty B+ -tree is just a leaf node as its root, and it has
root .last = 0
and
• This balance condition is maintained by the algorithm which inserts a given index
record into a given B+ -tree.
• Describing this algorithm is simpler, if we assume that each node can in fact become
over full while it is in RAM:
– The it has 1 more key and RID/subtree than would fit into its disk Block .
– When this node gets written back into its disk Block , it will no longer be
overfull – because the algorithm will have rebalanced the B+ -tree first.
– We leave the details to the programmer. . .
either OK if it could insert this new index record r into the B+ -(sub)tree T without
its height growing – that is, it will modify its parameter T
(which some consider to be a bad programming habit, but here we are trying
to save disk Block s)
or a pair hm, U i if this could not be done.
– Instead, consider a new tree V whose root had just 1 key m with the
modified T as its left and this new U as its right subtree.
– This new V would be a correct B+ -tree for r and the index record originally
in T .
238
– However, the height of this new V would also be 1+ the height of the
original T . . .
This fairly elaborate explanation of its return value is needed for arguing that this
recursive algorithm is correct.
239
Figure 84: A B+ -tree with height 2. (Sciore, 2008)
insert(r, node):
1 if this node is internal
2 determine (by binary search) the only
node .subtree[child ] where r.key could be;
3 if insert(r, child ) returned hm, U i
4 splice m between node .key[1 . . . child ] and node .key[child + 1 . . .];
5 splice U into the corresponding position within node .subtree;
6 if this splicing overflowed this node
7 U 0 = a new initially empty internal node;
8 move the top half of the node .subtree array
and the node .keys between them into U 0 ;
9 m0 = detach the last node .key which was not moved
(and which can no longer stay in node)
10 return hm0 , U 0 i
11 else return OK
12 else return OK
13 else determine (by binary search) whether r .key appears in some node .slot;
14 if it does
15 change the RID of that slot into r .RID
16 return OK
17 else splice r into its correct place within node .slot;
18 if this splicing overflowed this node
19 U 0 = a new initially empty leaf node;
20 U 0 .next = node .next;
21 node .next = U 0 ;
22 move the top half of the node .slot array into U 0 ;
23 m0 = U 0 .slot[1].key;
24 return hm0 , U 0 i
25 else return OK.
• Figure 84 shows an example B+ -tree. Assume that all its leaf nodes are already full.
• Figure 84 then shows what happens when a new key “hal” is inserted:
¬ It goes into the leaf starting with “eli”. . .
240
Figure 85: Splitting a leaf node. (Sciore, 2008)
This leaf splits into 2 leaf nodes. The new leaf node starts with “jim”. . .
® This starting key gets copied into its parent.
¯ This parent still has space for it, but becomes full too.
• Figure 86 shows what happens when another new key “zoe” is inserted:
• These B+ -trees scale very well when the amount of data grows:
index records.
241
Figure 87: Splitting the root node. (Sciore, 2008)
– On today’s disks with their big Block s, these internal nodes can have over 100
subtrees.
– Then even the largest indexes have height of only about 6.
– For instance the lookup and insertion algorithm access only height + 1 disk
Block s.
On Deletion
• This restricts Eq. (29) only to non-root internal nodes – leaf nodes are now permitted
to be even less than half full.
• This is because. . .
– It is possible to develop a deletion algorithm which merges less than half full
nodes together, but its details turn out to be intricate.
– Moreover, this deletion algorithm would not perform very well when Transactions
are executing it concurrently with other operations.
– The disk space saving would be unlikely to be worth these troubles, because
databases which have already grown big rarely get much smaller in the future
either. . .
242
Figure 88: Splitting in a nonunique B+ -tree index. (Sciore, 2008)
• When we extend our B+ -trees to nonunique indexes, where the same key can appear
in many index records, Eq. (26) forces us to keep all the index records with the same
key in the same node .subtree too.
• Hence we must keep all of them in the same leaf node too.
• Figure 88 shows what this means when splitting nodes:
We may have to split them unevenly.
• Because there can be more index records with the same key than fit into one leaf
node, we must also allow a chain of overflow Block s in our leaf nodes, as in Figure 89.
• Hence a nonunique B+ -tree has 2 kinds of leaf nodes:
One-Block leaf nodes. They are as before, except that. . .
– the same key is allowed to repeat
– they can be less than half full.
Many-Block leaf nodes with an overflow chain, where
– all index records must share the same key.
243
Figure 89: Overflow chain in a nonunique B+ -tree index leaf. (Sciore, 2008)
Range Queries
• A range query asks for all the records where some Attribute falls into a specified
interval with a lower and upper limit.
• For instance, we can ask for all the STUDENTs of our university example whose
names begin with ’b’:
SELECT ∗
FROM STUDENT
WHERE ’ b ’ <= SName
AND SName < ’ c ’
(Full SQL would offer a special “SName LIKE ’b%’” operation for such queries.)
Lower limit is that the student’s name must be alphabetically at least ’b’.
Upper limit is that it must be less than ’c’.
• Because a search tree like B+ -tree retains the order of its keys, a search tree based
index on this Attribute can answer such queries efficiently:
1 c = the first index record which satisfies the lower limit test;
2 while this current index record c exists
and its c .key satisfies the upper limit test
3 fetch the corresponding data record via c .RID;
4 report it as the next row of the query result;
5 move c to the next index record.
244
• If we have built an extra (nonunique) B+ -tree index on the (non-key) SName
Attribute of the STUDENT Table, then the RDBMS can. . .
¬ first find quickly the alphabetically first student whose name is ’b’ or greater
using our extra index, as line 1
then move directly to the next index record on each line 5
® finally stop this while loop when the name of this next student becomes ’c’ or
greater.
• One correct solution (taken by SimpleDB) would be to use the same 2PL developed
for Table file Block s also for these index file Block s:
– Every tree operation starts at its root, so every Transaction needs the corre-
sponding lock(root).
– Then for instance a Transaction t which modifies an index takes xlock(root)
– which forces all later Transactions to wait until t ends before they can use
this index.
• This is why B+ -trees use their own specific non-2PL locking mechanisms, which
allow multiple Transactions to access different parts of the same tree at the same
time.
– Because this operation does not modify the tree, t needs only slocks on the
nodes of tree.
– Transaction t needs an slock(node) where it currently is – so that no other
Transaction u can modify this node.
– To guarantee this, Transaction t must also take another slock(child ) of this
current node before t can descend into it on line 4.
– However, t can release(node) as soon as t has descended to its child .
This increases concurrency for the later Transactions u who want to use the
same tree.
– When this u walks the same path as t down the tree, u cannot go faster than t
because of these locks.
Hence this lookup operation by t will happen before the tree operation by u,
and these 2 operations remain serializable.
245
• This locking mechanism is called lock coupling because the locks of the node and
its child are considered together.
• This lock coupling becomes slightly more complicated for a single insertion opera-
tion.
– Now t needs xlocks on the nodes of the path it takes, because it will modify
some of those nodes.
– Moreover, since these modifications happen when the recursion returns back
along this path, it seems that Transaction t must hold these xlocks during the
whole operation. . .
– However, closer reading of the recursive insertion subroutine reveals that the
operation is done as soon as the first OK is returned:
Its caller will just return OK to its own caller, and so on.
– Hence when the insertion subroutine is descending from its current node into
its child , it can use the following mehanism:
xlock(child );
if child .last < how many keys fit into an internal node
release all the other xlocks taken during this operation
(including xlock(node) in particular);
This is because its child will always return OK, since it is not yet completely
full.
– In this way Transaction t releases its xlock on a node n already when it is
descending into child ren as soon as it is certain that it will not modify n.
– Again, this serializes all the other later Transactions u to reach this part of
the tree which t might modify to execute only after t.
• However, lock coupling is not yet enough, because it considers only a single opera-
tion, but Transactions consist of many.
– Let our B+ -tree index contain 2 index records x and y located on the same
disk Block b.
– Consider the following 2 concurrent Transactions:
Transaction t: Transaction u:
1 look up x; 1 look up y;
2 modify y based on x. 2 modify x based on y.
246
• The solution is to add another level of locking on top of the Page-level locks we
have had so far.
– first takes a high-level lock on the key range associated with this operation.
∗ Intuitively, t takes a lock for a range of index records in the leaf nodes of
the tree.
∗ These high-level locks do obey 2PL, so t holds them until it ends.
– then performs the operation, using lock coupling on the low-level Page locks
for the disk file Block s where the affected index records are stored.
The another Transaction u can perform another concurrent operation, if its key
range does not overlap the key range(s) locked by t – that is, if u operates on
different index records than t.
1 step 1 of t takes
a high-level slock(x) which it keeps, and
a low-level slock(b) which it releases soon;
2 step 1 of u takes
a high-level slock(y) which it keeps, and
a low-level slock(b) which it releases soon;
3 step 2 of t tries to take a high-level xlock(y),
but must wait for u;
4 step 2 of u tries to take a high-level xlock(x),
but must wait for t;
5 the scheduler detects this deadlock
begins at k itself
extends until the next larger key l than k in the index, but does not include this l.
– Suppose that k is the key for the current row of a range query result set.
– Then this slock(k) range extends (almost) until the next l in this result set.
• Such a high-level lock(k) tells the other concurrently running Transactions that
“I have noticed that the part of this B+ -tree corresponding to this range
has no other keys than k, so if you are going to change that (by taking an
xlock within this range) then you must wait until I have done everything
that I am going to do first.”
247
• But we do know this l when we are taking this high-level lock(k) – so how can we
compare it against the other high-level locks already taken?
• Now we can state the high-level key range locking rules, which ensure serializability
for Transactions performing many B+ -tree operations:
• RDBMSs often use this kind of 2-level locking (not only for their B+ -tree indexes
but also) their stored Tables.
– There the items with 2PL high-level locks are (not keys k but) the RIDs of
their stored Record s.
– Early-release low-level locks synchronize in turn access to the Pages where
these Record s are stored.
248
– This allows a Transaction to lock only some of the Record s in a Page, and the
other concurrently running Transactions can still access its other Record s.
– This increased concurrency is not possible, if we only have the Page locks (like
SimpleDB does).
– It splits a node already when it becomes full – it does not wait it to become
over full.
+ This avoids having to program the support of the overflowing part of a node
in RAM, but. . .
− it also means that the nodes on disk can never fill a Block completely, because
they always have at least one unused node .slot.
• It keeps the internal and leaf nodes of the same B+ -tree in 2 separate files.
• This allows it to treat each file as consisting of just one kind of Page.
package s i m p l e d b . i n d e x . b t r e e ;
/∗ ∗
∗ A B−t r e e i m p l e m e n t a t i o n o f t h e I n d e x i n t e r f a c e .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s BTreeIndex implements I n d e x {
private T r a n s a c t i o n tx ;
private T a b l e I n f o d i r T i , l e a f T i ;
p r i v a t e B T r e e Le a f l e a f = n u l l ;
private Block r o o t b l k ;
/∗ ∗
∗ Opens a B−t r e e i n d e x f o r t h e s p e c i f i e d i n d e x .
∗ The method d e t e r m i n e s t h e a p p r o p r i a t e f i l e s
∗ f o r t h e l e a f and d i r e c t o r y r e c o r d s ,
∗ c r e a t i n g them i f t h e y d i d n o t e x i s t .
∗ @param i d x n a m e t h e name o f t h e i n d e x
∗ @param l e a f s c h t h e schema o f t h e l e a f i n d e x r e c o r d s
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public BTreeIndex ( S t r i n g idxname , Schema l e a f s c h , T r a n s a c t i o n t x ) {
this . tx = tx ;
// d e a l w i t h t h e l e a v e s
S t r i n g l e a f t b l = idxname + ” l e a f ” ;
l e a f T i = new T a b l e I n f o ( l e a f t b l , l e a f s c h ) ;
i f ( t x . s i z e ( l e a f T i . f i l e N a m e ( ) ) == 0 )
t x . append ( l e a f T i . f i l e N a m e ( ) , new BTPageFormatter ( l e a f T i , −1) ) ;
// d e a l w i t h t h e d i r e c t o r y
Schema d i r s c h = new Schema ( ) ;
d i r s c h . add ( ” b l o c k ” , leafsch ) ;
d i r s c h . add ( ” d a t a v a l ” , l e a f s c h ) ;
S t r i n g d i r t b l = idxname + ” d i r ” ;
d i r T i = new T a b l e I n f o ( d i r t b l , d i r s c h ) ;
r o o t b l k = new B l o c k ( d i r T i . f i l e N a m e ( ) , 0 ) ;
i f ( t x . s i z e ( d i r T i . f i l e N a m e ( ) ) == 0 )
// c r e a t e new r o o t b l o c k
t x . append ( d i r T i . f i l e N a m e ( ) , new BTPageFormatter ( d i r T i , 0) ) ;
BTreePage page = new BTreePage ( r o o t b l k , d i r T i , t x ) ;
i f ( page . getNumRecs ( ) == 0 ) {
// i n s e r t i n i t i a l d i r e c t o r y e n t r y
int f l d t y p e = d i r s c h . type ( ” dataval ” ) ;
C o n s t a n t m i n v a l = ( f l d t y p e == INTEGER) ?
new I n t C o n s t a n t ( I n t e g e r . MIN VALUE) :
new S t r i n g C o n s t a n t ( ” ” ) ;
page . i n s e r t D i r ( 0 , minval , 0 ) ;
}
page . c l o s e ( ) ;
}
/∗ ∗
∗ Traverses the directory to find the leaf block corresponding
249
∗ to the s p e c i f i e d search key .
∗ The method t h e n o p e n s a p a g e f o r t h a t l e a f b l o c k , and
∗ p o s i t i o n s t h e p a g e b e f o r e t h e f i r s t r e c o r d ( i f any )
∗ having t h a t search key .
∗ The l e a f p a g e i s k e p t open , f o r u s e b y t h e m e t h o d s n e x t
∗ and g e t D a t a R i d .
∗ @see s i m p l e d b . i n d e x . I n d e x#b e f o r e F i r s t ( s i m p l e d b . q u e r y . C o n s t a n t )
∗/
public void b e f o r e F i r s t ( C o n s t a n t s e a r c h k e y ) {
close () ;
BTreeDir r o o t = new BTreeDir ( r o o t b l k , d i r T i , t x ) ;
i n t blknum = r o o t . s e a r c h ( s e a r c h k e y ) ;
root . close () ;
B l o c k l e a f b l k = new B l o c k ( l e a f T i . f i l e N a m e ( ) , blknum ) ;
l e a f = new B T r e e L ea f ( l e a f b l k , l e a f T i , s e a r c h k e y , t x ) ;
}
/∗ ∗
∗ Moves t o t h e n e x t l e a f r e c o r d h a v i n g t h e
∗ p r e v i o u s l y −s p e c i f i e d s e a r c h k e y .
∗ R e t u r n s f a l s e i f t h e r e a r e no more s u c h l e a f records .
∗ @see s i m p l e d b . i n d e x . I n d e x#n e x t ( )
∗/
public boolean n e x t ( ) {
return l e a f . n e x t ( ) ;
}
/∗ ∗
∗ R e t u r n s t h e dataRID v a l u e f r o m t h e c u r r e n t leaf record .
∗ @see s i m p l e d b . i n d e x . I n d e x#g e t D a t a R i d ( )
∗/
public RID g e t D a t a R i d ( ) {
return l e a f . g e t D a t a R i d ( ) ;
}
/∗ ∗
∗ Inserts the s p e c i f i e d record into the index .
∗ The method f i r s t t r a v e r s e s t h e d i r e c t o r y t o f i n d
∗ the a p p r o p r i a t e l e a f page ; then i t i n s e r t s
∗ the record into the l e a f .
∗ I f the i n s e r t i o n causes the l e a f to s p l i t , then
∗ t h e method c a l l s i n s e r t on t h e r o o t ,
∗ p a s s i n g i t t h e d i r e c t o r y e n t r y o f t h e new l e a f p a g e .
∗ I f t h e r o o t n o d e s p l i t s , t h e n makeNewRoot i s c a l l e d .
∗ @see s i m p l e d b . i n d e x . I n d e x#i n s e r t ( s i m p l e d b . q u e r y . C o n s t a n t , s i m p l e d b . r e c o r d . RID )
∗/
public void i n s e r t ( C o n s t a n t d a t a v a l , RID d a t a r i d ) {
b e f o r e F i r s t ( dataval ) ;
DirEntry e = l e a f . i n s e r t ( d a t a r i d ) ;
leaf . close () ;
i f ( e == n u l l )
return ;
BTreeDir r o o t = new BTreeDir ( r o o t b l k , d i r T i , t x ) ;
DirEntry e2 = r o o t . i n s e r t ( e ) ;
i f ( e 2 != n u l l )
r o o t . makeNewRoot ( e 2 ) ;
root . close () ;
}
/∗ ∗
∗ Deletes the s p e c i f i e d index record .
∗ The method f i r s t t r a v e r s e s t h e d i r e c t o r y t o f i n d
∗ the l e a f page c o n t a i n i n g t h a t record ; then i t
∗ d e l e t e s t h e r e c o r d from t h e page .
∗ @see s i m p l e d b . i n d e x . I n d e x#d e l e t e ( s i m p l e d b . q u e r y . C o n s t a n t , s i m p l e d b . r e c o r d . RID )
∗/
public void d e l e t e ( C o n s t a n t d a t a v a l , RID d a t a r i d ) {
b e f o r e F i r s t ( dataval ) ;
l e a f . delete ( datarid ) ;
leaf . close () ;
}
/∗ ∗
∗ C l o s e s t h e i n d e x by c l o s i n g i t s open leaf page ,
∗ i f necessary .
∗ @see s i m p l e d b . i n d e x . I n d e x#c l o s e ( )
∗/
public void c l o s e ( ) {
i f ( l e a f != n u l l )
leaf . close () ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s
∗ required to find a l l index records having
∗ a p a r t i c u l a r search key .
∗ @param n u m b l o c k s t h e number o f b l o c k s i n t h e B−t r e e d i r e c t o r y
∗ @param r p b t h e number o f i n d e x e n t r i e s p e r b l o c k
∗ @return t h e e s t i m a t e d t r a v e r s a l c o s t
∗/
public s t a t i c i n t s e a r c h C o s t ( i n t numblocks , i n t rpb ) {
return 1 + ( i n t ) ( Math . l o g ( numblocks ) / Math . l o g ( rpb ) ) ;
}
}
250
import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import simpledb . f i l e . Block ;
import simpledb . record . ∗ ;
import simpledb . query . ∗ ;
import simpledb . tx . Transaction ;
/∗ ∗
∗ B−t r e e d i r e c t o r y and l e a f p a g e s h a v e many c o m m o n a l i t i e s :
∗ in p a r t i c u l a r , t h e i r r e c o r d s are s t o r e d in s o r t e d order ,
∗ and p a g e s s p l i t when f u l l .
∗ A BTreePage o b j e c t c o n t a i n s t h i s common f u n c t i o n a l i t y .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s BTreePage {
private Block c u r r e n t b l k ;
private T a b l e I n f o t i ;
private T r a n s a c t i o n tx ;
private int s l o t s i z e ;
/∗ ∗
∗ Opens a p a g e f o r t h e s p e c i f i e d B−t r e e b l o c k .
∗ @param c u r r e n t b l k a r e f e r e n c e t o t h e B−t r e e b l o c k
∗ @param t i t h e m e t a d a t a f o r t h e p a r t i c u l a r B−t r e e f i l e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public BTreePage ( B l o c k c u r r e n t b l k , T a b l e I n f o t i , T r a n s a c t i o n t x ) {
this . currentblk = currentblk ;
this . t i = t i ;
this . tx = tx ;
s l o t s i z e = t i . recordLength () ;
tx . pin ( c u r r e n t b l k ) ;
}
/∗ ∗
∗ C a l c u l a t e s t h e p o s i t i o n where t h e f i r s t r e c o r d h a v i n g
∗ t h e s p e c i f i e d s e a r c h k e y s h o u l d be , t h e n r e t u r n s
∗ the position before i t .
∗ @param s e a r c h k e y t h e s e a r c h k e y
∗ @return t h e p o s i t i o n b e f o r e where t h e s e a r c h key g o e s
∗/
public i n t f i n d S l o t B e f o r e ( C o n s t a n t s e a r c h k e y ) {
int s l o t = 0 ;
while ( s l o t < getNumRecs ( ) && g e t D a t a V a l ( s l o t ) . compareTo ( s e a r c h k e y ) < 0 )
s l o t ++;
return s l o t −1;
}
/∗ ∗
∗ C l o s e s t h e page by unpinning its buffer .
∗/
public void c l o s e ( ) {
i f ( c u r r e n t b l k != n u l l )
tx . unpin ( c u r r e n t b l k ) ;
c u r r e n t b l k = null ;
}
/∗ ∗
∗ Returns t r ue i f the b l o c k i s f u l l .
∗ @return t r u e i f t h e b l o c k i s f u l l
∗/
public boolean i s F u l l ( ) {
return s l o t p o s ( getNumRecs ( ) +1) >= BLOCK SIZE ;
}
/∗ ∗
∗ S p l i t s the page at the s p e c i f i e d p o s i t i o n .
∗ A new p a g e i s c r e a t e d , and t h e r e c o r d s o f t h e p a g e
∗ s t a r t i n g a t t h e s p l i t p o s i t i o n a r e t r a n s f e r r e d t o t h e new p a g e .
∗ @param s p l i t p o s t h e s p l i t p o s i t i o n
∗ @param f l a g t h e i n i t i a l v a l u e o f t h e f l a g f i e l d
∗ @ r e t u r n t h e r e f e r e n c e t o t h e new b l o c k
∗/
public B l o c k s p l i t ( i n t s p l i t p o s , i n t f l a g ) {
B l o c k newblk = appendNew ( f l a g ) ;
BTreePage newpage = new BTreePage ( newblk , t i , t x ) ;
t r a n s f e r R e c s ( s p l i t p o s , newpage ) ;
newpage . s e t F l a g ( f l a g ) ;
newpage . c l o s e ( ) ;
return newblk ;
}
/∗ ∗
∗ Returns the d a t a v a l of the record at the s p e c i f i e d slot .
∗ @param s l o t t h e i n t e g e r s l o t o f an i n d e x r e c o r d
∗ @return t h e d a t a v a l o f t h e record at t h a t s l o t
∗/
public C o n s t a n t g e t D a t a V a l ( i n t s l o t ) {
return g e t V a l ( s l o t , ” d a t a v a l ” ) ;
}
/∗ ∗
∗ Returns the v a l u e of the page ’ s f l a g field
∗ @return t h e v a l u e o f t h e page ’ s f l a g field
∗/
public i n t g e t F l a g ( ) {
return t x . g e t I n t ( c u r r e n t b l k , 0 ) ;
}
/∗ ∗
∗ S e t s the page ’ s f l a g f i e l d to the s p e c i f i e d value
∗ @param v a l t h e new v a l u e o f t h e p a g e f l a g
∗/
public void s e t F l a g ( i n t v a l ) {
251
tx . s e t I n t ( currentblk , 0, val ) ;
}
/∗ ∗
∗ Appends a new b l o c k t o t h e end o f t h e s p e c i f i e d B−t r e e f i l e ,
∗ having the s p e c i f i e d f l a g value .
∗ @param f l a g t h e i n i t i a l v a l u e o f t h e f l a g
∗ @ r e t u r n a r e f e r e n c e t o t h e n e w l y −c r e a t e d b l o c k
∗/
public B l o c k appendNew ( i n t f l a g ) {
return t x . append ( t i . f i l e N a m e ( ) , new BTPageFormatter ( t i , f l a g ) ) ;
}
/∗ ∗
∗ R e t u r n s t h e b l o c k number s t o r e d i n t h e i n d e x r e c o r d
∗ at the s p e c i f i e d s l o t .
∗ @param s l o t t h e s l o t o f an i n d e x r e c o r d
∗ @ r e t u r n t h e b l o c k number s t o r e d i n t h a t r e c o r d
∗/
public i n t getChildNum ( i n t s l o t ) {
return g e t I n t ( s l o t , ” b l o c k ” ) ;
}
/∗ ∗
∗ Inserts a directory entry at the s p e c i f i e d s l o t .
∗ @param s l o t t h e s l o t o f an i n d e x r e c o r d
∗ @param v a l t h e d a t a v a l t o b e s t o r e d
∗ @param b l k n u m t h e b l o c k number t o b e s t o r e d
∗/
public void i n s e r t D i r ( i n t s l o t , C o n s t a n t v a l , i n t blknum ) {
insert ( slot ) ;
setVal ( slot , ” dataval ” , val ) ;
s e t I n t ( s l o t , ” b l o c k ” , blknum ) ;
}
/∗ ∗
∗ R e t u r n s t h e dataRID v a l u e s t o r e d i n t h e s p e c i f i e d l e a f i n d e x record .
∗ @param s l o t t h e s l o t o f t h e d e s i r e d i n d e x r e c o r d
∗ @ r e t u r n t h e dataRID v a l u e s t o r e a t t h a t s l o t
∗/
public RID g e t D a t a R i d ( i n t s l o t ) {
return new RID ( g e t I n t ( s l o t , ” b l o c k ” ) , g e t I n t ( s l o t , ” i d ” ) ) ;
}
/∗ ∗
∗ Inserts a l e a f index record at the s p e c i f i e d s l o t .
∗ @param s l o t t h e s l o t o f t h e d e s i r e d i n d e x r e c o r d
∗ @param v a l t h e new d a t a v a l
∗ @param r i d t h e new dataRID
∗/
public void i n s e r t L e a f ( i n t s l o t , C o n s t a n t v a l , RID r i d ) {
insert ( slot ) ;
setVal ( slot , ” dataval ” , val ) ;
s e t I n t ( s l o t , ” b l o c k ” , r i d . blockNumber ( ) ) ;
setInt ( slot , ” id ” , rid . id () ) ;
}
/∗ ∗
∗ Deletes the index record at the s p e c i f i e d slot .
∗ @param s l o t t h e s l o t o f t h e d e l e t e d i n d e x record
∗/
public void d e l e t e ( i n t s l o t ) {
f o r ( i n t i=s l o t +1; i <getNumRecs ( ) ; i ++)
c o p y R e c o r d ( i , i −1) ;
setNumRecs ( getNumRecs ( ) −1) ;
return ;
}
/∗ ∗
∗ R e t u r n s t h e number o f i n d e x r e c o r d s i n t h i s page .
∗ @ r e t u r n t h e number o f i n d e x r e c o r d s i n t h i s page
∗/
public i n t getNumRecs ( ) {
return t x . g e t I n t ( c u r r e n t b l k , INT SIZE ) ;
}
// P r i v a t e methods
252
}
p r i v a t e void s e t V a l ( i n t s l o t , S t r i n g fldname , C o n s t a n t v a l ) {
i n t t y p e = t i . schema ( ) . t y p e ( f l d n a m e ) ;
i f ( t y p e == INTEGER)
s e t I n t ( s l o t , fldname , ( I n t e g e r ) v a l . a s J a v a V a l ( ) ) ;
else
s e t S t r i n g ( s l o t , fldname , ( S t r i n g ) v a l . a s J a v a V a l ( ) ) ;
}
p r i v a t e void setNumRecs ( i n t n ) {
t x . s e t I n t ( c u r r e n t b l k , INT SIZE , n ) ;
}
p r i v a t e void i n s e r t ( i n t s l o t ) {
f o r ( i n t i=getNumRecs ( ) ; i >s l o t ; i −−)
c o p y R e c o r d ( i −1 , i ) ;
setNumRecs ( getNumRecs ( ) +1) ;
}
p r i v a t e void c o p y R e c o r d ( i n t from , i n t t o ) {
Schema s c h = t i . schema ( ) ;
for ( S t r i n g fldname : sch . f i e l d s ( ) )
s e t V a l ( to , fldname , g e t V a l ( from , f l d n a m e ) ) ;
}
p r i v a t e void t r a n s f e r R e c s ( i n t s l o t , BTreePage d e s t ) {
int d e s t s l o t = 0 ;
while ( s l o t < getNumRecs ( ) ) {
dest . i n s e r t ( d e s t s l o t ) ;
Schema s c h = t i . schema ( ) ;
for ( S t r i n g fldname : sch . f i e l d s ( ) )
d e s t . s e t V a l ( d e s t s l o t , fldname , g e t V a l ( s l o t , f l d n a m e ) ) ;
delete ( slot ) ;
d e s t s l o t ++;
}
}
import s t a t i c s i m p l e d b . f i l e . Page . ∗ ;
import s t a t i c j a v a . s q l . Types . INTEGER ;
import s i m p l e d b . f i l e . Page ;
import simpledb . b u f f e r . PageFormatter ;
import simpledb . record . TableInfo ;
/∗ ∗
∗ An o b j e c t t h a t can f o r m a t a p a g e t o l o o k l i k e an
∗ empty B−t r e e b l o c k .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s BTPageFormatter implements P a g e F o r m a t t e r {
private T a b l e I n f o t i ;
private int f l a g ;
/∗ ∗
∗ C r e a t e s a f o r m a t t e r f o r a new p a g e o f t h e
∗ s p e c i f i e d B−t r e e i n d e x .
∗ @param t i t h e i n d e x ’ s m e t a d a t a
∗ @param f l a g t h e p a g e ’ s i n i t i a l f l a g v a l u e
∗/
public BTPageFormatter ( T a b l e I n f o t i , i n t f l a g ) {
this . t i = t i ;
this . f l a g = f l a g ;
}
/∗ ∗
∗ F o r m a t s t h e p a g e b y i n i t i a l i z i n g a s many i n d e x −r e c o r d s l o t s
∗ as p o s s i b l e to have d e f a u l t v a l u e s .
∗ Each i n t e g e r f i e l d i s g i v e n a v a l u e o f 0 , and
∗ each s t r i n g f i e l d i s g i v e n a v a l u e of ””.
∗ The l o c a t i o n t h a t i n d i c a t e s t h e number o f r e c o r d s
∗ in the page i s a l s o s e t to 0.
∗ @see s i m p l e d b . b u f f e r . P a g e F o r m a t t e r#f o r m a t ( s i m p l e d b . f i l e . Page )
∗/
public void f o r m a t ( Page page ) {
page . s e t I n t ( 0 , f l a g ) ;
page . s e t I n t ( INT SIZE , 0 ) ; // #r e c o r d s = 0
int r e c s i z e = t i . recordLength ( ) ;
f o r ( i n t p o s =2∗INT SIZE ; p o s+r e c s i z e <=BLOCK SIZE ; p o s += r e c s i z e )
m a k e D e f a u l t R e c o r d ( page , p o s ) ;
}
253
p r i v a t e void m a k e D e f a u l t R e c o r d ( Page page , i n t p o s ) {
f o r ( S t r i n g f l d n a m e : t i . schema ( ) . f i e l d s ( ) ) {
int o f f s e t = t i . o f f s e t ( fldname ) ;
i f ( t i . schema ( ) . t y p e ( f l d n a m e ) == INTEGER)
page . s e t I n t ( p o s + o f f s e t , 0 ) ;
else
page . s e t S t r i n g ( p o s + o f f s e t , ” ” ) ;
}
}
}
/∗ ∗
∗ An o b j e c t t h a t h o l d s t h e c o n t e n t s o f a B−t r e e leaf block .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s B T r e e L ea f {
private T a b l e I n f o t i ;
private T r a n s a c t i o n tx ;
private Constant s e a r c h k e y ;
p r i v a t e BTreePage c o n t e n t s ;
private int c u r r e n t s l o t ;
/∗ ∗
∗ Opens a p a g e t o h o l d t h e s p e c i f i e d l e a f b l o c k .
∗ The p a g e i s p o s i t i o n e d i m m e d i a t e l y b e f o r e t h e f i r s t r e c o r d
∗ h a v i n g t h e s p e c i f i e d s e a r c h k e y ( i f any ) .
∗ @param b l k a r e f e r e n c e t o t h e d i s k b l o c k
∗ @param t i t h e m e t a d a t a o f t h e B−t r e e l e a f f i l e
∗ @param s e a r c h k e y t h e s e a r c h k e y v a l u e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public B T r e e L ea f ( B l o c k b l k , T a b l e I n f o t i , C o n s t a n t s e a r c h k e y , Transaction tx ) {
this . t i = t i ;
this . tx = tx ;
this . searchkey = searchkey ;
c o n t e n t s = new BTreePage ( b l k , t i , t x ) ;
c u r r e n t s l o t = contents . findSlotBefore ( searchkey ) ;
}
/∗ ∗
∗ Closes the l e a f page .
∗/
public void c l o s e ( ) {
contents . close () ;
}
/∗ ∗
∗ Moves t o t h e n e x t l e a f r e c o r d h a v i n g t h e
∗ p r e v i o u s l y −s p e c i f i e d s e a r c h k e y .
∗ R e t u r n s f a l s e i f t h e r e i s no more s u c h r e c o r d s .
∗ @ r e t u r n f a l s e i f t h e r e a r e no more l e a f r e c o r d s f o r t h e s e a r c h key
∗/
public boolean n e x t ( ) {
c u r r e n t s l o t ++;
i f ( c u r r e n t s l o t >= c o n t e n t s . getNumRecs ( ) )
return t r y O v e r f l o w ( ) ;
e l s e i f ( c o n t e n t s . getDataVal ( c u r r e n t s l o t ) . e q u a l s ( searchkey ) )
return true ;
else
return t r y O v e r f l o w ( ) ;
}
/∗ ∗
∗ R e t u r n s t h e dataRID v a l u e o f t h e c u r r e n t l e a f record .
∗ @ r e t u r n t h e dataRID o f t h e c u r r e n t r e c o r d
∗/
public RID g e t D a t a R i d ( ) {
return c o n t e n t s . g e t D a t a R i d ( c u r r e n t s l o t ) ;
}
/∗ ∗
∗ D e l e t e s t h e l e a f r e c o r d h a v i n g t h e s p e c i f i e d dataRID
∗ @param d a t a r i d t h e d a t a R I d w h o s e r e c o r d i s t o b e d e l e t e d
∗/
public void d e l e t e ( RID d a t a r i d ) {
while ( n e x t ( ) )
i f ( getDataRid ( ) . e q u a l s ( d a t a r i d ) ) {
contents . delete ( currentslot ) ;
return ;
}
}
/∗ ∗
∗ I n s e r t s a new l e a f r e c o r d h a v i n g t h e s p e c i f i e d dataRID
∗ and t h e p r e v i o u s l y − s p e c i f i e d s e a r c h k e y .
∗ I f t h e r e c o r d d o e s not f i t i n t h e page , t h e n
∗ t h e p a g e s p l i t s and t h e method r e t u r n s t h e
∗ d i r e c t o r y e n t r y f o r t h e new p a g e ;
∗ o t h e r w i s e , t h e method r e t u r n s n u l l .
∗ I f a l l o f t h e r e c o r d s i n t h e p a g e h a v e t h e same d a t a v a l ,
∗ t h e n t h e b l o c k d o e s n o t s p l i t ; i n s t e a d , a l l b u t one o f t h e
254
∗ r e c o r d s a r e p l a c e d i n t o an o v e r f l o w b l o c k .
∗ @param d a t a r i d t h e dataRID v a l u e o f t h e new r e c o r d
∗ @ r e t u r n t h e d i r e c t o r y e n t r y o f t h e n e w l y − s p l i t p a g e , i f one e x i s t s .
∗/
public D i r E n t r y i n s e r t ( RID d a t a r i d ) {
c u r r e n t s l o t ++;
contents . i n s e r t L e a f ( c u r r e n t s l o t , searchkey , datarid ) ;
i f ( ! contents . i s F u l l () )
return n u l l ;
// e l s e p a g e i s f u l l , s o s p l i t i t
Constant f i r s t k e y = c o n t e n t s . getDataVal ( 0 ) ;
Constant l a s t k e y = c o n t e n t s . g e t D a t a V a l ( c o n t e n t s . getNumRecs ( ) −1) ;
i f ( lastkey . equals ( f i r s t k e y ) ) {
// c r e a t e an o v e r f l o w b l o c k t o h o l d a l l b u t t h e f i r s t r e c o r d
B l o c k newblk = c o n t e n t s . s p l i t ( 1 , c o n t e n t s . g e t F l a g ( ) ) ;
c o n t e n t s . s e t F l a g ( newblk . number ( ) ) ;
return n u l l ;
}
else {
i n t s p l i t p o s = c o n t e n t s . getNumRecs ( ) / 2 ;
Constant s p l i t k e y = c o n t e n t s . getDataVal ( s p l i t p o s ) ;
i f ( splitkey . equals ( f i r s t k e y ) ) {
// move r i g h t , l o o k i n g f o r t h e n e x t k e y
while ( c o n t e n t s . g e t D a t a V a l ( s p l i t p o s ) . e q u a l s ( s p l i t k e y ) )
s p l i t p o s ++;
s p l i t k e y = c o n t e n t s . getDataVal ( s p l i t p o s ) ;
}
else {
// move l e f t , l o o k i n g f o r f i r s t e n t r y h a v i n g t h a t k e y
while ( c o n t e n t s . g e t D a t a V a l ( s p l i t p o s −1) . e q u a l s ( s p l i t k e y ) )
s p l i t p o s −−;
}
B l o c k newblk = c o n t e n t s . s p l i t ( s p l i t p o s , −1) ;
return new D i r E n t r y ( s p l i t k e y , newblk . number ( ) ) ;
}
}
p r i v a t e boolean t r y O v e r f l o w ( ) {
Constant f i r s t k e y = c o n t e n t s . getDataVal ( 0 ) ;
int f l a g = contents . getFlag ( ) ;
i f ( ! searchkey . equals ( f i r s t k e y ) | | f l a g < 0)
return f a l s e ;
contents . close () ;
B l o c k n e x t b l k = new B l o c k ( t i . f i l e N a m e ( ) , f l a g ) ;
c o n t e n t s = new BTreePage ( n e x t b l k , t i , t x ) ;
currentslot = 0;
return true ;
}
}
/∗ ∗
∗ A B−t r e e d i r e c t o r y b l o c k .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s BTreeDir {
private T a b l e I n f o t i ;
private T r a n s a c t i o n tx ;
private S t r i n g f i l e n a m e ;
p r i v a t e BTreePage c o n t e n t s ;
/∗ ∗
∗ C r e a t e s an o b j e c t t o h o l d t h e c o n t e n t s o f t h e s p e c i f i e d
∗ B−t r e e b l o c k .
∗ @param b l k a r e f e r e n c e t o t h e s p e c i f i e d B−t r e e b l o c k
∗ @param t i t h e m e t a d a t a o f t h e B−t r e e d i r e c t o r y f i l e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
BTreeDir ( B l o c k b l k , T a b l e I n f o t i , T r a n s a c t i o n t x ) {
this . t i = t i ;
this . tx = tx ;
filename = blk . fileName () ;
c o n t e n t s = new BTreePage ( b l k , t i , t x ) ;
}
/∗ ∗
∗ Closes the directory page .
∗/
public void c l o s e ( ) {
contents . close () ;
}
/∗ ∗
∗ R e t u r n s t h e b l o c k number o f t h e B−t r e e l e a f b l o c k
∗ t h a t contains the s p e c i f i e d search key .
∗ @param s e a r c h k e y t h e s e a r c h k e y v a l u e
∗ @ r e t u r n t h e b l o c k number o f t h e l e a f b l o c k c o n t a i n i n g that search key
∗/
public i n t s e a r c h ( C o n s t a n t s e a r c h k e y ) {
Block c h i l d b l k = f i n d C h i l d B l o c k ( searchkey ) ;
while ( c o n t e n t s . g e t F l a g ( ) > 0 ) {
contents . close () ;
255
c o n t e n t s = new BTreePage ( c h i l d b l k , t i , tx ) ;
childblk = findChildBlock ( searchkey ) ;
}
return c h i l d b l k . number ( ) ;
}
/∗ ∗
∗ C r e a t e s a new r o o t b l o c k f o r t h e B−t r e e .
∗ The new r o o t w i l l h a v e t w o c h i l d r e n :
∗ t h e o l d r o o t , and t h e s p e c i f i e d b l o c k .
∗ S i n c e t h e r o o t must a l w a y s b e i n b l o c k 0 o f t h e f i l e ,
∗ t h e c o n t e n t s o f t h e o l d r o o t w i l l g e t t r a n s f e r r e d t o a new b l o c k .
∗ @param e t h e d i r e c t o r y e n t r y t o b e a d d e d a s a c h i l d o f t h e new r o o t
∗/
public void makeNewRoot ( D i r E n t r y e ) {
Constant f i r s t v a l = c o n t e n t s . getDataVal ( 0 ) ;
int l e v e l = contents . getFlag ( ) ;
B l o c k newblk = c o n t e n t s . s p l i t ( 0 , l e v e l ) ; // i e , t r a n s f e r a l l t h e r e c o r d s
D i r E n t r y o l d r o o t = new D i r E n t r y ( f i r s t v a l , newblk . number ( ) ) ;
insertEntry ( oldroot ) ;
insertEntry ( e ) ;
c o n t e n t s . s e t F l a g ( l e v e l +1) ;
}
/∗ ∗
∗ I n s e r t s a new d i r e c t o r y e n t r y i n t o t h e B−t r e e b l o c k .
∗ I f the block i s at l e v e l 0 , then the entry i s i n s e r t e d there .
∗ Otherwise , the entry i s i n s e r t e d i n t o the a p p r o p r i a t e
∗ c h i l d node , and t h e r e t u r n v a l u e i s e x a m i n e d .
∗ A non−n u l l r e t u r n v a l u e i n d i c a t e s t h a t t h e c h i l d n o d e
∗ s p l i t , and s o t h e r e t u r n e d e n t r y i s i n s e r t e d i n t o
∗ this block .
∗ I f t h i s b l o c k s p l i t s , t h e n t h e method s i m i l a r l y r e t u r n s
∗ t h e e n t r y i n f o r m a t i o n o f t h e new b l o c k t o i t s c a l l e r ;
∗ o t h e r w i s e , t h e method r e t u r n s n u l l .
∗ @param e t h e d i r e c t o r y e n t r y t o b e i n s e r t e d
∗ @ r e t u r n t h e d i r e c t o r y e n t r y o f t h e n e w l y − s p l i t b l o c k , i f one e x i s t s ; otherwise , null
∗/
public D i r E n t r y i n s e r t ( D i r E n t r y e ) {
i f ( c o n t e n t s . g e t F l a g ( ) == 0 )
return i n s e r t E n t r y ( e ) ;
Block c h i l d b l k = f i n d C h i l d B l o c k ( e . dataVal ( ) ) ;
BTreeDir c h i l d = new BTreeDir ( c h i l d b l k , t i , t x ) ;
D i r E n t r y myentry = c h i l d . i n s e r t ( e ) ;
child . close () ;
return ( myentry != n u l l ) ? i n s e r t E n t r y ( myentry ) : n u l l ;
}
import s i m p l e d b . q u e r y . C o n s t a n t ;
/∗ ∗
∗ A d i r e c t o r y e n t r y h a s t w o c o m p o n e n t s : t h e number o f the child block ,
∗ and t h e d a t a v a l o f t h e f i r s t r e c o r d i n t h a t b l o c k .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s D i r E n t r y {
private Constant d a t a v a l ;
p r i v a t e i n t blocknum ;
/∗ ∗
∗ C r e a t e s a new e n t r y f o r t h e s p e c i f i e d d a t a v a l and block number .
∗ @param d a t a v a l t h e d a t a v a l
∗ @param b l o c k n u m t h e b l o c k number
∗/
public D i r E n t r y ( C o n s t a n t d a t a v a l , i n t blocknum ) {
this . dataval = dataval ;
t h i s . blocknum = blocknum ;
}
/∗ ∗
∗ Re tur ns t h e d a t a v a l component of the entry
∗ @return t h e d a t a v a l component of the entry
∗/
public C o n s t a n t d a t a V a l ( ) {
256
return d a t a v a l ;
}
/∗ ∗
∗ R e t u r n s t h e b l o c k number c o m p o n e n t of the entry
∗ @ r e t u r n t h e b l o c k number c o m p o n e n t of the entry
∗/
public i n t blockNumber ( ) {
return blocknum ;
}
}
T is a stored Table
A is an Attribute of T such that there is an index on T .A
c is a constant.
• It is called an indexselect.
package s i m p l e d b . i n d e x . q u e r y ;
/∗ ∗
∗ C r e a t e s a new i n d e x s e l e c t n o d e i n t h e query t r e e
∗ f o r t h e s p e c i f i e d i n d e x and s e l e c t i o n constant .
∗ @param p t h e i n p u t t a b l e
∗ @param i i i n f o r m a t i o n a b o u t t h e i n d e x
∗ @param v a l t h e s e l e c t i o n c o n s t a n t
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public I n d e x S e l e c t P l a n ( Plan p , I n d e x I n f o ii , Constant val , Transaction tx ) {
this . p = p ;
this . i i = i i ;
this . val = val ;
}
/∗ ∗
∗ C r e a t e s a new i n d e x s e l e c t s c a n f o r t h i s q u e r y
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
// t h r o w s an e x c e p t i o n i f p i s n o t a t a b l e p l a n .
T a b l e S c a n t s = ( T a b l e S c a n ) p . open ( ) ;
I n d e x i d x = i i . open ( ) ;
return new I n d e x S e l e c t S c a n ( i d x , v a l , t s ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s t o c o m p u t e t h e
∗ i n d e x s e l e c t i o n , w h i c h i s t h e same a s t h e
∗ i n d e x t r a v e r s a l c o s t p l u s t h e number o f m a t c h i n g d a t a records .
257
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return i i . b l o c k s A c c e s s e d ( ) + r e c o r d s O u t p u t ( ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f o u t p u t r e c o r d s i n the index selection ,
∗ w h i c h i s t h e same a s t h e number o f s e a r c h key values
∗ for the index .
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return i i . r e c o r d s O u t p u t ( ) ;
}
/∗ ∗
∗ Returns t h e d i s t i n c t v a l u e s as d e f i n e d by t h e i n d e x .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
return i i . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}
/∗ ∗
∗ R e t u r n s t h e schema o f t h e d a t a t a b l e .
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return p . schema ( ) ;
}
}
import s i m p l e d b . r e c o r d . RID ;
import s i m p l e d b . q u e r y . ∗ ;
import s i m p l e d b . i n d e x . I n d e x ;
/∗ ∗
∗ The s c a n c l a s s c o r r e s p o n d i n g t o t h e s e l e c t r e l a t i o n a l
∗ algebra operator .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n d e x S e l e c t S c a n implements Scan {
private Index i dx ;
private Constant v a l ;
private TableScan t s ;
/∗ ∗
∗ C r e a t e s an i n d e x s e l e c t s c a n f o r t h e s p e c i f i e d
∗ i n d e x and s e l e c t i o n c o n s t a n t .
∗ @param i d x t h e i n d e x
∗ @param v a l t h e s e l e c t i o n c o n s t a n t
∗/
public I n d e x S e l e c t S c a n ( I n d e x i d x , C o n s t a n t v a l , T a b l e S c a n ts ) {
this . idx = idx ;
this . val = val ;
this . ts = ts ;
beforeFirst () ;
}
/∗ ∗
∗ P o s i t i o n s the scan b e f o r e the f i r s t record ,
∗ w h i c h i n t h i s c a s e means p o s i t i o n i n g t h e i n d e x
∗ before the f i r s t instance of the s e l e c t i o n constant .
∗ @see s i m p l e d b . q u e r y . Scan#b e f o r e F i r s t ( )
∗/
public void b e f o r e F i r s t ( ) {
idx . b e f o r e F i r s t ( val ) ;
}
/∗ ∗
∗ Moves t o t h e n e x t r e c o r d , w h i c h i n t h i s c a s e means
∗ moving t h e i n d e x t o t h e n e x t r e c o r d s a t i s f y i n g t h e
∗ s e l e c t i o n c o n s t a n t , and r e t u r n i n g f a l s e i f t h e r e a r e
∗ no more s u c h i n d e x r e c o r d s .
∗ I f t h e r e i s a n e x t r e c o r d , t h e method moves t h e
∗ t a b l e s c a n to the corresponding data record .
∗ @see s i m p l e d b . q u e r y . Scan#n e x t ( )
∗/
public boolean n e x t ( ) {
boolean ok = i d x . n e x t ( ) ;
i f ( ok ) {
RID r i d = i d x . g e t D a t a R i d ( ) ;
t s . moveToRid ( r i d ) ;
}
return ok ;
}
/∗ ∗
∗ C l o s e s t h e s c a n b y c l o s i n g t h e i n d e x and t h e tablescan .
∗ @see s i m p l e d b . q u e r y . Scan#c l o s e ( )
∗/
public void c l o s e ( ) {
idx . c l o s e () ;
ts . close () ;
}
258
/∗ ∗
∗ Returns the v a l u e of the f i e l d of the current data record .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) {
return t s . g e t V a l ( f l d n a m e ) ;
}
/∗ ∗
∗ Returns the v a l u e of the f i e l d of the current data record .
∗ @see s i m p l e d b . q u e r y . Scan#g e t I n t ( j a v a . l a n g . S t r i n g )
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
return t s . g e t I n t ( f l d n a m e ) ;
}
/∗ ∗
∗ Returns the v a l u e of the f i e l d of the current data record .
∗ @see s i m p l e d b . q u e r y . Scan#g e t S t r i n g ( j a v a . l a n g . S t r i n g )
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
return t s . g e t S t r i n g ( f l d n a m e ) ;
}
/∗ ∗
∗ Returns whether the data record has the s p e c i f i e d f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#h a s F i e l d ( j a v a . l a n g . S t r i n g )
∗/
public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return t s . h a s F i e l d ( f l d n a m e ) ;
}
}
259
Figure 90: An example of a join using an index. (Sciore, 2008)
260
SimpleDB source file simpledb/index/query/IndexJoinPlan.java
package s i m p l e d b . i n d e x . q u e r y ;
/∗ ∗
∗ Implements the j o i n operator ,
∗ u s i n g t h e s p e c i f i e d LHS and RHS p l a n s .
∗ @param p1 t h e l e f t −hand p l a n
∗ @param p2 t h e r i g h t −hand p l a n
∗ @param i i i n f o r m a t i o n a b o u t t h e r i g h t −hand i n d e x
∗ @param j o i n f i e l d t h e l e f t −hand f i e l d u s e d f o r j o i n i n g
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public I n d e x J o i n P l a n ( Plan p1 , Plan p2 , I n d e x I n f o i i , S t r i n g joinfield , Transaction tx ) {
t h i s . p1 = p1 ;
t h i s . p2 = p2 ;
this . i i = i i ;
this . j o i n f i e l d = j o i n f i e l d ;
s c h . a d d A l l ( p1 . schema ( ) ) ;
s c h . a d d A l l ( p2 . schema ( ) ) ;
}
/∗ ∗
∗ Opens an i n d e x j o i n s c a n f o r t h i s q u e r y
∗ @see s i m p l e d b . q u e r y . P l a n#o p e n ( )
∗/
public Scan open ( ) {
Scan s = p1 . open ( ) ;
// t h r o w s an e x c e p t i o n i f p2 i s n o t a t a b l e p l a n
T a b l e S c a n t s = ( T a b l e S c a n ) p2 . open ( ) ;
I n d e x i d x = i i . open ( ) ;
return new I n d e x J o i n S c a n ( s , i d x , j o i n f i e l d , t s ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f b l o c k a c c e s s e s t o c o m p u t e t h e join .
∗ The f o r m u l a i s :
∗ <p r e > B( i n d e x j o i n ( p1 , p2 , i d x ) ) = B( p1 ) + R( p1 ) ∗B( i d x )
∗ + R( i n d e x j o i n ( p1 , p2 , i d x ) </p r e >
∗ @see s i m p l e d b . q u e r y . P l a n#b l o c k s A c c e s s e d ( )
∗/
public i n t b l o c k s A c c e s s e d ( ) {
return p1 . b l o c k s A c c e s s e d ( )
+ ( p1 . r e c o r d s O u t p u t ( ) ∗ i i . b l o c k s A c c e s s e d ( ) )
+ recordsOutput ( ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f o u t p u t r e c o r d s i n t h e j o i n .
∗ The f o r m u l a i s :
∗ <p r e > R( i n d e x j o i n ( p1 , p2 , i d x ) ) = R( p1 ) ∗R( i d x ) </p r e >
∗ @see s i m p l e d b . q u e r y . P l a n#r e c o r d s O u t p u t ( )
∗/
public i n t r e c o r d s O u t p u t ( ) {
return p1 . r e c o r d s O u t p u t ( ) ∗ i i . r e c o r d s O u t p u t ( ) ;
}
/∗ ∗
∗ E s t i m a t e s t h e number o f d i s t i n c t v a l u e s f o r t h e
∗ specified field .
∗ @see s i m p l e d b . q u e r y . P l a n#d i s t i n c t V a l u e s ( j a v a . l a n g . S t r i n g )
∗/
public i n t d i s t i n c t V a l u e s ( S t r i n g f l d n a m e ) {
i f ( p1 . schema ( ) . h a s F i e l d ( f l d n a m e ) )
return p1 . d i s t i n c t V a l u e s ( f l d n a m e ) ;
else
return p2 . d i s t i n c t V a l u e s ( f l d n a m e ) ;
}
/∗ ∗
∗ R e t u r n s t h e schema o f t h e i n d e x j o i n .
∗ @see s i m p l e d b . q u e r y . P l a n#schema ( )
∗/
public Schema schema ( ) {
return s c h ;
}
}
261
• Here is the SimpleDB implementation of a pipelined indexjoin Scan.
package s i m p l e d b . i n d e x . q u e r y ;
import s i m p l e d b . q u e r y . ∗ ;
import s i m p l e d b . i n d e x . I n d e x ;
/∗ ∗
∗ The s c a n c l a s s c o r r e s p o n d i n g t o t h e i n d e x j o i n r e l a t i o n a l
∗ algebra operator .
∗ The c o d e i s v e r y s i m i l a r t o t h a t o f P r o d u c t S c a n ,
∗ w h i c h makes s e n s e b e c a u s e an i n d e x j o i n i s e s s e n t i a l l y
∗ t h e p r o d u c t o f e a c h LHS r e c o r d w i t h t h e m a t c h i n g RHS i n d e x records .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n d e x J o i n S c a n implements Scan {
p r i v a t e Scan s ;
private TableScan t s ; // t h e d a t a t a b l e
private Index id x ;
private S t r i n g j o i n f i e l d ;
/∗ ∗
∗ C r e a t e s an i n d e x j o i n s c a n f o r t h e s p e c i f i e d LHS s c a n and
∗ RHS i n d e x .
∗ @param s t h e LHS s c a n
∗ @param i d x t h e RHS i n d e x
∗ @param j o i n f i e l d t h e LHS f i e l d u s e d f o r j o i n i n g
∗/
public I n d e x J o i n S c a n ( Scan s , I n d e x i d x , S t r i n g j o i n f i e l d , T a b l e S c a n ts ) {
this . s = s ;
this . idx = idx ;
this . j o i n f i e l d = j o i n f i e l d ;
this . ts = ts ;
beforeFirst () ;
}
/∗ ∗
∗ P o s i t i o n s the scan b e f o r e the f i r s t record .
∗ That i s , t h e LHS s c a n w i l l b e p o s i t i o n e d a t i t s
∗ f i r s t r e c o r d , and t h e i n d e x w i l l b e p o s i t i o n e d
∗ before the f i r s t record for the join value .
∗ @see s i m p l e d b . q u e r y . Scan#b e f o r e F i r s t ( )
∗/
public void b e f o r e F i r s t ( ) {
s . beforeFirst () ;
s . next ( ) ;
resetIndex () ;
}
/∗ ∗
∗ Moves t h e s c a n t o t h e n e x t r e c o r d .
∗ The method moves t o t h e n e x t i n d e x r e c o r d , i f p o s s i b l e .
∗ O t h e r w i s e , i t moves t o t h e n e x t LHS r e c o r d and t h e
∗ f i r s t index record .
∗ I f t h e r e a r e no more LHS r e c o r d s , t h e method r e t u r n s f a l s e .
∗ @see s i m p l e d b . q u e r y . Scan#n e x t ( )
∗/
public boolean n e x t ( ) {
while ( true ) {
i f ( idx . next ( ) ) {
t s . moveToRid ( i d x . g e t D a t a R i d ( ) ) ;
return true ;
}
i f ( ! s . next ( ) )
return f a l s e ;
resetIndex () ;
}
}
/∗ ∗
∗ C l o s e s t h e s c a n b y c l o s i n g i t s LHS s c a n and i t s RHS i n d e x .
∗ @see s i m p l e d b . q u e r y . Scan#c l o s e ( )
∗/
public void c l o s e ( ) {
s . close () ;
idx . c l o s e () ;
ts . close () ;
}
/∗ ∗
∗ Returns the Constant v a l u e of the s p e c i f i e d f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public C o n s t a n t g e t V a l ( S t r i n g f l d n a m e ) {
i f ( t s . h a s F i e l d ( fldname ) )
return t s . g e t V a l ( f l d n a m e ) ;
else
return s . g e t V a l ( f l d n a m e ) ;
}
/∗ ∗
∗ Returns the i n t e g e r v a l u e of the s p e c i f i e d f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public i n t g e t I n t ( S t r i n g f l d n a m e ) {
i f ( t s . h a s F i e l d ( fldname ) )
return t s . g e t I n t ( f l d n a m e ) ;
else
return s . g e t I n t ( f l d n a m e ) ;
}
/∗ ∗
262
∗ Returns the s t r i n g v a l u e of the s p e c i f i e d f i e l d .
∗ @see s i m p l e d b . q u e r y . Scan#g e t V a l ( j a v a . l a n g . S t r i n g )
∗/
public S t r i n g g e t S t r i n g ( S t r i n g f l d n a m e ) {
i f ( t s . h a s F i e l d ( fldname ) )
return t s . g e t S t r i n g ( f l d n a m e ) ;
else
return s . g e t S t r i n g ( f l d n a m e ) ;
}
/∗ ∗ R e t u r n s t r u e i f t h e f i e l d i s i n t h e schema .
∗ @see s i m p l e d b . q u e r y . Scan#h a s F i e l d ( j a v a . l a n g . S t r i n g )
∗/
public boolean h a s F i e l d ( S t r i n g f l d n a m e ) {
return t s . h a s F i e l d ( f l d n a m e ) | | s . h a s F i e l d ( f l d n a m e ) ;
}
p r i v a t e void r e s e t I n d e x ( ) {
Constant s e a r c h k e y = s . getVal ( j o i n f i e l d ) ;
idx . b e f o r e F i r s t ( searchkey ) ;
}
}
263
5.4 Updating Indexed Data
(Sciore, 2008, Chapter 21.6)
• The Planner Component of the RDBMS must also be aware of the existing indexes.
• In particular, when the contents of a stored Table T are updated, it must also
change the indexes defined on T to reflect the update.
import j a v a . u t i l . I t e r a t o r ;
import j a v a . u t i l . Map ;
/∗ ∗
∗ A modification of the basic update planner .
∗ I t d i s p a t c h e s each update statement to the corresponding
∗ index planner .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s I n d e x U p d a t e P l a n n e r implements U p d a t e P l a n n e r {
// f i r s t , i n s e r t t h e r e c o r d
UpdateScan s = ( UpdateScan ) p . open ( ) ;
s . insert () ;
RID r i d = s . g e t R i d ( ) ;
// t h e n m o d i f y e a c h f i e l d , i n s e r t i n g an i n d e x r e c o r d i f a p p r o p r i a t e
Map<S t r i n g , I n d e x I n f o > i n d e x e s = SimpleDB . mdMgr ( ) . g e t I n d e x I n f o ( tblname , tx ) ;
I t e r a t o r <Constant> v a l I t e r = d a t a . v a l s ( ) . i t e r a t o r ( ) ;
for ( S t r i n g fldname : data . f i e l d s ( ) ) {
Constant v a l = v a l I t e r . next ( ) ;
s . s e t V a l ( fldname , v a l ) ;
I n d e x I n f o i i = i n d e x e s . get ( fldname ) ;
i f ( i i != n u l l ) {
I n d e x i d x = i i . open ( ) ;
idx . i n s e r t ( val , r i d ) ;
idx . c l o s e () ;
}
}
s . close () ;
return 1 ;
}
public i n t e x e c u t e D e l e t e ( D e l e t e D a t a data , T r a n s a c t i o n t x ) {
S t r i n g tblname = d a t a . tableName ( ) ;
Plan p = new T a b l e P l a n ( tblname , t x ) ;
p = new S e l e c t P l a n ( p , d a t a . p r e d ( ) ) ;
Map<S t r i n g , I n d e x I n f o > i n d e x e s = SimpleDB . mdMgr ( ) . g e t I n d e x I n f o ( tblname , tx ) ;
264
return c o u n t ;
}
// t h e n u p d a t e t h e a p p r o p r i a t e index , if it exists
i f ( i d x != n u l l ) {
RID r i d = s . g e t R i d ( ) ;
idx . d e l e t e ( oldval , r i d ) ;
i d x . i n s e r t ( newval , r i d ) ;
}
c o u n t ++;
}
i f ( i d x != n u l l ) i d x . c l o s e ( ) ;
s . close () ;
return c o u n t ;
}
public i n t e x e c u t e C r e a t e T a b l e ( C r e a t e T a b l e D a t a data , T r a n s a c t i o n t x ) {
SimpleDB . mdMgr ( ) . c r e a t e T a b l e ( d a t a . tableName ( ) , d a t a . newSchema ( ) , t x ) ;
return 0 ;
}
public i n t e x e c u t e C r e a t e I n d e x ( C r e a t e I n d e x D a t a data , T r a n s a c t i o n t x ) {
SimpleDB . mdMgr ( ) . c r e a t e I n d e x ( d a t a . indexName ( ) , d a t a . tableName ( ) , d a t a . f i e l d N a m e ( ) , t x ) ;
return 0 ;
}
}
6 Query Optimization
(Sciore, 2008, Chapter 24)
¬ Start with the initial translation of the SQL query Q into a Relational Algebra
expression E.
– The purpose of E is to express in Relational Algebra what this query Q
means in SQL – intuitively, they both yield the same answer.
– Then we call them equivalent and denote this symbolically as E ≡ Q.
– However, this E would be much too slow to execute.
Therefore the Optimizer Component of the RDBMS first constructs another
Relational Algebra expression F which is much faster to execute than E but
still F ≡ E.
• More precisely:
265
Queries F ≡ E if
the output of F ≡ the output of E
whenever they are executed on the same database contents.
• Each part of optimization is designed to preserve this ‘≡’ – so that its final result F
preserves the meaning of the original query Q.
• Figure 91 shows why query optimization is crucial (and not just nice):
266
Figure 91: Why optimize? (Sciore, 2008)
267
1 P = the initial translation E of Q;
2 repeat
3 ρ = choose a rule which applies into P;
4 P = apply ρ into P
5 until no ρ applies into P;
6 F = P.
– This heuristic approach can compare different rules ρ0 , ρ00 , ρ000 , . . . in its
step 3, maybe by using cost information.
– However, it does not remember different Plans P 0 , P 00 , P 000 , . . . – it just
improves the one current P.
– Hence it can end up with a bad final F by making
early choices which looked good then, but
later turn out to have been bad, because they force it to make much worse
choices to complete F.
– Here splitting stage lets each phase ¶ or · to have its own rules to
consider.
Cost-based approach uses cost estimates for finding F.
– B(s), R(s) and V(s, F ) are these estimates in SimpleDB.
– Conceptually, this approach remembers several Plans P 0 , P 00 , P 000 , . . . so
that it can choose the one with the lowest cost as the final F.
– This is more tedious than the heuristic approach.
– This can avoid getting stuck with good-looking early choices that lead into
a bad final F by always keeping in mind several choices at the same time.
– Here splitting stage lets phase ¶ to consider fewer Plans – otherwise
it must consider also all their implementations from phase · at the same
time.
• A practical RDBMS planner might not split stage into phases ¶ and ·.
– This split namely loses information which could be useful in finding a good
final F.
– For instance, phase ¶ can decide to put a join somewhere – but does not know
yet what algorithm it will use, because phase · will decide that only later.
– But then phase ¶ cannot yet use the cost estimate of this still unknown join
algorithm.
268
– Instead, it must use some other estimate which applies to all joins – and this
is coarser.
– Moreover, the measure of this coarser estimate cannot be the number of Block s,
because that would require knowing the particular algorithm – so phase ¶
cannot use the measure which we want the final F to optimize!
– We do not want to use B(select(product(T, U ), . . .)) because this is not how
we want to implement the join!
cost(join(T, U )) =
(number of rows in T ) + (number of rows in U ). (30)
– The cost estimate of a whole Plan is the sum of these costs of all its joins.
– The intuition is that joins determine most of the performance of the whole
Plan, because they can. . .
∗ read their input Tables many times, and
∗ generate large output Tables from them.
• Phase ¶ rearranges the Relational Algebra expression tree by substituting one sub-
tree with another.
• Later we consider heuristic rules which suggest how they should be used.
269
Figure 92: A group of products can be reordered freely. (Sciore, 2008)
• We have already used Eq. (33) implicitly in our basic translation of SQL into Rela-
tional Algebra.
• Figure 94 illustrates this Eq. (33).
270
Figure 93: Rearranging products freely. (Sciore, 2008)
271
Figure 94: Splitting one selection node into two. (Sciore, 2008)
• Since the SQL translation placed the WHERE part with its selections on top of
the FROM part with its products, we need a transformation to rearrange them:
if the selection Predicate p does not mention any of the Attributes of T2 – other-
wise they would no longer be defined on the right-hand side!
• Figure 96 shows how this transformation allows the optimizer to move a selection
as far down the tree as it will go.
• Figures 96 and 97 show the joint effect of Eqs. (94) and (34).
272
Figure 96: Pushing one selection down. (Sciore, 2008)
• Together, Eqs. (31)–(34) permit reorganizing the FROM and WHERE parts from
the SQL translation quite freely.
• This freedom lets the optimizer turn a select-product pair into a join with
• This is how the optimizer can find out what joins it should perform, even though
the user has written this information only implicitly into the FROM and WHERE
parts of the SQL query.
• However, we have not given transformations for the semi- and antijoins arising
from [NOT] IN. . . and EXISTS. . . subqueries in the WHERE part.
273
Figure 97: Pushing selections past products. (Sciore, 2008)
274
Figure 98: Figure 97 continued. (Sciore, 2008)
275
Figure 99: Adding joins into Figure 98. (Sciore, 2008)
• SQL translation generates just one projection node on top of the Relational Algebra
expression, whose task is to output only those Attributes which the used asked for.
• Analogous transformations like these are available also for other operations like
groupby, extend, union,. . . but we omit them here.
• The intuition is that if we are going to drop a row from the result, then we should
do it early as possible, before we have unnecessarily joined it with other rows.
• If all the Attributes in a selection Predicate φi come from the same stored Table T
then this select(. . . , φi ) lands just on top of T . . .
276
Figure 100: Adding projections into Figure 99. (Sciore, 2008)
277
• They can be recombined with Eq. (33) into
former generates all combinations of rows from its 2 input Tables, and selects
some of them as its output, but the
latter can avoid generating the other combinations.
• But it is not so easy to see what joins should be performed and when.
• However, this is perhaps the single most important question in query optimization!
• A single join node is left-deep, if its right subtree does not contain join nodes.
• Or if we want to consider also View s and other nested subqueries within the SQL
FROM part, then amend this into “. . . unless they came from the nested subquery”.
• Figures 101–102 show different shapes of join trees for the same query.
• However, Figure 104 shows that the best choice in Figures 101–102 would be (f)
which is not left-deep.
• Many optimizers consider only left-deep join trees, even though they can lead to
worse Plans, because. . .
278
Figure 101: Different join tree shapes for the same query. (Sciore, 2008)
279
Figure 102: Figure 101 continued. (Sciore, 2008)
280
Figure 103: Figure 103 continued. (Sciore, 2008)
281
Figure 104: Costs of join trees in Figures 101–102. (Sciore, 2008)
– the best such Plan is usually not much worse – compare (d) to (f).
– the more general and difficult problem
“What is a good join tree?”
turns into the simpler but still difficult problem
“What is a good join order ?”
• A heuristic solution to this simpler problem consists of rules for deciding which
Table (or subquery) should. . .
1. start the left-deep join tree as its leftmost leaf T1 ?
2. be added to the current left-deep join tree as its next leaf Ti+1 to the right?
Heuristic 4 (start with the smallest Table). Start the join order with the Table having
the smallest output.
• The intuition of Heuristic 4 is to start with the smallest intermediate result, and
hope that this causes the intermediate results of later joins to stay small too.
• In Figure 101(a) this heuristic recommends starting the left-deep join tree with
COURSE, because ψCOURSE reduces the estimate of its output size into 12.5 Record s.
Heuristic 5 (start with most restrictive). Start the join order with the Table T whose
selection predicate ψT is most restrictive.
• The intuition of Heuristic 5 is that ψT is most effective when it appears early in the
joins.
• The corresponding expression has usually the form
select(T, A1 = c1 ANDA2 = c2 ANDA3 = c3 AND . . .)
| {z }
ψT
282
• This leads by Figure 68 to the estimate
1
· R(T ) (38)
V(T, A1 ) · V(T, A2 ) · V(T, A3 ) · . . .
| {z }
Maximize this denominator!
• In Figure 101(a) this heuristic recommends starting the left-deep join tree with
STUDENT instead of COURSE, because its output size reduction factor is
1 1
< for COURSE.
50 40
• The designer of a heuristic optimizer decides which one (s)he will include into the
optimizer.
Heuristic 6 (avoid products). Choose the next Table in the join order so that it can
connected to the preceding join order with an actual join if possible.
• That is, try to choose the next Table N so that there is some selection Predicate φ
which compares Attributes of N to Attributes in the preceding join order.
• Then we get
φ = true.
• In Figure 101, this Heuristic 6 determines the rest of the join order, once its starting
Table has been chosen Heuristic 4 or 5. Starting with. . .
• Let us then turn to heuristics for phase ·, which selects implementations for the
nodes of the plan P produced by phase ¶.
283
• Phase · starts at the leaves and works towards the root:
– This way it has already chosen algorithms for the children of its current node C.
– Then it can choose the algorithm for C based on their actual costs.
Heuristic 7 (use an index). Implement a select operation with the indexselect algo-
rithm whenever possible.
• The intuition is that if a stored Table T does have a suitable index, then use it.
and T has an index on A1 , then phase · must first use Eq. (33) to get
• If T has many indexes Ai , then choose the one with the largest V(T, Ai ) by Eq. (38).
3. mergejoin otherwise.
join(T, U, T .A = U .B )
a hash table for Table U on its join Attribute U .B , then all the rows s of U to be
joined with a row r of T can be found in the bucket for key r .A.
– That is, hashing can be used to exclude the other rows s0 of U which are not
joined with r.
– This lets us split T and U into smaller bucket files, which can be joined
recursively.
• The mergejoin algorithm is in turn based on the insight that if we first sort each
input Table on its join Predicate (that is, sort Table T on T .A and Table U
on U .B ) then
• These hash- and mergejoins are examples of operations M which must materialize
their input (by hashing or sorting).
284
Figure 105: Adding projections to Figure 91(c). (Sciore, 2008)
• Its intuition is that then M no longer has to store those Attributes of N which are
no longer needed.
• Figure 105 shows these projections added to the materialized arguments of the
topmost join.
• Whenever many different next Tables could be added into the current left-deep join
tree, this Planner makes a greedy choice:
Choose the Table which produces the next tree T whose R(T ) is smallest.
• This heuristic Planner uses cost information in this way to choose among the pos-
sibilities permitted by its rules.
285
package s i m p l e d b . o p t ;
/∗ ∗
∗ A q u e r y p l a n n e r t h a t o p t i m i z e s u s i n g a h e u r i s t i c −b a s e d a l g o r i t h m .
∗ @ a u t h o r Edward S c i o r e
∗/
public c l a s s H e u r i s t i c Q u e r y P l a n n e r implements Q u e r y P l a n n e r {
p r i v a t e C o l l e c t i o n <T a b l e P l a n n e r > t a b l e p l a n n e r s = new A r r a y L i s t <T a b l e P l a n n e r >() ;
/∗ ∗
∗ C r e a t e s an o p t i m i z e d l e f t −d e e p q u e r y p l a n u s i n g t h e f o l l o w i n g
∗ heuristics .
∗ H1 . C h o o s e t h e s m a l l e s t t a b l e ( c o n s i d e r i n g s e l e c t i o n p r e d i c a t e s )
∗ to be f i r s t in t h e j o i n order .
∗ H2 . Add t h e t a b l e t o t h e j o i n o r d e r w h i c h
∗ r e s u l t s in the s m a l l e s t output .
∗/
public Plan c r e a t e P l a n ( QueryData data , T r a n s a c t i o n t x ) {
// S t e p 3 : R e p e a t e d l y add a p l a n t o t h e j o i n o r d e r
while ( ! t a b l e p l a n n e r s . isEmpty ( ) ) {
Plan p = g e t L o w e s t J o i n P l a n ( c u r r e n t p l a n ) ;
i f ( p != n u l l )
currentplan = p ;
else // no a p p l i c a b l e j o i n
cu rrent pla n = getLowestProductPlan ( c urre ntpl an ) ;
}
// S t e p 4 . P r o j e c t on t h e f i e l d names and r e t u r n
return new P r o j e c t P l a n ( c u r r e n t p l a n , d a t a . f i e l d s ( ) ) ;
}
p r i v a t e Plan g e t L o w e s t S e l e c t P l a n ( ) {
TablePlanner besttp = null ;
Plan b e s t p l a n = n u l l ;
for ( TablePlanner tp : t a b l e p l a n n e r s ) {
Plan p l a n = t p . m a k e S e l e c t P l a n ( ) ;
i f ( b e s t p l a n == n u l l | | p l a n . r e c o r d s O u t p u t ( ) < b e s t p l a n . r e c o r d s O u t p u t ( ) ) {
b e s t t p = tp ;
bestplan = plan ;
}
}
t a b l e p l a n n e r s . remove ( b e s t t p ) ;
return b e s t p l a n ;
}
p r i v a t e Plan g e t L o w e s t J o i n P l a n ( Plan c u r r e n t ) {
TablePlanner besttp = null ;
Plan b e s t p l a n = n u l l ;
for ( TablePlanner tp : t a b l e p l a n n e r s ) {
Plan p l a n = t p . makeJoinPlan ( c u r r e n t ) ;
i f ( p l a n != n u l l && ( b e s t p l a n == n u l l | | p l a n . r e c o r d s O u t p u t ( ) < b e s t p l a n . r e c o r d s O u t p u t ( ) ) ) {
b e s t t p = tp ;
bestplan = plan ;
}
}
i f ( b e s t p l a n != n u l l )
t a b l e p l a n n e r s . remove ( b e s t t p ) ;
return b e s t p l a n ;
}
p r i v a t e Plan g e t L o w e s t P r o d u c t P l a n ( Plan c u r r e n t ) {
TablePlanner besttp = null ;
Plan b e s t p l a n = n u l l ;
for ( TablePlanner tp : t a b l e p l a n n e r s ) {
Plan p l a n = t p . makeProductPlan ( c u r r e n t ) ;
i f ( b e s t p l a n == n u l l | | p l a n . r e c o r d s O u t p u t ( ) < b e s t p l a n . r e c o r d s O u t p u t ( ) ) {
b e s t t p = tp ;
bestplan = plan ;
}
}
t a b l e p l a n n e r s . remove ( b e s t t p ) ;
return b e s t p l a n ;
}
}
286
package s i m p l e d b . o p t ;
/∗ ∗
∗ This c l a s s c o n t a i n s methods f o r p l a n n i n g a single table .
∗ @ a u t h o r Edward S c i o r e
∗/
class TablePlanner {
p r i v a t e T a b l e P l a n myplan ;
p r i v a t e P r e d i c a t e mypred ;
p r i v a t e Schema myschema ;
p r i v a t e Map<S t r i n g , I n d e x I n f o > i n d e x e s ;
private T r a n s a c t i o n tx ;
/∗ ∗
∗ C r e a t e s a new t a b l e p l a n n e r .
∗ The s p e c i f i e d p r e d i c a t e a p p l i e s t o t h e e n t i r e q u e r y .
∗ The t a b l e p l a n n e r i s r e s p o n s i b l e f o r d e t e r m i n i n g
∗ which p o r t i o n of the p r e d i c a t e i s u s e f u l to the t a b l e ,
∗ and when i n d e x e s a r e u s e f u l .
∗ @param t b l n a m e t h e name o f t h e t a b l e
∗ @param mypred t h e q u e r y p r e d i c a t e
∗ @param t x t h e c a l l i n g t r a n s a c t i o n
∗/
public T a b l e P l a n n e r ( S t r i n g tblname , P r e d i c a t e mypred , T r a n s a c t i o n t x ) {
t h i s . mypred = mypred ;
this . tx = tx ;
myplan = new T a b l e P l a n ( tblname , t x ) ;
myschema = myplan . schema ( ) ;
indexes = SimpleDB . mdMgr ( ) . g e t I n d e x I n f o ( tblname , t x ) ;
}
/∗ ∗
∗ Constructs a s e l e c t plan for the t a b l e .
∗ The p l a n w i l l u s e an i n d e x s e l e c t , i f p o s s i b l e .
∗ @return a s e l e c t plan f o r t h e t a b l e .
∗/
public Plan m a k e S e l e c t P l a n ( ) {
Plan p = m a k e I n d e x S e l e c t ( ) ;
i f ( p == n u l l )
p = myplan ;
return a d d S e l e c t P r e d ( p ) ;
}
/∗ ∗
∗ Constructs a join plan of the s p e c i f i e d plan
∗ and t h e t a b l e . The p l a n w i l l u s e an i n d e x j o i n , i f p o s s i b l e .
∗ ( Which means t h a t i f an i n d e x s e l e c t i s a l s o p o s s i b l e ,
∗ the indexjoin operator takes precedence . )
∗ The method r e t u r n s n u l l i f no j o i n i s p o s s i b l e .
∗ @param c u r r e n t t h e s p e c i f i e d p l a n
∗ @ r e t u r n a j o i n p l a n o f t h e p l a n and t h i s t a b l e
∗/
public Plan makeJoinPlan ( Plan c u r r e n t ) {
Schema c u r r s c h = c u r r e n t . schema ( ) ;
P r e d i c a t e j o i n p r e d = mypred . j o i n P r e d ( myschema , c u r r s c h ) ;
i f ( j o i n p r e d == n u l l )
return n u l l ;
Plan p = m a k e I n d e x J o i n ( c u r r e n t , c u r r s c h ) ;
i f ( p == n u l l )
p = makeProductJoin ( c u r r e n t , c u r r s c h ) ;
return p ;
}
/∗ ∗
∗ C o n s t r u c t s a p r o d u c t p l a n o f t h e s p e c i f i e d p l a n and
∗ this table .
∗ @param c u r r e n t t h e s p e c i f i e d p l a n
∗ @ r e t u r n a p r o d u c t p l a n o f t h e s p e c i f i e d p l a n and t h i s table
∗/
public Plan makeProductPlan ( Plan c u r r e n t ) {
Plan p = a d d S e l e c t P r e d ( myplan ) ;
return new M u l t i B u f f e r P r o d u c t P l a n ( c u r r e n t , p , t x ) ;
}
p r i v a t e Plan m a k e I n d e x S e l e c t ( ) {
for ( S t r i n g fldname : i n d e x e s . keySet ( ) ) {
C o n s t a n t v a l = mypred . e q u a t e s W i t h C o n s t a n t ( f l d n a m e ) ;
i f ( v a l != n u l l ) {
I n d e x I n f o i i = i n d e x e s . get ( fldname ) ;
return new I n d e x S e l e c t P l a n ( myplan , i i , v a l , t x ) ;
}
}
return n u l l ;
}
287
}
return n u l l ;
}
p r i v a t e Plan a d d S e l e c t P r e d ( Plan p ) {
P r e d i c a t e s e l e c t p r e d = mypred . s e l e c t P r e d ( myschema ) ;
i f ( s e l e c t p r e d != n u l l )
return new S e l e c t P l a n ( p , s e l e c t p r e d ) ;
else
return p ;
}
288
• The best solution will appear finally into lowest[all Tables].
• Figures 106 and 107 show how this lowest array is calculated for Figures 101–103.
• Note how it remembers the best solutions to smaller subproblems in order to solve
larger subproblems.
289
Figure 106: An example lowest array. (Sciore, 2008)
290
Figure 107: Figure 106 continued. (Sciore, 2008)
291