0% found this document useful (0 votes)
28 views160 pages

ADT Unit 1 To 5

A distributed database is a collection of interconnected databases located across various sites, managed by a Distributed Database Management System (DDBMS) that ensures data transparency and synchronization. It aims to improve reliability, availability, and performance through techniques like fragmentation and replication, with types categorized into homogeneous and heterogeneous systems. Fragmentation divides data into smaller subsets for efficiency, while replication creates copies across sites for fault tolerance and quicker access, each having its own advantages and disadvantages.

Uploaded by

DEEPAK LS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views160 pages

ADT Unit 1 To 5

A distributed database is a collection of interconnected databases located across various sites, managed by a Distributed Database Management System (DDBMS) that ensures data transparency and synchronization. It aims to improve reliability, availability, and performance through techniques like fragmentation and replication, with types categorized into homogeneous and heterogeneous systems. Fragmentation divides data into smaller subsets for efficiency, while replication creates copies across sites for fault tolerance and quicker access, each having its own advantages and disadvantages.

Uploaded by

DEEPAK LS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 160

UNIT-I

What are distributed databases?

A distributed database is a collection of multiple interconnected databases, which are


spread physically across various locations that communicate via a computer network.
● Distributed database is a system in which storage devices are not connected to a
common processing unit.
● Database is controlled by Distributed Database Management System and data may
be stored at the same location or spread over the interconnected network. It is a
loosely coupled system.
● Shared nothing architecture is used in distributed databases.

● The above diagram is a typical example of distributed database system, in which


communication channel is used to communicate with the different locations and
every system has its own memory and database.
Features
● Databases in the collection are logically interrelated with each other. Often they
represent a single logical database.
● Data is physically stored across multiple sites. Data in each site can be managed by a
DBMS independent of the other sites.
● The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
● A distributed database is not a loosely connected file system.
● A distributed database incorporates transaction processing, but it is not synonymous
with a transaction processing system.
Distributed Database Management System
A distributed database management system (DDBMS) is a centralized software system that
manages a distributed database in a manner as if it were all stored in a single location.
Features
● It is used to create, retrieve, update and delete distributed databases.
● It synchronizes the database periodically and provides access mechanisms by
the virtue of which the distribution becomes transparent to the users.
● It ensures that the data modified at any site is universally updated

1
● It is used in application areas where large volumes of data are processed and
accessed by numerous users simultaneously.
● It is designed for heterogeneous database platforms.
● It maintains confidentiality and data integrity of the databases.

Goals of Distributed Database system.

The concept of distributed database was built with a goal to improve:

Reliability: In distributed database system, if one system fails down or stops working for
some time another system can complete the task.
Availability: In distributed database system reliability can be achieved even if sever fails
down. Another system is available to serve the client request.
Performance: Performance can be achieved by distributing database over different locations.
So the databases are available to every location which is easy to maintain.

Types of distributed databases.

The two types of distributed systems are as follows:

1. Homogeneous distributed databases system:


● Homogeneous distributed database system is a network of two or more databases
(With same type of DBMS software) which can be stored on one or more machines.
● So, in this system data can be accessed and modified simultaneously on several
databases in the network. Homogeneous distributed system are easy to handle.
Example: Consider that we have three departments using Oracle-9i for DBMS. If some
changes are made in one department then, it would update the other department also.

2
Types of Homogeneous Distributed Database
There are two types of homogeneous distributed database −
● Autonomous − Each database is independent that functions on its own. They are
integrated by a controlling application and use message passing to share data updates.
● Non-autonomous − Data is distributed across the homogeneous nodes and a central
or master DBMS co-ordinates data updates across the sites.

2. Heterogeneous distributed database system.


● Heterogeneous distributed database system is a network of two or more databases
with different types of DBMS software, which can be stored on one or more
machines.
● In this system data can be accessible to several databases in the network with the
help of generic connectivity (ODBC and JDBC).
Example: In the following diagram, different DBMS software are accessible to each
other using ODBC and JDBC.

Types of Heterogeneous Distributed Databases


● Federated − The heterogeneous database systems are independent in nature and
integrated together so that they function as a single database system.
● Un-federated − The database systems employ a central coordinating module through
which the databases are accessed.
The basic types of distributed DBMS are as follows:
1. Client-server architecture of Distributed system.

● A client server architecture has a number of clients and a few servers connected in a
network.
● A client sends a query to one of the servers. The earliest available server solves it
and replies.
● A Client-server architecture is simple to implement and execute due to centralized
server system.

3
2. Collaborating server architecture.

● Collaborating server architecture is designed to run a single query on multiple


servers.
● Servers break single query into multiple small queries and the result is sent to the
client.
● Collaborating server architecture has a collection of database servers. Each server is
capable for executing the current transactions across the databases.

3. Middleware architecture.

● Middleware architectures are designed in such a way that single query is executed on
multiple servers.
● This system needs only one server which is capable of managing queries and
transactions from multiple servers.
● Middleware architecture uses local servers to handle local queries and transactions.
● The softwares are used for execution of queries and transactions across one or more
independent database servers, this type of software is called as middleware.

What is fragmentation?

Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the
table are called fragments. Fragmentation can be of three types: horizontal, vertical, and
hybrid (combination of horizontal and vertical). Horizontal fragmentation can further be
classified into two techniques: primary horizontal fragmentation and derived horizontal
fragmentation.
Fragmentation should be done in a way so that the original table can be reconstructed from
the fragments. This is needed so that the original table can be reconstructed from the
fragments whenever required. This requirement is called “reconstructiveness.”
● The process of dividing the database into a smaller multiple parts is called
as fragmentation.
● These fragments may be stored at different locations.
● The data fragmentation process should be carrried out in such a way that the

4
reconstruction of original database from the fragments is possible.

Advantages of Fragmentation
● Since data is stored close to the site of usage, efficiency of the database system is
increased.
● Local query optimization techniques are sufficient for most queries since data is
locally available.
● Since irrelevant data is not available at the sites, security and privacy of the database
system can be maintained.
Disadvantages of Fragmentation
● When data from different fragments are required, the access speeds may be very low.
● In case of recursive fragmentations, the job of reconstruction will need expensive
techniques.
● Lack of back-up copies of data in different sites may render the database ineffective
in case of failure of a site.

Types of data Fragmentation

The three fragmentation techniques are −


● Vertical fragmentation
● Horizontal fragmentation
● Hybrid fragmentation

There are three types of data fragmentation:

1. Horizontal data fragmentation

Horizontal fragmentation divides a relation(table) horizontally into the group of rows to


create subsets of tables.
Example:
Account (Acc_No, Balance, Branch_Name, Type).
In this example if values are inserted in table Branch_Name as Pune, Baroda, Delhi.

The query can be written as:


SELECT*FROM ACCOUNT WHERE Branch_Name= “Baroda”

Types of horizontal data fragmentation are as follows:

1) Primary horizontal fragmentation


Primary horizontal fragmentation is the process of fragmenting a single table, row wise using
a set of conditions.

Example:

Acc_No Balance Branch_Name


A_101 5000 Pune

5
A_102 10,000 Baroda
A_103 25,000 Delhi

For the above table we can define any simple condition like, Branch_Name= 'Pune',
Branch_Name= 'Delhi', Balance < 50,000

Fragmentation1:
SELECT * FROM Account WHERE Branch_Name= 'Pune' AND Balance < 50,000

Fragmentation2:
SELECT * FROM Account WHERE Branch_Name= 'Delhi' AND Balance < 50,000

2) Derived horizontal fragmentation


Fragmentation derived from the primary relation is called as derived horizontal
fragmentation.

Example: Refer the example of primary fragmentation given above.

The following fragmentation are derived from primary fragmentation.

Fragmentation1:
SELECT * FROM Account WHERE Branch_Name= 'Baroda' AND Balance < 50,000

Fragmentation2:
SELECT * FROM Account WHERE Branch_Name= 'Delhi' AND Balance < 50,000

3) Complete horizontal fragmentation


● The complete horizontal fragmentation generates a set of horizontal fragmentation,
which includes every table of original relation.
● Completeness is required for reconstruction of relation so that every table belongs to
at least one of the partitions.
4) Disjoint horizontal fragmentation
The disjoint horizontal fragmentation generates a set of horizontal fragmentation in which no
two fragments have common tables. That means every table of relation belongs to only one
fragment.
5) Reconstruction of horizontal fragmentation
Reconstruction of horizontal fragmentation can be performed using UNION operation on
fragments.

2. Vertical Fragmentation

Vertical fragmentation divides a relation(table) vertically into groups of columns to create


subsets of tables.

Example:

Acc_No Balance Branch_Name


A_101 5000 Pune
6
A_102 10,000 Baroda
A_103 25,000 Delhi

Fragmentation1:
SELECT * FROM Acc_NO

Fragmentation2:
SELECT * FROM Balance

Complete vertical fragmentation


● The complete vertical fragmentation generates a set of vertical fragments, which can
include all the attributes of original relation.
● Reconstruction of vertical fragmentation is performed by using Full Outer
Join operation on fragments.

3) Hybrid Fragmentation

● Hybrid fragmentation can be achieved by performing horizontal and vertical partition


together.
● Mixed fragmentation is group of rows and columns in relation.

Example: Consider the following table which consists of employee information.

Emp_ID Emp_Name Emp_Address Emp_Age Emp_Salary


101 Surendra Baroda 25 15000
102 Jaya Pune 37 12000
103 Jayesh Pune 47 10000

Fragmentation1:
SELECT * FROM Emp_Name WHERE Emp_Age < 40

Fragmentation2:
SELECT * FROM Emp_Id WHERE Emp_Address= 'Pune' AND Salary < 14000

Reconstruction of Hybrid Fragmentation


The original relation in hybrid fragmentation is reconstructed by performing UNION
and FULL OUTER JOIN.

What is data replication?

Data replication is the process in which the data is copied at multiple locations (Different
computers or servers) to improve the availability of data.

Data replication is the process of storing separate copies of the database at two or more sites.

7
It is a popular fault tolerance technique of distributed databases.
Advantages of Data Replication
● Reliability − In case of failure of any site, the database system continues to work
since a copy is available at another site(s).
● Reduction in Network Load − Since local copies of data are available, query
processing can be done with reduced network usage, particularly during prime hours.
Data updating can be done at non-prime hours.
● Quicker Response − Availability of local copies of data ensures quick query
processing and consequently quick response time.
● Simpler Transactions − Transactions require less number of joins of tables located at
different sites and minimal coordination across the network. Thus, they become
simpler in nature.
Disadvantages of Data Replication
● Increased Storage Requirements − Maintaining multiple copies of data is associated
with increased storage costs. The storage space required is in multiples of the storage
required for a centralized system.
● Increased Cost and Complexity of Data Updating − Each time a data item is
updated, the update needs to be reflected in all the copies of the data at the different
sites. This requires complex synchronization techniques and protocols.
● Undesirable Application – Database coupling − If complex update mechanisms are
not used, removing data inconsistency requires complex co-ordination at application
level. This results in undesirable application – database coupling.

Goals of data replication


Data replication is done with an aim to:
● Increase the availability of data.
● Speed up the query evaluation.

Types of data replication


There are two types of data replication:

1. Synchronous Replication:
In synchronous replication, the replica will be modified immediately after some changes are
made in the relation table. So there is no difference between original data and replica.

2. Asynchronous replication:
In asynchronous replication, the replica will be modified after commit is fired on to the
database.

Replication Schemes

The three replication schemes are as follows:

1. Full Replication

In this design alternative, at each site, one copy of all the database tables is stored. Since,
each site has its own copy of the entire database, queries are very fast requiring negligible
communication cost. On the contrary, the massive redundancy in data requires huge cost
during update operations. Hence, this is suitable for systems where a large number of queries
is required to be handled whereas the number of database updates is low.
8
In full replication scheme, the database is available to almost every location or user in
communication network.

Advantages of full replication


● High availability of data, as database is available to almost every location.
● Faster execution of queries.
Disadvantages of full replication
● Concurrency control is difficult to achieve in full replication.
● Update operation is slower.

2. No Replication
In this design alternative, different tables are placed at different sites. Data is placed so that it
is at a close proximity to the site where it is used most. It is most suitable for database
systems where the percentage of queries needed to join information in tables placed at
different sites is low.

If an appropriate distribution strategy is adopted, then this design alternative helps to reduce
the communication cost during data processing.

No replication means, each fragment is stored exactly at one location.

9
Advantages of no replication
● Concurrency can be minimized.
● Easy recovery of data.
Disadvantages of no replication
● Poor availability of data.
● Slows down the query execution process, as multiple clients are accessing the same
server.

3. Partial replication

Copies of tables or portions of tables are stored at different sites. The distribution of the
tables is done in accordance to the frequency of access. This takes into consideration the fact
that the frequency of accessing the tables vary considerably from site to site. The number of
copies of the tables (or portions) depends on how frequently the access queries execute and
the site which generate the access queries.

Partial replication means only some fragments are replicated from the database.

Advantages of partial replication


The number of replicas created for fragments depend upon the importance of data in that
fragment.

Distributed databases - Query processing and Optimization

DDBMS processes and optimizes a query in terms of communication cost of processing a


distributed query and other parameters.

Various factors which are considered while processing a query are as follows:

Costs of Data transfer

● This is a very important factor while processing queries. The intermediate data is
transferred to other location for data processing and the final result will be sent to the
location where the actual query is processing.
10
● The cost of data increases if the locations are connected via high performance
communicating channel.
● The DDBMS query optimization algorithms are used to minimize the cost of data
transfer.

Semi-join based query optimization

● Semi-join is used to reduce the number of relations in a table before transferring it to


another location.
● Only joining columns are transferred in this method.
● This method reduces the cost of data transfer.

Cost based query optimization

● Query optimization involves many operations like, selection, projection, aggregation.


● Cost of communication is considered in query optimization.
● In centralized database system, the information of relations at remote location is
obtained from the server system catalogs.
● The data (query) which is manipulated at local location is considered as a sub query
to other global locations. This process estimates the total cost which is needed to
compute the intermediate relations.
Transactions
A transaction is a program including a collection of database operations, executed as a
logical unit of data processing. The operations performed in a transaction include one or
more of database operations like insert, delete, update or retrieve data. It is an atomic process
that is either performed into completion entirely or is not performed at all. A transaction
involving only data retrieval without any data update is called read-only transaction.
Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks −
● read_item() − reads data item from storage to main memory.
● modify_item() − change value of item in the main memory.
● write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all
transactions, read and write forms the basic database operations.
Transaction Operations
The low level operations performed in a transaction are −
● begin_transaction − A marker that specifies start of transaction execution.
● read_item or write_item − Database operations that may be interleaved with main
memory operations as a part of transaction.
● end_transaction − A marker that specifies end of transaction.
● commit − A signal to specify that the transaction has been successfully completed in
its entirety and will not be undone.
● rollback − A signal to specify that the transaction has been unsuccessful and so all
temporary changes in the database are undone. A committed transaction cannot be
rolled back.
Transaction States
A transaction may go through a subset of five states, active, partially committed, committed,

11
failed and aborted.
● Active − The initial state where the transaction enters is the active state. The
transaction remains in this state while it is executing read, write or other operations.
● Partially Committed − The transaction enters this state after the last statement of the
transaction has been executed.
● Committed − The transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
● Failed − The transaction goes from partially committed state or active state to failed
state when it is discovered that normal execution can no longer proceed or system
checks fail.
● Aborted − This is the state after the transaction has been rolled back after failure and
the database has been restored to its state that was before the transaction began.
Desirable Properties of Transactions
Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation,
and Durability.
● Atomicity − This property states that a transaction is an atomic unit of
processing, that is, either it is performed in its entirety or not performed at all.
No partial update should exist.
● Consistency − A transaction should take the database from one consistent state
to another consistent state. It should not adversely affect any data item in the
database.
● Isolation − A transaction should be executed as if it is the only one in the system.
There should not be any interference from the other concurrent transactions that
are simultaneously running.
● Durability − If a committed transaction brings about a change, that change
should be durable in the database and not lost in case of any failure.

Distributed Transactions

● A Distributed Databases Management System should be able to survive in a system


failure without losing any data in the database.
● This property is provided in transaction processing.
● The local transaction works only on own location(Local Location) where it is
considered as a global transaction for other locations.

● Transactions are assigned to transaction monitor which works as a supervisor.


● A distributed transaction process is designed to distribute data over many locations
and transactions are carried out successfully or terminated successfully.
● Transaction Processing is very useful for concurrent execution and recovery of data.

What is recovery in distributed databases?

Recovery is the most complicated process in distributed databases. Recovery of a failed


system in the communication network is very difficult.

For example:
Consider that, location A sends message to location B and expects response from B but B is
12
unable to receive it. There are several problems for this situation which are as follows.

● Message was failed due to failure in the network.


● Location B sent message but not delivered to location A.
● Location B crashed down.
● So it is actually very difficult to find the cause of failure in a large communication
network.
● Distributed commit in the network is also a serious problem which can affect the
recovery in a distributed databases.

COMMIT PROTOCOL
In a local database system, for committing a transaction, the transaction manager has to only
convey the decision to commit to the recovery manager. However, in a distributed system,
the transaction manager should convey the decision to commit to all the servers in the
various sites where the transaction is being executed and uniformly enforce the decision.
When processing is complete at each site, it reaches the partially committed transaction state
and waits for all other transactions to reach their partially committed states. When it receives
the message that all the sites are ready to commit, it starts to commit. In a distributed system,
either all sites commit or none of them does.
The different distributed commit protocols are −
● One-phase commit
● Two-phase commit
● Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a
controlling site and a number of slave sites where the transaction is being executed. The
steps in distributed commit are −
● After each slave has locally completed its transaction, it sends a “DONE” message to
the controlling site.
● The slaves wait for “Commit” or “Abort” message from the controlling site. This
waiting time is called window of vulnerability.
● When the controlling site receives “DONE” message from each slave, it makes a
decision to commit or abort. This is called the commit point. Then, it sends this
message to all the slaves.
● On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.

Two-phase commit protocol in Distributed databases

Distributed two-phase commit reduces the vulnerability of one-phase commit protocols.

● Two-phase protocol is a type of atomic commitment protocol. This is a distributed


algorithm which can coordinate all the processes that participate in the database and
decide to commit or terminate the transactions. The protocol is based on commit and
terminate action.
● The two-phase protocol ensures that all participant which are accessing the database
server can receive and implement the same action (Commit or terminate), in case of

13
local network failure.
● Two-phase commit protocol provides automatic recovery mechanism in case of a
system failure.
● The location at which original transaction takes place is called as coordinator and
where the sub process takes place is called as Cohort.

Commit request:
In commit phase the coordinator attempts to prepare all cohorts and take necessary
steps to commit or terminate the transactions.

Commit phase:
The commit phase is based on voting of cohorts and the coordinator decides to
commit or terminate the transaction.
The steps performed in the two phases are as follows −
Phase 1: Prepare Phase
● After each slave has locally completed its transaction, it sends a “DONE”
message to the controlling site. When the controlling site has received “DONE”
message from all slaves, it sends a “Prepare” message to the slaves.
● The slaves vote on whether they still want to commit or not. If a slave wants to
commit, it sends a “Ready” message.
● A slave that does not want to commit sends a “Not Ready” message. This may
happen when the slave has conflicting concurrent transactions or there is a
timeout.
Phase 2: Commit/Abort Phase
● After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the slaves.
o The slaves apply the transaction and send a “Commit ACK” message to the
controlling site.
o When the controlling site receives “Commit ACK” message from all the slaves,
it considers the transaction as committed.
● After the controlling site has received the first “Not Ready” message from any
slave −
o The controlling site sends a “Global Abort” message to the slaves.
o The slaves abort the transaction and send a “Abort ACK” message to the
controlling site.
o When the controlling site receives “Abort ACK” message from all the slaves, it
considers the transaction as aborted.

Distributed Three-phase Commit


The steps in distributed three-phase commit are as follows −
Phase 1: Prepare Phase
The steps are same as in distributed two-phase commit.
Phase 2: Prepare to Commit Phase
● The controlling site issues an “Enter Prepared State” broadcast message.
● The slave sites vote “OK” in response.

14
Phase 3: Commit / Abort Phase
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message
is not required.

Concurrency problems in distributed databases.

Some problems which occur while accessing the database are as follows:

1. Failure at local locations


When system recovers from failure the database is out dated compared to other locations. So
it is necessary to update the database.

2. Failure at communication location


System should have a ability to manage temporary failure in a communicating network in
distributed databases. In this case, partition occurs which can limit the communication
between two locations.

3. Dealing with multiple copies of data


It is very important to maintain multiple copies of distributed data at different locations.

4. Distributed commit
While committing a transaction which is accessing databases stored on multiple locations, if
failure occurs on some location during the commit process then this problem is called as
distributed commit.

5. Distributed deadlock
Deadlock can occur at several locations due to recovery problem and concurrency problem
(multiple locations are accessing same system in the communication network).

Concurrency Controls in distributed databases

There are three different ways of making distinguish copy of data by applying:
1) Lock based protocol
A lock is applied to avoid concurrency problem between two transaction in such a way that
the lock is applied on one transaction and other transaction can access it only when the lock is
released. The lock is applied on write or read operations. It is an important method to avoid
deadlock.
2) Shared lock system (Read lock)
The transaction can activate shared lock on data to read its content. The lock is shared in such
a way that any other transaction can activate the shared lock on the same data for reading
purpose.

15
UNIT-II

Active Database

1
Active Database is a database consisting of set of triggers. These databases are very difficult to be
maintained because of the complexity that arises in understanding the effect of these triggers. In
such database, DBMS initially verifies whether the particular trigger specified in the statement that
modifies the database) is activated or not, prior to executing the statement.

If the trigger is active then DBMS executes the condition part and then executes the action part
only if the specified condition is evaluated to true. It is possible to activate more than one trigger
within a single statement.

In such situation, DBMS processes each of the trigger randomly. The execution of an action part
of a trigger may either activate other triggers or the same trigger that Initialized this action. Such
types of trigger that activates itself is called as ‘recursive trigger’. The DBMS executes such chains
of trigger in some pre-defined manner but it effects the concept of understanding.

Features of Active Database:

1. It possess all the concepts of a conventional database i.e. data modelling facilities, query
language etc.

2. It supports all the functions of a traditional database like data definition, data
manipulation, storage management etc.
3. It supports definition and management of ECA rules.

4. It detects event occurrence.

5. It must be able to evaluate conditions and to execute actions.

6. It means that it has to implement rule execution.

2
Advantages :

1. Enhances traditional database functionalities with powerful rule processing capabilities.

2. Enable a uniform and centralized description of the business rules relevant to the
information system.

3. Avoids redundancy of checking and repair operations.

4. Suitable platform for building large and efficient knowledge base and expert systems.

1. Generalized Model for Active Databases and Oracle Triggers

The model that has been used to specify active database rules is referred to as the Event-
Condition-Action (ECA) model. A rule in the ECA model has three components:
1. The event(s) that triggers the rule: These events are usually database update operations that are
explicitly applied to the database. However, in the general model, they could also be temporal
events2 or other kinds of external events.
2. The condition that determines whether the rule action should be executed: Once the triggering
event has occurred, an optional condition may be evaluated. If no condition is specified, the action
will be executed once the event occurs. If a condition is specified, it is first evaluated, and only if
it evaluates to true will the rule action be executed.
3. The action to be taken: The action is usually a sequence of SQL statements, but it could also
be a database transaction or an external program that will be automatically executed.
Let us consider some examples to illustrate these concepts. The examples are based on a much
simplified variation of the COMPANY database application from Figure 3.5 and is shown in
Figure 26.1, with each employee having a name (Name), Social

Security number (Ssn), salary (Salary), department to which they are currently assigned (Dno, a
foreign key to DEPARTMENT), and a direct supervisor (Supervisor_ssn, a (recursive) foreign key
to EMPLOYEE). For this example, we assume that NULL is allowed for Dno, indicating that an
employee may be temporar-ily unassigned to any department. Each department has a name
(Dname), number (Dno), the total salary of all employees assigned to the department (Total_sal),
and a manager (Manager_ssn, which is a foreign key to EMPLOYEE).

Notice that the Total_sal attribute is really a derived attribute, whose value should be the sum of
the salaries of all employees who are assigned to the particular department. Maintaining the correct

3
value of such a derived attribute can be done via an active rule. First we have to determine
the events that may cause a change in the value of Total_sal, which are as follows:
1. Inserting (one or more) new employee tuples

2. Changing the salary of (one or more) existing employees

3. Changing the assignment of existing employees from one department to another


4. Deleting (one or more) employee tuples

In the case of event 1, we only need to recompute Total_sal if the new employee is immediately
assigned to a department—that is, if the value of the Dno attribute for the new employee tuple is
not NULL (assuming NULL is allowed for Dno). Hence, this would be the condition to be
checked. A similar condition could be checked for event 2 (and 4) to determine whether the
employee whose salary is changed (or who is being deleted) is currently assigned to a department.
For event 3, we will always execute an action to maintain the value of Total_sal correctly, so no

condition is needed (the action is always executed).

The action for events 1, 2, and 4 is to automatically update the value of Total_sal for the
employee’s department to reflect the newly inserted, updated, or deleted employee’s salary. In the
case of event 3, a twofold action is needed: one to update the Total_sal of the employee’s old
department and the other to update the Total_sal of the employee’s new department.
4
The four active rules (or triggers) R1, R2, R3, and R4—corresponding to the above situation—can
be specified in the notation of the Oracle DBMS as shown in Figure 26.2(a). Let us consider rule
R1 to illustrate the syntax of creating triggers in Oracle.

The CREATE TRIGGER statement specifies a trigger (or active rule) name Total_sal1 for R1.
The AFTER clause specifies that the rule will be triggered after the events that trigger the rule
occur. The triggering events—an insert of a new employee in this example—are specified
following the AFTER keyword.
The ON clause specifies the relation on which the rule is specified—EMPLOYEE for R1.
The optional keywords FOR EACH ROW specify that the rule will be triggered once for each
row that is affected by the triggering event.

The optional WHEN clause is used to specify any conditions that need to be checked after the rule
is triggered, but before the action is executed. Finally, the action(s) to be taken is (are) specified
as a PL/SQL block, which typically contains one or more SQL statements or calls to execute
external procedures.

The four triggers (active rules) R1, R2, R3, and R4 illustrate a number of features of active rules.
First, the basic events that can be specified for triggering the rules are the standard SQL update
commands: INSERT, DELETE, and UPDATE. They are spec-ified by the
keywords INSERT, DELETE, and UPDATE in Oracle notation. In the case of UPDATE, one may
specify the attributes to be updated—for example, by writing UPDATE OF Salary, Dno. Second,
the rule designer needs to have a way to refer to the tuples that have been inserted, deleted, or
modified by the triggering event. The key-words NEW and OLD are used in Oracle
notation; NEW is used to refer to a newly inserted or newly updated tuple, whereas OLD is used
to refer to a deleted tuple or to a tuple before it was updated.

Thus, rule R1 is triggered after an INSERT operation is applied to the EMPLOYEE relation.
In R1, the condition (NEW.Dno IS NOT NULL) is checked, and if it evaluates to true, meaning
that the newly inserted employee tuple is related to a department, then the action is executed. The
action updates the DEPARTMENT tuple(s) related to the newly inserted employee by adding their
salary (NEW. Salary) to the Total_sal attribute of their related department.

Rule R2 is similar to R1, but it is triggered by an UPDATE operation that updates


the SALARY of an employee rather than by an INSERT. Rule R3 is triggered by an update to
the Dno attribute of EMPLOYEE, which signifies changing an employee’s assign-ment from one
department to another. There is no condition to check in R3, so the action is executed whenever
the triggering event occurs. The action updates both the old department and new department of the
reassigned employees by adding their salary to Total_sal of their new department and subtracting
their salary from Total_sal of their old department. Note that this should work even if the value
of Dno is NULL, because in this case no department will be selected for the rule action.

It is important to note the effect of the optional FOR EACH ROW clause, which sig-nifies that the
rule is triggered separately for each tuple. This is known as a row-level trigger. If this clause was
5
left out, the trigger would be known as a statement-level trigger and would be triggered once for
each triggering statement. To see the differ-ence, consider the following update operation, which
gives a 10 percent raise to all employees assigned to department 5. This operation would be an
event that triggers rule R2:
UPDATE EMPLOYEE

SET Salary = 1.1 *

Salary WHERE Dno = 5;

Because the above statement could update multiple records, a rule using row-level semantics, such
as R2 in Figure 26.2, would be triggered once for each row, whereas a rule using statement-level
semantics is triggered only once. The Oracle system allows the user to choose which of the above
options is to be used for each rule. Including the optional FOR EACH ROW clause creates a row-
level trigger, and leaving it out creates a statement-level trigger. Note that the
keywords NEW and OLD can only be used with row-level triggers.

As a second example, suppose we want to check whenever an employee’s salary is greater than
the salary of his or her direct supervisor. Several events can trigger this rule: inserting a new
employee, changing an employee’s salary, or changing an employee’s supervisor. Suppose that
the action to take would be to call an external procedure inform_supervisor,6 which will notify the
supervisor. The rule could then be written as in R5 (see Figure 26.2(b)).

Figure 26.3 shows the syntax for specifying some of the main options available in Oracle triggers.
We will describe the syntax for triggers in the SQL-99 standard in Section 26.1.5.

Figure 26.3

A syntax summary for specifying triggers in the Oracle system (main options only).

<trigger> ::= CREATE TRIGGER <trigger name>

( AFTER I BEFORE ) <triggering events> ON <table name>

[ FOR EACH ROW ]

[ WHEN <condition> ]

<trigger actions> ;

<triggering events> ::= <trigger event> {OR <trigger event> }

<trigger event> ::= INSERT I DELETE I UPDATE [ OF <column name> { ,


<column name> } ]

<trigger action> ::= <PL/SQL block>


6
2. Design and Implementation Issues for Active Databases

The previous section gave an overview of some of the main concepts for specifying active rules.
In this section, we discuss some additional issues concerning how rules are designed and
implemented. The first issue concerns activation, deactivation, and grouping of rules. In addition
to creating rules, an active database system should allow users to activate,
deactivate, and drop rules by referring to their rule names. A deactivated rule will not be
triggered by the triggering event. This feature allows users to selectively deactivate rules for certain
periods of time when they are not needed. The activate command will make the rule active again.
The drop command deletes the rule from the system. Another option is to group rules into
named rule sets, so the whole set of rules can be activated, deactivated, or dropped. It is also useful
to have a command that can trigger a rule or rule set via an explicit PROCESS RULES command
issued by the user.

The second issue concerns whether the triggered action should be executed before, after, instead
of, or concurrently with the triggering event. A before trigger executes the trigger before
executing the event that caused the trigger. It can be used in applications such as checking for
constraint violations. An after trigger executes the trigger after executing the event, and it can be
used in applications such as maintaining derived data and monitoring for specific events and
conditions. An instead of trigger executes the trigger instead of executing the event, and it can be
used in applications such as executing corresponding updates on base relations in response to an
event that is an update of a view.

A related issue is whether the action being executed should be considered as


a separate transaction or whether it should be part of the same transaction that triggered the rule.
We will try to categorize the various options. It is important to note that not all options may be
available for a particular active database system. In fact, most commercial systems are limited to
one or two of the options that we will now discuss.

Let us assume that the triggering event occurs as part of a transaction execution. We should first
consider the various options for how the triggering event is related to the evaluation of the rule’s
condition. The rule condition evaluation is also known as rule consideration, since the action is
to be executed only after considering whether the condition evaluates to true or false. There are
three main possibilities for rule consideration:

Immediate consideration. The condition is evaluated as part of the same transaction as the
triggering event, and is evaluated immediately. This case can be further categorized into three
options:

Evaluate the condition before executing the triggering event.

7
Evaluate the condition after executing the triggering event.

Evaluate the condition instead of executing the triggering event.

Deferred consideration. The condition is evaluated at the end of the trans-action that included the
triggering event. In this case, there could be many triggered rules waiting to have their conditions
evaluated.

Detached consideration. The condition is evaluated as a separate transaction, spawned from the
triggering transaction.

The next set of options concerns the relationship between evaluating the rule condition
and executing the rule action. Here, again, three options are possible: immediate, deferred,
or detached execution. Most active systems use the first option. That is, as soon as the condition
is evaluated, if it returns true, the action is immediately executed.

The Oracle system (see Section 26.1.1) uses the immediate consideration model, but it allows the
user to specify for each rule whether the before or after option is to be used with immediate
condition evaluation. It also uses the immediate execution model. The STARBURST system (see
Section 26.1.3) uses the deferred consideration option, meaning that all rules triggered by a
transaction wait until the triggering transaction reaches its end and issues its COMMIT
WORK command before the rule conditions are evaluated. 7

Another issue concerning active database rules is the distinction between row-
level rules and statement-level rules. Because SQL update statements (which act as triggering
events) can specify a set of tuples, one has to distinguish between whether the rule should be
considered once for the whole statement or whether it should be considered separately for each
row (that is, tuple) affected by the statement. The SQL-99 standard (see Section 26.1.5) and the
Oracle system (see Section 26.1.1) allow the user to choose which of the options is to be used for
each rule, whereas STAR-BURST uses statement-level semantics only. We will give examples of
how statement-level triggers can be specified in Section 26.1.3.

One of the difficulties that may have limited the widespread use of active rules, in spite of their
potential to simplify database and software development, is that there are no easy-to-use
techniques for designing, writing, and verifying rules. For exam-ple, it is quite difficult to verify
that a set of rules is consistent, meaning that two or more rules in the set do not contradict one
another. It is also difficult to guarantee termination of a set of rules under all circumstances. To
illustrate the termination.

8
problem briefly, consider the rules in Figure 26.4. Here, rule R1 is triggered by an INSERT event
on TABLE1 and its action includes an update event on Attribute1 of TABLE2. However,
rule R2’s triggering event is an UPDATE event on Attribute1 of TABLE2, and its action includes
an INSERT event on TABLE1. In this example, it is easy to see that these two rules can trigger
one another indefinitely, leading to non-termination. However, if dozens of rules are written, it is
very difficult to determine whether termination is guaranteed or not.

If active rules are to reach their potential, it is necessary to develop tools for the design, debugging,
and monitoring of active rules that can help users design and debug their rules.

Spatial Databases

Spatial data is associated with geographic locations such as cities, towns etc. A spatial database is
optimized to store and query data representing objects. These are the objects which are defined in
a geometric space.
Characteristics of Spatial Database
A spatial database system has the following characteristics
 It is a database system
 It offers spatial data types (SDTs) in its data model and query language.
 It supports spatial data types in its implementation, providing at least spatial indexing and
efficient algorithms for spatial join.
Example

A road map is a visualization of geographic information. A road map is a 2-dimensional object


which contains points, lines, and polygons that can represent cities, roads, and political boundaries
such as states or provinces.

9
In general, spatial data can be of two types −
 Vector data: This data is represented as discrete points, lines and polygons
 Rastor data: This data is represented as a matrix of square cells.

The spatial data in the form of points, lines, polygons etc. is used by many different databases as
shown above.
Spatial operators :

Spatial operators these operators are applied in geometric properties of objects.


It is then used in the physical space to capture them and the relation among them.
It is also used to perform spatial analysis.
Spatial operators are grouped into three categories :

1. Topological operators :
Topological properties do not vary when topological operations are applied, like
translation or rotation.

Topological operators are hierarchically structured in many levels. The base level offers operators,
ability to check for detailed topological relations between regions with a broad boundary. The
higher levels offer more abstract operators that allow users to query uncertain spatial data
independent of the geometric data model.

Examples –
open (region), close (region), and inside (point, loop).

2. Projective operators :
Projective operators, like convex hull are used to establish predicates regarding the
concavity convexity of objects.

10
Example –
Having inside the object’s concavity.

3. Metric operators :
Metric operators’s task is to provide a more accurate description of the geometry of the
object. They are often used to measure the global properties of singular objects, and to
measure the relative position of different objects, in terms of distance and direction.

Example –
length (of an arc) and distance (of a point to point).

2. Dynamic Spatial Operators :


Dynamic operations changes the objects upon which the operators are applied. Create, destroy,
and update are the fundamental dynamic operations.

Example –
Updating of a spatial object via translate, rotate, scale up or scale down, reflect, and shear.

3. Spatial Queries :
The requests for the spatial data which requires the use of spatial operations are called Spatial
Queries.

It can be divided into –

1. Range queries :
It finds all objects of a particular type that are within a given spatial area.

Example –
Finds all hospitals within the Siliguri area. A variation of this query is for a given location, find
all objects within a particular distance, for example, find all banks within 5 km range.

2. Nearest neighbor queries :


It Finds object of a particular type which is nearest to a given location.

Example –
Finds the nearest police station from the location of accident.

3. Spatial joins or overlays :


It joins the objects of two types based on spatial condition, such as the objects
which are intersecting or overlapping spatially.

Example –
Finds all Dhabas on a National Highway between two cities. It spatially joins township objects
and highway object.

11
Finds all hotels that are within 5 kilometers of a railway station. It spatially joins railway station
objects and hotels objects.

How do Spatial Databases differ from each other?

When it comes to comparing spatial databases, we can look at three primary features:

 Spatial data types

 Spatial queries

 Spatial indexes

Together, these three components comprise the basis of a spatial database. These three
components will help you decide which spatial database is most suitable for your enterprise or
business.

Spatial Data Type

Spatial data comes in all shapes and sizes. All databases typically support points, lines, and
polygons, but some support many more spatial data types. Some databases abide by the standards
set by the Open Geospatial Consortium. Yet, that doesn’t mean it is easy to move the data between
databases.

This is where the FME platform reveals some of its strengths. Database barriers no longer matter,
as you can move your data wherever you want. With support for over 450 different systems and
applications, it can handle all your data tasks, spatial and otherwise.

FME platform supports over 450 different systems and applications


12
Spatial Queries

Spatial queries perform an action on spatial data stored in the database. Some spatial queries can
be used to perform simple operations. However, some queries can become much more complex,
invoking spatial functions that span multiple tables. A spatial query using SQL allows you to
retrieve a specific subset of spatial data. This helps you retrieve only what you need from your
database.

This is how data is retrieved in spatial databases. The spatial query capabilities can vary from
database to database, both in terms of performance and functionality. This is important to consider
when you select your database.

Spatial queries drive a whole new class of business decisions retrieving requested data efficiently
for your business systems.

Spatial Indexes

What does the added size and complexity of spatial data mean for your data? Will your database
run slower? Will large spatial databases be too bulky for your database to store?

This is why spatial indexes are important. Spatial indexes are created with SQL commands. These
are generated from the database management interface or external program (i.e FME) with access
to your spatial database. Spatial indexes vary from database to database and are responsible for the
database performance necessary for adding spatial to your decision making.

SPATIAL DATA MINING

Spatial data mining describes the process of discovering hidden patterns in large spatial data sets.
As a key driver of GIS application development, spatial data mining allows users to extract
valuable data on contiguous regions and investigate spatial patterns. In this scenario, spatial
variables like distance and direction are taken into account.

Data visualization software, such as Tableau, allows data scientists and marketers to connect
different spatial data files like Esri File Geodatabases, GeoJSON files, Keyhole Markup Language
(KML) files, MapInfo tables, Shapefiles and TopoJSON files. Once connected, users can create
points, lines and polygon maps using the information in spatial data files, lidar data files and
geospatial data files.

Spatial data is important for the internet of things (IoT). It helps IoT protocols use remote sensing
to collect data for spatial analysis. Spatial data is also used in transportation and logistics to help
companies understand which machine would work best at a specific location, make accurate time
estimations for deliveries and track deliveries in real time.

13
Environmental technologies also use spatial data to monitor temperature patterns, tidal patterns
and more. The ability to track at-risk areas in combination with historical data, weather data and
geospatial data gives scientists better information to predict natural disaster

APPLICATIONS OF SPATIAL DATABASE

Building Urban Resilient Cities with Geospatial AI

One of the challenging aspects today for policymakers worldwide is to reduce the risks related to
climate change. GIS technologies help in understanding complex situations better and offering
concrete Geospatial solutions. Planning bodies can leverage technology to assess and implement
sustainable programs for the future.

GIS frameworks offer a scientific understanding of earth systems that lead to better decision-
making. Some GIS analysis examples are as follows.

 Deforestation analysis to enable reforestation programs

 Sea level analysis to measure rising levels and the threats they pose

 Assessment of emissions and preparing for alternative energy sources

With Evergreen Canada, one of our clients from the ESG sector, we built AI for Resilient City.
It’s an AI-driven 3-D data visualization tool that aims to help municipalities across Canada plan
for and mitigate the impacts of climate change.

Predicting Quality of Life with Satellite Imagery

The spatial structure of a geographical area plays a vital role in the lives of its inhabitants. To
ensure the quality of life of inhabitants, planners need better insights for effective decision-making.
So, improved knowledge of spatial structures and related socio-economic levels is vital.

Spatial pattern metrics from local climate zone classification helps in this aspect. It becomes
available by combining open GIS data, remote sensing, and machine learning. The data helps in
identifying a relationship between socioeconomic variables and spatial pattern metrics.

Some examples of variables include healthcare, education, and transportation. These variables also
help in grouping areas of any geography based on the quality of life they offer.

Apparently, we built an AI-driven application to predict the quality of life by evaluating the
socioeconomic data. This application leverages Deep Learning to process images captured from
satellite, enriched with census data and offers insights about rising in urbanization and poverty,
and anomalies in census-measured factors like literacy, employment, and healthcare.

14
Traffic Analysis

One of the better ways to identify problems in transportation systems is by modeling public transit
accessibility and traffic congestion. Traffic modeling also helps in identifying road stretches that
often exceed their capacity levels. People, usually in the low-income groups, lack vehicles which
make transit more difficult for them.

So inadequate public transportation can impact their ability to access employment and other
amenities freely. In many cases, public transportation does not cover every neighborhood.
Similarly, traffic congestion makes them unreliable.

Disease Mapping with Location

Satellite data is finding increasing use in predicting disease risks across geographies. It can predict
the spatial distribution of infections and help plan the medication distribution for control and
preventive measures.

Geostatistical models are helpful, along with other factors like surface temperature and rainfall –
these help in understanding the prevalence of disease in society. Annual temperatures and distance
to water bodies are some other critical factors involved in the process.

One of our clients, the World Mosquito Program (WMP), alters mosquitos with natural bacteria to
limit their capacity to transmit dangerous diseases. These mosquitoes’ progeny lose their capacity
to transmit illness as well.

WMP collaborated with Gramener as part of a Microsoft AI for Good grant. Our AI-driven
approach evaluates population density using satellite images and Geospatial AI. It suggests a
neighborhood-level action to hasten social effect and save lives.

Gramener used computer vision models on high-resolution satellite pictures to estimate population
density and sensitivity to mosquito-borne illnesses at the sub-neighborhood level. The AI solution
creates a fine-grained release and monitoring strategy based on the city, population, and projected
coverage. This allows the WMP team to move quickly and maximize the effectiveness of their
solution.

Population Density Mapping for Vaccination

When we talk of the current COVID-19 pandemic, it is a global challenge. Vaccinating as many
people as possible to achieve herd immunity status is critical to ending the pandemic. GIS
technologies can help in optimizing vaccine distribution to reduce the period to vaccinate everyone
eligible.

It is possible to map the population by identifying and segregating people based on their age groups
and the type of vaccines available. Analysis can show the number of people living close to a

15
vaccination site and the time taken to vaccinate them. It helps in the quick and efficient distribution
of vaccines to enable uniform coverage across cities, states, and countries.

Health Facility Mapping

Healthcare facilities are a vital component of any health system. Healthcare access is critical for
each individual living in a geographical area. Ensuring enough coverage of healthcare facilities is
something that defines a successful healthcare program. Spatial analysis helps identify the
locations of health facilities and their proximity to people.

GIS systems further help gain access to advanced metrics of healthcare and identify inequalities
for better planning. The correct data helps in planning appropriate access of healthcare facilities to
even the marginalized sections of society.

Crop Yield Prediction

Reduced crop production has become a common phenomenon in recent years due to the
unpredictability of climate. Weather forecasts today play a critical role in improving crop
management. Crop yield prediction, through spatial analysis, helps in planning and executing
smooth logistical operations.

It is possible to study crop yield prediction through satellite imagery, soil conditions, and climate
data. Another crucial factor is the possibility of pest attacks in farms, which can also be predicted.
These resources combine to help identify a suitable time for crop production.

Livestock Monitoring

Livestock is a vital element of the economy, making their management an essential task. At places
where cattle roam around freely, spatial monitoring assumes much more importance. Studies have
also shown that livestock can release methane, which has a direct impact on global warming.
Larger herds can lead to higher methane emissions.

Nitrogen released in the soil is also hazardous as it can pollute water bodies as well. More
importantly, the effect of different species like cattle, swine, and goats can differ. GIS tools enable
online monitoring to check the damage caused to vegetation and landscape.

Another essential aspect is the manure production of animals, which is beneficial for biofuels.
Quantitative modeling tools help in calculating biogas production through the number of livestock
and the quality of manure. It assumes importance for countries where dependency is heavy on
livestock and the natural gas resources are less.

Farm-Level Nutrient Analysis

Soil properties mapping is essential to adopt sustainable farming practices. GPS tools can help in
identifying and collecting coordinates of sample areas. Researchers can then study the soil

16
properties like pH level, nitrogen content, nutrient levels, and much more. A GIS environment can
showcase the soil properties and their spatial variability. It is possible to do that with the help of
geostatistical analysis and interpolation techniques.

The spatial dependency level and spatial distribution of soil properties can vary significantly. This
data can help decision-makers to plan for better nutrient management. The prototype data also
remains helpful for future use of fertilizers.

Crop Detection & Monitoring

The growth and productivity of crops depend on several factors like soil condition, weather, and
other management techniques. These can differ significantly across regions. To enable smart
farming, remote sensing data is ideal for mapping crops and understanding their performance.

A crop simulation model in the GIS framework can monitor crop performance through remote
sensing techniques. The remote sensing data can also help identify information related to crop
distribution, environment, and phenology.

Purpose of Spatial Analysis Solution

The spatial analysis examples and applications are far and wide. We take a look at four such
applications in detail:

Simplify Geographic Search

As the first spatial analysis example, it involves researching geography to identify the buildings
and other structures. For example, researchers might need to understand “how many hospitals are
available in a particular town.” This nonspatial query does not require knowledge of the physical
location of a hospital. However, if the question is “how many hospitals are within a distance of
five kilometers with each other,” it becomes spatial.

GIS can help in measuring the distance between the hospitals. It becomes possible as a GIS can
link spatial data with facts regarding geographical features on a map. The information remains
stored as attributes and characteristics of graphical representation. The lack of GIS means that
street networks will have simple street centerlines. It is not very beneficial from a visual
representation point of view.

GIS gives you the chance to use different symbols and showcase the database on the map. You
can show the building type like hospitals. The visual representation makes it easier for users to
study information seamlessly.

Enable Population Clustering

The distribution of population has spatial features. Population analysis through traditional methods
does not allow combining quantity, quality, data, and graphic methods. GIS helps in exhibiting the

17
spatial characteristics of population data on a macro level. The technology leverages display and
analysis functions to enable comprehensive representation.

The micro-level representation of population data involves adding public institutions, retail units,
and other structures that make up a geographic area. Such models also showcase the effect of such
structures on a population. GIS helps the decision-making authorities by integrating population
and spatial data. Population clustering remains a prominent spatial analysis example.

Offer Ease of Data annotation for Insights

The third spatial analysis example is data annotation or exploratory insights, which involves using
tools and methods that uncover finer details of data. It also includes spatial and nonspatial patterns
and distributions. The raw data usually comes in tabular form, and making sense out of that set
can be difficult.

Exploratory analysis works with numeric data to identify the mean value. Some other statistics
involved in the process are median, standard deviation, and visualizations. Scatter plots and bar
charts are part of visualizations. Insights help in exploring spatial patterns and performing spatial
analysis.

Enable Visual Mapping with Layers

This spatial analysis example is critical in visualizing information. Geospatial analysis tools help
in performing the visual mapping. Users can analyze data sets by adding them to maps. The layers
remain on background maps and can have charts, heatmap, line layers, and geodata. You can use
internal and external sources to gather data for layers and background maps.

Visual mapping involves gathering data from sources like smartphones, satellites, vehicles, and
wearable devices. These can power your analytics and dashboard reporting to improve the
decision-making process. You can also identify patterns and get insights that do not appear in raw
data available in spreadsheets.

Bottomline

Spatial analysis has assumed an essential role across industries today. Researchers and planners
across governmental and non-governmental agencies use spatial algorithms to study patterns
across geographies and plan their interventions. The backing of data also gives assurance regarding
the successful implementation of programs.

In the case of welfare programs of non-governmental organizations, it becomes much more critical.
It helps them spend their finances wisely to ensure maximum people get benefit from their
programs.

At Gramener, spatial analysis solutions are one of our key offerings. We help you leverage satellite
imagery and related information to solve your business challenges.

18
Mobile Database

A Mobile database is a database that can be connected to a mobile computing device over a
mobile network (or wireless network). Here the client and the server have wireless connections.
In today’s world, mobile computing is growing very rapidly, and it is huge potential in the field
of the database. It
will be applicable on different-different devices like android based mobile databases; iOS based
mobile databases, etc. Common examples of databases are Couch base Lite, Object Box, etc.

Features of Mobile database:

Here, we will discuss the features of the mobile database as follows.


 A cache is maintained to hold frequent and transactions so that they are not lost due to
connection failure.
 As the use of laptops, mobile and PDAs is increasing to reside in the mobile system.

 Mobile databases are physically separate from the central database server.

 Mobile databases resided on mobile devices.

 Mobile databases are capable of communicating with a central database server or other
mobile clients from remote sites.

 With the help of a mobile database, mobile users must be able to work without a wireless
connection due to poor or even non-existent connections (disconnected).
 A mobile database is used to analyze and manipulate data on mobile devices.

Mobile Database typically involves three parties:

1. Fixed Hosts –
It performs the transactions and data management functions with the help of database servers.

2. Mobiles Units –
These are portable computers that move around a geographical region that includes the
cellular network that these units use to communicate to base stations.

3. Base Stations–
These are two-way radios installation in fixed locations that pass communication with
the mobile units to and from the fixed hosts.
Limitations
Here, we will discuss the limitation of mobile databases as follows.

 It has limited wireless bandwidth.

 In the mobile database, Wireless communication speed.

 It required unlimited battery power to access.

19
 It is less secured.
 It is Hard to make theft-proof.

Mobility Management

With the convergence of the Internet and wireless mobile communications and with the rapid
growth in the number of mobile subscribers, mobility management emerges as one of the most
important and challenging problems for wireless mobile communication over the Internet.
Mobility management enables the serving networks to locate a mobile subscriber’s point of
attachment for delivering data packets (i.e. location management), and maintain a mobile
subscriber’s connection as it continues to change its point of attachment (i.e. handoff
management). The issues and functionalities of these activities are discussed in this section.

 Location management

Location management enables the networks to track the locations of mobile nodes.
Location management has two major sub-tasks:

(i) location registration, and (ii) call delivery or paging. In location registration procedure,
the mobile node periodically sends specific signals to inform the network of its current location so
that the location database is kept updated. The call delivery procedure is invoked after the
completion of the location registration. Based on the information that has been registered in the
network during the location registration, the call delivery procedure queries the network about the
exact location of the mobile device so that a call may be delivered successfully. The design of a
location management scheme must address the following issues:

(i) minimization of signaling overhead and latency in the service delivery,

(ii) meeting the guaranteed quality of service (QoS) of applications, and

(iii) in a fully overlapping area where several wireless networks co-exist, an efficient and
robust algorithm must be designed so as to select the network through which a mobile device
should perform registration, deciding on where and how frequently the location information should
be stored, and how to determine the exact location of a mobile device within a specific time frame.

 Handoff management

Handoff management is the process by which a mobile node keeps its connection active
when it moves from one access point to another. There are three stages in a handoff process. First,
the initiation of handoff is triggered by either the mobile device, or a network agent, or the
changing network conditions. The second stage is for a new connection generation, where the
network must find new resources for the handoff connection and perform any additional routing

20
operations. Finally, data-flow control needs to maintain the delivery of the data from the old
connection path to the new connection path according to the agreed-upon QoS guarantees.
Depending on the movement of the mobile device, it may undergo various types of handoff. In a
broad sense, handoffs may be of two types:

(i) intra-system handoff (horizontal handoff) and


(ii) inter-system handoff (vertical handoff). Handoffs in homogeneous networks
are referred to as intra-system handoffs. This type of handoff occurs when the
signal strength of the serving BS goes below a certain threshold value.

An inter-system handoff between heterogeneous networks may arise in the following scenarios

(i) when a user moves out of the serving network and enters an overlying network,
(ii) when a user connected to a network chooses to handoff to an underlying or overlaid
network for his/her service requirements,
(iii) when the overall load on the network is required to be distributed among different systems.

The design of handoff management techniques in all-IP based next-generation wireless networks
must address the following issues:

(i) signaling overhead and power requirement for processing handoff messages should be
minimized,
(ii) QoS guarantees must be made,
(iii) network resources should be efficiently used, and
(iv) the handoff mechanism should be scalable, reliable and robust.

Mobility management at different layers

A number of mobility management mechanisms in homogeneous networks have been


presented and discussed. Mobility management in heterogeneous networks is a much more
complex issue and usually involves different layers of the TCP/IP protocol stack. Several mobility
management protocols have been proposed in the literature for next-generation all-IP wireless
networks. Depending on the layers of communication protocol they primarily use, these
mechanisms can be classified into three categories – protocols at the networks layer, protocols at
the link layer and the cross-layer protocols. Network layer mobility protocols use messages at the
IP layer, and are agnostic of the underlying wireless access technologies. Link layer mobility
mechanisms provide mobility-related features in the underlying radio systems. Additional
gateways are usually required to be deployed to handle the inter-operating issues when roaming
across heterogeneous access networks. In link layer protocols, handoff signals are transmitted
through wireless links, and therefore, these protocols are tightly-coupled with specific wireless
technologies. Mobility supported at the link layer is also called access mobility or link layer
mobility. The cross-layer protocols are more common for handoff management. These protocols
aim to achieve network layer handoff with the help of communication and signaling from the link

21
layer. By receiving and analyzing, in advance, the signal strength reports and the information
regarding the direction of movement of the mobile node from the link layer, the system gets ready
for a network layer handoff so that packet loss is minimized and latency is reduced.

MOBILE TRANSACTION MODELS

The disconnection of mobile stations for possibly long periods of time and bandwidth limitations
require a serious reevaluation of transaction model and transaction processing techniques. There
have been many proposals to model mobile transactions with different notions of a mobile
transaction. Most of these approaches view a mobile transaction as consisting of subtransactions
which have some flexibility in consistency and commit processing. The management of these
transactions may be static at the mobile unit or the database server, or may move from base station
to base station as the mobile unit moves.

Network disconnection may not be treated as failure, and if the data and methods needed to
complete a task are already present on the mobile device, processing may continue even though
disconnection has occurred. Because the traditional techniques for providing serializability (e.g.,
transaction monitors, scheduler, locks) do not function properly in a disconnected environment,
new mechanisms are to be developed for the management of mobile transaction processing

Applications of mobile computing may involve many different tasks, which can include long-lived
transactions as well as some data-processing tasks as remote order entry . Since users need to be
able to work effectively in disconnected state, mobile devices will require some degree of
transaction management. So, concurrency control schemes for mobile distributed databases should
support the autonomous operation of mobile devices during disconnections. These schemes should
also consider the message traffic with the realization of bandwidth limitations. Another issue in
these schemes would be to consider the new locality or place after the movement of the mobile
device. These challenging issues have been studied by many researchers but only some of the work
is included below.

In many of these models, relaxing some of the ACID properties and non-blocking execution in the
disconnected mobile unit, and caching of data before the request, adaptation of commit protocols
and recovery issues are examined. Each used its basic requirements for the transaction models.
However, the first of the following transaction models is a new model especially defined for the
mobile environment based on the traditional transaction models.

Kangaroo Transaction Model

A mobile transaction model has been defined addressing the movement behavior of transactions.
Mobile transactions are named as Kangaroo Transactions which incorporate the property that the
transactions in a mobile environment hop from one base station to another as the mobile unit
moves. The model captures this movement behavior and the data behavior reflecting the access to
data located in databases throughout the static network.

22
The reference model assumed in, has a Data Access Agent (DAA) which is used for accessing data
in the database (of fixed host, base station or mobile unit) and each base station hosts a DAA.
When it receives a transaction request from a mobile user, the DAA forwards it to the specific base
stations or fixed hosts that contains the required data. DAA acts as a Mobile Transaction Manager
and data access coordinator for the site. It is built on top of an existing Global Database
System(GDBS). A GDBS assumes that the local DBMS systems perform the required transaction
processing functions including recovery and concurrency. A DDA’s view of the GDBS is similar
to that seen by a user at a fixed terminal and GDBS is not aware of the mobile nature of some
nodes in the network. DDA is also not aware of the implementation details of each requested
transaction.

When a mobile transaction moves to a new cell, the control of the transaction may move or may
retain at the originating site. If it remains at the originating site, messages would have to be sent
from the originating site to the current base station any time the mobile unit requests information.
If the transaction management function moves with the mobile unit, the overhead of these
messages can be avoided. For the logging side of this movement, each DAA will have the log
information for its corresponding portion of the executed transaction.

The model is based on traditional transaction concept which is a sequence of operations including,
read, write, begin transaction, end transaction, commit and abort transaction operations. The basic
structure is mainly a Local transaction (LT) to a particular DBMS. On the other hand, Global
Transactions (GT) can consist of either subtransactions viewed as LTs to some DBMS (Global
SubTransaction -GST) or subtransactions viewed as sequence of operations which can be global
themselves (GTs). This kind of nested viewing gives a recursive definition based on the limiting
bottom view of local transactions. A hopping property is added to model the mobility of the
transactions and Figure 2 shows this basic Kangaroo Transaction (KT) structure.

Each subtransaction represents the unit of execution at one base station and is called a Joey
Transaction (JT). The sequence of global and local transactions which are executed under a given
KT is defines ad a Pouch. The origin of base station initially creates a JT for its execution. A GT
and a JT are different from each other only JT is a part of KT and it must be coordinated by a DAA

23
at some base station site. A KT has a unique identification number consisting of the base station
number and unique sequence number within the base station. When the mobile unit moves from
one cell to another, the control of the KT changes to a new DAA at another base station. The DAA
at the new base station site creates a new JT as the result of the handoff process. JTs have also
identifications numbers in sequence where a JT ID has both the KT ID and the sequence number.

The mobility of the transaction model is captured by the use of split transactions. The old JT is
thus committed independently of the new JT. In Figure 2, JT1 is committed independently from
JT2 and JT3. If a failure of any JT occurs, that may result the entire KT to be undone by
compensating any previously completed JTs since the autonomy of the local DBMSs must be
assured. Therefore, a Kangaroo Transaction could be in a Split Mode or in a Compensating Mode.
A split transaction divides an ongoing transaction into serializable subtransactions. Earlier created
subtransaction may be committed and the second one can continue to its execution. However, the
decision as to abort or commit currently executing ones is left up to the component DBMSs.
Previously JTs may not be compensated so that neither Splitting Mode nor Compensating Mode
guarantees serializability of kangaroo transactions. Although Compensating Mode assures
atomicity, isolation may be violated because locks are obtained and released at the local transaction
level. With the Compensating Mode, Joey subtransactions are serializable. The Mobile transaction
Manager (MTM) keeps a Transaction Status Table on the base station DAA to maintain the status
of those transactions. It also keeps a local log into which the MTM writes the records needed for
recovery purposes, but the log does not contain any records related to recovering database
operations. Most records in the log are related to KT transaction status and some compensating
information.

Kangaroo Transaction model captures both the data and moving behavior of mobile transactions
and it is defined as a general model where it can provide mobile transaction processing in a
heterogeneous, multidatabase environment. The model can deal with both short-lived and long-
lived transactions. The mobile agents concept for multi-node processing of a KT can be used when
the user requests new subtransactions based on the results of earlier ones. This idea is discussed in
[6] as pointing out that there will be no need to keep status table and log files in the base stations
DAA. In this case, agent infrastructure must provide the movement of the state information with
the moving agent.

Clustering Model

A flexible, two-level consistency model has been introduced in to deal with the frequent,
predictable and varying disconnections. It is also pointed out that, maintaining data consistency
over all distributed sites injects unbearable overheads on mobile computing and a more flexible
open-nested model is proposed. The model is based on grouping semantically related or closely
located data together to form a cluster. Data are stored or cached at a mobile host (MH) to support
its autonomous operations during disconnections. A fully distributed environment is assumed

24
where users submit transactions from both mobile and fixed terminals. Transactions may involve
both remote data and data stored locally at the user’s device.

The items of a database are partitioned into clusters and they are the units of consistency in that all
data items inside a cluster are required to be fully consistent, while data items residing at different
clusters may exhibit bounded inconsistencies. Clustering may be constructed depending on the
physical location of data. By using this locality definition, data located at the same, neighbor, or
strongly connected hosts may be considered to belong to the same cluster, while data residing at
disconnected or remote hosts may be regarded as belonging to separate clusters. In this way, a
dynamic cluster configuration will be created.

It is also stated that, the nature of voluntary disconnection can be used in defining clusters.
Therefore, clusters of data may be explicitly created or merged by a probable disconnection or
connection of the associated mobile host. Also, the movement of the mobile will cause the place
of the mobile in the cluster, when it enters a new cell, it can change its cluster too.

On the other hand, clusters of data may be defined by using the semantics of data such as the
location data or by defining a user profile. Location data, which represent the address of a mobile
host, are fast changing data replicated over many sites. These data are often imprecise, since
updating all their copies creates overhead and there may be no need to provide consistency for
these kinds of data. On the other hand, by defining user profiles for the cluster creation, it may be
possible to differentiate users based on the requirements of their data and applications. For
example, data that are most often accessed by some user or data that are somewhat private to a
user can be considered to belong to the same cluster independent of their location or semantics.
defines the full consistency to be required for all data inside a cluster but degrees of consistency
for replicated data at different clusters. The degree of consistency may vary depending on the
availability of network bandwidth among clusters by allowing little deviation in availability. This
will provide applications with the capability to adapt to the currently available bandwidth,
providing the user with data of variable level of detail or quality. For example, in the instance of a
cooperative editing application, the application can display only one chapter or older versions of
chapters of the book under weak network connections and up-to-date copies of all chapters under
strong network connections.

The mobile database is seen as a set of data items which is partitioned to a set of clusters. Data
items are related by a number of restrictions called integrity constraints that express relationships
of data items that a database state must satisfy. Integrity constraints among data-items inside the
same cluster are called intra-cluster constraints and constraints among data items at different
clusters are called inter-cluster constraints. During disconnection or when connection is weak or
costly, the only data that user can access may not satisfy inter-cluster constraints strictly. To
maximize local processing and reduce network access, the user is allowed to interact with locally
-in a cluster available m- degree consistent data by using weak-read and weak-write operations.
These operations allow users to operate with the lack of strict consistency which can be tolerated

25
by the semantics of their applications. On the other hand, the standard read and write operations
are called strict read and strict write operations to differentiate them from weak operations.

Based on the ideas stated, two basic types of transaction are defined in: weak and strict
transactions. As the names imply, weak transactions consist only weak read and weak write
operations and they only access data copies that belong to same cluster and can be considered local
at that cluster. A weak read operation on a data item reads a locally available copy, which is the
value written by the last weak or strict write operation at that cluster. A weak write operation writes
a local copy and is not permanent unless it is committed in the merged network. Likewise, strict
transactions consist only strict read and strict write operations. A strict read operation is defined
as the one that reads the value of the data item which is written by the last strict write operation
where a strict write operation writing one or more copies of the data item.

Weak transactions have two commit points, a local commit in the associated cluster and an implicit
global commit after cluster merging. Updates made by locally committed weak transactions are
only visible to other weak transactions in the same cluster, but not visible to strict transactions
before merging, or local transactions become globally committed. How weak transactions can be
a part of concurrency controller has been shown and the criteria and graph-based tests for the
correctness of created schedules have been developed. .

The addition of weak operations to the database interface provides the users to access locally -in a
cluster, consistent data by issuing weak transactions and globally consistent data by issuing strict
transactions. Weak operations support disconnected operation since a mobile device can operate
disconnected as long as applications are satisfied with local copies. Users can use weak
transactions to update mostly private data and strict transactions to update highly used common
data. Furthermore, by allowing applications to specify their consistency requirements, better
bandwidth utilization can be achieved.

MultiDatabase Transactions

The mobile host can play many roles in a distributed database environment. It may simply submit
operations to be executed on a server or an agent at the fixed network. How multidatabase
transactions could be submitted from mobile workstations is examined in . A framework for mobile
computing in a cooperative multidatabase processing environment and a global transactions
manager facility are also introduced.

Each mobile client is assumed to submit a transaction to a coordinating agent. Once the transaction
has been submitted, the coordinating agent schedules and coordinates its execution on behalf of
the mobile client. Mobile units may voluntarily disconnect from the network prior to having any
associated transactions completed. They aimed an architecture that satisfies the following :

 providing full-fledged transaction management framework so that the users and application
programs will be able to access data across multiple sites transparently,

26
 enhancing database concurrency and data availability through the adoption of a distributed
concurrency control and recovery mechanism that preserves local autonomy,
 implementing the concept extensibility to support various database systems in the
framework so that the components can cooperate with a relational or an object- oriented
database system,
 providing an environment where the proposed transaction processing component operates
independently and transparently of the local DBMS.
 incorporating the concept of mobile computing through the use of mobile workstations into
the model.

MDSTPM System Architecture

A multidatabase system (MDS) is defined as an integrated distributed database system consisting


of a number of autonomous component database management systems. Each of the underlying
component database systems is responsible for the management of transactions locally. To
facilitate the execution of global transactions, an additional layer of software must be implemented
which permits the scheduling and coordination of transactions across these heterogeneous database
management systems. The proposed Multidatabase Transaction Processing Manager (MDSTPM)
architecture combining mobile computing is shown in Figure 3.

The MDSTPM consists of the following components:

The Global Communication Manager (GCM) is responsible for the generation and management
of message queues within the local site. Additionally, it also communicates, delivers and
exchanges these messages with its peer sites and mobile hosts in the network.

The Global Transaction Manager (GTM) coordinates the submission of global subtransactions to
its relevant sites. The Global Transaction Manager Coordinator (GTMC) is the site where the
global transaction is initiated. All participating GTMs for that global transaction are known as
GTMPs. The GTM can be a Global Scheduling Submanager (GSS) or a Global Concurrency
Submanager (GCS). The GSS is responsible for the scheduling of global transactions and
subtransactions. The GCS is responsible for acquisition of necessary concurrency control
requirements needed for the successful execution of global transactions and subtransactions. The
GTM is responsible for the scheduling and commitment of global transactions while the Local
Transaction Manager (LTM) is responsible for the execution and recovery of transactions executed
locally.

27
The Global Recovery Manager (GRM) coordinates the commitment and recovery of global
transactions and subtransactions after a failure. It ensures that the effects of committed global
subtransactions are written to the underlying local database or none of the effects of aborted global
subtransactions are written at all. It also uses the write- ahead logging protocol so that the effect
to the database are written immediately without having to wait for the global subtransaction to
complete or commit.

Global Interface Manager (GIM) coordinates the submission of request/reply between the
MDSTPM and the local database manager which can be executing in a relational database system
or an object-oriented database system. This component provides extensibility function including
the translation of an SQL request to an object-oriented query language request.

The approach used in for the management of mobile workstations and the global transactions
submitted is to have these mobile workstations to be part of the MDS during its connections with
their respective coordinator node. Once a global transaction has been submitted, the coordinating
site can then schedule and coordinate the execution of the global transaction on behalf of the
mobile host. In this way, mobile workstation may disconnect from the network without waiting
the global transaction to complete. Also, the coordinating sites are assumed to be connected with
reliable communication networks which are less subject to failures.

28
An alternative mechanism to Remote Procedure Call (RPC) is proposed as Message and Queuing
Facility (MQF) for the implementation of the proposed approach. Request messages sent from a
mobile host to its coordinating site are handled asynchronously providing the mobile host to
disconnect itself. The coordinating node execute the messages on behalf of the mobile unit and it
is possible to query the status of the global transactions from mobile hosts.

In the proposed MQF, for each mobile workstation there exists a message queue and a transaction
queue. Request, acknowledgment and information type messages such as, request for
connection/reconnection, acknowledgment for connection/reconnection to mobile workstation,
ask message queue status can be used. To manage the transactions submitted, a simple global
transaction queuing mechanism is proposed. This approach is based on the finite state machine
concept. Set of possible state and transitions can be clearly defined between the beginning and
ending state of the global transaction. For the implementation of this mechanism five transaction
sub-queues are used (input queue, allocate queue, active queue, suspend queue, output queue) to
manage global transactions/subtransactions submitted to local site by the mobile workstation.

It is also noted that for an multidatabase to function correctly within this architecture and
management issues it seemed necessary to establish an MDSTPM component software at each site
in order to provide the integration. On the other hand, pointed out that this model ignores important
issues including interactive transactions that need input from the user and produce output,
transactions that involve data stored at mobile workstations and mobile host migration and beside
the model is offering a practical approach.

PRO-MOTION

A mobile transaction processing system PRO-MOTION has been developed by , and has the aim
of migrating existing database applications and supporting the development of new database
applications involving mobile and wireless data access. PRO-MOTION is said to be a mobile
transaction processing system which supports disconnected transaction processing in a mobile
client-server environment.

The underlying transaction processing model of PRO-MOTION is the concept of nested- split
transactions. Nested split transactions are an example of open nesting, which relaxes the top-level
atomicity restriction of closed nested transactions where an open nested transaction allows its
partial results to be observed outside the transaction. Consequently, one of the main issue for
describing the local transaction processing on the mobile host is visibility and allowing new
transactions to see uncommitted changes (dirty data) may result undesired dependencies and
cascading aborts. But since no updates on a disconnected MH can be incorporated in the server
database, subsequent transactions using the same data items normally could not proceed until
connection occurs and the mobile transaction commits. PRO-MOTION considers the entire mobile
sub-system as one extremely large, long-lived transaction which executes at the server with a
subtransaction executing at each MH. Each of these MH subtransactions, in turn, is the root of

29
another nested-split transaction. It is stated that, by making the results of a transaction visible as
soon as transaction begins to commit at the MH, it can provide additional transactions to progress
even though the data items involved have been modified by an active (i.e. non-committed)
transaction. In this way, local visibility and local commitment can reduce the blocking of
transactions during disconnection and minimize the probability of cascading aborts.

The PRO-MOTION infrastructure is shown in Figure 4. It is built on a generalized, multi- tier


client-server architecture with a mobile agent called compact agent, a stationary server front-end
called compact manager, and an intermediate array of mobility managers to help manage the flow
of updates and data between the other components of the system. Its fundamental building block
is the compact which functions as the basic unit of replication for caching, prefetching, and
hoarding.

A compact is defined as a satisfied request to cache data, with its obligations, restrictions and state
information. It represents an agreement between the database server and the mobile host where the
database server delegates control of some data to the MH to be used for local transaction
processing. The database server need not to be aware of the operations executed by individual
transactions on the MH, but, rather, sees periodic updates to a compact for each of the data items
manipulated by the mobile transactions. Compacts are defined as objects encapsulating the cached
data, methods for the access of the cached data, current state information, consistency rules,
obligations and the interface methods. The main structure is shown in Figure 5.

30
The management of compacts is performed by the compact manager on the database server, and
the compact agent on each mobile host cooperatively. Compacts are obtained from the database
by requesting when a data demand is created by the MH. If data is available to satisfy the request,
the database server creates a compact with the help of compact manager. The compact is then
recorded to the compact store and transmitted to the MH to provide the data and methods to satisfy
the needs of transactions executing on the MH. It is possible to transmit the missing or outdated
components of a compact which avoids the expensive transmission of already available compact
methods on the MH. Once the compact is received by the MH, it is recorded in the compact registry
which is used by the compact agent to track the location and status of all local compacts.

Each compact has a common interface which is used by the compact agent to manage the compacts
in the compact registry list and to perform updates submitted by transactions run by applications
executing on the MH. The implementation of a common interface simplifies the design of the
compact agent and guarantees minimum acceptable functionality of a specific compact instance.
Additionally, each compact can have specialized methods which support the particular type of data
or concurrency control methods specific to itself.

Compacts are managed by the compact agent which is similar to cache management daemon in
Coda file system, handles disconnections and manages storage on a MH. Compact agent monitors
activity and interacts with the user and applications to determine the candidates for caching. Unlike
the Coda daemon, the compact agent acts as a transaction manager for transactions executing on
the MH, which in turn it is responsible from concurrency control, logging and recovery.

After a disconnection, while reconnecting to the database, the MH identifies a group of compacts
whose states reflect the updates of the locally committed transactions. The transactions in this
subset are split from uncommitted transactions and communicated to the compact manager, which
creates a split transaction for this group of updates. The compact manager then commits this split
transaction into the database making the updates visible to all transaction -fixed or mobile- waiting
for server commitment. All of these happen without releasing the locks held by the compact
manager root transaction.

31
Limiting all database access to the compact manager can provide a nested-split transaction
processing capability to the database server. If the compact manager is the only means to access
the database, every item in the database can be considered implicitly locked by the root transaction.
When an item is needed by a MH, the compact manager can read the data value and immediately
release any actual (i.e. server imposed) locks on the data item, knowing that it will not be accessed
by any transaction unknown to the compact manager. During the reconnection, the compact
manager locks the items necessary for the “split transaction”, writes the updates to the data items,
commits the “split transaction”, and re-reads and releases the altered items, maintaining the
implicit lock.

Compact agents perform hoarding when the mobile host is connected to the network and the
compact manager is storing compacts in preparation for an eventual disconnection. Hoarding
utilizes a list of resources required for processing transactions on the mobile host. The resource
list is built and maintained in the MH and compact agent adds items to the list by monitoring usage
of items by running applications. An expiration mechanism is used for matching the server-side
compacts, resynchronization and garbage collection. Compact agent also perform disconnected
processing when the mobile host is disconnected from the network and the compact manager is
processing transactions locally. The compact manager maintains an event log, which is used for
managing transaction processing, recovery, and resynchronization on the MH.

Local commitment is permitted to make the results visible to other transaction on the MH,
accepting the possibility of an eventual failure to commit at the server. Transactions which do not
have a local option will not commit locally until the updates have committed at the server. Because
more than one compacts may be used in a single transaction, the commitment of a transaction is
performed using s two-phase commit protocol where all participants reside on the MH. On the
other hand, resynchronization occurs when the MH has reconnected to the network and the
compact agent is reconciling the updates committed during the disconnection with the fixed
database.

PRO-MOTION uses a ten level scale to characterize the correctness of a transaction execution and
currently it is based on the degrees of isolation defined in ANSI SQL standard. Compacts are
written in Java and much of the code is maintained in Java virtual Machine and need not be
replicated in each compact. Simple compacts are implemented and studies are continuing in
designing a database server supporting compacts. It is claimed that PRO-MOTION offers many
advantages over other proposed systems where the latter rely on the application to enforce
consistency but PRO-MOTION is using data- centric approach.

Toggle Transactions

A similar approach to MDSTPM using a layer of interface management software is considered in


[4] and a transaction management technique which is called Toggle Transaction Management
(TTM) technique is introduced. As defined in a Mobile Multidatabase system (MMDBS) is

32
defined to be a collection of autonomous databases connected to a fixed network and a Mobile
Multidatabase Management System (MMDBMS). The MMDBMS management system is a set of
software modules that resides on the fixed network system. The respective Database Management
Systems (DBMS) of each independent database has the complete control over its database so that
they can differ in data models, transaction management mechanisms they used. Each local database
provides a service interface that specifies the operations accepted and the services provided to the
MMDBMS. Local transactions executed by the local users are transparent to the MMDBMS.
Global users,. either static or mobile users, are capable of accessing multiple databases by
submitting global transactions to the MMDBMS.

A global transaction is defined as consisting of a set of operations, each of which is a legal


operation accepted by some service interface. Any subset of operations of a global transaction that
access the same site may be executed as a single transaction with respect to that site and will form
a logical unit called a site-transaction. Site-transactions are executed under the authority of the
respective DBMS. As mobile users migrate or move their location to a new coverage area of
another Mobile Support Station (MSS), operations of a global transaction may be submitted from
different MSSs. Such transactions are referred to as migrating transactions.

It is assumed that there is no need to define integrity constraints on data items residing at different
sites. As each local DBMS ensures the site-transactions executed by it do not violate any local
integrity constraints, and global transactions satisfy consistency property. Similarly, the Global
Transaction Manager (GTM) which manages the execution of global transactions, can rely on the
durability property of the local DBMS to ensure durability of committed global transactions. So,
it is noted that the GTM need only enforce the atomicity and isolation properties. In addition to
these the GTM of the MMDBMS should address disconnection and migrating transactions. The
interactive nature of global transactions, as well as disconnection and migration prolog the
execution time of global transactions which can be referred as Long-Lived Transactions (LLT).
The GTM is said to require the minimization of ill-effect upon LLTs, which can be caused by
conflicting of these with others.

A transaction management technique that addresses the above issues is proposed . In the Toggle
Transaction Management (TTM) technique, global transaction manager is designed to consist of
two layers : Global Coordinator layer and Site Manager layer. Global Coordinator layer consists
of Global Transaction Coordinators (GTCs) in each MSS and manages the overall execution and
migration of global transactions. The Site Manager layer consists of Site Transaction Managers
(STMs) in participating database sites and supervises the execution of vital or non-vital site-
transactions. Each global transaction is defined to have a data structure that contains the current
execution status of that transaction, and follow the user in migration from MSS to MSS. The main
communication framework is shown in Figure 6.

33
Global transactions are based on the Multi-Level transaction model in which the global transaction
consists of a set of compensatable transactions. Also, the vital site- transactions must succeed in
order for the global transaction to succeed. The abort of non- vital site-transactions do not force
the global transaction to be aborted. In this way, restrictions can be placed to enforce the atomicity
and isolation levels. Global transactions are initiated at some GTC component of the GTM. The
GTC submits the site-transactions to the STMs, handles disconnections and migration of the user,
logs responses that cannot be delivered to the disconnected user, enforces the atomicity and
isolation properties.

Two new states are defined in TTM to support disconnected operations: Disconnected and
Suspended. In disconnection, the transactions are put into Disconnected state and execution is
allowed to continue. If the disconnection stems from a catastrophic failure, the transactions are put
in the Suspended state and execution is suspended. This way, needless aborts will be minimized.

In order to minimize the ill-effects of the extended execution time of mobile transactions, a global
transaction can state its intent to commit by executing a toggle operation. If the operation is
successful, the GTM guarantees that the transaction would not be aborted due to atomicity or
isolation violations unless the transaction is suspended. Whenever a transaction requires to commit
or to be toggled, the TTM technique executes the Partial Global Serialization Graph (PGSG)
commit algorithm to verify the atomicity and isolation properties. If the first verification towards
the atomicity is failed, the transaction is aborted; else, the isolation property is checked.

If the violation can not be resolved, the transaction is aborted, otherwise the commit or toggle
operation succeeds.

In TTM technique, it is stated that, concurrency is limited as all site-transactions that execute at
each site are forced to conflict with each other. The artificial conflicts generated by the algorithm
will be eliminated by exploiting semantic information of site- transactions. Each service interface

34
will need to provide conflict information on all operations accepted by that site. This information
will be used to generate conflicts between site-transactions that actually conflict with each other.

TEMPORAL DATABASE

A Temporal Database is a database with built-in support for handling time sensitive data.
Usually, databases store information only about current state, and not about past states. For
example in a employee database if the address or salary of a particular person changes, the database
gets updated, the old value is no longer there. However for many applications, it is important to
maintain the past or historical values and the time at which the data was updated. That is, the
knowledge of evolution is required. That is where temporal databases are useful. It stores
information about the past, present and future. Any data that is time dependent is called the
temporal data and these are stored in temporal databases.

Temporal Databases store information about states of the real world across time. Temporal
Database is a database with built-in support for handling data involving time. It stores information
relating to past, present and future time of all events.

Examples Of Temporal Databases

 Healthcare Systems: Doctors need the patients’ health history for proper diagnosis.
Information like the time a vaccination was given or the exact time when fever goes high
etc.

 Insurance Systems: Information about claims, accident history, time when policies are in
effect needs to be maintained.

 Reservation Systems: Date and time of all reservations is important.

Temporal Aspects

There are two different aspects of time in temporal databases.

 Valid Time: Time period during which a fact is true in real world, provided to the system.

 Transaction Time: Time period during which a fact is stored in the database, based on
transaction serialization order and is the timestamp generated automatically by the system.

Temporal Relation

Temporal Relation is one where each tuple has associated time; either valid time or transaction
time or both associated with it.

 Uni-Temporal Relations: Has one axis of time, either Valid Time or Transaction Time.

35
 Bi-Temporal Relations: Has both axis of time – Valid time and Transaction time. It
includes Valid Start Time, Valid End Time, Transaction Start Time, Transaction End
Time.

Valid Time Example

Now let’s see an example of a person, John:

 John was born on April 3, 1992 in Chennai.

 His father registered his birth after three days on April 6, 1992.

 John did his entire schooling and college in Chennai.

 He got a job in Mumbai and shifted to Mumbai on June 21, 2015.

 He registered his change of address only on Jan 10, 2016.

John’s Data In Non-Temporal Database

In a non-temporal database, John’s address is entered as Chennai from 1992. When he registers
his new address in 2016, the database gets updated and the address field now shows his Mumbai
address. The previous Chennai address details will not be available. So, it will be difficult to find
out exactly when he was living in Chennai and when he moved to Mumbai.

Date Real world event Address

April 3, 1992 John is born

April 6, 1992 John’s father registered his birth Chennai

June 21, 2015 John gets a job Chennai

Jan 10, 2016 John registers his new address Mumbai

Non temporal Database

Uni-Temporal Relation (Adding Valid Time To John’s Data)

To make the above example a temporal database, we’ll be adding the time aspect also to the
database. First let’s add the valid time which is the time for which a fact is true in real world. Valid
time is the time for which a fact is true in the real world. A valid time period may be in the past,
span the current time, or occur in the future.

36
The valid time temporal database contents look like this:
Name, City, Valid From, Valid Till

In our example, john was born on 3rd April 1992. Even though his father registered his birth three
days later, the valid time entry would be 3rd April of 1992. There are two entries for the valid time.
The Valid Start Time and the Valid End Time. So in this case 3rd April 1992 is the valid start
time. Since we do not know the valid end time we add it as infinity.

Johns father registers his birth on 6th April 1992, a new database entry is made:
Person(John, Chennai, 3-Apr-1992, ∞).

Similarly John changes his address to Mumbai on 10th Jan 2016. However, he has been living in
Mumbai from 21st June of the previous year. So his valid time entry would be 21 June 2015.

On January 10, 2016 John reports his new address in Mumbai:


Person(John, Mumbai, 21-June-2015, ∞).
The original entry is updated.
Person(John, Chennai, 3-Apr-1992, 20-June-2015).
The table will look something like this with two additional entries:

Name City Valid From Valid Till

John Chennai April 3, 1992 June 20, 2015

John Mumbai June 21, 2015 ∞

Uni-temporal Database

Bi-Temporal Relation (John’s Data Using Both Valid And Transaction Time)

Next we’ll see a bi-temporal database which includes both the valid time and transaction time.
Transaction time records the time period during which a database entry is made. So, now the
database will have four additional entries the valid from, valid till, transaction
entered and transaction superseded.

The database contents look like this:


Name, City, Valid From, Valid Till, Entered, Superseded
First, when John’s father records his birth the valid start time would be 3rd April 1992, his actual
birth date. However, the transaction entered time would be 6th April 1992.

Johns father registers his birth on 6th April 1992:


Person(John, Chennai, 3-Apr-1992, ∞, 6-Apr-1992, ∞).

37
Similarly, when john registers his change of address in Mumbai, a new entry is made. The valid
from time for this entry is 21st June 2015, the actual date from which he started living in Mumbai.
whereas the transaction entered time would be 10th January 2016. We do not know how long he’ll
be living in Mumbai. So the transaction end time and the valid end time would be infinity. At the
same time the original entry is updated with the valid till time and the transaction superseded time.

On January 10, 2016 John reports his new address in Mumbai:


Person(John, Mumbai, 21-June-2015, ∞, 10-Jan-2016, ∞).
The original entry is updated.
Person(John, Chennai, 3-Apr-1992, 20-June-2015, 6-Apr-1992, 10-Jan-2016).
Now the database looks something like this:

Name City Valid From Valid Till Entered Superseded

John Chennai April 3, 1992 June 20, 2015 April 6, 1992 Jan 10, 2016

John Mumbai June 21, 2015 ∞ Jan 10, 2016 ∞

Bi-temporal Database

Advantages

The main advantages of this bi-temporal relations is that it provides historical and roll back
information. For example, you can get the result for a query on John’s history, like: Where did
John live in the year 2001?. The result for this query can be got with the valid time entry. The
transaction time entry is important to get the rollback information.

 Historical Information – Valid Time.

 Rollback Information – Transaction Time.

Products Using Temporal Databases

The popular products that use temporal databases include:

 Oracle.

 Microsoft SQL Server.

 IBM DB2.

DEDUCTIVE DATABASE

A deductive database is a database system that can make deductions (i.e. conclude additional
facts) based on rules and facts stored in the (deductive) database. Datalog is the language typically

38
used to specify facts, rules and queries in deductive databases. Deductive databases have grown
out of the desire to combine logic programming with relational databases to construct systems that
support a powerful formalism and are still fast and able to deal with very large datasets. Deductive
databases are more expressive than relational databases but less expressive than logic
programming systems. In recent years, deductive databases such as Datalog have found new
application in data integration, information extraction, networking, program analysis, security,
and cloud computing.[1]

Deductive databases reuse many concepts from logic programming; rules and facts specified in
the deductive database language Datalog look very similar to those in Prolog. However important
differences between deductive databases and logic programming:

 Order sensitivity and procedurality: In Prolog, program execution depends on the order of
rules in the program and on the order of parts of rules; these properties are used by
programmers to build efficient programs. In database languages (like SQL or Datalog),
however, program execution is independent of the order of rules and facts.

 Special predicates: In Prolog, programmers can directly influence the procedural


evaluation of the program with special predicates such as the cut, this has no
correspondence in deductive databases.

 Function symbols: Logic Programming languages allow function symbols to build up


complex symbols. This is not allowed in deductive databases.

 Tuple-oriented processing: Deductive databases use set-oriented processing while logic


programming languages concentrate on one tuple at a time.

1. Overview of Deductive Databases

In a deductive database system we typically specify rules through a declarative language—a


language in which we specify what to achieve rather than how to achieve it. An inference
engine (or deduction mechanism) within the system can deduce new facts from the database by
interpreting these rules. The model used for deductive databases is closely related to the relational
data model, and particularly to the domain relational calculus formalism (see Section 6.6). It is
also related to the field of logic programming and theProlog language. The deductive database
work based on logic has used Prolog as a starting point. A variation of Prolog calledDatalog is
used to define rules declaratively in conjunction with an existing set of relations, which are
themselves treated as literals in the language. Although the language structure of Datalog
resembles that of Prolog, its operational semantics—that is, how a Datalog program is executed—
is still different.
A deductive database uses two main types of specifications: facts and rules. Facts are specified in
a manner similar to the way relations are specified, except that it is not necessary to include the
attribute names. Recall that a tuple in a relation describes some real-world fact whose meaning is

39
partly determined by the attribute names. In a deductive database, the meaning of an attribute value
in a tuple is determined solely by its position within the tuple. Rules are somewhat similar to
relational views. They specify virtual relations that are not actually stored but that can be formed
from the facts by applying inference mechanisms based on the rule specifications. The main
difference between rules and views is that rules may involve recursion and hence may yield virtual
relations that cannot be defined in terms of basic relational views.
The evaluation of Prolog programs is based on a technique called backward chaining, which
involves a top-down evaluation of goals. In the deductive databases that use Datalog, attention has
been devoted to handling large volumes of data stored in a relational database. Hence, evaluation
techniques have been devised that resemble those for a bottom-up evaluation. Prolog suffers from
the limitation that the order of specification of facts and rules is significant in evaluation; moreover,
the order of literals (defined in Section 26.5.3) within a rule is significant. The execution
techniques for Datalog programs attempt to circumvent these problems.

2. Prolog/Datalog Notation

The notation used in Prolog/Datalog is based on providing predicates with unique names.
A predicate has an implicit meaning, which is suggested by the predicate name, and a fixed
number of arguments. If the arguments are all constant values, the predicate simply states that a
certain fact is true. If, on the other hand, the predicate has variables as arguments, it is either
considered as a query or as part of a rule or constraint. In our discussion, we adopt the Prolog
convention that all constant

values in a predicate are either numeric or character strings; they are represented as identifiers (or
names) that start with a lowercase letter, whereas variable names always start with an uppercase
letter.

40
Consider the example shown in Figure 26.11, which is based on the relational data-base in Figure
3.6, but in a much simplified form. There are three predicate names: supervise,
superior, and subordinate. The SUPERVISE predicate is defined via a set of facts, each of which
has two arguments: a supervisor name, followed by the name of a direct supervisee (subordinate)
of that supervisor. These facts correspond to the actual data that is stored in the database, and they
can be considered as constituting a set of tuples in a relationSUPERVISE with two attributes
whose schema is

SUPERVISE(Supervisor, Supervisee)

Thus, SUPERVISE(X, Y ) states the fact that X supervises Y. Notice the omission of the attribute
names in the Prolog notation. Attribute names are only represented by virtue of the position of
each argument in a predicate: the first argument represents the supervisor, and the second argument
represents a direct subordinate.

The other two predicate names are defined by rules. The main contributions of deductive databases
are the ability to specify recursive rules and to provide a frame-work for inferring new information
based on the specified rules. A rule is of the form head :– body, where :– is read as if and only if.
A rule usually has a single predicate to the left of the :– symbol—called the head or left-hand
side(LHS) or conclusion of the rule—and one or more predicates to the right of the :–
symbol— called the body or right-hand side(RHS) or premise(s) of the rule. A predicate with
constants as arguments is said to be ground; we also refer to it as an instantiated predicate. The
arguments of the predicates that appear in a rule typically include a number of variable symbols,
although predicates can also contain constants as arguments. A rule specifies that, if a particular
assignment or binding of constant values to the variables in the body (RHS predicates)
makes all the RHS predicates true, it also makes the head (LHS predicate) true by using the same
assignment of constant values to variables. Hence, a rule provides us with a way of generating new
facts that are instantiations of the head of the rule. These new facts are based on facts that already
exist, corresponding to the instantiations (or bind-ings) of predicates in the body of the rule. Notice
that by listing multiple predicates in the body of a rule we implicitly apply the logical
AND operator to these predicates. Hence, the commas between the RHS predicates may be read
as meaning and.

Consider the definition of the predicate SUPERIOR in Figure 26.11, whose first argument is an
employee name and whose second argument is an employee who is either a direct or
an indirect subordinate of the first employee. By indirect subordinate, we mean the subordinate of
some subordinate down to any number of levels. Thus SUPERIOR(X, Y ) stands for the fact that X
is a superior of Ythrough direct or indirect supervision. We can write two rules that together specify
the meaning of the new predicate. The first rule under Rules in the figure states that for every value
of X and Y, if SUPERVISE(X, Y)—the rule body—is true, then SUPERIOR(X, Y )—the rule
head—is also true, since Y would be a direct subordinate of X (at one level down). This rule can
be used to generate all direct superior/subordinate relation-ships from the facts that define

41
the SUPERVISE predicate. The second recursive rule states that ifSUPERVISE(X,
Z) and SUPERIOR(Z, Y ) are both true, then SUPERIOR(X, Y) is also true. This is an example of
a recursive rule, where one of the rule body predicates in the RHS is the same as the rule head
predicate in the LHS. In general, the rule body defines a number of premises such that if they are
all true, we can deduce that the conclusion in the rule head is also true. Notice that if we have two
(or more) rules with the same head (LHS predicate), it is equivalent to saying that the predicate is
true (that is, that it can be instantiated) if either one of the bodies is true; hence, it is equivalent to
a logical OR operation. For example, if we have two rules X:– Y and X :– Z, they are equivalent to
a rule X :– Y OR Z. The latter form is not used in deductive systems, however, because it is not in
the stan-dard form of rule, called a Horn clause, as we discuss in Section 26.5.4.

A Prolog system contains a number of built-in predicates that the system can interpret directly.
These typically include the equality comparison operator =(X, Y), which returns true if X and Y are
identical and can also be written as X=Y by using the standard infix notation. Other comparison
operators for numbers, such as <, <=, >, and >=, can be treated as binary predicates. Arithmetic
functions such as +, –, *, and / can be used as arguments in predicates in Prolog. In contrast,
Datalog (in its basic form) does not allow functions such as arithmetic operations as arguments;
indeed, this is one of the main differences between Prolog and Datalog. However, extensions to
Datalog have been proposed that do include functions.

A query typically involves a predicate symbol with some variable arguments, and its meaning
(or answer) is to deduce all the different constant combinations that, when bound (assigned) to
the variables, can make the predicate true. For example, the first query in Figure 26.11 requests
the names of all subordinates of james at any level. A different type of query, which has only
constant symbols as arguments, returns either a true or a false result, depending on whether the
arguments provided can be deduced from the facts and rules. For example, the second query in
Figure 26.11 returns true, since SUPERIOR(james, joyce) can be deduced.

3. Datalog Notation

In Datalog, as in other logic-based languages, a program is built from basic objects called atomic
formulas. It is customary to define the syntax of logic-based languages by describing the syntax
of atomic formulas and identifying how they can be combined to form a program. In Datalog,
atomic formulas are literals of the form p(a1, a2, ..., an), where p is the predicate name and n is
the number of arguments for predicate p. Different predicate symbols can have different numbers
of arguments, and the number of arguments n of predicate p is sometimes called
the arity or degree of p. The arguments can be either constant values or variable names.
Asmentioned earlier, we use the convention that constant values either are numeric or start with
a lowercase character, whereas variable names always start with an uppercase character.

A number of built-in predicates are included in Datalog, which can also be used to construct
atomic formulas. The built-in predicates are of two main types: the binary comparison predicates

42
< (less), <= (less_or_equal), > (greater), and >= (greater_or_equal) over ordered domains; and the
comparison predicates = (equal) and /= (not_equal) over ordered or unordered domains. These can
be used as binary predicates with the same functional syntax as other predicates—for example, by
writing less(X, 3)—or they can be specified by using the customary infix notation X<3. Note that
because the domains of these predicates are potentially infinite, they should be used with care in
rule definitions. For example, the predicate greater(X, 3), if used alone, generates an infinite set of
values for X that satisfy the predicate (all integer numbers greater than 3).

A literal is either an atomic formula as defined earlier—called a positive literal—or an atomic


formula preceded by not. The latter is a negated atomic formula, called a negative literal. Datalog
programs can be considered to be a subset of the predicate calculus formulas, which are somewhat
similar to the formulas of the domain relational calculus (see Section 6.7). In Datalog, however,
these formulas are first con-verted into what is known as clausal form before they are expressed
in Datalog, and only formulas given in a restricted clausal form, called Horn clauses, can be used
in Datalog.

4. Clausal Form and Horn Clauses

Recall from Section 6.6 that a formula in the relational calculus is a condition that includes
predicates called atoms (based on relation names). Additionally, a formula can have quantifiers—
namely, the universal quantifier (for all) and the existential quantifier (there exists). In clausal
form, a formula must be transformed into another formula with the following characteristics:

· All variables in the formula are universally quantified. Hence, it is not necessary to include
the universal quantifiers (for all) explicitly; the quantifiers are removed, and all variables in the
formula are implicitly quantified by the universal quantifier.

· In clausal form, the formula is made up of a number of clauses, where each clause is
composed of a number of literals connected by OR logical connectives only. Hence, each clause
is a disjunction of literals.

· The clauses themselves are connected by AND logical connectives only, to form a formula.
Hence, the clausal form of a formula is a conjunction of clauses.

It can be shown that any formula can be converted into clausal form. For our purposes, we are
mainly interested in the form of the individual clauses, each of which is a disjunction of literals.
Recall that literals can be positive literals or negative literals. Consider a clause of the form:

This clause has n negative literals and m positive literals. Such a clause can be trans-formed into
the following equivalent logical formula:

43
where ⇒ is the implies symbol. The formulas (1) and (2) are equivalent, meaning that their truth
values are always the same. This is the case because if all the Pi liter-als (i = 1, 2, ..., n) are true,
the formula (2) is true only if at least one of the Qi’s is true, which is the meaning of
the ⇒ (implies) symbol. For formula (1), if all the Pi literals (i = 1, 2, ..., n) are true, their negations
are all false; so in this case formula

(1) is true only if at least one of the Qi’s is true. In Datalog, rules are expressed as a restricted
form of clauses called Horn clauses, in which a clause can contain at most one positive literal.
Hence, a Horn clause is either of the form

A Datalog rule, as in (6), is hence a Horn clause, and its meaning, based on formula (5), is that if
the predicates P1 AND P2 AND ...AND Pn are all true for a particular binding to their variable
arguments, then Q is also true and can hence be inferred. The Datalog expression (8) can be
considered as an integrity constraint, where all the predicates must be true to satisfy the query.

In general, a query in Datalog consists of two components:

A Datalog program, which is a finite set of rules

A literal P(X1, X2, ..., Xn), where each Xi is a variable or a constant

A Prolog or Datalog system has an internal inference engine that can be used to process and
compute the results of such queries. Prolog inference engines typically return one result to the

44
query (that is, one set of values for the variables in the query) at a time and must be prompted to
return additional results. On the con-trary, Datalog returns results set-at-a-time.

5. Interpretations of Rules

There are two main alternatives for interpreting the theoretical meaning of rules: proof-
theoretic and model-theoretic. In practical systems, the inference mechanism within a system
defines the exact interpretation, which may not coincide with either of the two theoretical
interpretations. The inference mechanism is a computational procedure and hence provides a
computational interpretation of the meaning of rules. In this section, first we discuss the two
theoretical interpretations. Then we briefly discuss inference mechanisms as a way of defining the
meaning of rules.

In the proof-theoretic interpretation of rules, we consider the facts and rules to be true statements,
or axioms. Ground axiomscontain no variables. The facts are ground axioms that are given to be
true. Rules are called deductive axioms, since they can be used to deduce new facts. The deductive
axioms can be used to con-struct proofs that derive new facts from existing facts. For example,
Figure 26.12 shows how to prove the fact SUPERIOR(james, ahmad) from the rules and facts

given in Figure 26.11. The proof-theoretic interpretation gives us a procedural or computational


approach for computing an answer to the Datalog query. The process of proving whether a certain
fact (theorem) hold is known as theorem proving.

The second type of interpretation is called the model-theoretic interpretation. Here, given a finite
or an infinite domain of constant values,33 we assign to a predicate every possible combination of
values as arguments. We must then determine whether the predicate is true or false. In general, it
is sufficient to specify the combinations of arguments that make the predicate true, and to state
that all other combi-nations make the predicate false. If this is done for every predicate, it is called
an interpretation of the set of predicates. For example, consider the interpretation shown in
Figure 26.13 for the predicates SUPERVISE and SUPERIOR. This interpretation assigns a truth
value (true or false) to every possible combination of argument values (from a finite domain) for
the two predicates.

An interpretation is called a model for a specific set of rules if those rules are always true under
that interpretation; that is, for any values assigned to the variables in the rules, the head of the rules
is true when we substitute the truth values assigned to the predicates in the body of the rule by that
45
interpretation. Hence, whenever a particular substitution (binding) to the variables in the rules is
applied, if all the predicates in the body of a rule are true under the interpretation, the predicate in
the head of the rule must also be true. The interpretation shown in Figure 26.13 is a model for the
two rules shown, since it can never cause the rules to be violated. Notice that a rule is violated if a
particular binding of constants to the variables makes all the predicates in the rule body true but
makes the predicate in the rule head false. For example, if SUPERVISE(a, b) and SUPERIOR(b,
c) are both true under some interpretation, but SUPERIOR(a, c) is not true, the interpretation can-
not be a model for the recursive rule:

SUPERIOR(X, Y) :– SUPERVISE(X, Z), SUPERIOR(Z, Y)

In the model-theoretic approach, the meaning of the rules is established by providing a model for
these rules. A model is called aminimal model for a set of rules if we cannot change any fact from
true to false and still get a model for these rules. For example, consider the interpretation in Figure
26.13, and assume that the SUPERVISE predicate is defined by a set of known facts, whereas
the SUPERIOR predicate is defined as an interpretation (model) for the rules. Suppose that we add
the predicate SUPERIOR(james, bob) to the true predicates. This remains a model for the rules
shown, but it is not a minimal model, since changing the truth value ofSUPERIOR(james,bob)
from true to false still provides us with a model for the rules. The model shown in Figure 26.13 is
the minimal model for the set of facts that are defined by the SUPERVISE predicate.

In general, the minimal model that corresponds to a given set of facts in the model-theoretic
interpretation should be the same as the facts generated by the proof

Rules
SUPERIOR(X, Y ) :– SUPERVISE(X, Y ).
SUPERIOR(X, Y ) :– SUPERVISE(X, Z ), SUPERIOR(Z, Y ).
Interpretation
Known Facts:
SUPERVISE(franklin, john) is true.
SUPERVISE(franklin, ramesh) is true.
SUPERVISE(franklin, joyce) is true.
SUPERVISE(jennifer, alicia) is true.
SUPERVISE(jennifer, ahmad) is true.
SUPERVISE(james, franklin) is true.
SUPERVISE(james, jennifer) is true.
SUPERVISE(X, Y ) is false for all other possible (X, Y ) combinations
Derived Facts:
SUPERIOR(franklin, john) is true.
SUPERIOR(franklin, ramesh) is true.
SUPERIOR(franklin, joyce) is true.
SUPERIOR(jennifer, alicia) is true.
SUPERIOR(jennifer, ahmad) is true.
SUPERIOR(james, franklin) is true.
SUPERIOR(james, jennifer) is true.
SUPERIOR(james, john) is true.
SUPERIOR(james, ramesh) is true.
SUPERIOR(james, joyce) is true.
SUPERIOR(james, alicia) is true.
SUPERIOR(james, ahmad) is true.
SUPERIOR(X, Y ) is false for all other possible (X, Y ) combinations
Figure 26.13 An interpretation that is a minimal model.

theoretic interpretation for the same original set of ground and deductive axioms. However, this is
generally true only for rules with a simple structure. Once we allow negation in the specification
of rules, the correspondence between interpretations does not hold. In fact, with negation,
numerous minimal models are possible for a given set of facts.

A third approach to interpreting the meaning of rules involves defining an inference mechanism
that is used by the system to deduce facts from the rules. This inference mechanism would define
a computational interpretation to the meaning of the rules. The Prolog logic programming
language uses its inference mechanism to define the meaning of the rules and facts in a Prolog
program. Not all Prolog pro-grams correspond to the proof-theoretic or model-theoretic
interpretations; it depends on the type of rules in the program. However, for many simple Prolog
pro-grams, the Prolog inference mechanism infers the facts that correspond either to the proof-
theoretic interpretation or to a minimal model under the model-theoretic interpretation.

6. Datalog Programs and Their Safety

There are two main methods of defining the truth values of predicates in actual Datalog
programs. Fact-defined predicates (orrelations) are defined by listing all the combinations of
values (the tuples) that make the predicate true. These correspond to base relations whose contents
are stored in a database system. Figure 26.14 shows the fact-defined
predicates EMPLOYEE, MALE,FEMALE, DEPARTMENT, SUPERVISE, PROJECT,
and WORKS_ON, which correspond to part of the relational database shown in Figure 3.6. Rule-
defined predicates (or views) are defined by being the head (LHS) of one or more Datalog rules;
they correspond to virtual rela tions whose contents can be inferred by the inference engine. Figure
26.15 shows a number of rule-defined predicates
A program or a rule is said to be safe if it generates a finite set of facts. The general theoretical
problem of determining whether a set of rules is safe is undecidable. However, one can determine
the safety of restricted forms of rules. For example, the rules shown in Figure 26.16 are safe. One
situation where we get unsafe rules that can generate an infinite number of facts arises when one
of the variables in the rule can range over an infinite domain of values, and that variable is not
limited to ranging over a finite relation. For example, consider the following rule:

BIG_SALARY(Y ) :– Y>60000

Here, we can get an infinite result if Y ranges over all possible integers. But suppose that we change
the rule as follows:

BIG_SALARY(Y ) :– EMPLOYEE(X), Salary(X, Y ), Y>60000


In the second rule, the result is not infinite, since the values that Y can be bound to are now
restricted to values that are the salary of some employee in the database— presumably, a finite set
of values. We can also rewrite the rule as follows:

BIG_SALARY(Y ) :– Y>60000, EMPLOYEE(X ), Salary(X, Y )

In this case, the rule is still theoretically safe. However, in Prolog or any other system that uses a
top-down, depth-first inference mechanism, the rule creates an infinite loop, since we first search
for a value for Y and then check whether it is a salary of an employee. The result is generation of
an infinite number of Y values, even though these, after a certain point, cannot lead to a set of true
RHS predicates. One definition of Datalog considers both rules to be safe, since it does not depend
on a particular inference mechanism. Nonetheless, it is generally advisable to write such a rule in
the safest form, with the predicates that restrict possible bindings of variables placed first. As
another example of an unsafe rule, consider the following rule:

HAS_SOMETHING(X, Y ) :– EMPLOYEE(X )

REL_ONE(A, B, C ).

REL_TWO(D, E, F ).
REL_THREE(G, H, I, J ).
SELECT_ONE_A_EQ_C(X, Y, Z ) :– REL_ONE(C, Y, Z ).
SELECT_ONE_B_LESS_5(X, Y, Z ) :– REL_ONE(X, Y, Z ), Y< 5.
SELECT_ONE_A_EQ_C_AND_B_LESS_5(X, Y, Z ) :– REL_ONE(C, Y, Z ), Y<5
SELECT_ONE_A_EQ_C_OR_B_LESS_5(X, Y, Z ) :– REL_ONE(C, Y, Z ).
SELECT_ONE_A_EQ_C_OR_B_LESS_5(X, Y, Z ) :– REL_ONE(X, Y, Z ), Y<5.
PROJECT_THREE_ON_G_H(W, X ) :– REL_THREE(W, X, Y, Z ).
UNION_ONE_TWO(X, Y, Z ) :– REL_ONE(X, Y, Z ).
UNION_ONE_TWO(X, Y, Z ) :– REL_TWO(X, Y, Z ).
INTERSECT_ONE_TWO(X, Y, Z ) :– REL_ONE(X, Y, Z ), REL_TWO(X, Y, Z ).
DIFFERENCE_TWO_ONE(X, Y, Z ) :– REL_TWO(X, Y, Z ) NOT(REL_ONE(X, Y, Z ).
CART PROD _ONE_THREE(T, U, V, W, X, Y, Z ) :–
REL_ONE(T, U, V), REL_THREE(W, X, Y, Z ).
NATURAL_JOIN_ONE_THREE_C_EQ_G(U, V, W, X, Y, Z ) :–

REL_ONE(U, V, W ), REL_THREE(W, X, Y, Z ).

Figure 26.16

Predicates for illustrating relational operations.

Here, an infinite number of Y values can again be generated, since the variable Y appears only in
the head of the rule and hence is not limited to a finite set of values. To define safe rules more
formally, we use the concept of a limited variable. A variable X is limited in a rule if (1) it appears
in a regular (not built-in) predicate in the body of the rule; (2) it appears in a predicate of the
form X=c or c=Xor (c1<<=X and X<=c2) in the rule body, where c, c1, and c2 are constant values;
or (3) it appears in a predicate of the form X=Y orY=X in the rule body, where Y is a limited vari-
able. A rule is said to be safe if all its variables are limited.

7. Use of Relational Operations

It is straightforward to specify many operations of the relational algebra in the form of Datalog
rules that define the result of applying these operations on the database relations (fact predicates).
This means that relational queries and views can easily be specified in Datalog. The additional
power that Datalog provides is in the specification of recursive queries, and views based on
recursive queries. In this section, we show how some of the standard relational operations can be
specified as Datalog rules. Our examples will use the base relations (fact-defined
predicates) REL_ONE, REL_TWO, and REL_THREE, whose schemas are shown in Figure
26.16. In Datalog, we do not need to specify the attribute names as in Figure 26.16; rather, the
arity (degree) of each predicate is the important aspect. In a practical system, the domain (data
type) of each attribute is also important for operations such as UNION,INTERSECTION,
and JOIN, and we assume that the attribute types are compatible for the various operations, as
discussed in Chapter 3.

Figure 26.16 illustrates a number of basic relational operations. Notice that if the Datalog model
is based on the relational model and hence assumes that predicates (fact relations and query results)
specify sets of tuples, duplicate tuples in the same predicate are automatically eliminated. This
may or may not be true, depending on the Datalog inference engine. However, it is
definitely not the case in Prolog, so any of the rules in Figure 26.16 that involve duplicate
elimination are not correct for Prolog. For example, if we want to specify Prolog rules for
the UNION operation with duplicate elimination, we must rewrite them as follows:

UNION_ONE_TWO(X, Y, Z) :– REL_ONE(X, Y, Z).

UNION_ONE_TWO(X, Y, Z) :– REL_TWO(X, Y, Z), NOT(REL_ONE(X, Y, Z)).


However, the rules shown in Figure 26.16 should work for Datalog, if duplicates are automatically
eliminated. Similarly, the rules for the PROJECT operation shown in Figure 26.16 should work
for Datalog in this case, but they are not correct for Prolog, since duplicates would appear in the
latter case.

8. Evaluation of Nonrecursive Datalog Queries

In order to use Datalog as a deductive database system, it is appropriate to define an inference


mechanism based on relational database query processing concepts. The inherent strategy involves
a bottom-up evaluation, starting with base relations; the order of operations is kept flexible and
subject to query optimization. In this section we discuss an inference mechanism based on
relational operations that can be applied to nonrecursive Datalog queries. We use the fact and rule
base shown in Figures 26.14 and 26.15 to illustrate our discussion.

If a query involves only fact-defined predicates, the inference becomes one of searching among
the facts for the query result. For example, a query such as

DEPARTMENT(X, Research)?

is a selection of all employee names X who work for the Research department. In relational
algebra, it is the query:

π$1 (σ$2 = “Research” (DEPARTMENT))

which can be answered by searching through the fact-defined predicate department(X,Y ). The
query involves relational SELECTand PROJECT operations on a base relation, and it can be
handled by the database query processing and opti-mization techniques discussed in Chapter 19.

When a query involves rule-defined predicates, the inference mechanism must compute the result
based on the rule definitions. If a query is nonrecursive and involves a predicate p that appears as
the head of a rule p :– p1, p2, ..., pn, the strategy is first to compute the relations corresponding
to p1, p2, ..., pn and then to compute the relation corresponding to p. It is useful to keep track of
the dependency among the predicates of a deductive database in a predicate dependency graph.
Figure 26.17 shows the graph for the fact and rule predicates shown in Figures 26.14 and 26.15.
The dependency graph contains a node for each predicate. Whenever a predicate A is specified in
the body (RHS) of a rule, and the head (LHS) of that rule is the predicate B, we say that B depends
on A, and we draw a directed edge from A to B. This indicates that in order to compute the facts
for the predicate B (the rule head), we must first compute the facts for all the predicates A in the
rule body. If the dependency graph has no cycles, we call the rule setnonrecursive. If there is at
least one cycle, we call the rule set recursive. In Figure 26.17, there is one recursively defined
predicate—namely, SUPERIOR—which has a recursive edge pointing back to itself. Additionally,
because the predicate subordinate depends onSUPERIOR, it also requires recursion in computing
its result.
A query that includes only nonrecursive predicates is called a nonrecursive query. In this section
we discuss only inference mechanisms for nonrecursive queries. In Figure 26.17, any query that
does not involve the predicates SUBORDINATE orSUPERIOR is nonrecursive. In the predicate
dependency graph, the nodes corresponding to fact-defined predicates do not have any incoming
edges, since all fact-defined predicates have their facts stored in a database relation. The contents
of a fact-defined predicate can be computed by directly retrieving the tuples in the cor-responding
database relation.

The main function of an inference mechanism is to compute the facts that correspond to query
predicates. This can be accomplished by generating a relational expression involving relational
operators as SELECT, PROJECT, JOIN, UNION, and SET DIFFERENCE (with appropriate
provision for dealing with safety issues) that, when executed, provides the query result. The query
can then be executed by utilizing the internal query processing and optimization operations of a
relational data-base management system. Whenever the inference mechanism needs to compute
the fact set corresponding to a nonrecursive rule-defined predicate p, it first locates all the rules
that have p as their head. The idea is to compute the fact set for each such rule and then to apply
the UNIONoperation to the results, since UNION corresponds to a logical OR operation. The
dependency graph indicates all predicates q on which each p depends, and since we assume that
the predicate is nonrecursive, we can always determine a partial order among such predicates q.
Before computing the fact set for p, first we compute the fact sets for all predicates q on
which p depends, based on their partial order. For example, if a query involves the
predicate UNDER_40K_SUPERVISOR, we must first compute both SUPERVISOR
and OVER_40K_EMP. Since the latter two depend only on the fact-defined
predicates EMPLOYEE, SALARY, and SUPERVISE, they can be computed directly from
the stored database relations.
MULTIMEDIA DATABASE

Multimedia database is the collection of interrelated multimedia data that includes text, graphics
(sketches, drawings), images, animations, video, audio etc and have vast amounts of multisource
multimedia data. The framework that manages different types of multimedia data which can be
stored, delivered and utilized in different ways is known as multimedia database management
system. There are three classes of the multimedia database which includes static media, dynamic
media and dimensional media.

The multimedia databases are used to store multimedia data such as images, animation, audio,
video along with text. This data is stored in the form of multiple file types like .txt(text),
.jpg(images), .swf(videos), .mp3(audio) etc.

Content of Multimedia Database management system:


1. Media data – The actual data representing an object.
2. Media format data – Information such as sampling rate, resolution, encoding scheme etc.
about the format of the media data after it goes through the acquisition, processing and
encoding phase.
3. Media keyword data – Keywords description relating to the generation of data. It is also
known as content descriptive data. Example: date, time and place of recording.
4. Media feature data – Content dependent data such as the distribution of colors, kinds of
texture and different shapes present in data.
Types of multimedia applications based on data management characteristic are:
1. Repository applications – A Large amount of multimedia data as well as meta-data
(Media format date, Media keyword data, Media feature data) that is stored for retrieval
purpose, e.g., Repository of satellite images, engineering drawings, radiology scanned
pictures.
2. Presentation applications – They involve delivery of multimedia data subject to temporal
constraint. Optimal viewing or listening requires DBMS to deliver data at certain rate
offering the quality of service above a certain threshold. Here data is processed as it is
delivered. Example: Annotating of video and audio data, real-time editing analysis.
3. Collaborative work using multimedia information – It involves executing a complex
task by merging drawings, changing notifications. Example: Intelligent healthcare
network.
There are still many challenges to multimedia databases, some of which are:
1. Modelling – Working in this area can improve database versus information retrieval
techniques thus, documents constitute a specialized area and deserve special consideration.
2. Design – The conceptual, logical and physical design of multimedia databases has not yet
been addressed fully as performance and tuning issues at each level are far more complex
as they consist of a variety of formats like JPEG, GIF, PNG, MPEG which is not easy to
convert from one form to another.
3. Storage – Storage of multimedia database on any standard disk presents the problem of
representation, compression, mapping to device hierarchies, archiving and buffering
during input-output operation. In DBMS, a ”BLOB” (Binary Large Object) facility allows
untyped bitmaps to be stored and retrieved.
4. Performance – For an application involving video playback or audio-video
synchronization, physical limitations dominate. The use of parallel processing may
alleviate some problems but such techniques are not yet fully developed. Apart from this
multimedia database consume a lot of processing time as well as bandwidth.
5. Queries and retrieval –For multimedia data like images, video, audio accessing data
through query opens up many issues like efficient query formulation, query execution and
optimization which need to be worked upon.
Areas where multimedia database is applied are:
 Documents and record management: Industries and businesses that keep detailed
records and variety of documents. Example: Insurance claim record.
 Knowledge dissemination: Multimedia database is a very effective tool for knowledge
dissemination in terms of providing several resources. Example: Electronic books.
 Education and training: Computer-aided learning materials can be designed using
multimedia sources which are nowadays very popular sources of learning. Example:
Digital libraries.
 Marketing, advertising, retailing, entertainment and travel. Example: a virtual tour of
cities.
 Real-time control and monitoring: Coupled with active database technology,
multimedia presentation of information can be very effective means for monitoring and
controlling complex tasks Example: Manufacturing operation control.
UNIT III NOSQL DATABASES

NoSQL – CAP Theorem – Sharding - Document based – MongoDB Operation: Insert,


Update, Delete, Query, Indexing, Application, Replication, Sharding–Cassandra: Data Model,
Key Space, Table Operations, CRUD Operations, CQL Types – HIVE: Data types, Database
Operations, Partitioning – HiveQL – OrientDB Graph database – OrientDB Features.

CAP theorem
It is very important to understand the limitations of NoSQL database. NoSQL can not provide
consistency and high availability together. This was first expressed by Eric Brewer in CAP
Theorem.
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of
three guarantees for a database: Consistency, Availability and Partition Tolerance.
Here Consistency means that all nodes in the network see the same data at the same time.
Availability is a guarantee that every request receives a response about whether it was
successful or failed. However it does not guarantee that a read request returns the most recent
write.The more number of users a system can cater to better is the availability.
Partition Tolerance is a guarantee that the system continues to operate despite arbitrary
message loss or failure of part of the system. In other words, even if there is a network outage
in the data center and some of the computers are unreachable, still the system continues to
perform.
What is CAP theorem in NoSQL databases?
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of
three guarantees for a database: Consistency, Availability and Partition Tolerance. Here
Consistency means that all nodes in the network see the same data at the same time.

What Is Database Sharding? Sharding is a method for distributing a single dataset across
multiple databases, which can then be stored on multiple machines. This allows for larger
datasets to be split in smaller chunks and stored in multiple data nodes, increasing the total
storage capacity of the system.
What is difference between sharding and partitioning?
Sharding and partitioning are both about breaking up a large data set into smaller subsets. The
difference is that sharding implies the data is spread across multiple computers while
partitioning does not. Partitioning is about grouping subsets of data within a single database
instance.
What are the types of sharding?
Sharding Architectures
 Key Based Sharding. This technique is also known as hash-based sharding. ...
 Horizontal or Range Based Sharding. In this method, we split the data based on the ranges of
a given value inherent in each entity. ...
 Vertical Sharding. ...
 Directory-Based Sharding.
NoSQL

NoSQL Database is used to refer a non-SQL or non relational database.

It provides a mechanism for storage and retrieval of data other than tabular relations model
used in relational databases. NoSQL database doesn't use tables for storing data. It is
generally used to store big data and real-time web applications.

Databases can be divided in 3 types:

1. RDBMS (Relational Database Management System)


2. OLAP (Online Analytical Processing)
3. NoSQL (recently developed database)

Advantages of NoSQL

o It supports query language.


o It provides fast performance.
o It provides horizontal scalability.

What is MongoDB?

MongoDB is an open-source document database that provides high performance, high


availability, and automatic scaling.
Mongo DB is a document-oriented database. It is an open source product, developed and
supported by a company named 10gen.

MongoDB is a scalable, open source, high performance, document-oriented database." -


10gen
MongoDB was designed to work with commodity servers. Now it is used by the company of
all sizes, across all industry.

MongoDB Advantages

o MongoDB is schema less. It is a document database in which one collection holds


different documents.
o There may be difference between number of fields, content and size of the
document from one to other.
o Structure of a single object is clear in MongoDB.
o There are no complex joins in MongoDB.
o MongoDB provides the facility of deep query because it supports a powerful
dynamic query on documents.
o It is very easy to scale.
o It uses internal memory for storing working sets and this is the reason of its fast
access.
Distinctive features of MongoDB

o Easy to use
o Light Weight
o Extremely faster than RDBMS

Where MongoDB should be used


o Big and complex data
o Mobile and social infrastructure
o Content management and delivery
o User data management
o Data hub

MongoDB Create Database

There is no create database command in MongoDB. Actually, MongoDB do not provide any
command to create database.

How and when to create database

If there is no existing database, the following command is used to create a new database.

Syntax:

use DATABASE_NAME
we are going to create a database "javatpointdb"

>use javatpointdb

To check the currently selected database, use the command db

>db
To check the database list, use the command show dbs:
>show dbs

insert at least one document into it to display database:

MongoDB insert documents

In MongoDB, the db.collection.insert() method is used to add or insert new documents into a
collection in your database.

>db.movie.insert({"name":"javatpoint"})

MongoDB Drop Database

The dropDatabase command is used to drop a database. It also deletes the associated data
files. It operates on the current database.

Syntax:

db.dropDatabase()

This syntax will delete the selected database. In the case you have not selected any database,
it will delete default "test" database.
If you want to delete the database "javatpointdb", use the dropDatabase() command as
follows:
>db.dropDatabase()
MongoDB Create Collection
In MongoDB, db.createCollection(name, options) is used to create collection. But usually you
don?t need to create collection. MongoDB creates collection automatically when you insert
some documents. It will be explained later. First see how to create collection:

Syntax:

db.createCollection(name, options)
Name: is a string type, specifies the name of the collection to be created.

Options: is a document type, specifies the memory size and indexing of the collection. It is
an optional parameter.
To check the created collection, use the command "show collections".

>show collections
How does MongoDB create collection automatically

MongoDB creates collections automatically when you insert some documents. For example:
Insert a document named seomount into a collection named SSSIT. The operation will create
the collection if the collection does not currently exist.

>db.SSSIT.insert({"name" : "seomount"})
>show collections
SSSIT
MongoDB update documents
In MongoDB, update() method is used to update or modify the existing documents of a
collection.

Syntax:

db.COLLECTION_NAME.update(SELECTIOIN_CRITERIA, UPDATED_DATA)

Example

Consider an example which has a collection name javatpoint. Insert the following documents
in collection:

db.javatpoint.insert(
{
course: "java",
details: {
duration: "6 months",
Trainer: "Sonoo jaiswal"
},
Batch: [ { size: "Small", qty: 15 }, { size: "Medium", qty: 25 } ],
category: "Programming language"
}

Update the existing course "java" into "android":


>db.javatpoint.update({'course':'java'},{$set:{'course':'android'}})
MongoDB insert multiple documents
If you want to insert multiple documents in a collection, you have to pass an array of
documents to the db.collection.insert() method.
Create an array of documents

Define a variable named Allcourses that hold an array of documents to insert.

var Allcourses =
[
{
Course: "Java",
details: { Duration: "6 months", Trainer: "Sonoo Jaiswal" },
Batch: [ { size: "Medium", qty: 25 } ],
category: "Programming Language"
},
{
Course: ".Net",
details: { Duration: "6 months", Trainer: "Prashant Verma" },
Batch: [ { size: "Small", qty: 5 }, { size: "Medium", qty: 10 }, ],
category: "Programming Language"
},
{
Course: "Web Designing",
details: { Duration: "3 months", Trainer: "Rashmi Desai" },
Batch: [ { size: "Small", qty: 5 }, { size: "Large", qty: 10 } ],
category: "Programming Language"
}
];

Inserts the documents

Pass this Allcourses array to the db.collection.insert() method to perform a bulk insert.

> db.javatpoint.insert( Allcourses );


MongoDB Delete documents

In MongoDB, the db.colloction.remove() method is used to delete documents from a


collection. The remove() method works on two parameters.

1. Deletion criteria: With the use of its syntax you can remove the documents from the
collection.

2. JustOne: It removes only one document when set to true or 1.


Syntax:
b.collection_name.remove (DELETION_CRITERIA)

Remove all documents

If you want to remove all documents from a collection, pass an empty query document {} to
the remove() method. The remove() method does not remove the indexes.

db.javatpoint.remove({})
Indexing in MongoDB :

MongoDB uses indexing in order to make the query processing more efficient. If there is no
indexing, then the MongoDB must scan every document in the collection and retrieve only
those documents that match the query. Indexes are special data structures that stores some
information related to the documents such that it becomes easy for MongoDB to find the
right data file. The indexes are order by the value of the field specified in the index.

Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an index.
Syntax db.COLLECTION_NAME.createIndex({KEY:1})

Example

db.mycol.createIndex({“age”:1})
{
“createdCollectionAutomatically” : false,
“numIndexesBefore” : 1,
“numIndexesAfter” : 2,
“ok” : 1
}
In order to drop an index, MongoDB provides the dropIndex() method.
Syntax

db.NAME_OF_COLLECTION.dropIndex({KEY:1})
The dropIndex() methods can only delete one index at a time. In order to delete (or drop)
multiple indexes from the collection, MongoDB provides the dropIndexes() method that
takes multiple indexes as its parameters.
Syntax –

db.NAME_OF_COLLECTION.dropIndexes({KEY1:1, KEY2, 1})


Applications of MongoDB

These are some important features of MongoDB:

1. Support ad hoc queries:

In MongoDB, you can search by field, range query and it also supports regular expression
searches.

2. Indexing:

You can index any field in a document.

3. Replication:

MongoDB supports Master Slave replication.


A master can perform Reads and Writes and a Slave copies data from the master and
can only be used for reads or back up (not writes)
4. Duplication of data:

MongoDB can run over multiple servers. The data is duplicated to keep the system up and
also keep its running condition in case of hardware failure.

5. Load balancing:

It has an automatic load balancing configuration because of data placed in shards.

6. Supports map reduce and aggregation tools.

7. Uses JavaScript instead of Procedures.

8. It is a schema-less database written in C++.

9. Provides high performance.

10. Stores files of any size easily without complicating your stack.

11. Easy to administer in the case of failures.

12. It also supports:

o JSON data model with dynamic schemas


o Auto-sharding for horizontal scalability
o Built in replication for high availability

Now a day many companies using MongoDB to create new types of applications, improve
performance and availability.

MongoDB Replication Methods

The MongoDB Replication methods are used to replicate the member to the replica sets.

rs.add(host, arbiterOnly)

The add method adds a member to the specified replica set. We are required to connect to the
primary set of the replica set to this method. The connection to the shell will be terminated if
the method will trigger an election for primary. For example - if we try to add a new member

with a higher priority than the primary. An error will be reflected by the mongo shell even if
the operation succeeds.
Example:

In the following example we will add a new secondary member with default vote.

rs.add( { host: "mongodbd4.example.net:27017" } )


MongoDBSharding Commands

Sharding is a method to distribute the data across different machines. Sharding can be used by
MongoDB to support deployment on very huge scale data sets and high throughput
operations.

MongoDBsh.addShard(<url>) command

A shard replica set added to a sharded cluster using this command. If we add it among the
shard of cluster, it affects the balance of chunks. It starts transferring chunks to balance the
cluster.

<replica_set>/<hostname><:port>,<hostname><:port>, ...

Syntax:

sh.addShard("<replica_set>/<hostname><:port>")

Example:

sh.addShard("repl0/mongodb3.example.net:27327")

Output:
It will add a shard to specify the name of the replica set and the hostname of at least one
member of the replica set.
Cassandra

What is Cassandra?

Apache Cassandra is highly scalable, high performance, distributed NoSQL database.


Cassandra is designed to handle huge amount of data across many commodity servers,
providing high availability without a single point of failure.

Cassandra is a NoSQL database

NoSQL database is Non-relational database. It is also called Not Only SQL. It is a database
that provides a mechanism to store and retrieve data other than the tabular relations used in
relational databases. These databases are schema-free, support easy replication, have simple
API, eventually consistent, and can handle huge amounts of data.

Important Points of Cassandra

o Cassandra is a column-oriented database.


o Cassandra is scalable, consistent, and fault-tolerant.
o Cassandra is created at Facebook. It is totally different from relational database
management systems.
o Cassandra is being used by some of the biggest companies like Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.

Cassandra Data Model

Data model in Cassandra is totally different from normally we see in RDBMS. Let's see how
Cassandra stores its data.
Cluster

Cassandra database is distributed over several machines that are operated together. The
outermost container is known as the Cluster which contains different nodes. Every node
contains a replica, and in case of a failure, the replica takes charge. Cassandra arranges the
nodes in a cluster, in a ring format, and assigns data to them.

Keyspace

Keyspace is the outermost container for data in Cassandra. Following are the basic attributes
of Keyspace in Cassandra:

o Replication factor: It specifies the number of machine in the cluster that will receive
copies of the same data.
o Replica placement Strategy: It is a strategy which species how to place replicas in
the ring. There are three types of strategies such as:

1) Simple strategy (rack-aware strategy)

2) old network topology strategy (rack-aware strategy)

3) network topology strategy (datacenter-shared strategy)

Cassandra Create Keyspace

Cassandra Query Language (CQL) facilitates developers to communicate with Cassandra.


The syntax of Cassandra query language is very similar to SQL.

What is Keyspace?

A keyspace is an object that is used to hold column families, user defined types. A keyspace
is like RDBMS database which contains column families, indexes, user defined types, data
center awareness, strategy used in keyspace, replication factor, etc.

In Cassandra, "Create Keyspace" command is used to create keyspace.

Syntax:

CREATE KEYSPACE <identifier> WITH <properties>

Different components of Cassandra Keyspace

Strategy: There are two types of strategy declaration in Cassandra syntax:

o Simple Strategy: Simple strategy is used in the case of one data center. In this
strategy, the first replica is placed on the selected node and the remaining nodes are
placed in clockwise direction in the ring without considering rack or node location.
o Network Topology Strategy: This strategy is used in the case of more than one data
o

o centers. In this strategy, you have to provide replication factor for each data center
separately.

Replication Factor: Replication factor is the number of replicas of data placed on different
nodes. More than two replication factor are good to attain no single point of failure. So, 3 is
good replication factor.

Example:

Let's take an example to create a keyspace named "javatpoint".

CREATE KEYSPACE javatpoint


WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};

Keyspace is created now.

Using a Keyspace

To use the created keyspace, you have to use the USE command.

Syntax:

USE <identifier>
Cassandra Alter Keyspace

The "ALTER keyspace" command is used to alter the replication factor, strategy name and
durable writes properties in created keyspace in Cassandra.

Syntax:

ALTER KEYSPACE <identifier> WITH <properties>


Cassandra Drop Keyspace

In Cassandra, "DROP Keyspace" command is used to drop keyspaces with all the data,
column families, user defined types and indexes from Cassandra.

Syntax:

DROP keyspace KeyspaceName ;


Cassandra Create Table

In Cassandra, CREATE TABLE command is used to create a table. Here, column family is
used to store data just like table in RDBMS.

So, you can say that CREATE TABLE command is used to create a column family in
Cassandra.

Syntax:
CREATE TABLE tablename(
column1 name datatype PRIMARYKEY,
column2 name data type,
column3 name data type.
)

There are two types of primary keys:


Single primary key: Use the following syntax for single primary key.

Primary key (ColumnName)


Compound primary key: Use the following syntax for single primary key.

Primary key(ColumnName1,ColumnName2 . . .)

Example:

Let's take an example to demonstrate the CREATE TABLE command.

Here, we are using already created Keyspace "javatpoint".

CREATE TABLE student(


student_id int PRIMARY KEY,
student_name text,
student_city text,
student_fees varint,
student_phone varint
);

SELECT * FROM student;


Cassandra Alter Table

ALTER TABLE command is used to alter the table after creating it. You can use the ALTER
command to perform two types of operations:

o Add a column
o Drop a column

Syntax:

ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction>

Adding a Column

You can add a column in the table by using the ALTER command. While adding column, you
have to aware that the column name is not conflicting with the existing column names and
that the table is not defined with compact storage option.
Syntax:

ALTER TABLE table name


ADD new column datatype;
After using the following command:

ALTER TABLE student


ADD student_email text;

A new column is added. You can check it by using the SELECT command.

Dropping a Column

You can also drop an existing column from a table by using ALTER command. You should
check that the table is not defined with compact storage option before dropping a column
from a table.

Syntax:

ALTER table name


DROP column name;
Example:

After using the following command:

ALTER TABLE student


DROP student_email;

Now you can see that a column named "student_email" is dropped now.

If you want to drop the multiple columns, separate the columns name by ",".

Cassandra DROP table

DROP TABLE command is used to drop a table.

Syntax:
DROP TABLE <tablename>
Example:
After using the following command:
DROP TABLE student;

The table named "student" is dropped now. You can use DESCRIBE command to verify if
the table is deleted or not. Here the student table has been deleted; you will not find it in the
column families list.
Cassandra Truncate Table

TRUNCATE command is used to truncate a table. If you truncate a table, all the rows of the
table are deleted permanently.

Syntax:
TRUNCATE <tablename>

Example:

Cassandra Batch

In Cassandra BATCH is used to execute multiple modification statements (insert, update,


delete) simultaneously. It is very useful when you have to update some column as well as
delete some of the existing.

Syntax:

BEGIN BATCH
<insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH

Use of WHERE Clause

WHERE clause is used with SELECT command to specify the exact location from where we
have to fetch data.

Syntax:

SELECT FROM <table name> WHERE <condition>;


SELECT * FROM student WHERE student_id=2;
Cassandra Update Data

UPDATE command is used to update data in a Cassandra table. If you see no result after
updating the data, it means data is successfully updated otherwise an error will be returned.
While updating data in Cassandra table, the following keywords are commonly used:

o Where: The WHERE clause is used to select the row that you want to update.
o Set: The SET clause is used to set the value.
o Must: It is used to include all the columns composing the primary key.

Syntax:

UPDATE <tablename>
SET <column name> = <new value>
<column name> = <value>....
WHERE <condition>
Cassandra DELETE Data

DELETE command is used to delete data from Cassandra table. You can delete the complete
table or a selected row by using this command.

Syntax:
DELETE FROM <identifier> WHERE <condition>;
Delete an entire row
To delete the entire row of the student_id "3", use the following command:
DELETE FROM student WHERE student_id=3;
Delete a specific column name
Example:
Delete the student_fees where student_id is 4.

DELETE student_fees FROM student WHERE student_id=4;


HAVING Clause in SQL

The HAVING clause places the condition in the groups defined by the GROUP BY clause in
the SELECT statement.

This SQL clause is implemented after the 'GROUP BY' clause in the 'SELECT' statement.

This clause is used in SQL because we cannot use the WHERE clause with the SQL
aggregate functions. Both WHERE and HAVING clauses are used for filtering the records in
SQL queries.

Syntax of HAVING clause in SQL


SELECT column_Name1, column_Name2, ......, column_NameN aggregate_function_name(
column_Name)
GROUP BY
SELECT SUM(Emp_Salary), Emp_City FROM Employee GROUP BY Emp_City;

the following query with the HAVING clause in SQL:

SELECT SUM(Emp_Salary), Emp_City FROM Employee GROUP BY Emp_City

HAVING SUM(Emp_Salary)>12000;
MIN Function with HAVING Clause:

If you want to show each department and the minimum salary in each department, you have
to write the following query:
SELECT MIN(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;
MAX Function with HAVING Clause:
SELECT MAX(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;
AVERAGE CLAUSE:

SELECT AVG(Emp_Salary), Emp_Dept FROM Employee_Dept GROUP BY Emp_Dept;


SQL ORDER BY Clause
o Whenever we want to sort the records based on the columns stored in the tables of the
SQL database, then we consider using the ORDER BY clause in SQL.
o The ORDER BY clause in SQL will help us to sort the records based on the specific
column of a table. This means that all the values stored in the column on which we are
applying ORDER BY clause will be sorted, and the corresponding column values will
be displayed in the sequence in which we have obtained the values in the earlier step.

Syntax to sort the records in ascending order:


SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
ColumnName ASC;
Syntax to sort the records in descending order:
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
ColumnNameDESC;
Syntax to sort the records in ascending order without using ASC keyword:
SELECT ColumnName1,...,ColumnNameN FROM TableName ORDER BY
ColumnName;
Index Cassandra Mongodb

1) Cassandra is high performance MongoDB is cross-platform document-oriented


distributed database system. database system.

2) Cassandra is written in Java. MongoDB is written in C++.

3) Cassandra stores data in tabular form MongoDB stores data in JSON format.
like SQL format.

4) Cassandra is got license by Apache. MongoDB is got license by AGPL and drivers
by Apache.

5) Cassandra is mainly designed to MongoDB is designed to deal with JSON-like


handle large amounts of data across documents and access applications easier and
many commodity servers. faster.

6) Cassandra provides high availability MongoDB is easy to administer in the case of


with no single point of failure. failure.
Hive
What is HIVE?

Hive is a data warehouse system which is used to analyze structured data. It is built on the top
of Hadoop. It was developed by Facebook.

Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.

Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and
User Defined Functions (UDF).

Features of Hive

o Hive is fast and scalable.


o It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce
or Spark jobs.
o It is capable of analyzing large datasets stored in HDFS.
o It uses indexing to accelerate queries.
o It can operate on compressed data stored in the Hadoop ecosystem.
o It supports user-defined functions (UDFs) where user can provide its functionality.

HIVE Data Types

Hive data types are categorized in numeric types, string types, misc types, and complex types.
A list of Hive data types is given below.

Integer Types

Type Size Range

TINYINT 1-byte signed integer -128 to 127

SMALLINT 2-byte signed integer 32,768 to 32,767

INT 4-byte signed integer 2,147,483,648 to 2,147,483,647

BIGINT 8-byte signed integer -9,223,372,036,854,775,808 to


9,223,372,036,854,775,807
Decimal Type

Type Size Range

FLOAT 4-byte Single precision floating point number

DOUBLE 8-byte Double precision floating point number

Date/Time Types

TIMESTAMP

o It supports traditional UNIX timestamp with optional nanosecond precision.


o As Integer numeric type, it is interpreted as UNIX timestamp in seconds.
o As Floating point numeric type, it is interpreted as UNIX timestamp in seconds with
decimal precision.
o As string, it follows java.sql.Timestamp format "YYYY-MM-DD
HH:MM:SS.fffffffff" (9 decimal place precision)

DATES

The Date value is used to specify a particular year, month and day, in the form YYYY--MM--
DD. However, it didn't provide the time of the day. The range of Date type lies between
0000--01--01 to 9999--12--31.

String Types

STRING

The string is a sequence of characters. It values can be enclosed within single quotes (') or
double quotes (").

Varchar

The varchar is a variable length type whose range lies between 1 and 65535, which specifies
that the maximum number of characters allowed in the character string.

CHAR

The char is a fixed-length type whose maximum length is fixed at 255.


Complex Type

Type Size Range

Struct It is similar to C struct or an object where fields struct('James','Roy')


are accessed using the "dot" notation.

Map It contains the key-value tuples where the fields map('first','James','last','Roy')


are accessed using array notation.

Array It is a collection of similar type of values that array('James','Roy')


indexable using zero-based integers.

Hive - Create Database

In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default.

o Initially, we check the default database provided by Hive. So, to check the list of
existing databases, follow the below command: -
o hive> create database demo

hive> show databases;


Let's create a new database by using the following command: -

Hive - Drop Database

In this section, we will see various ways to drop the existing database.

drop the database by using the following command.

hive> drop database demo;


Hive - Create Table

In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide
range of flexibility where the data files for tables are stored. It provides two types of table: -

o Internal table
o External table
Internal Table
The internal tables are also called managed tables as the lifecycle of their data is controlled by
the Hive. By default, these tables are stored in a subdirectory under the directory defined by
hive. metastore. warehouse.dir (i.e. /user/hive/warehouse). The internal tables are not flexible
enough to share with other tools like Pig. If we try to drop the internal table, Hive deletes
both table schema and data.

o Let's create an internal table by using the following command:-

hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
Let's see the metadata of the created table by using the following command:-

hive> describe demo.employee

External Table

The external table allows us to create and access a table and a data externally.
The external keyword is used to specify the external table, whereas the location keyword is
used to determine the location of loaded data.

As the table is external, the data is not present in the Hive directory. Therefore, if we try to
drop the table, the metadata of the table will be deleted, but the data still exists.

Let's create an external table using the following command: -

hive> create external table emplist (Id int, Name string , Salary float)
row format delimited
fields terminated by ','
location '/HiveDirectory';

we can use the following command to retrieve the data: -

select * from emplist;


Hive - Load Data

Once the internal table has been created, the next step is to load the data into it. So, in Hive,
we can easily load data from any file to the database.

o Let's load the data of the file into the database by using the following command: -

load data local inpath '/home/codegyani/hive/emp_details' into table demo.employee;


Hive - Drop Table

Hive facilitates us to drop a table by using the SQL drop table command. Let's follow the
below steps to drop the table from the database.

o Let's check the list of existing databases by using the following command: -

hive> show databases;

hive> use demo;

hive> show tables;


hive> drop table new_employee;
Hive - Alter Table

In Hive, we can perform modifications in the existing table like changing the table name,
column name, comments, and table properties. It provides SQL like commands to alter the
table.

Rename a Table

If we want to change the name of an existing table, we can rename that table by using the
following signature: -

Alter table old_table_name rename to new_table_name;


o Now, change the name of the table by using the following command: -

Alter table emp rename to employee_data;


Adding column

In Hive, we can add one or more columns in an existing table by using the following
signature:

Alter table table_name add columns(column_name datatype);


o Now, add a new column to the table by using the following command: -

Alter table employee_data add columns (age int);

Change Column
In Hive, we can rename a column, change its type and position. Here, we are changing the
name of the column by using the following signature: -

Alter table table_name change old_column_name new_column_name datatype;


o Now, change the name of the column by using the following command: -

Alter table employee_data change name first_name string;


Delete or Replace Column

Hive allows us to delete one or more columns by replacing them with the new columns. Thus,
we cannot drop the column directly.

o Let's see the existing schema of the table.

o Now, drop a column from the table.

alter table employee_data replace columns( id string, first_name string, age int);
Partitioning in Hive

The partitioning in Hive means dividing the table into some parts based on the values of a
particular column like date, course, city or country. The advantage of partitioning is that since
the data is stored in slices, the query response time becomes faster.

The partitioning in Hive can be executed in two ways –


o Static partitioning
o Dynamic partitioning

Static Partitioning

In static or manual partitioning, it is required to pass the values of partitioned columns


manually while loading the data into the table. Hence, the data file doesn't contain the
partitioned columns.

Example of Static Partitioning

o First, select the database in which we want to create a table.

hive> use test;

o Create the table and provide the partitioned columns by using the following
command: -
hive> create table student (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';

hive> describe student;


o Load the data into the table and pass the values of partition columns with it by using
the following command: -

hive> load data local inpath '/home/codegyani/hive/student_details1' into table student


partition(course= "java");

Here, we are partitioning the students of an institute based on courses.

o Load the data of another file into the same table and pass the values of partition
columns with it by using the following command: -

hive> load data local inpath '/home/codegyani/hive/student_details2' into table student


partition(course= "hadoop");

hive> select * from student;


o Now, try to retrieve the data based on partitioned columns by using the following
command: -

hive> select * from student where course="java";


Dynamic Partitioning

In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not
required to pass the values of partitioned columns manually.

o First, select the database in which we want to create a table.

hive> use show;


o Enable the dynamic partition by using the following commands: -

hive> set hive.exec.dynamic.partition=true;


hive> set hive.exec.dynamic.partition.mode=nonstrict;

o Create a dummy table to store the data.


hive> create table stud_demo(id int, name string, age int, institute string, course string)
row format delimited
fields terminated by ',';

o Now, load the data into the table.


hive> load data local inpath '/home/codegyani/hive/student_details' into table stud_demo;

o Create a partition table by using the following command: -

hive> create table student_part (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';

o Now, insert the data of dummy table into the partition table.
hive> insert into student_part

partition(course)
select id, name, age, institute, course
from stud_demo;

OrientDB Graph database

What is Graph?

A graph is a pictorial representation of objects which are connected by some pair of links. A
graph contains two elements: Nodes (vertices) and relationships (edges).

What is Graph database

A graph database is a database which is used to model the data in the form of graph. It store
any kind of data using:

o Nodes
o Relationships
o Properties

Nodes: Nodes are the records/data in graph databases. Data is stored as properties and
properties are simple name/value pairs.

Relationships: It is used to connect nodes. It specifies how the nodes are related.

o Relationships always have direction.


o Relationships always have a type.

o Relationships form patterns of data.

Properties: Properties are named data values.

Popular Graph Databases


Neo4j is the most popular Graph Database. Other Graph Databases are

o Oracle NoSQL Database


o OrientDB
o HypherGraphDB
o GraphBase
o InfiniteGraph
o AllegroGraph etc.

Graph Database vs. RDBMS

Differences between Graph database and RDBMS:

In Graph Database RDBMS


de
x

1. In graph database, data is stored in graphs. In RDBMS, data is stored in tables.

2. In graph database there are nodes. In RDBMS, there are rows.

3. In graph database there are properties and In RDBMS, there are columns and
their values. data.

4. In graph database the connected nodes are In RDBMS, constraints are used
defined by relationships. instead of that.

5. In graph database traversal is used instead In RDBMS, join is used instead of


of join. traversal.

MongoDB vs OrientDB

MongoDB and OrientDB contains many common features but the engines are fundamentally
different. MongoDB is pure Document database and OrientDB is a hybrid Document with
graph engine.

Features MongoDB OrientDB


Uses the RDBMS JOINS to create Embeds and connects documents
relationship between entities. It has like relational database. It uses
Relationships high runtime cost and does not direct, super-fast links taken from
scale when database scale graph database world.
increases.

Costly JOIN operations. Easily returns complete graph


Fetch Plan
with interconnected documents.

Doesn’t support ACID Supports ACID transactions as


Transactions transactions, but it supports atomic well as atomic operations.
operations.

Has its own language based on Query language is built on SQL.


Query language
JSON.

Uses the B-Tree algorithm for all Supports three different indexing
Indexes indexes. algorithms so that the user can
achieve best performance.

Uses memory mapping technique. Uses the storage engine name


Storage engine
LOCAL and PLOCAL.

The following table illustrates the comparison between relational model, document model,
and OrientDB document model −

Relational Model Document Model OrientDB Document Model

Table Collection Class or Cluster

Row Document Document

Column Key/value pair Document field

Relationship Not available Link

The SQL Reference of the OrientDB database provides several commands to create, alter, and
drop databases.
Create database
The following statement is a basic syntax of Create Database command.

CREATE DATABASE <database-url> [<user> <password> <storage-type> [<db-type>]]

Following are the details about the options in the above syntax.
<database-url> − Defines the URL of the database. URL contains two parts, one is <mode>
and the second one is <path>.
<mode> − Defines the mode, i.e. local mode or remote mode.
<path> − Defines the path to the database.
<user> − Defines the user you want to connect to the database.
<password> − Defines the password for connecting to the database.
<storage-type> − Defines the storage types. You can choose between PLOCAL and
MEMORY.

Example

You can use the following command to create a local database named demo.

Orientdb> CREATE DATABASE PLOCAL:/opt/orientdb/databses/demo

If the database is successfully created, you will get the following output.
Database created successfully.

Current database is: plocal: /opt/orientdb/databases/demo

orientdb {db = demo}>


The following statement is the basic syntax of the Alter Database command.
ALTER DATABASE <attribute-name> <attribute-value>
Where <attribute-name> defines the attribute that you want to modify and <attribute-
value> defines the value you want to set for that attribute.

orientdb> ALTER DATABASE custom strictSQL = false

If the command is executed successfully, you will get the following output.
Database updated successfully

The following statement is the basic syntax of the Connect command.


CONNECT <database-url> <user> <password>
Following are the details about the options in the above syntax.
<database-url> − Defines the URL of the database. URL contains two parts one is <mode>
and the second one is <path>.
<mode> − Defines the mode, i.e. local mode or remote mode.
<path> − Defines the path to the database.
<user> − Defines the user you want to connect to the database.
<password> − Defines the password for connecting to the database.
Example

We have already created a database named ‘demo’ in the previous chapters. In this example,
we will connect to that using the user admin.
You can use the following command to connect to demo database.

orientdb> CONNECT PLOCAL:/opt/orientdb/databases/demo admin admin

If it is successfully connected, you will get the following output −


Connecting to database [plocal:/opt/orientdb/databases/demo] with user 'admin'…OK
Orientdb {db = demo}>

the following statement is the basic syntax of the info command.


LIST DATABASES

The following statement is the basic syntax of the Drop database command.
DROP DATABASE [<database-name> <server-username> <server-user-password>]
Following are the details about the options in the above syntax.
<database-name> − Database name you want to drop.
<server-username> − Username of the database who has the privilege to drop a database.
<server-user-password> − Password of the particular user.

In this example, we will use the same database named ‘demo’ that we created in an earlier
chapter. You can use the following command to drop a database demo.

orientdb {db = demo}> DROP DATABASE

If this command is successfully executed, you will get the following output.
Database 'demo' deleted successfully

INSERT DATABASE

The following statement is the basic syntax of the Insert Record command.
INSERT INTO [class:]<class>|cluster:<cluster>|index:<index>
[(<field>[,]*) VALUES (<expression>[,]*)[,]*]|
[SET <field> = <expression>|<sub-command>[,]*]|
[CONTENT {<JSON>}]
[RETURN <expression>]
[FROM <query>]
Following are the details about the options in the above syntax.
SET − Defines each field along with the value.
CONTENT − Defines JSON data to set field values. This is optional.
RETURN − Defines the expression to return instead of number of records inserted. The most
common use cases are −
 @rid − Returns the Record ID of the new record.
 @this − Returns the entire new record.

FROM − Where you want to insert the record or a result set.


The following command is to insert the first record into the Customer table.

INSERT INTO Customer (id, name, age) VALUES (01,'satish', 25)

The following command is to insert the second record into the Customer table.

INSERT INTO Customer SET id = 02, name = 'krishna', age = 26

The following command is to insert the next two records into the Customer table.

INSERT INTO Customer (id, name, age) VALUES (04,'javeed', 21), (05,'raja', 29)

SELECT COMMAND
The following statement is the basic syntax of the SELECT command.
SELECT [ <Projections> ] [ FROM <Target> [ LET <Assignment>* ] ]
[ WHERE <Condition>* ]
[ GROUP BY <Field>* ]
[ ORDER BY <Fields>* [ ASC|DESC ] * ]
[ UNWIND <Field>* ]
[ SKIP <SkipRecords> ]
[ LIMIT <MaxRecords> ]
[ FETCHPLAN <FetchPlan> ]
[ TIMEOUT <Timeout> [ <STRATEGY> ] ]
[ LOCK default|record ]
[ PARALLEL ]

[ NOCACHE ]
Following are the details about the options in the above syntax.
<Projections> − Indicates the data you want to extract from the query as a result records set.
FROM − Indicates the object to query. This can be a class, cluster, single Record ID, set of
Record IDs. You can specify all these objects as target.
WHERE − Specifies the condition to filter the result-set.
LET − Indicates the context variable which are used in projections, conditions or sub queries.
GROUP BY − Indicates the field to group the records.
ORDER BY − Indicates the filed to arrange a record in order.
UNWIND − Designates the field on which to unwind the collection of records.
SKIP − Defines the number of records you want to skip from the start of the result-set.
LIMIT − Indicates the maximum number of records in the result-set.
FETCHPLAN − Specifies the strategy defining how you want to fetch results.
TIMEOUT − Defines the maximum time in milliseconds for the query.
LOCK − Defines the locking strategy. DEFAULT and RECORD are the available lock
strategies.
PARALLEL − Executes the query against ‘x’ concurrent threads.
NOCACHE − Defines whether you want to use cache or not.

Example

Method 1 − You can use the following query to select all records from the Customer table.

orientdb {db = demo}> SELECT FROM Customer


orientdb {db = demo}> SELECT FROM Customer WHERE name LIKE 'k%'
orientdb {db = demo}> SELECT FROM Customer WHERE name.left(1) = 'k'
orientdb {db = demo}> SELECT id, name.toUpperCase() FROM Customer
orientdb {db = demo}> SELECT FROM Customer WHERE age in [25,29]
orientdb {db = demo}> SELECT FROM Customer WHERE ANY() LIKE '%sh%'
orientdb {db = demo}> SELECT FROM Customer ORDER BY age DESC

UPDATE QUERY

Update Record command is used to modify the value of a particular record. SET is the basic
command to update a particular field value.
The following statement is the basic syntax of the Update command.
UPDATE <class>|cluster:<cluster>|<recordID>
[SET|INCREMENT|ADD|REMOVE|PUT <field-name> = <field-value>[,]*] |[CONTENT|
MERGE <JSON>]
[UPSERT]
[RETURN <returning> [<returning-expression>]]
[WHERE <conditions>]
[LOCK default|record]
[LIMIT <max-records>] [TIMEOUT <timeout>]

Following are the details about the options in the above syntax.
SET − Defines the field to update.
INCREMENT − Increments the specified field value by the given value.
ADD − Adds the new item in the collection fields.
REMOVE − Removes an item from the collection field.
PUT − Puts an entry into map field.
CONTENT − Replaces the record content with JSON document content.
MERGE − Merges the record content with a JSON document.
LOCK − Specifies how to lock the records between load and update. We have two options to
specify Default and Record.
UPSERT − Updates a record if it exists or inserts a new record if it doesn’t. It helps in
executing a single query in the place of executing two queries.
RETURN − Specifies an expression to return instead of the number of records.
LIMIT − Defines the maximum number of records to update.
TIMEOUT − Defines the time you want to allow the update run before it times out.
Try the following query to update the age of a customer ‘Raja’.

Orientdb {db = demo}> UPDATE Customer SET age = 28 WHERE name = 'Raja'
Truncate
Truncate Record command is used to delete the values of a particular record.
The following statement is the basic syntax of the Truncate command.
TRUNCATE RECORD <rid>*
Where <rid>* indicates the Record ID to truncate. You can use multiple Rids separated by
comma to truncate multiple records. It returns the number of records truncated.
Try the following query to truncate the record having Record ID #11:4.

Orientdb {db = demo}> TRUNCATE RECORD #11:4

DELETE
Delete Record command is used to delete one or more records completely from the database.
The following statement is the basic syntax of the Delete command.
DELETE FROM <Class>|cluster:<cluster>|index:<index>
[LOCK <default|record>]
[RETURN <returning>]
[WHERE <Condition>*]
[LIMIT <MaxRecords>]
[TIMEOUT <timeout>]
Following are the details about the options in the above syntax.
LOCK − Specifies how to lock the records between load and update. We have two options to
specify Default and Record.
RETURN − Specifies an expression to return instead of the number of records.
LIMIT − Defines the maximum number of records to update.
TIMEOUT − Defines the time you want to allow the update run before it times out.
Note − Don’t use DELETE to remove Vertices or Edges because it effects the integrity of the
graph.
Try the following query to delete the record having id = 4.
orientdb {db = demo}> DELETE FROM Customer WHERE id = 4

OrientDB Features
providing more functionality and flexibility, while being powerful enough to replace your
operational DBMS.
SPEED
OrientDB was engineered from the ground up with performance as a key specification. It’s
fast on both read and write operations. Stores up to 120,000 records per secon
 No more Joins: relationships are physical links to the records.
 Better RAM use.
 Traverses parts of or entire trees and graphs of records in milliseconds.
 Traversing speed is not affected by the database size.

ENTERPRISE

 Incremental backups
 Unmatched security
 24x7 Support
 Query Profiler
 Distributed Clustering configuration
 Metrics Recording
 Live Monitor with configurable alerts

With a master-slave architecture, the master often becomes the bottleneck. With OrientDB,
throughput is not limited by a single server. Global throughput is the sum of the throughput
of all the servers.

 Multi-Master + Sharded architecture


 Elastic Linear Scalability
 estore the database content using WAL

 OrientDB Community is free for commercial use.


 Comes with an Apache 2 Open Source License.
 Eliminates the need for multiple products and multiple licenses.
UNIT IV XML DATABASES

Structured, Semi structured, and Unstructured Data – XML Hierarchical Data Model –
XML Documents – Document Type Definition – XML Schema – XML Documents and Databases
– XML Querying – XPath – XQuery

Difference between Structured, Semi-structured and Unstructured data

Big Data includes huge volume, high velocity, and extensible variety of data. These are 3
types: Structured data, Semi-structured data, and Unstructured data.

Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example: Relational data.

Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyze. With some processes, you can
store them in the relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.

Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence and
analytics applications. Example: Word, PDF, Text, Media logs.
Differences between Structured, Semi-structured and Unstructured data:

Properties Structured data Semi-structured data Unstructured data

It is based on It is based on It is based on


Relational database XML/RDF(Resource character and
Technology table Description Framework). binary data

Matured transaction No transaction


Transaction and various Transaction is adapted from management and
management concurrency techniques DBMS not matured no concurrency

Version Versioning over Versioning over tuples or Versioned as a


management tuples,row,tables graph is possible whole

It is more flexible than It is more flexible


It is schema dependent structured data but less and there is
Flexibility and less flexible flexible than unstructured data absence of schema

It is very difficult to It’s scaling is simpler than


Scalability scale DB schema structured data It is more scalable.

New technology, not very


Robustness Very robust spread —

Only textual
Query Structured query allow Queries over anonymous queries are
performance complex joining nodes are possible possible

XML Hierarchical Data Model

XML Tree Structure

An XML document has a self descriptive structure. It forms a tree structure which is referred as an
XML tree. The tree structure makes easy to describe an XML document.

A tree structure contains root element (as parent), child element and so on. It is very easy to
traverse all succeeding branches and sub-branches and leaf nodes starting from the root.
<?xml version="1.0"?>
<college>
<student>
<firstname>Tamanna</firstname>
<lastname>Bhatia</lastname>
<contact>09990449935</contact>
<email>[email protected]</email>
<address>
<city>Ghaziabad</city>
<state>Uttar Pradesh</state>
<pin>201007</pin>
</address>
</student>
</college>

lets see the tree-structure representation of the above example.

In the above example, first line is the XML declaration. It defines the XML version 1.0. Next line
shows the root element (college) of the document. Inside that there is one more element (student).
Student element contains five branches named <firstname>, <lastname>, <contact>, <Email> and
<address>.<address> branch contains 3 sub-branches named <city>, <state> and <pin>.
XML Tree Rules

These rules are used to figure out the relationship of the elements. It shows if an element is a child
or a parent of the other element.

Descendants: If element A is contained by element B, then A is known as descendant of B. In the


above example "College" is the root element and all the other elements are the descendants of
"College".

Ancestors: The containing element which contains other elements is called "Ancestor" of other
element. In the above example Root element (College) is ancestor of all other elements.

What is xml?

Xml (eXtensible Markup Language) is a mark up language.


XML is designed to store and transport data.
Xml was released in late 90’s. it was created to provide an easy to use and store self describing
data.

XML became a W3C Recommendation on February 10, 1998.


XML is not a replacement for HTML.
XML is designed to be self-descriptive.

XML is designed to carry data, not to display data.

XML tags are not predefined. You must define your own tags.
XML is platform independent and language independent.
XML Example

<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>

The root element in the example is <bookstore>. All elements in the document are contained
within <bookstore>.The <book> element has 4 children: <title>,< author>, <year> and <price>.

<?xml version="1.0" encoding="UTF-8"?>


<emails>
<email>
<to>Vimal</to>
<from>Sonoo</from>
<heading>Hello</heading>
<body>Hello brother, how are you!</body>
</email>
<email>
<to>Peter</to>
<from>Jack</from>
<heading>Birth day wish</heading>
<body>Happy birth day Tom!</body>
</email>
<email>
<to>James</to>
<from>Jaclin</from>
<heading>Morning walk</heading>
<body>Please start morning walk to stay fit!</body>
</email>
<email>
<to>Kartik</to>
<from>Kumar</from>
<heading>Health Tips</heading>
<body>Smoking is injurious to health!</body>
</email>
</emails>

XML Attributes
XML elements can have attributes. By the use of attributes we can add the information about the
element.

<book publisher='Tata McGraw Hill'></book>

Metadata should be stored as attribute and data should be stored as elements

<book>
<book category="computer">
<author> A & B </author>
</book>
XML Comments

XML comments are just like HTML comments. We know that the comments are used to make
codes more understandable other developers.

An XML comment should be written as:

<!-- Write your comment-->


XML Validation

A well formed XML document can be validated against DTD or Schema.

A well-formed XML document is an XML document with correct syntax. It is very necessary to
know about valid XML document before knowing XML validation.
Valid XML document

It must be well formed (satisfy all the basic syntax condition)

It should be behave according to predefined DTD or XML schema


Rules for well formed XML
It must begin with the XML declaration.

It must have one unique root element.

All start tags of XML documents must match end tags.


XML tags are case sensitive.
All elements must be closed.

All elements must be properly nested.


All attributes values must be quoted.
XML entities must be used for special characters.
XML Validation
XML DTD:

DTD stands for Document Type Definition. It defines the legal building blocks of an XML
document. It is used to define document structure with a list of legal elements and attributes.
Purpose of DTD:

elements and define the strucIts main purpose is to define the structure of an XML document. It
contains a list of legal ture with the help of them.
Example:
<?xml version="1.0"?>
<!DOCTYPE employee SYSTEM "employee.dtd">
<employee>
<firstname>vimal</firstname>
<lastname>jaiswal</lastname>
<email>[email protected]</email>
</employee>
Description of DTD:
<!DOCTYPE employee : It defines that the root element of the document is employee.
<!ELEMENT employee: It defines that the employee element contains 3 elements "firstname,
lastname and email".
<!ELEMENT firstname: It defines that the firstname element is #PCDATA typed. (parse-able data
type).
<!ELEMENT lastname: It defines that the lastname element is #PCDATA typed. (parse-able data
type).
<!ELEMENT email: It defines that the email element is #PCDATA typed. (parse-able data type).
XML DTD
A DTD defines the legal elements of an XML document
In simple words we can say that a DTD defines the document structure with a list of legal elements
and attributes.
XML schema is a XML based alternative to DTD.
Actually DTD and XML schema both are used to form a well formed XML document.
We should avoid errors in XML documents because they will stop the XML programs.
XML schema
It is defined as an XML language
Uses namespaces to allow for reuses of existing definitions
It supports a large number of built in data types and definition of derived data types
Valid and well-formed XML document with External DTD

Let's take an example of well-formed and valid XML document. It follows all the rules of DTD.

employee.xml

<?xml version="1.0"?>
<!DOCTYPE employee SYSTEM "employee.dtd">
<employee>
<firstname>vimal</firstname>
<lastname>jaiswal</lastname>
<email>[email protected]</email>
</employee>

In the above example, the DOCTYPE declaration refers to an external DTD file. The content of the
file is shown in below paragraph.

employee.dtd

<!ELEMENT employee (firstname,lastname,email)>

<!ELEMENT firstname (#PCDATA)>


<!ELEMENT lastname (#PCDATA)>
<!ELEMENT email (#PCDATA)>
Valid and well-formed XML document with Internal DTD

<?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?>

<!DOCTYPE address [

<!ELEMENT address (name,company,phone)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT company (#PCDATA)>

<!ELEMENT phone (#PCDATA)>

]>

<address>

<name>Tanmay Patil</name>

<company>TutorialsPoint</company>

<phone>(011) 123-4567</phone>
</address>

Description of DTD

<!DOCTYPE employee : It defines that the root element of the document is employee.

<!ELEMENT employee: It defines that the employee element contains 3 elements "firstname,
lastname and email".

<!ELEMENT firstname: It defines that the firstname element is #PCDATA typed. (parse-able
data type).

<!ELEMENT lastname: It defines that the lastname element is #PCDATA typed. (parse-able
data type).

<!ELEMENT email: It defines that the email element is #PCDATA typed. (parse-able data type).

XML CSS
Purpose of CSS in XML

CSS (Cascading Style Sheets) can be used to add style and display information to an XML
document. It can format the whole XML document.

How to link XML file with CSS

To link XML files with CSS, you should use the following syntax:

<?xml-stylesheet type="text/css" href="cssemployee.css"?>

XML CSS Example

cssemployee.css

employee
{
background-color: pink;
}
firstname,lastname,email
{
font-size:25px;
display:block;
color: blue;
margin-left: 50px;
}
employee.dtd

<!ELEMENT employee (firstname,lastname,email)>


<!ELEMENT firstname (#PCDATA)>
<!ELEMENT lastname (#PCDATA)>
<!ELEMENT email (#PCDATA)>

employee.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="cssemployee.css"?>
<!DOCTYPE employee SYSTEM "employee.dtd">
<employee>
<firstname>vimal</firstname>
<lastname>jaiswal</lastname>
<email>[email protected]</email>
</employee>
CDATA vs PCDATA

CDATA

CDATA: (Unparsed Character data): CDATA contains the text which is not parsed further in an
XML document. Tags inside the CDATA text are not treated as markup and entities will not be
expanded.

Let's take an example for CDATA:

<?xml version="1.0"?>

<!DOCTYPE employee SYSTEM "employee.dtd">

<employee>

<![CDATA[

<firstname>vimal</firstname>

<lastname>jaiswal</lastname>

<email>[email protected]</email>

]]>

</employee>

In the above CDATA example, CDATA is used just after the element employee to make the
data/text unparsed, so it will give the value of employee:

<firstname>vimal</firstname><lastname>jaiswal</lastname><email>[email protected]</e
mail>

PCDATA

PCDATA: (Parsed Character Data): XML parsers are used to parse all the text in an XML
document. PCDATA stands for Parsed Character data. PCDATA is the text that will be parsed by
a parser. Tags inside the PCDATA will be treated as markup and entities will be expanded.
In other words you can say that a parsed character data means the XML parser examine the data
and ensure that it doesn't content entity if it contains that will be replaced.

Let's take an example:

<?xml version="1.0"?>

<!DOCTYPE employee SYSTEM "employee.dtd">

<employee>

<firstname>vimal</firstname>

<lastname>jaiswal</lastname>

<email>[email protected]</email>

</employee>

In the above example, the employee element contains 3 more elements 'firstname', 'lastname', and
'email', so it parses further to get the data/text of firstname, lastname and email to give the value of
employee as:

vimaljaiswal [email protected]

XML Schema:

XML schema is a language which is used for expressing constraint about XML documents.
There are so many schema languages which are used now a days for example Relax- NG and XSD
(XML schema definition).

An XML schema is used to define the structure of an XML document. It is like DTD but
provides more control on XML structure.

Example:

<?xml version="1.0"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

targetNamespace="http://www.javatpoint.com"
xmlns="http://www.javatpoint.com"

elementFormDefault="qualified">

<xs:element name="employee">
<xs:complexType>

<xs:sequence>

<xs:element name="firstname" type="xs:string"/>

<xs:element name="lastname" type="xs:string"/>

<xs:element name="email" type="xs:string"/>

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:schema>

Description of XML Schema:

<xs:element name="employee"> : It defines the element name employee.

<xs:complexType> : It defines that the element 'employee' is complex type.

<xs:sequence> : It defines that the complex type is a sequence of elements.

<xs:element name="firstname" type="xs:string"/> : It defines that the element 'firstname' is of


string/text type.

<xs:element name="lastname" type="xs:string"/> : It defines that the element 'lastname' is of


string/text type.

<xs:element name="email" type="xs:string"/> : It defines that the element 'email' is of string/text


type.

XML Schema Data types:

There are two types of data types in XML schema.

1.SimpleType 2.ComplexType
SimpleType
The simpleType allows you to have text-based elements. It contains less attributes, child elements,
and cannot be left empty.

ComplexType
The complexType allows you to hold multiple attributes and elements. It can contain additional
sub elements and can be left empty.

XML Database:

XML database is a data persistence software system used for storing the huge amount of
information in XML format. It provides a secure place to store XML documents.

You can query your stored data by using XQuery, export and serialize into desired format. XML
databases are usually associated with document-oriented databases.

DTD vs XSD

There are many differences between DTD (Document Type Definition) and XSD (XML Schema
Definition). In short, DTD provides less control on XML structure whereas XSD (XML schema)
provides more control.

No. DTD XSD

1) DTD stands for Document Type XSD stands for XML Schema Definition.
Definition.

2) DTDs are derived from SGML syntax. XSDs are written in XML.

3) DTD doesn't support datatypes. XSD supports datatypes for elements and
attributes.

4) DTD doesn't support namespace. XSD supports namespace.

5) DTD doesn't define order for child XSD defines order for child elements.
elements.

6) DTD is not extensible. XSD is extensible.

7) DTD is not simple to learn. XSD is simple to learn because you don't need
to learn new language.
8) DTD provides less control on XML XSD provides more control on XML structure.
structure.

XML Database:

XML database is a data persistence software system used for storing the huge amount of
information in XML format.

It provides a secure place to store XML documents.

Types of XML databases:

There are two types of XML databases.

1. XML-enabled database

2. Native XML database (NXD)

XML-enable Database:

XML-enable database works just like a relational database. It is like an extension provided
for the conversion of XML documents. In this database, data is stored in table, in the form of rows
and columns.

Native XML Database:

Native XML database is used to store large amount of data. Instead of table format, Native
XML database is based on container format. You can query data by XPath expressions.
Native XML database is preferred over XML-enable database because it is highly capable to store,
maintain and query XML documents.

Example:

<?xml version="1.0"?>

<contact-info>

<contact1>

<name>Vimal Jaiswal</name>

<company>SSSIT.org</company>

<phone>(0120) 4256464</phone>

</contact1>
<contact2>

<name>Mahesh Sharma </name>

<company>SSSIT.org</company>

<phone>09990449935</phone>

</contact2>

</contact-info>

XPath:

XPath is an important and core component of XSLT standard. It is used to traverse the elements
and attributes in an XML document.

XPath is a W3C recommendation. XPath provides different types of expressions to retrieve


relevant information from the XML document. It is syntax for defining parts of an XML
document.

Important features of XPath:

XPath defines structure: XPath is used to define the parts of an XML document i.e. element,
attributes, text, namespace, processing-instruction, comment, and document nodes.

XPath provides path expression: XPath provides powerful path expressions, select nodes, or list of
nodes in XML documents.

XPath is a core component of XSLT: XPath is a major element in XSLT standard and must be
followed to work with XSLT documents.

XPath is a standard function: XPath provides a rich library of standard functions to manipulate
string values, numeric values, date and time comparison, node and QName manipulation,
sequence manipulation, Boolean values etc.

Path is W3C recommendation.

XPath Expression

XPath defines a pattern or path expression to select nodes or node sets in an XML document.
These patterns are used by XSLT to perform transformations. The path expressions look like very
similar to the general expressions we used in traditional file system.

XPath specifies seven types of nodes that can be output of the execution of the XPath expression.

o Root
o Element
o Text
o Attribute
o Comment
o Processing Instruction
o Namespace

We know that XPath uses a path expression to select node or a list of nodes from an XML
document.

A list of useful paths and expression to select any node/ list of nodes from an XML document:

XPath Expression Example

Let's take an example to see the usage of XPath expression. Here, we use an xml file
"employee.xml" and a stylesheet for that xml file named "employee.xsl". The XSL file uses the
XPath expressions under select attribute of various XSL tags to fetchvalues of id, firstname,
lastname, nickname andsalary of each employee node.

Employee.xml

<?xml version = "1.0"?>


<?xml-stylesheet type = "text/xsl" href = "employee.xsl"?>
<class>
<employee id = "001">
<firstname>Aryan</firstname>
<lastname>Gupta</lastname>
<nickname>Raju</nickname>
<salary>30000</salary>
</employee>
<employee id = "024">
<firstname>Sara</firstname>
<lastname>Khan</lastname>
<nickname>Zoya</nickname>
<salary>25000</salary>
</employee>
<employee id = "056">
<firstname>Peter</firstname>
<lastname>Symon</lastname>
<nickname>John</nickname>
<salary>10000</salary>
</employee>
</class>

Employee.xsl

<?xml version = "1.0" encoding = "UTF-8"?>


<xsl:stylesheet version = "1.0">
xmlns:xsl = "http://www.w3.org/1999/XSL/Transform">
<xsl:template match = "/">
<html>
<body>
<h2> Employees</h2>
<table border = "1>
<tr bgcolor = "pink">
<th> ID</th>
<th> First Name</th>
<th> Last Name</th>
<th> Nick Name</th>
<th> Salary</th>
</tr>
<xsl:for-each select = "class/employee">
<tr>
<td> <xsl:value-of select = "@id"/> </td>
<td> <xsl:value-of select = "firstname"/> </td>
<td> <xsl:value-of select = "lastname"/> </td>
<td> <xsl:value-of select = "nickname"/> </td>
<td> <xsl:value-of select = "salary"/> </td>
</tr>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
</xsl:stylesheet>

XQuery:

XQuery is a functional query language used to retrieve information stored in XML format. It is
same as for XML what SQL is for databases. It was designed to query XML data.

XQuery is built on XPath expressions. It is a W3C recommendation which is supported by all


major databases.
What does it do
XQuery is a functional language which is responsible for finding and extracting elements and
attributes from XML documents.
It can be used for following things:

To extract information to use in a web service.


To generates summary reports.
To transform XML data to XHTML.
Search Web documents for relevant information.

XQuery Features:
There are many features of XQuery query language. A list of top features are given below:
XQuery is a functional language. It is used to retrieve and query XML based data.
XQuery is expression-oriented programming language with a simple type system.
XQuery is analogous to SQL. For example: As SQL is query language for databases, same as
XQuery is query language for XML.

XQuery is XPath based and uses XPath expressions to navigate through XML documents.

XQuery is a W3C standard and universally supported by all major databases.

Advantages of XQuery:
XQuery can be used to retrieve both hierarchal and tabular data.

XQuery can also be used to query tree and graphical structures.


XQUery can used to build web pages.
XQuery can be used to query web pages.
XQuery is best for XML-based databases and object-based databases. Object databases are much
more flexible and powerful than purely tabular databases.

XQuery can be used to transform XML documents into XHTML documents.

XQuery Environment Setup


Let's see how to create a local development environment. Here we are using the jar file of Saxon
XQuery processor. The Java-based Saxon XQuery processor is used to test the ".xqy" file, a file
containing XQuery expression against our sample XML document.
You need to load Saxon XQuery processor jar files to run the java application.
For eclipse project, add build-path to these jar files. Or, if you are running java using command
prompt, you need to set classpath to these jar files or put these jar files inside JRE/lib/ext directory.
How to Set CLASSPATH in Windows Using Command Prompt
Type the following command in your Command Prompt and press enter.
1. set CLASSPATH=%CLASSPATH%;C:\Program Files\Java\jre1.8\rt.jar;

XQuery First Example


Here, the XML document is named as courses.xml and xqy file is named as courses.xqy

courses.xml
<?xml version="1.0" encoding="UTF-8"?>
<courses>
<course category="JAVA">
<title lang="en">Learn Java in 3 Months.</title>
<trainer>Sonoo Jaiswal</trainer>
<year>2008</year>

<fees>10000.00</fees>
</course>
<course category="Dot Net">
<title lang="en">Learn Dot Net in 3 Months.</title>

<trainer>Vicky Kaushal</trainer>
<year>2008</year>
<fees>10000.00</fees>
</course>

<course category="C">
<title lang="en">Learn C in 2 Months.</title>
<trainer>Ramesh Kumar</trainer>

<year>2014</year>
<fees>3000.00</fees>
</course>
<course category="XML">
<title lang="en">Learn XML in 2 Months.</title>
<trainer>Ajeet Kumar</trainer>

<year>2015</year>
<fees>4000.00</fees>
</course>
</courses>

courses.xqy
for $x in doc("courses.xml")/courses/course

where $x/fees>5000
return $x/title
This example will display the title elements of the courses whose fees are greater than 5000.
Create a Java based XQuery executor program to read the courses.xqy, passes it to the XQuery
expression processor, and executes the expression. After that the result will be displayed.

XQueryTester.java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;

import javax.xml.xquery.XQConnection;
import javax.xml.xquery.XQDataSource;
import javax.xml.xquery.XQException;
import javax.xml.xquery.XQPreparedExpression;
import javax.xml.xquery.XQResultSequence;
import com.saxonica.xqj.SaxonXQDataSource;

public class XQueryTester {

public static void main(String[] args){


try {
execute();
}
catch (FileNotFoundException e) {

e.printStackTrace();
}
catch (XQException e) {
e.printStackTrace();
}
}
private static void execute() throws FileNotFoundException, XQException{
InputStream inputStream = new FileInputStream(new File("courses.xqy"));
XQDataSource ds = new SaxonXQDataSource();
XQConnection conn = ds.getConnection();
XQPreparedExpression exp = conn.prepareExpression(inputStream);
XQResultSequence result = exp.executeQuery();
while (result.next()) {
System.out.println(result.getItemAsString(null));
}
}}
Execute XQuery against XML
Put the above three files to a same location. We put them on desktop in a folder name XQuery2.
Compile XQueryTester.java using console. You must have JDK 1.5 or later installed on your
computer and classpaths are configured.

Compile:
javac XQueryTester.java
Execute:

javaXQueryTester

XQuery FLWOR
FLWOR is an acronym which stands for "For, Let, Where, Order by, Return".
• For - It is used to select a sequence of nodes.
• Let - It is used to bind a sequence to a variable.
• Where - It is used to filter the nodes.

• Order by - It is used to sort the nodes.


• Return - It is used to specify what to return (gets evaluated once for every node).

XQuery FLWOR Example

Example

Following is a sample XML document that contains information on a collection of books. We will
use a FLWOR expression to retrieve the titles of those books with a price greater than 30.
books.xml
<?xml version="1.0" encoding="UTF-8"?>
<books>
<book category="JAVA">
<title lang="en">Learn Java in 24 Hours</title>
<author>Robert</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="DOTNET">
<title lang="en">Learn .Net in 24 hours</title>
<author>Peter</author>
<year>2011</year>
<price>70.50</price>
</book>
<book category="XML">
<title lang="en">Learn XQuery in 24 hours</title>
<author>Robert</author>
<author>Peter</author>
<year>2013</year>
<price>50.00</price>
</book>
<book category="XML">
<title lang="en">Learn XPath in 24 hours</title>
<author>Jay Ban</author>
<year>2010</year>
<price>16.50</price>

</book>

</books>

The following Xquery document contains the query expression to be executed on the above XML
document.
books.xqy
let $books := (doc("books.xml")/books/book)
return <results>
{
for $x in $books
where $x/price>30
order by $x/price
return $x/title
}</results>

Result
<title lang="en">Learn XQuery in 24 hours</title>
<title lang="en">Learn .Net in 24 hours</title>
2. Let's take an XML document having the information on the collection of courses. We will use a
FLWOR expression to retrieve the titles of those courses which fees are greater than 2000.

courses.xml
<?xml version="1.0" encoding="UTF-8"?>
<courses>
<course category="JAVA">

<title lang="en">Learn Java in 3 Months.</title>


<trainer>Sonoo Jaiswal</trainer>
<year>2008</year>
<fees>10000.00</fees>
</course>
<course category="Dot Net">

<title lang="en">Learn Dot Net in 3 Months.</title>


<trainer>Vicky Kaushal</trainer>
<year>2008</year>
<fees>10000.00</fees>
</course>
<course category="C">
<title lang="en">Learn C in 2 Months.</title>

<trainer>Ramesh Kumar</trainer>
<year>2014</year>
<fees>3000.00</fees>
</course>
<course category="XML">
<title lang="en">Learn XML in 2 Months.</title>

<trainer>Ajeet Kumar</trainer>
<year>2015</year>
<fees>4000.00</fees>

</course>
</courses>
Let's take the Xquery document named "courses.xqy" that contains the query expression to be
executed on the above XML document.

courses.xqy
let $courses := (doc("courses.xml")/courses/course)
return <results>
{
for $x in $courses

where $x/fees>2000
order by $x/fees
return $x/title
}

</results>
Create a Java based XQuery executor program to read the courses.xqy, passes it to the XQuery
expression processor, and executes the expression. After that the result will be displayed.

XQueryTester.java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import javax.xml.xquery.XQConnection;
import javax.xml.xquery.XQDataSource;

import javax.xml.xquery.XQException;

import javax.xml.xquery.XQPreparedExpression;
import javax.xml.xquery.XQResultSequence;
import com.saxonica.xqj.SaxonXQDataSource;
public class XQueryTester {

public static void main(String[] args){


try {
execute();

}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (XQException e) {

e.printStackTrace();
}
}

private static void execute() throws FileNotFoundException, XQException{


InputStream inputStream = new FileInputStream(new File("courses.xqy"));
XQDataSource ds = new SaxonXQDataSource();

XQConnection conn = ds.getConnection();


XQPreparedExpression exp = conn.prepareExpression(inputStream);
XQResultSequence result = exp.executeQuery();
while (result.next()) {
System.out.println(result.getItemAsString(null));
}

}
}

XQuery XPath Example


Let's take an XML document having the information on the collection of courses. We will use
XQuery expression to retrieve the titles of those courses.

courses.xml

<?xml version="1.0" encoding="UTF-8"?>


<courses>
<course category="JAVA">
<title lang="en">Learn Java in 3 Months.</title>
<trainer>Sonoo Jaiswal</trainer>
<year>2008</year>
<fees>10000.00</fees>
</course>
<course category="Dot Net">
<title lang="en">Learn Dot Net in 3 Months.</title>
<trainer>Vicky Kaushal</trainer>
<year>2008</year>
<fees>10000.00</fees>
</course>
<course category="C">
<title lang="en">Learn C in 2 Months.</title>
<trainer>Ramesh Kumar</trainer>
<year>2014</year>
<fees>3000.00</fees>
</course>
<course category="XML">
<title lang="en">Learn XML in 2 Months.</title>
<trainer>Ajeet Kumar</trainer>
<year>2015</year>
<fees>4000.00</fees>
</course>
</courses>
courses.xqy
(: read the entire xml document :)
let $courses := doc("courses.xml")
for $x in $courses/courses/course
where $x/fees > 2000
return $x/title

Here, we use three different types of XQuery statement that will display the same result having
fees is greater than 2000.

Execute XQuery against XML

Put the above three files to a same location. We put them on desktop in a folder name XQuery3.
Compile XQueryTester.java using console. You must have JDK 1.5 or later installed on your
computer and classpaths are configured.

XQuery vs XPath:
Index Xquery XPath
1) XQuery is a functional programming and query language that is used to query a group of XML
data.
XPath is a xml path language that is used to select nodes from an xml document using queries.

2) XQuery is used to extract and manipulate data from either xml documents or relational databases
and ms office documents that support an xml data source.
XPath is used to compute values like strings, numbers and boolean types from another xml
documents.
3) Xquery is represented in the form of a tree model with seven nodes, namely processing
instructions, elements, document nodes, attributes, namespaces, text nodes, and comments.
xpath is represented as tree structure, navigate it by selecting different nodes.

4) Xquery supports xpath and extended relational models.

xpath is still a component of query language.


5) Xquery language helps to create syntax for new xml documents.

xpath was created to define a common syntax and behavior model for xpointer and xslt
UNIT V INFORMATION RETRIVAL AND WEB SEARCH

What is Information Retrieval?


Information Retrieval (IR) can be defined as a software program that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material
that can usually be documented on an unstructured nature i.e. usually text which satisfies an
information need from within large collections which is stored on computers. For example,
Information Retrieval can be when a user enters a query into the system.
Not only librarians, professional searchers, etc engage themselves in the activity of
information retrieval but nowadays hundreds of millions of people engage in IR every day
when they use web search engines. Information Retrieval is believed to be the dominant form
of Information access. The IR system assists the users in finding the information they require
but it does not explicitly return the answers to the question. It notifies regarding the existence
and location of documents that might consist of the required information.
Information retrieval also extends support to users in browsing or filtering document
collection or processing a set of retrieved documents. The system searches over billions of
documents stored on millions of computers. A spam filter, manual or automatic means are
provided by Email program for classifying the mails so that it can be placed directly into
particular folders.
An IR system has the ability to represent, store, organize, and access information items. A set
of keywords are required to search. Keywords are what people are searching for in search
engines. These keywords summarize the description of the information.

What is an IR Model?

An Information Retrieval (IR) model selects and ranks the document that is required by the
user or the user has asked for in the form of a query. The documents and the queries are
represented in a similar manner, so that document selection and ranking can be formalized by
a matching function that returns a retrieval status value (RSV) for each document in the
collection. Many of the Information Retrieval systems represent document contents by a set
of descriptors, called terms, belonging to a vocabulary V. An IR model determines the query-
document matching function according to four main approaches:
The estimation of the probability of user’s relevance rel for each
document d and query q with respect to a set R q of training documents: Prob (rel|d, q, Rq)

Types of IR Models
Components of Information Retrieval/ IR Model

● Acquisition: In this step, the selection of documents and other objects from
various web resources that consist of text-based documents takes place. The
required data is collected by web crawlers and stored in the database.
● Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual & automatic techniques as well. example: Abstracting
contains summarizing and Bibliographic description that contains author, title,
sources, data, and metadata.
● File Organization: There are two types of file organization methods.
i.e. Sequential: It contains documents by document data. Inverted: It contains
term by term, list of records under each term. Combination of
both.
● Query: An IR process starts when a user enters a query into the system. Queries
are formal statements of information needs, for example, search strings in web
search engines. In information retrieval, a query does not uniquely identify a
single object in the collection. Instead, several objects may match the query,
perhaps with different degrees of relevancy.
Difference Between Information Retrieval and Data Retrieval

Information Retrieval Data Retrieval

The software program that deals with Data retrieval deals with obtaining data from a
the organization, storage, retrieval, database management system such as
and evaluation of information from ODBMS. It is A process of identifying and
document repositories particularly retrieving the data from the database, based on
textual information. the query provided by user or application.

Retrieves information about a Determines the keywords in the user query


subject. and retrieves the data.

Small errors are likely to go


unnoticed. A single error object means total failure.

Not always well structured and is


semantically ambiguous. Has a well-defined structure and semantics.

Does not provide a solution to the Provides solutions to the user of the database
user of the database system. system.

The results obtained are approximate


matches. The results obtained are exact matches.

Results are ordered by relevance. Results are unordered by relevance.

It is a probabilistic model. It is a deterministic model.


User Interaction With Information Retrieval System

The User Task: The information first is supposed to be translated into a query by the user. In
the information retrieval system, there is a set of words that convey the semantics of the
information that is required whereas, in a data retrieval system, a query expression is used to
convey the constraints which are satisfied by the objects. Example: A user wants to search for
something but ends up searching with another thing. This means that the user is browsing
and not searching. The above figure shows the interaction of the user through different tasks.
● Logical View of the Documents: A long time ago, documents were represented
through a set of index terms or keywords. Nowadays, modern computers represent
documents by a full set of words which reduces the set of representative
keywords. This can be done by eliminating stopwords i.e. articles and connectives.
These operations are text operations. These text operations reduce the complexity
of the document representation from full text to set of index terms.

Past, Present, and Future of Information Retrieval


1. Early Developments: As there was an increase in the need for a lot of information, it
became necessary to build data structures to get faster access. The index is the data structure
for faster retrieval of information. Over centuries manual categorization of hierarchies was
done for indexes.
2. Information Retrieval In Libraries: Libraries were the first to adopt IR systems for
information retrieval. In first-generation, it consisted, automation of previous technologies,
and the search was based on author name and title. In the second generation, it included
searching by subject heading, keywords, etc. In the third generation, it consisted of graphical
interfaces, electronic forms, hypertext features, etc.
3. The Web and Digital Libraries: It is cheaper than various sources of information,
it provides greater access to networks due to digital communication and it gives free
access to publish on a larger medium.
Information Retrieval Models
2.1 Introduction

The purpose of this chapter is two-fold: First, we want to set the stage for the problems in
information retrieval that we try to address in this thesis. Second, we want to give the reader a
quick overview of the major textual retrieval methods, because the InfoCrystal can help to
visualize the output from any of them. We begin by providing a general model of the
information retrieval process. We then briefly describe the major retrieval methods and
characterize them in terms of their strengths and shortcomings.

2.2 General Model of Information Retrieval

The goal of information retrieval (IR) is to provide users with those documents that will satisfy their
information need. We use the word "document" as a general term that could also include
non-textual information, such as multimedia objects. Figure 4.1 provides a general overview
of the information retrieval process, which has been adapted from Lancaster and Warner
(1993). Users have to formulate their information need in a form that can be understood by
the retrieval mechanism. There are several steps involved in this translation process that we
will briefly discuss below. Likewise, the contents of large document collections need to be
described in a form that allows the retrieval mechanism to identify the potentially relevant
documents quickly. In both cases, information may be lost in the transformation process
leading to a computer-usable representation. Hence, the matching process is inherently
imperfect.

Information seeking is a form of problem solving [Marcus 1994, Marchionini 1992]. It


proceeds according to the interaction among eight subprocesses: problem recognition and
acceptance, problem definition, search system selection, query formulation, query execution,
examination of results (including relevance feedback), information extraction, and
reflection/iteration/termination. To be able to perform effective searches, users have to
develop the following expertise: knowledge about various sources of information, skills in
defining search problems and applying search strategies, and competence in using electronic
search tools.

Marchionini (1992) contends that some sort of spreadsheet is needed that supports users in
the problem definition as well as other information seeking tasks. The InfoCrystal is such a
spreadsheet because it assists users in the formulation of their information needs and the
exploration of the retrieved documents, using the a visual interface that supports a "what-if"
functionality. He further predicts that advances in computing power and speed, together with
improved information retrieval procedures, will continue to blur the distinctions between
problem articulation and examination of results. The InfoCrystal is both a visual query
language and a tool for visualizing retrieval results.

The information need can be understood as forming a pyramid, where only its peak is made
visible by users in the form of a conceptual query (see Figure 2.1). The conceptual query
captures the key
concepts and the relationships among them. It is the result of a conceptual analysis that
operates on the information need, which may be well or vaguely defined in the user's mind.
This analysis can be challenging, because users are faced with the general "vocabulary
problem" as they are trying to translate their information need into a conceptual query. This
problem refers to the fact that a single word can have more than one meaning, and,
conversely, the same concept can be described by surprisingly many different words. Furnas,
Landauer, Gomez and Dumais (1983) have shown that two people use the same main word to
describe an object only 10 to 20% of the time. Further, the concepts used to represent the
documents can be different from the concepts used by the user. The conceptual query can
take the form of a natural language statement, a list of concepts that can have degrees of
importance assigned to them, or it can be statement that coordinates the concepts using
Boolean operators. Finally, the conceptual query has to be translated into a query surrogate
that can be understood by the retrieval system.

Figure 2.1: represents a general model of the information retrieval process, where both the
user's information need and the document collection have to be translated into the form of
surrogates to enable the matching process to be performed. This figure has been adapted from
Lancaster and Warner (1993).
Similarly, the meanings of documents need to be represented in the form of text surrogates
that can be processed by computer. A typical surrogate can consist of a set of index terms or
descriptors. The text surrogate can consist of multiple fields, such as the title, abstract,
descriptor fields to capture the meaning of a document at different levels of resolution or
focusing on different characteristic aspects of a document. Once the specified query has been
executed by IR system, a user is presented with the retrieved document surrogates. Either the
user is satisfied by the retrieved information or he will evaluate the retrieved documents and
modify the query to initiate a further search. The process of query modification based on user
evaluation of the retrieved documents is known as relevance feedback [Lancaster and Warner
1993]. Information retrieval is an inherently interactive process, and the users can change
direction by modifying the query surrogate, the conceptual query or their understanding of
their information need.

It is worth noting here the results, which have been obtained in studies investigating the
information-seeking process, that describe information retrieval in terms of the cognitive
and affective symptoms commonly experienced by a library user. The findings by Kuhlthau
et al. (1990) indicate that thoughts about the information need become clearer and more
focused as users move through the search process. Similarly, uncertainty, confusion, and
frustration are nearly universal experiences in the early stages of the search process, and
they decrease as the search process progresses and feelings of being confident, satisfied,
sure and relieved increase. The studies also indicate that cognitive attributes may affect the
search process. User's expectations of the information system and the search process may
influence the way they approach searching and therefore affect the intellectual access to
information.
Analytical search strategies require the formulation of specific, well-structured queries and a
systematic, iterative search for information, whereas browsing involves the generation of
broad query terms and a scanning of much larger sets of information in a relatively
unstructured fashion. Campagnoni et al. (1989) have found in information retrieval studies in
hypertext systems that the predominant search strategy is "browsing" rather than "analytical
search". Many users, especially novices, are unwilling or unable to precisely formulate their
search objectives, and browsing places less cognitive load on them. Furthermore, their
research showed that search strategy is only one dimension of effective information retrieval;
individual differences in visual skill appear to play an equally important role.

These two studies argue for information displays that provide a spatial overview of the data
elements and that simultaneously provide rich visual cues about the content of the individual
data elements.
Such a representation is less likely to increase the anxiety that is a natural part of the early
stages of the search process and it caters for a browsing interaction style, which is appropriate
especially in the beginning, when many users are unable to precisely formulate their search
objectives.

2.3 Major Information Retrieval Models

The following major models have been developed to retrieve information: the Boolean
model, the Statistical model, which includes the vector space and the probabilistic retrieval
model, and the Linguistic and Knowledge-based models. The first model is often referred to
as the "exact match" model; the latter ones as the "best match" models [Belkin and Croft
1992]. The material presented here is based on the textbooks by Lancaster and Warner (1992)
as well as Frakes and Baeza-Yates (1992), the review article by Belkin and Croft (1992), and
discussions with Richard Marcus, my thesis advisor and mentor in the field of information
retrieval.

Queries generally are less than perfect in two respects: First, they retrieve some irrelevant
documents. Second, they do not retrieve all the relevant documents. The following two
measures are usually used to evaluate the effectiveness of a retrieval method. The first one,
called the precision rate, is equal to the proportion of the retrieved documents that are
actually relevant. The second one, called the recall rate, is equal to the proportion of all
relevant documents that are actually retrieved. If searchers want to raise precision, then they
have to narrow their queries. If searchers want to raise recall, then they broaden their query.
In general, there is an inverse relationship between precision and recall. Users need help to
become knowledgeable in how to manage the precision and recall trade-off for their
particular information need [Marcus 1991].

2.3.1.1 Standard Boolean

In Table 2.1 we summarize the defining characteristics of the standard Boolean approach
and list its key advantages and disadvantages. It has the following strengths: 1) It is easy to
implement and it is computationally efficient [Frakes and Baeza-Yates 1992]. Hence, it is
the standard model for the
current large-scale, operational retrieval systems and many of the major on-line information
services use it. 2) It enables users to express structural and conceptual constraints to describe
important linguistic features [Marcus 1991]. Users find that synonym specifications
(reflected by OR-clauses) and phrases (represented by proximity relations) are useful in the
formulation of queries [Cooper 1988, Marcus 1991]. 3) The Boolean approach possesses a
great expressive power and clarity.
Boolean retrieval is very effective if a query requires an exhaustive and unambiguous
selection. 4) The Boolean method offers a multitude of techniques to broaden or narrow a
query. 5) The Boolean approach can be especially effective in the later stages of the search
process, because of the clarity and exactness with which relationships between concepts can
be represented.

The standard Boolean approach has the following shortcomings: 1) Users find it difficult to
construct effective Boolean queries for several reasons [Cooper 1988, Fox and Koll 1988,
Belkin and Croft 1992]. Users are using the natural language terms AND, OR or NOT that
have a different meaning when used in a query. Thus, users will make errors when they form
a Boolean query, because they resort to their knowledge of English.

Table 2.1: summarizes the defining characteristics of the standard Boolean approach and list
the its key advantages and disadvantages.
For example, in ordinary conversation a noun phrase of the form "A and B" usually refers to
more entities than would "A" alone, whereas when used in the context of information
retrieval it refers to fewer documents than would be retrieved by "A" alone. Hence, one of the
common mistakes made by users is to substitute the AND logical operator for the OR logical
operator when translating an English sentence to a Boolean query. Furthermore, to form
complex queries, users must be familiar with the rules of precedence and the use of
parentheses. Novice users have difficulty using parentheses, especially nested parentheses.
Finally, users are overwhelmed by the multitude of ways a query can be structured or
modified, because of the combinatorial explosion of feasible queries as the number of
concepts increases. In particular, users have difficulty identifying and applying the different
strategies that are available for narrowing or broadening a Boolean query [Marcus 1991,
Lancaster and Warner 1993]. 2) Only documents that satisfy a query exactly are retrieved. On
the one hand, the AND operator is too severe because it does not distinguish between the
case when none of the concepts are satisfied and the case where all except one are satisfied.
Hence, no or very few documents are retrieved when more than three and four criteria are
combined with the Boolean operator AND (referred to as the Null Output problem). On the
other hand, the OR operator does not reflect how many concepts have been satisfied. Hence,
often too many documents are retrieved (the Output Overload problem). 3) It is difficult to
control the number of retrieved documents. Users are often faced with the null-output or the
information overload problem and they are at loss of how to modify the query to retrieve the
reasonable number documents. 4) The traditional Boolean approach does not provide a
relevance ranking of the retrieved documents, although modern Boolean approaches can
make use of the degree of coordination, field level and degree of stemming present to rank
them [Marcus 1991]. 5) It does not represent the degree of uncertainty or error due the
vocabulary problem [Belkin and Croft 1992].
2.3.1.2 Narrowing and Broadening Techniques

As mentioned earlier, a Boolean query can be described in terms of the following four
operations: degree and type of coordination, proximity constraints, field specifications and
degree of stemming as expressed in terms of word/string specifications. If users want to
(re)formulate a Boolean query then they need to make informed choices along these four
dimensions to create a query that is sufficiently broad or narrow depending on their
information needs. Most narrowing techniques lower recall as well as raise precision, and
most broadening techniques lower precision as well as raise recall. Any query can be
reformulated to achieve the desired precision or recall characteristics, but generally it is
difficult to achieve both. Each of the four kinds of operations in the query formulation has
particular operators, some of which tend to have a narrowing or broadening effect. For each
operator with a narrowing effect, there is one or more inverse operators with a broadening
effect [Marcus 1991]. Hence, users require help to gain an understanding of how changes
along these four dimensions will affect the broadness or narrowness of a query.

Figure 2.2: captures how coordination, proximity, field level and stemming affect the
broadness or narrowness of a Boolean query. By moving in the direction in which the
wedges are expanding the query is broadened.

Figure 2.2 shows how the four dimensions affect the broadness or narrowness of a query: 1)
Coordination: the different Boolean operators AND, OR and NOT have the following effects
when used to add a further concept to a query: a) the AND operator narrows a query; b) the
OR broadens it; c) the effect of the NOT depends on whether it is combined with an AND or
OR operator. Typically, in searching textual databases, the NOT is connected to the AND, in
which case it has a narrowing effect like the AND operator. 2) Proximity: The closer together
two terms have to appear in a document, the more narrow and precise the query. The most
stringent proximity constraint requires the two terms to be adjacent. 3) Field level: current
document records have fields associated with them, such as the "Title", "Index", "Abstract" or
"Full-text" field: a) the more fields that are searched, the broader the query; b) the individual
fields have varying degrees of precision associated with them, where the "title" field is the
most specific and the "full-text" field is the most general. 4) Stemming: The shorter the prefix
that is used in truncation-based searching, the broader the query. By reducing a term to its
morphological stem and using it as a prefix, users can retrieve many terms that are
conceptually related to the original term [Marcus 1991].

Using Figure 2.2, we can easily read off how to broaden query. We just need to move in the
direction in which the wedges are expanding: we use the OR operator (rather than the AND),
impose no proximity constraints, search over all fields and apply a great deal of stemming.
Similarly, we can formulate a very narrow query by moving in the direction in which the
wedges are contracting: we use the AND operator (rather than the OR), impose proximity
constraints, restrict the search to the
title field and perform exact rather than truncated word matches. In Chapter 4 we will show
how Figure 2.2 indicates how the broadness or narrowness of a Boolean query could be
visualized.

2.3.1.3 Smart Boolean

There have been attempts to help users overcome some of the disadvantages of the traditional
Boolean discussed above. We will now describe such a method, called Smart Boolean,
developed by Marcus [1991, 1994] that tries to help users construct and modify a Boolean
query as well as make better choices along the four dimensions that characterize a Boolean
query. We are not attempting to provide an in-depth description of the Smart Boolean
method, but to use it as a good example that illustrates some of the possible ways to make
Boolean retrieval more user-friendly and effective. Table 2.2 provides a summary of the key
features of the Smart Boolean approach.

Users start by specifying a natural language statement that is automatically translated into a
Boolean Topic representation that consists of a list of factors or concepts, which are
automatically coordinated using the AND operator. If the user at the initial stage can or wants
to include synonyms, then they are coordinated using the OR operator. Hence, the Boolean
Topic representation connects the different factors using the AND operator, where the factors
can consist of single terms or several synonyms connected by the OR operator. One of the
goals of the Smart Boolean approach is to make use of the structural knowledge contained in
the text surrogates, where the different fields represent contexts of useful information.
Further, the Smart Boolean approach wants to use the fact that related concepts can share a
common stem. For example, the concepts "computers" and "computing" have the common
stem comput*.

Table 2.2: summarizes the defining characteristics of the Smart Boolean approach and list the its key
advantages and disadvantages.

The initial strategy of the Smart Boolean approach is to start out with the broadest possible
query within the constraints of how the factors and their synonyms have been coordinated.
Hence, it modifies the Boolean Topic representation into the query surrogate by using only
the stems of the concepts and searches for them over all the fields. Once the query surrogate
has been performed, users are guided in the process of evaluating the retrieved document
surrogates. They choose from a list of reasons to indicate why they consider certain
documents as relevant. Similarly, they can indicate why other documents are not relevant by
interacting with a list of possible reasons. This user feedback is used by the Smart Boolean
system to automatically modify the Boolean Topic representation or the query surrogate,
whatever is more appropriate. The Smart Boolean approach offers a rich set of strategies for
modifying a query based on the received relevance feedback or the expressed need to narrow
or broaden the query. The Smart Boolean retrieval paradigm has been implemented in the
form of a system called CONIT, which is one of the earliest expert retrieval systems that was
able to demonstrate that ordinary users, assisted by such a system, could perform equally well
as experienced search intermediaries [Marcus 1983]. However, users have to navigate
through a series of menus listing different choices, where it might be hard for them to
appreciate the
implications of some of these choices. A key limitation of the previous versions of the
CONIT system has been that lacked a visual interface. The most recent version has a
graphical interface and it uses the tiling metaphor suggested by Anick et al. (1991), and
discussed in section 10.4, to visualize Boolean coordination [Marcus 1994]. This
visualization approach suffers from the limitation that it enables users to visualize specific
queries, whereas we will propose a visual interface that represents all whole range of related
Boolean queries in a single display, making changes in Boolean coordination more user-
friendly. Further, the different strategies of modifying a query in CONIT require a better
visualization metaphor to enable users to make use these search heuristics. In Chapter 4 we
show how some of these modification techniques can be visualized.

2.3.1.4 Extended Boolean Models

Several methods have been developed to extend the Boolean model to address the following issues:
1) The Boolean operators are too strict and ways need to be found to soften them. 2) The
standard Boolean approach has no provision for ranking. The Smart Boolean approach and
the methods described in this section provide users with relevance ranking [Fox and Koll
1988, Marcus 1991]. 3) The Boolean model does not support the assignment of weights to
the query or document terms. We will briefly discuss the P-norm and the Fuzzy Logic
approaches that extend the Boolean model to address the above issues.

Table 2.3: summarizes the defining characteristics of the Extended Boolean approach and list
the its key advantages and disadvantages.

The P-norm method developed by Fox (1983) allows query and document terms to have
weights, which have been computed by using term frequency statistics with the proper
normalization procedures. These normalized weights can be used to rank the documents in the
order of decreasing distance from the point (0, 0, ... , 0) for an OR query, and in order of
increasing distance from the point (1, 1, ... , 1) for an AND query. Further, the Boolean
operators have a coefficient P associated with them to indicate the degree of strictness of the
operator (from 1 for least strict to infinity for most strict, i.e., the Boolean case). The P-norm
uses a distance-based measure and the coefficient P determines the degree of exponentiation
to be used. The exponentiation is an expensive computation, especially for P-values greater
than one.

In Fuzzy Set theory, an element has a varying degree of membership to a set instead of the
traditional binary membership choice. The weight of an index term for a given document
reflects the degree to which this term describes the content of a document. Hence, this weight
reflects the degree of membership of the document in the fuzzy set associated with the term
in question. The degree of membership for union and intersection of two fuzzy sets is equal
to the maximum and minimum, respectively, of the degrees of membership of the elements of
the two sets. In the "Mixed Min and Max" model developed by Fox and Sharat (1986) the
Boolean operators are softened by
considering the query-document similarity to be a linear combination of the min and max
weights of the documents.

2.3.2 Statistical Model

The vector space and probabilistic models are the two major examples of the statistical
retrieval approach. Both models use statistical information in the form of term frequencies to
determine the relevance of documents with respect to a query. Although they differ in the
way they use the term frequencies, both produce as their output a list of documents ranked by
their estimated relevance. The statistical retrieval models address some of the problems of
Boolean retrieval methods, but they have disadvantages of their own. Table 2.4 provides
summary of the key features of the vector space and probabilistic approaches. We will also
describe Latent Semantic Indexing and clustering approaches that are based on statistical
retrieval approaches, but their objective is to respond to what the user's query did not say,
could not say, but somehow made manifest [Furnas et al. 1983, Cutting et al. 1991].

2.3.2.1 Vector Space Model

The vector space model represents the documents and queries as vectors in a
multidimensional space, whose dimensions are the terms used to build an index to represent
the documents [Salton 1983]. The creation of an index involves lexical scanning to identify
the significant terms, where morphological analysis reduces different word forms to common
"stems", and the occurrence of those stems is computed. Query and document surrogates are
compared by comparing their vectors, using, for example, the cosine similarity measure. In
this model, the terms of a query surrogate can be weighted to take into account their
importance, and they are computed by using the statistical distributions of the terms in the
collection and in the documents [Salton 1983]. The vector space model can assign a high
ranking score to a document that contains only a few of the query terms if these terms occur
infrequently in the collection but frequently in the document. The vector space model makes
the following assumptions: 1) The more similar a document vector is to a query vector, the
more likely it is that the document is relevant to that query. 2) The words used to define the
dimensions of the space are orthogonal or independent. While it is a reasonable first
approximation, the assumption that words are pairwise independent is not realistic.

2.3.2.2 Probabilistic Model

The probabilistic retrieval model is based on the Probability Ranking Principle, which states
that an information retrieval system is supposed to rank the documents based on their
probability of relevance to the query, given all the evidence available [Belkin and Croft
1992]. The principle takes into account that there is uncertainty in the representation of the
information need and the documents. There can be a variety of sources of evidence that are
used by the probabilistic retrieval
methods, and the most common one is the statistical distribution of the terms in both the
relevant and non-relevant documents.

We will now describe the state-of-art system developed by Turtle and Croft (1991) that uses
Bayesian inference networks to rank documents by using multiple sources of evidence to
compute the conditional probability
P(Info need|document) that an information need is satisfied by a given document. An
inference network consists of a directed acyclic dependency graph, where edges represent
conditional dependency or causal relations between propositions represented by the nodes.
The inference network consists of a document network, a concept representation network that
represents indexing vocabulary, and a query network representing the information need. The
concept representation network is the interface between documents and queries. To compute
the rank of a document, the inference network is instantiated and the resulting probabilities
are propagated through the network to derive a probability associated with the node
representing the information need. These probabilities are used to rank documents.

The statistical approaches have the following strengths: 1) They provide users with a
relevance ranking of the retrieved documents. Hence, they enable users to control the output
by setting a relevance threshold or by specifying a certain number of documents to display. 2)
Queries can be easier to formulate because users do not have to learn a query language and
can use natural language. 3) The uncertainty inherent in the choice of query concepts can be
represented. However, the statistical approaches have the following shortcomings: 1) They
have a limited expressive power. For example, the NOT operation can not be represented
because only positive weights are used. It can be proven that only 2N*N of the 22N possible
Boolean queries can be generated by the statistical approaches that use weighted linear sums
to rank the documents. This result follows from the analysis of Linear Threshold Networks or
Boolean Perceptrons [Anthony and Biggs 1992]. For example, the very common and
important Boolean query ((A and B) or (C and D)) can not be represented by a vector space
query (see section 5.4 for a proof). Hence, the statistical approaches do not have the
expressive power of the Boolean approach. 3) The statistical approach lacks the structure to
express important linguistic features such as phrases. Proximity constraints are also difficult
to express, a feature that is of great use for experienced searchers. 4) The computation of the
relevance scores can be computationally expensive. 5) A ranked linear list provides users
with a limited view of the information space and it does not directly suggest how to modify a
query if the need arises [Spoerri 1993, Hearst 1994]. 6) The queries have to contain a large
number of words to improve the retrieval performance. As is the case for the Boolean
approach, users are faced with the problem of having to choose the appropriate words that are
also used in the relevant documents.

Table 2.4 summarizes the advantages and disadvantages that are specific to the vector space
and probabilistic model, respectively. This table also shows the formulas that are commonly
used to compute the term weights. The two central quantities used are the inverse term
frequency in a collection (idf), and the frequencies of a term i in a document j (freq(i,j)). In the
probabilistic model, the weight computation also considers how often a term appears in the
relevant and irrelevant documents, but this presupposes that the relevant documents are
known or that these frequencies can be reliably estimated.
Table 2.4: summarizes the defining characteristics of the statistical retrieval approach, which
includes the vector space and the probabilistic model and we list the their key advantages and
disadvantages.

If users provide the retrieval system with relevance feedback, then this information is used by
the statistical approaches to recompute the weights as follows: the weights of the query terms
in the relevant documents are increased, whereas the weights of the query terms that do not
appear in the relevant documents are decreased [Salton and Buckley 1990]. There are
multiple ways of computing and updating the weights, where each has its advantages and
disadvantages. We do not discuss these formulas in more detail, because research on
relevance feedback has shown that significant effectiveness improvements can be gained by
using quite simple feedback techniques [Salton and Buckley 1990]. Furthermore, what is
important to this thesis is that the statistical retrieval approach generates a ranked list,
however how this ranking has been computed in detail is immaterial for the purpose of this
thesis.

2.3.2.3 Latent Semantic Indexing

Several statistical and AI techniques have been used in association with domain semantics to
extend the vector space model to help overcome some of the retrieval problems described
above, such as the "dependence problem" or the "vocabulary problem". One such method is
Latent Semantic Indexing (LSI). In LSI the associations among terms and documents are
calculated and exploited in the retrieval process. The assumption is that there is some "latent"
structure in the pattern of word usage across documents and that statistical techniques can be
used to estimate this latent structure. An advantage of this approach is that queries can
retrieve documents even if they have no words in common. The LSI technique captures
deeper associative structure than simple term-to-term correlations and is completely
automatic. The only difference between LSI and vector space methods is that LSI represents
terms and documents in a reduced dimensional space of the derived indexing dimensions. As
with the vector space method, differential term weighting and relevance feedback can
improve LSI performance substantially.

Foltz and Dumais (1992) compared four retrieval methods that are based on the vector-space
model. The four methods were the result of crossing two factors, the first factor being whether
the retrieval method used Latent Semantic Indexing or keyword matching, and the second
factor being whether the profile was based on words or phrases provided by the user (Word
profile), or documents that the user had previously rated as relevant (Document profile). The
LSI match-document profile method proved to be the most successful of the four methods.
This method combines the advantages of both LSI and the document profile. The document
profile provides a simple, but
effective, representation of the user's interests. Indicating just a few documents that are of
interest is as effective as generating a long list of words and phrases that describe one's
interest. Document profiles have an added advantage over word profiles: users can just
indicate documents they find relevant without having to generate a description of their
interests.

2.3.3 Linguistic and Knowledge-based Approaches

In the simplest form of automatic text retrieval, users enter a string of keywords that are used
to search the inverted indexes of the document keywords. This approach retrieves documents
based solely on the presence or absence of exact single word strings as specified by the
logical representation of the query. Clearly this approach will miss many relevant documents
because it does not capture the complete or deep meaning of the user's query. The Smart
Boolean approach and the statistical retrieval approaches, each in their specific way, try to
address this problem (see Table 2.5). Linguistic and knowledge-based approaches have also
been developed to address this problem by performing a morphological, syntactic and
semantic analysis to retrieve documents more effectively [Lancaster and Warner 1993]. In a
morphological analysis, roots and affixes are analyzed to determine the part of speech (noun,
verb, adjective etc.) of the words. Next complete phrases have to be parsed using some form
of syntactic analysis. Finally, the linguistic methods have to resolve word ambiguities and/or
generate relevant synonyms or quasi-synonyms based on the semantic relationships between
words. The development of a sophisticated linguistic retrieval system is difficult and it
requires complex knowledge bases of semantic information and retrieval heuristics. Hence
these systems often require techniques that are commonly referred to as artificial intelligence
or expert systems techniques.

2.3.3.1 DR-LINK Retrieval System

We will now describe in some detail the DR-LINK system developed by Liddy et al., because
it represents an exemplary linguistic retrieval system. DR-LINK is based on the principle that
retrieval should take place at the conceptual level and not at the word level. Liddy et al.
attempt to retrieve documents on the basis of what people mean in their query and not just
what they say in their query. DR-LINK system employs sophisticated, linguistic text
processing techniques to capture the conceptual information in documents. Liddy et al. have
developed a modular system that represents and matches text at the lexical, syntactic,
semantic, and the discourse levels of language. Some of the modules that have been
incorporated are: The Text Structurer is based on discourse linguistic theory that suggests that
texts of a particular type have a predictable structure which serves as an indication where
certain information can be found. The Subject Field Coder uses an established semantic
coding scheme from a machine-readable dictionary to tag each word with its disambiguated
subject code (e.g., computer science, economics) and to then produce a fixed-length, subject-
based vector representation of the document and the query. The Proper Noun Interpreter uses
a variety of processing heuristics and knowledge bases to produce: a canonical representation
of each proper noun; a classification of each proper noun into thirty-seven categories; and an
expansion of group nouns into their constituent proper noun members. The Complex
Nominal Phraser provides means for precise matching of complex semantic constructs when
expressed as either adjacent nouns or a
non-predicating adjective and noun pair. Finally, The Natural Language Query Constructor
takes as input a natural language query and produces a formal query that reflects the
appropriate logical combination of text structure, proper noun, and complex nominal
requirements of the user's information need. This module interprets a query into pattern-
action rules that translate each sentence into a first-order logic assertion, reflecting the
Boolean-like requirements of queries.

Table 2.5: characterizes the major retrieval methods in terms of how deal with lexical,
morphological, syntactic and semantic issues.

To summarize, the DR-LINK retrieval system represents content at the conceptual level rather
than at the word level to reflect the multiple levels of human language comprehension. The
text representation combines the lexical, syntactic, semantic, and discourse levels of
understanding to predict the relevance of a document. DR-LINK accepts natural language
statements, which it translates into a precise Boolean representation of the user's relevance
requirements. It also produces a summary-level, semantic vector representations of queries
and documents to provide a ranking of the documents.

2.4 Conclusion

There is a growing discrepancy between the retrieval approach used by existing commercial
retrieval systems and the approaches investigated and promoted by a large segment of the
information retrieval research community. The former is based on the Boolean or Exact
Matching retrieval model, whereas the latter ones subscribe to statistical and linguistic
approaches, also referred to as the Partial Matching approaches. First, the major criticism
leveled against the Boolean approach is that its queries are difficult to formulate. Second, the
Boolean approach makes it possible to represent structural and contextual information that
would be very difficult to represent using the statistical approaches. Third, the Partial
Matching approaches provide users with a ranked output, but these ranked lists obscure

Table 2.6: lists some of the key problems in the field of information retrieval and possible solutions.

valuable information. Fourth, recent retrieval experiments have shown that the Exact and
Partial matching approaches are complementary and should therefore be combined [Belkin et
al. 1993].

In Table 2.6 we summarize some of the key problems in the field of information retrieval and
possible solutions to them. We will attempt to show in this thesis: 1) how visualization can
offer ways to address these problems; 2) how to formulate and modify a query; 3) how to
deal with large sets of retrieved documents, commonly referred to as the information
overload problem. In particular, this
thesis overcomes one of the major "bottlenecks" of the Boolean approach by showing how
Boolean coordination and its diverse narrowing and broadening techniques can be visualized,
thereby making it more user-friendly without limiting its expressive power. Further, this
thesis shows how both the Exact and Partial Matching approaches can be visualized in the
same visual framework to enable users to make effective use of their respective strengths.
TEXT PREPROCESSING
The information retrieval is the task of obtaining relevant information from a large collection
of databases. Preprocessing plays an important role in information retrieval to extract the
relevant information. A text preprocessing approach works in two steps. Firstly, spell check
utility is used for enhancing stemming and secondly, synonyms of similar tokens are
combined.The commonly used text preprocessing techniques are:
1. Stopword Removal

Stopwords are very commonly used words in a language that play a major role in the
formation of a sentence but which seldom contribute to the meaning of that sentence. Words
that are expected to occur in 80 percent or more of the documents in a collection are typically
referred to as stopwords, and they are rendered potentially useless. Because of the
commonness and function of these words, they do not contribute much to the relevance of a
document for a query search.
Examples include words such as the, of, to, a, and, in, said, for, that, was, on, he, is, with, at,
by, and it. Removal of stopwords from a document must be performed before indexing.
Articles, prepositions, conjunctions, and some pronouns are generally classified as
stopwords. Queries must also be preprocessed for stopword removal before the actual
retrieval process. Removal of stopwords results in elimination of possible spurious indexes,
thereby reducing the size of an index structure by about 40 percent or more. However, doing
so could impact the recall if the stopword is an integral part of a query (for example, a search
for the phrase ‘To be or not to be,’ where removal of stopwords makes the query
inappropriate, as all the words in the phrase are stopwords). Many search engines do not
employ query stopword removal for this reason.
2. Stemming

A stem of a word is defined as the word obtained after trimming the suffix and prefix
of an original word. For example, ‘comput’ is the stem word for computer, computing, and
computation. These suffixes and prefixes are very common in the English language for
supporting the notion of verbs, tenses, and plural forms. Stemming reduces the different
forms of the word formed by inflection (due to plurals or tenses) and derivation to a common
stem.A stemming algorithm can be applied to reduce any word to its stem. In English, the
most famous stemming algorithm is Martin Porter’s stemming algorithm. The Porter stemmer
is a simplified version of Lovin’s technique that uses a reduced set of about 60 rules (from
260 suffix patterns in Lovin’s technique) and organizes them into sets; conflicts within one
subset of rules are resolved before going on to the next. Using stemming for preprocessing
data results in a decrease in the size of the indexing structure and an increase in recall,
possibly at the cost of precision.
3. Utilizing a Thesaurus

A thesaurus comprises a precompiled list of important concepts and the main word
that describes each concept for a particular domain of knowledge. For each concept in this
list, a set of synonyms and related words is also compiled. Thus, a synonym can be converted
to its matching concept during preprocessing. This preprocessing step assists in providing a
standard vocabulary for
indexing and searching. Usage of a thesaurus, also known as a collection of synonyms, has a
substantial impact on the recall of information systems. This process can be complicated
because many words have different meanings in different contexts. UMLS is a large
biomedical thesaurus of millions of concepts (called the Metathesaurus) and a semantic
network of meta concepts and relationships that organize the Metathesaurus. The concepts
are assigned labels from the semantic network. This thesaurus of concepts contains synonyms
of medical terms, hierarchies of broader and narrower terms, and other relationships among
words and concepts that make it a very extensive resource for information retrieval of
documents in the medical domain. Figure 27.3 illustrates part of the UMLS Semantic
Network.
WordNet is a manually constructed thesaurus that groups words into strict synonym sets
called synsets. These synsets are divided into noun, verb, adjective, and adverb
categories. Within each category, these synsets are linked together by appropriate
relationships such as class/subclass or “is-a” relationships for nouns.
WordNet is based on the idea of using a controlled vocabulary for indexing, thereby eliminating
redundancies. It is also useful in providing assistance to users with locating terms for proper query
formulation.
4. Other Preprocessing Steps: Digits, Hyphens, Punctuation Marks, Cases

Digits, dates, phone numbers, e-mail addresses, URLs, and other standard
types of text may or may not be removed during preprocessing. Web search engines, however,
index them in order to to use this type of information in the document metadata to improve
precision and recall (see Section 27.6 for detailed definitions of precision and recall).
Hyphens and punctuation marks may be handled in different ways. Either the entire phrase with the
hyphens/punctuation marks may be used, or they may be eliminated. In some systems, the character
representing the hyphen/punctuation mark may be removed, or may be replaced with a space.
Different information retrieval systems follow different rules of processing. Handling
hyphens automatically can be complex: it can either be done as a classification problem, or
more commonly by some heuristic rules.
Most information retrieval systems perform case-insensitive search, converting all the letters
of the text to uppercase or lowercase. It is also worth noting that many of these text
preprocessing steps are language specific, such as involving accents and diacritics and the
idiosyncrasies that are associated with a particular language.
5. Information Extraction

Information extraction (IE) is a generic term used for extracting structured con-tent
from text. Text analytic tasks such as identifying noun phrases, facts, events, people, places,
and relationships are examples of IE tasks. These tasks are also called named entity
recognition tasks and use rule-based approaches with either a the-saurus, regular expressions
and grammars, or probabilistic approaches. For IR and search applications, IE technologies
are mostly used to identify contextually relevant features that involve text analysis, matching,
and categorization for improving the relevance of search systems. Language technologies
using part-of-speech tagging are applied to semantically annotate the documents with
extracted features to aid search relevance.
Inverted Index

An inverted index is an index data structure storing a mapping from content, such as words or
numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap
like data structure that directs you from a word to a document or a web page.
There are two types of inverted indexes:
A record-level inverted index contains a list of references to documents for each word.
A word-level inverted index additionally contains the positions of each word within a
document. The latter form offers more functionality, but needs more processing power and
space to be created.

Suppose we want to search the texts “hello everyone, ” “this article is based on inverted
index, ” “which is hashmap like data structure”. If we index by (text, word within the
text), the index with location in text is:

hello (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
is (2, 3); (3, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
data (3, 5)
structure (3, 6)

The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has an entry (1,
1) and word “is” is in document 2 and 3 at ‘3rd’ and ‘2nd’ positions respectively (here
position is based on word).
The index may have weights, frequencies, or other indicators.
Steps to build an inverted index:

Fetch the Document


Removing of Stop Words: Stop words are most occurring and useless words in document like
“I”, “the”, “we”, “is”, “an”.

Stemming of Root Word


Whenever I want to search for “cat”, I want to see a document that has information about it.
But the word present in the document is called “cats” or “catty” instead of “cat”. To relate the
both words, I’ll chop some part of each and every word I read so that I could get the “root
word”. There are standard tools for performing this like “Porter’s Stemmer”.

Record Document IDs


If word is already present add reference of document to index else create new entry. Add
additional information like frequency of word, location of word etc.
Example:

Words Document
ant doc1
demo doc2
world doc1, doc2

Advantage of Inverted Index are:

●Inverted index is to allow fast full text searches, at a cost of increased processing
when a document is added to the database.
● It is easy to develop.
● It is the most popular data structure used in document retrieval systems, used on a
large scale for example in search engines.
Inverted Index also has disadvantage:

● Large storage overhead and high maintenance costs on update, delete and insert.
Evaluative Masures :

DEFINITION for web search:


An internet search, otherwise known as a search query, is an entry into a search engine that
yields both paid and organic results. The paid results are the ads that appear at the top and the
bottom of the page, and they are marked accordingly. The organic results are the unmarked
results that appear in between the ads.
At the core of an internet search is a keyword. In turn, keywords are at the hearts of search
engine marketing (SEM) and search engine optimization (SEO).

What is web analytics?


Web analytics is the process of analyzing the behavior of visitors to a website. This involves
tracking, reviewing and reporting data to measure web activity, including the use of a website
and its components, such as webpages, images and videos.

Data collected through web analytics may include traffic sources, referring sites, page views,
paths taken and conversion rates. The compiled data often forms a part of customer
relationship management analytics (CRM analytics) to facilitate and streamline better
business decisions.

Web analytics enables a business to retain customers, attract more visitors and increase the
dollar volume each customer spends.

Analytics can help in the following ways:

Determine the likelihood that a given customer will repurchase a product after purchasing it in
the past.
Personalize the site to customers who visit it repeatedly.
Monitor the amount of money individual customers or specific groups of customers spend.
Observe the geographic regions from which the most and the least customers visit the site and
purchase specific products.
Predict which products customers are most and least likely to buy in the future.
The objective of web analytics is to serve as a business metric for promoting specific products
to the customers who are most likely to buy them and to determine which products a specific
customer is most likely to purchase. This can help improve the ratio of revenue to marketing
costs.
In addition to these features, web analytics may track the clickthrough and drilldown
behavior of customers within a website, determine the sites from which customers most often
arrive, and communicate with browsers to track and analyze online behavior. The results of
web analytics are provided in the form of tables, charts and graphs.

Web analytics process


The web analytics process involves the following steps:

Setting goals. The first step in the web analytics process is for businesses to determine goals
and the end results they are trying to achieve. These goals can include increased sales,
customer satisfaction and brand awareness. Business goals can be both quantitative and
qualitative.
Collecting data. The second step in web analytics is the collection and storage of data.
Businesses can collect data directly from a website or web analytics tool, such as Google
Analytics. The data mainly comes from Hypertext Transfer Protocol requests -- including
data at the network and application levels -- and can be combined with external data to
interpret web usage. For example, a user's Internet Protocol address is typically associated
with many factors, including geographic location and clickthrough rates.
Processing data. The next stage of the web analytics funnel involves businesses processing
the collected data into actionable information.
Identifying key performance indicators (KPIs). In web analytics, a KPI is a quantifiable
measure to monitor and analyze user behavior on a website. Examples include bounce rates,
unique users, user sessions and on-site search queries.
Developing a strategy. This stage involves implementing insights to formulate strategies that
align with an organization's goals. For example, search queries conducted on-site can help an
organization develop a content strategy based on what users are searching for on its website.
Experimenting and testing. Businesses need to experiment with different strategies in order
to find the one that yields the best results. For example, A/B testing is a simple strategy to help
learn how an audience responds to different content. The process involves creating two or
more versions of content and then displaying it to different audience segments to reveal
which version of the content performs better.

What are the two main categories of web analytics?


The two main categories of web analytics are off-site web analytics and on-site web analytics.

Off-site web analytics


The term off-site web analytics refers to the practice of monitoring visitor activity outside of an
organization's website to measure potential audience. Off-site web analytics provides an
industrywide analysis that gives insight into how a business is performing in comparison to
competitors. It refers to the type of analytics that focuses on data collected from across the
web, such as social media, search engines and forums.

On-site web analytics


On-site web analytics refers to a narrower focus that uses analytics to track the activity of
visitors to a specific site to see how the site is performing. The data gathered is usually more
relevant to a site's owner and can include details on site engagement, such as what content is
most popular. Two technological approaches to on-site web analytics include log file analysis
and page tagging.

Log file analysis, also known as log management, is the process of analyzing data gathered
from log files to monitor, troubleshoot and report on the performance of a website. Log files
hold records of virtually every action taken on a network server, such as a web server, email
server, database server or file server.

Page tagging is the process of adding snippets of code into a website's HyperText Markup
Language code using a tag management system to track website visitors and their
interactions across the website. These snippets of code are called tags. When businesses add
these tags to a website, they can be used to track any number of metrics, such as the number
of pages viewed, the number of unique visitors and the number of specific products viewed.
Web analytics tools
Web analytics tools report important statistics on a website, such as where visitors came from,
how long they stayed, how they found the site and their online activity while on the site. In
addition to web analytics, these tools are commonly used for product analytics, social media
analytics and marketing analytics.

Some examples of web analytics tools include the following:

Google Analytics. Google Analytics is a web analytics platform that monitors website traffic,
behaviors and conversions. The platform tracks page views, unique visitors, bounce rates,
referral Uniform Resource Locators, average time on-site, page abandonment, new vs.
returning visitors and demographic data.
Optimizely. Optimizely is a customer experience and A/B testing platform that helps
businesses test and optimize their online experiences and marketing efforts, including
conversion rate optimization.
Kissmetrics. Kissmetrics is a customer analytics platform that gathers website data and
presents it in an easy-to-read format. The platform also serves as a customer intelligence tool,
as it enables businesses to dive deeper into customer behavior and use this information to
enhance their website and marketing campaigns.
Crazy Egg. Crazy Egg is a tool that tracks where customers click on a page. This information
can help organizations understand how visitors interact with content and why they leave the
site. The tool tracks visitors, heatmaps and user session recordings.

current Trends in Information Retrieval & Web Search:


What is Information Retrieval?
□ Information Retrieval (IR) can be defined as a software program that deals with the

organization, storage, retrieval, and evaluation of information from document


repositories, particularly textual information.
□ Information Retrieval is the activity of obtaining material that can usually be documented on

an unstructured nature i.e. usually text which satisfies an information need from within large
collections which is stored on computers.
□ For example, Information Retrieval can be when a user enters a query into the system.

▪ In MongoDB user can Use Cmd To Work and Use Queries Like: MongoDB
◻ To Get Started Work in MongoDb , Show Databases ◻ to Get All The
Databases to be displayed , Use Students Details( Name of our Database
Table ) to Work which ll be in the list already or else We r going to Start
work with a new table DB ….
□ Not only librarians, professional searchers, etc engage themselves in the activity of

information retrieval but nowadays hundreds of millions of people engage in IR every


day when they use web search engines.
□ Information Retrieval is believed to be the dominant form of Information access. { Like we
are Accessing an information – Accessing a Databases by using “ Use name of our
Db “ in Cmd to access a Db in MongoDb .
□ The IR system assists the users in finding the information they require but it
does not explicitly return the answers to the question.
□ It notifies regarding the existence and location of documents that might consist of the
required information.
□ Information retrieval also extends support to users in browsing or filtering
document collection or processing a set of retrieved documents.
□ The system searches over billions of documents stored on millions of computers.
□ A spam filter, manual or automatic means are provided by Email program for

classifying the mails so that it can be placed directly into particular folders.
□ An IR system has the ability to represent, store, organize, and access information items.

□ A set of keywords are required to search.

□ Keywords are what people are searching for in search engines. ( Like if We wants Notes for

Information Retrieval ◻ have to Type it in Search Engine As Information Retrieval in


Advanced Database Technology ◻ from this lengthy sentences the SE Checks For the
Key Words and check the server and Response with Relevant Links about what We
Requested , These all Can Be Done By SEO ◻ Search Engine Objectives ( Set of
kewly words must be owned by organization to make their contents , web pages etcs
to be in th top of the Search list when ever the Users Use that particular Keywords
like Example : gree ◻ ll shows “GREEKSFORGREEKS.COM” As the primary
link in search list because GREEKSFORGREEKS organization has used SEO ◻
reserved a set of keys for them ( a person who is assigned for this work Will do it
every day and he ll get High Pay for doing this work ,. it’s a separate Topic we ll see
about it latter.
□ These keywords summarize the description of the information.

Difference Between Information Retrieval and Data Retrieval:


S.No Information Retrieval Data Retrieval
1. The software program that deals with the Data retrieval deals with obtaining data from
organization, storage, retrieval, and
a database management system such as
evaluation of information from document
repositories particularly textual information. ( ODBMS. It is A process of identifying and
Web Search/ Browsing ) retrieving the data from the database, based
on the query provided by user or application.
( Accessing MongoDb via Cmd with Key
Words Like Use name of the Db , show Db’s
etc … )

2. Retrieves information about a Determines the keywords in the user


subject. query and retrieves the data.
3. Small errors are likely to go unnoticed. A single error object means total
failure.
4. Not always well structured and is Has a well-defined structure and semantics.
semantically ambiguous.
5. Does not provide a solution to the Provides solutions to the user of the
user of the database system. database system.
6. The results obtained are approximate matches. The results obtained are exact matches.

7. Results are ordered by relevance. Results are unordered by relevance.


8. It is a probabilistic model It is a deterministic model.

Past, Present, and Future of Information Retrieval:

1. Early Developments: ◻ As there was an increase in the need for a lot of


information, it became necessary to build data structures to get faster access.
□ The index is the data structure for faster retrieval of information. ( Ex: BST , RB BT ,
Fibonacci Etc .. )
□ Over centuries manual categorization of hierarchies was done for indexes. ( Example : in an

Organization Employees are Ranked According to their Posting / Position like ◻ Owner of
the Organization / Project Manager ◻ Team Leader ◻ Workers / Team Members .

2. Information Retrieval In Libraries:◻ Libraries were the first to adopt IR


systems for information retrieval.
□ In first-generation, it consisted, automation of previous technologies, and the
search was based on author name and title.
□ In the second generation, it included searching by subject heading, keywords, etc.

□ In the third generation, it consisted of graphical interfaces, electronic forms,


hypertext features, etc.
□ The Web and Digital Libraries:◻ It is cheaper than various sources of information, it

provides greater access to networks due to digital communication and it gives free
access to publish on a larger medium.
Web Trends in The Coming Years:
□ When the Internet was introduced back in the 1980s, the sole purpose of it was
to communicate data locally on an inter-connected wired network for research
purposes.
□ Since then, it has expanded and evolved in bits and pieces.

□ The internet now holds a very strong place in our lives, and without it, our
lives seem impossible.
□ The internet of today runs in all the domains of our life from a simple search to sectors like

education, economy, business, healthcare and much more.


□ The internet and web technologies we see today are a result of hard work and strong
vision by engineers and technology aspirants made a decade ago.
□ With that being taken into account, some of the forecasts about the Web technologies are:

I. With the rapid development of technology, data volumes are expected to


increase at an exponential rate by 2030, and this will bring in more traffic from
across the globe.
II. The expenditure per capita will increase and this will boost the volume of
searches and in turn creating a huge market of information.
III. The searches made will be possible to make in any form, unlike only
from the keyboard or voice-searches.
IV. As the search volume will increase new technologies such as gesture search,
artificial interpretation of users will develop thus lessening the work from the
user side.
V. Hardware devices will mold themselves into easier portability.
VI. Not only computers and smartphones but the objects surrounding us will be a
smart device.
VII. A common example here is the use of IoT (Internet of Things).
VIII. The web will be made much more easily accessible.
IX. A heavy amount is funded by organizations for research into this field of
smart devices.

❖ The Web Trends that will be catching in coming years are:

1. WebRTC.
2. Internet of Things (IoT)

3. Progressive Web Apps


4. Social Networking via Virtual Reality.
❖ WebRTC ◻ Known as Web Real-Time Communication, it’s an open framework for
web and is widely used in many browsers such as Google Chrome, Mozilla Firefox,
Android, iOS, etc.
? Using this framework, users can do video conferencing, share files, desktop sharing
and can interact Realtime without the use of external web plugins.
? WebRTC can be used in the sector of online education & E-meetings.
□ The use of MOOCs (Massive Open Online Courses) has made WebRTC a

very essential framework.


? The use of WebRTC would enable a better online learning experience and would
break the boundaries for education to be transmitted to everyone.
□ As of now, there are already many platforms that provide online education, which
helps thousands of students.
□ WebRTC is continuously improved to achieve for a better end-user experience and many
recent developments have led the WebRTC framework to be used in the older Devices
and in offline mode, which further enables more users to get benefited.
? The concept of cloud-conferencing and E-meetings in the corporate sector is
only possible with the use of WebRTC.
□ Clients and employees are saving an ample amount of time via meeting online.

❖ Internet of Things (IoT)◻ Internet of things is considered as the backbone of the modern

internet, as only via IoT consumers, governments and businesses would be able to
interact with the physical world.
□ Thus, this will help the problems to be solved in a much better and engaging way.

□ The vision of an advanced and closely operated internet system cannot be visualized
without the use of these smart devices.
□ These smart devices need not necessarily be computerized devices but can also be
non-computerized devices such as fan, fridge, air conditioner, etc.
□ These devices will be given the potential to create user-specific data that can be
optimized for better user experience and to increase human productivity.
? The goal of IoT is to form a network of internet-connected devices, which
can interact internally for better usage.
? Many developed countries have already started using IoT, and a common
example is the use of light sensors in public places.
? Whenever a vehicle/object will pass through the road, the first street light will
be lightened and will lighten all the other lights on that road which are
internally connected, thus creating a smarter and energy-saving model.
? Around 35 billion devices are connected to the internet in 2020, and the
number of connections to the internet is expected to go up to 50 billion by
2030.
? Thus, IOT proves to be one of the emerging web technologies in the coming decades.

❖ Progressive Web Apps ◻ The smartphones we use in current scenarios are loaded with
apps, with the choice of users to download or remove any app depending upon his
liking.
? But what if, we do not have to download or remove any app to use its services?
? The idea behind progressive web apps is much similar to this.
? Such apps would cover the screen of the smartphones and would enable us to
use or try any app upon our liking without actually downloading it.
□ It is a combination of web and app technology to give the user a much smoother

experience.
? The advantage of using progressive web apps is that the users will not face any
hassle to download and update the app time of time thus saving data.
□ Also, the app companies would not need to release the app for every updated version.

□ This would also eliminate the use complexity to create responsive apps, as the progressive

web apps can be used in any device and will give the same experience, despite the
screen size.
? Further development into progressive web apps can also enable users to use it in
an offline mode, thus paving a way for those who are not connected with the
internet.
□ The ease of use and availability will increase thus benefiting the user and making
life much simpler.
□ A very common example is the ‘Try Now’ feature that we get in the Google play store for

specific apps, it more or less uses the same technology of progressive web apps to
run the app without actually downloading it.

❖ Social Networking via Virtual Reality◻ The rise of virtual reality in the last few
years is due to its ability to fill the gap between reality and virtual.
? The same idea of virtual reality is now being thought to be used with social networking.
? Making the idea of social networking a base i.e. to interact with people over long
distances & the idea of virtual reality is used on top.
□ Social networking sites are devising ways so that the users not only confine themselves

over-communicating online but to provide a way through which they have access to
the world of virtual reality.
□ The video calling/conferencing shall not remain a visual perception but would be
changed to a complete 360-degree experience.
□ The user will be able to feel much more than just communication and can interact in a much

better way.
? This idea of mixing social networking with virtual reality might be a challenging
one, but the kind of user experience, one could get will be amazing.
? World’s largest social networking company, Facebook has started to develop a
platform way back in 2014 and was able to successfully create a virtual
environment where users were not just able to communicate but also feel their
surroundings, but the platform has not been open to public yet.
□ These Web trends will shortly arrive in the coming years, and the availability of these

technologies will once again prove that the internet is not stagnant and is always
improving to provide a better user experience.
□ Improvising these technologies will make the internet take a very essential place in
our lives, just like the way it has taken now.

You might also like