ADT Unit 1 To 5
ADT Unit 1 To 5
1
● It is used in application areas where large volumes of data are processed and
accessed by numerous users simultaneously.
● It is designed for heterogeneous database platforms.
● It maintains confidentiality and data integrity of the databases.
Reliability: In distributed database system, if one system fails down or stops working for
some time another system can complete the task.
Availability: In distributed database system reliability can be achieved even if sever fails
down. Another system is available to serve the client request.
Performance: Performance can be achieved by distributing database over different locations.
So the databases are available to every location which is easy to maintain.
2
Types of Homogeneous Distributed Database
There are two types of homogeneous distributed database −
● Autonomous − Each database is independent that functions on its own. They are
integrated by a controlling application and use message passing to share data updates.
● Non-autonomous − Data is distributed across the homogeneous nodes and a central
or master DBMS co-ordinates data updates across the sites.
● A client server architecture has a number of clients and a few servers connected in a
network.
● A client sends a query to one of the servers. The earliest available server solves it
and replies.
● A Client-server architecture is simple to implement and execute due to centralized
server system.
3
2. Collaborating server architecture.
3. Middleware architecture.
● Middleware architectures are designed in such a way that single query is executed on
multiple servers.
● This system needs only one server which is capable of managing queries and
transactions from multiple servers.
● Middleware architecture uses local servers to handle local queries and transactions.
● The softwares are used for execution of queries and transactions across one or more
independent database servers, this type of software is called as middleware.
What is fragmentation?
Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the
table are called fragments. Fragmentation can be of three types: horizontal, vertical, and
hybrid (combination of horizontal and vertical). Horizontal fragmentation can further be
classified into two techniques: primary horizontal fragmentation and derived horizontal
fragmentation.
Fragmentation should be done in a way so that the original table can be reconstructed from
the fragments. This is needed so that the original table can be reconstructed from the
fragments whenever required. This requirement is called “reconstructiveness.”
● The process of dividing the database into a smaller multiple parts is called
as fragmentation.
● These fragments may be stored at different locations.
● The data fragmentation process should be carrried out in such a way that the
4
reconstruction of original database from the fragments is possible.
Advantages of Fragmentation
● Since data is stored close to the site of usage, efficiency of the database system is
increased.
● Local query optimization techniques are sufficient for most queries since data is
locally available.
● Since irrelevant data is not available at the sites, security and privacy of the database
system can be maintained.
Disadvantages of Fragmentation
● When data from different fragments are required, the access speeds may be very low.
● In case of recursive fragmentations, the job of reconstruction will need expensive
techniques.
● Lack of back-up copies of data in different sites may render the database ineffective
in case of failure of a site.
Example:
5
A_102 10,000 Baroda
A_103 25,000 Delhi
For the above table we can define any simple condition like, Branch_Name= 'Pune',
Branch_Name= 'Delhi', Balance < 50,000
Fragmentation1:
SELECT * FROM Account WHERE Branch_Name= 'Pune' AND Balance < 50,000
Fragmentation2:
SELECT * FROM Account WHERE Branch_Name= 'Delhi' AND Balance < 50,000
Fragmentation1:
SELECT * FROM Account WHERE Branch_Name= 'Baroda' AND Balance < 50,000
Fragmentation2:
SELECT * FROM Account WHERE Branch_Name= 'Delhi' AND Balance < 50,000
2. Vertical Fragmentation
Example:
Fragmentation1:
SELECT * FROM Acc_NO
Fragmentation2:
SELECT * FROM Balance
3) Hybrid Fragmentation
Fragmentation1:
SELECT * FROM Emp_Name WHERE Emp_Age < 40
Fragmentation2:
SELECT * FROM Emp_Id WHERE Emp_Address= 'Pune' AND Salary < 14000
Data replication is the process in which the data is copied at multiple locations (Different
computers or servers) to improve the availability of data.
Data replication is the process of storing separate copies of the database at two or more sites.
7
It is a popular fault tolerance technique of distributed databases.
Advantages of Data Replication
● Reliability − In case of failure of any site, the database system continues to work
since a copy is available at another site(s).
● Reduction in Network Load − Since local copies of data are available, query
processing can be done with reduced network usage, particularly during prime hours.
Data updating can be done at non-prime hours.
● Quicker Response − Availability of local copies of data ensures quick query
processing and consequently quick response time.
● Simpler Transactions − Transactions require less number of joins of tables located at
different sites and minimal coordination across the network. Thus, they become
simpler in nature.
Disadvantages of Data Replication
● Increased Storage Requirements − Maintaining multiple copies of data is associated
with increased storage costs. The storage space required is in multiples of the storage
required for a centralized system.
● Increased Cost and Complexity of Data Updating − Each time a data item is
updated, the update needs to be reflected in all the copies of the data at the different
sites. This requires complex synchronization techniques and protocols.
● Undesirable Application – Database coupling − If complex update mechanisms are
not used, removing data inconsistency requires complex co-ordination at application
level. This results in undesirable application – database coupling.
1. Synchronous Replication:
In synchronous replication, the replica will be modified immediately after some changes are
made in the relation table. So there is no difference between original data and replica.
2. Asynchronous replication:
In asynchronous replication, the replica will be modified after commit is fired on to the
database.
Replication Schemes
1. Full Replication
In this design alternative, at each site, one copy of all the database tables is stored. Since,
each site has its own copy of the entire database, queries are very fast requiring negligible
communication cost. On the contrary, the massive redundancy in data requires huge cost
during update operations. Hence, this is suitable for systems where a large number of queries
is required to be handled whereas the number of database updates is low.
8
In full replication scheme, the database is available to almost every location or user in
communication network.
2. No Replication
In this design alternative, different tables are placed at different sites. Data is placed so that it
is at a close proximity to the site where it is used most. It is most suitable for database
systems where the percentage of queries needed to join information in tables placed at
different sites is low.
If an appropriate distribution strategy is adopted, then this design alternative helps to reduce
the communication cost during data processing.
9
Advantages of no replication
● Concurrency can be minimized.
● Easy recovery of data.
Disadvantages of no replication
● Poor availability of data.
● Slows down the query execution process, as multiple clients are accessing the same
server.
3. Partial replication
Copies of tables or portions of tables are stored at different sites. The distribution of the
tables is done in accordance to the frequency of access. This takes into consideration the fact
that the frequency of accessing the tables vary considerably from site to site. The number of
copies of the tables (or portions) depends on how frequently the access queries execute and
the site which generate the access queries.
Partial replication means only some fragments are replicated from the database.
Various factors which are considered while processing a query are as follows:
● This is a very important factor while processing queries. The intermediate data is
transferred to other location for data processing and the final result will be sent to the
location where the actual query is processing.
10
● The cost of data increases if the locations are connected via high performance
communicating channel.
● The DDBMS query optimization algorithms are used to minimize the cost of data
transfer.
11
failed and aborted.
● Active − The initial state where the transaction enters is the active state. The
transaction remains in this state while it is executing read, write or other operations.
● Partially Committed − The transaction enters this state after the last statement of the
transaction has been executed.
● Committed − The transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
● Failed − The transaction goes from partially committed state or active state to failed
state when it is discovered that normal execution can no longer proceed or system
checks fail.
● Aborted − This is the state after the transaction has been rolled back after failure and
the database has been restored to its state that was before the transaction began.
Desirable Properties of Transactions
Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation,
and Durability.
● Atomicity − This property states that a transaction is an atomic unit of
processing, that is, either it is performed in its entirety or not performed at all.
No partial update should exist.
● Consistency − A transaction should take the database from one consistent state
to another consistent state. It should not adversely affect any data item in the
database.
● Isolation − A transaction should be executed as if it is the only one in the system.
There should not be any interference from the other concurrent transactions that
are simultaneously running.
● Durability − If a committed transaction brings about a change, that change
should be durable in the database and not lost in case of any failure.
Distributed Transactions
For example:
Consider that, location A sends message to location B and expects response from B but B is
12
unable to receive it. There are several problems for this situation which are as follows.
COMMIT PROTOCOL
In a local database system, for committing a transaction, the transaction manager has to only
convey the decision to commit to the recovery manager. However, in a distributed system,
the transaction manager should convey the decision to commit to all the servers in the
various sites where the transaction is being executed and uniformly enforce the decision.
When processing is complete at each site, it reaches the partially committed transaction state
and waits for all other transactions to reach their partially committed states. When it receives
the message that all the sites are ready to commit, it starts to commit. In a distributed system,
either all sites commit or none of them does.
The different distributed commit protocols are −
● One-phase commit
● Two-phase commit
● Three-phase commit
Distributed One-phase Commit
Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a
controlling site and a number of slave sites where the transaction is being executed. The
steps in distributed commit are −
● After each slave has locally completed its transaction, it sends a “DONE” message to
the controlling site.
● The slaves wait for “Commit” or “Abort” message from the controlling site. This
waiting time is called window of vulnerability.
● When the controlling site receives “DONE” message from each slave, it makes a
decision to commit or abort. This is called the commit point. Then, it sends this
message to all the slaves.
● On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.
13
local network failure.
● Two-phase commit protocol provides automatic recovery mechanism in case of a
system failure.
● The location at which original transaction takes place is called as coordinator and
where the sub process takes place is called as Cohort.
Commit request:
In commit phase the coordinator attempts to prepare all cohorts and take necessary
steps to commit or terminate the transactions.
Commit phase:
The commit phase is based on voting of cohorts and the coordinator decides to
commit or terminate the transaction.
The steps performed in the two phases are as follows −
Phase 1: Prepare Phase
● After each slave has locally completed its transaction, it sends a “DONE”
message to the controlling site. When the controlling site has received “DONE”
message from all slaves, it sends a “Prepare” message to the slaves.
● The slaves vote on whether they still want to commit or not. If a slave wants to
commit, it sends a “Ready” message.
● A slave that does not want to commit sends a “Not Ready” message. This may
happen when the slave has conflicting concurrent transactions or there is a
timeout.
Phase 2: Commit/Abort Phase
● After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the slaves.
o The slaves apply the transaction and send a “Commit ACK” message to the
controlling site.
o When the controlling site receives “Commit ACK” message from all the slaves,
it considers the transaction as committed.
● After the controlling site has received the first “Not Ready” message from any
slave −
o The controlling site sends a “Global Abort” message to the slaves.
o The slaves abort the transaction and send a “Abort ACK” message to the
controlling site.
o When the controlling site receives “Abort ACK” message from all the slaves, it
considers the transaction as aborted.
14
Phase 3: Commit / Abort Phase
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message
is not required.
Some problems which occur while accessing the database are as follows:
4. Distributed commit
While committing a transaction which is accessing databases stored on multiple locations, if
failure occurs on some location during the commit process then this problem is called as
distributed commit.
5. Distributed deadlock
Deadlock can occur at several locations due to recovery problem and concurrency problem
(multiple locations are accessing same system in the communication network).
There are three different ways of making distinguish copy of data by applying:
1) Lock based protocol
A lock is applied to avoid concurrency problem between two transaction in such a way that
the lock is applied on one transaction and other transaction can access it only when the lock is
released. The lock is applied on write or read operations. It is an important method to avoid
deadlock.
2) Shared lock system (Read lock)
The transaction can activate shared lock on data to read its content. The lock is shared in such
a way that any other transaction can activate the shared lock on the same data for reading
purpose.
15
UNIT-II
Active Database
1
Active Database is a database consisting of set of triggers. These databases are very difficult to be
maintained because of the complexity that arises in understanding the effect of these triggers. In
such database, DBMS initially verifies whether the particular trigger specified in the statement that
modifies the database) is activated or not, prior to executing the statement.
If the trigger is active then DBMS executes the condition part and then executes the action part
only if the specified condition is evaluated to true. It is possible to activate more than one trigger
within a single statement.
In such situation, DBMS processes each of the trigger randomly. The execution of an action part
of a trigger may either activate other triggers or the same trigger that Initialized this action. Such
types of trigger that activates itself is called as ‘recursive trigger’. The DBMS executes such chains
of trigger in some pre-defined manner but it effects the concept of understanding.
1. It possess all the concepts of a conventional database i.e. data modelling facilities, query
language etc.
2. It supports all the functions of a traditional database like data definition, data
manipulation, storage management etc.
3. It supports definition and management of ECA rules.
2
Advantages :
2. Enable a uniform and centralized description of the business rules relevant to the
information system.
4. Suitable platform for building large and efficient knowledge base and expert systems.
The model that has been used to specify active database rules is referred to as the Event-
Condition-Action (ECA) model. A rule in the ECA model has three components:
1. The event(s) that triggers the rule: These events are usually database update operations that are
explicitly applied to the database. However, in the general model, they could also be temporal
events2 or other kinds of external events.
2. The condition that determines whether the rule action should be executed: Once the triggering
event has occurred, an optional condition may be evaluated. If no condition is specified, the action
will be executed once the event occurs. If a condition is specified, it is first evaluated, and only if
it evaluates to true will the rule action be executed.
3. The action to be taken: The action is usually a sequence of SQL statements, but it could also
be a database transaction or an external program that will be automatically executed.
Let us consider some examples to illustrate these concepts. The examples are based on a much
simplified variation of the COMPANY database application from Figure 3.5 and is shown in
Figure 26.1, with each employee having a name (Name), Social
Security number (Ssn), salary (Salary), department to which they are currently assigned (Dno, a
foreign key to DEPARTMENT), and a direct supervisor (Supervisor_ssn, a (recursive) foreign key
to EMPLOYEE). For this example, we assume that NULL is allowed for Dno, indicating that an
employee may be temporar-ily unassigned to any department. Each department has a name
(Dname), number (Dno), the total salary of all employees assigned to the department (Total_sal),
and a manager (Manager_ssn, which is a foreign key to EMPLOYEE).
Notice that the Total_sal attribute is really a derived attribute, whose value should be the sum of
the salaries of all employees who are assigned to the particular department. Maintaining the correct
3
value of such a derived attribute can be done via an active rule. First we have to determine
the events that may cause a change in the value of Total_sal, which are as follows:
1. Inserting (one or more) new employee tuples
In the case of event 1, we only need to recompute Total_sal if the new employee is immediately
assigned to a department—that is, if the value of the Dno attribute for the new employee tuple is
not NULL (assuming NULL is allowed for Dno). Hence, this would be the condition to be
checked. A similar condition could be checked for event 2 (and 4) to determine whether the
employee whose salary is changed (or who is being deleted) is currently assigned to a department.
For event 3, we will always execute an action to maintain the value of Total_sal correctly, so no
The action for events 1, 2, and 4 is to automatically update the value of Total_sal for the
employee’s department to reflect the newly inserted, updated, or deleted employee’s salary. In the
case of event 3, a twofold action is needed: one to update the Total_sal of the employee’s old
department and the other to update the Total_sal of the employee’s new department.
4
The four active rules (or triggers) R1, R2, R3, and R4—corresponding to the above situation—can
be specified in the notation of the Oracle DBMS as shown in Figure 26.2(a). Let us consider rule
R1 to illustrate the syntax of creating triggers in Oracle.
The CREATE TRIGGER statement specifies a trigger (or active rule) name Total_sal1 for R1.
The AFTER clause specifies that the rule will be triggered after the events that trigger the rule
occur. The triggering events—an insert of a new employee in this example—are specified
following the AFTER keyword.
The ON clause specifies the relation on which the rule is specified—EMPLOYEE for R1.
The optional keywords FOR EACH ROW specify that the rule will be triggered once for each
row that is affected by the triggering event.
The optional WHEN clause is used to specify any conditions that need to be checked after the rule
is triggered, but before the action is executed. Finally, the action(s) to be taken is (are) specified
as a PL/SQL block, which typically contains one or more SQL statements or calls to execute
external procedures.
The four triggers (active rules) R1, R2, R3, and R4 illustrate a number of features of active rules.
First, the basic events that can be specified for triggering the rules are the standard SQL update
commands: INSERT, DELETE, and UPDATE. They are spec-ified by the
keywords INSERT, DELETE, and UPDATE in Oracle notation. In the case of UPDATE, one may
specify the attributes to be updated—for example, by writing UPDATE OF Salary, Dno. Second,
the rule designer needs to have a way to refer to the tuples that have been inserted, deleted, or
modified by the triggering event. The key-words NEW and OLD are used in Oracle
notation; NEW is used to refer to a newly inserted or newly updated tuple, whereas OLD is used
to refer to a deleted tuple or to a tuple before it was updated.
Thus, rule R1 is triggered after an INSERT operation is applied to the EMPLOYEE relation.
In R1, the condition (NEW.Dno IS NOT NULL) is checked, and if it evaluates to true, meaning
that the newly inserted employee tuple is related to a department, then the action is executed. The
action updates the DEPARTMENT tuple(s) related to the newly inserted employee by adding their
salary (NEW. Salary) to the Total_sal attribute of their related department.
It is important to note the effect of the optional FOR EACH ROW clause, which sig-nifies that the
rule is triggered separately for each tuple. This is known as a row-level trigger. If this clause was
5
left out, the trigger would be known as a statement-level trigger and would be triggered once for
each triggering statement. To see the differ-ence, consider the following update operation, which
gives a 10 percent raise to all employees assigned to department 5. This operation would be an
event that triggers rule R2:
UPDATE EMPLOYEE
Because the above statement could update multiple records, a rule using row-level semantics, such
as R2 in Figure 26.2, would be triggered once for each row, whereas a rule using statement-level
semantics is triggered only once. The Oracle system allows the user to choose which of the above
options is to be used for each rule. Including the optional FOR EACH ROW clause creates a row-
level trigger, and leaving it out creates a statement-level trigger. Note that the
keywords NEW and OLD can only be used with row-level triggers.
As a second example, suppose we want to check whenever an employee’s salary is greater than
the salary of his or her direct supervisor. Several events can trigger this rule: inserting a new
employee, changing an employee’s salary, or changing an employee’s supervisor. Suppose that
the action to take would be to call an external procedure inform_supervisor,6 which will notify the
supervisor. The rule could then be written as in R5 (see Figure 26.2(b)).
Figure 26.3 shows the syntax for specifying some of the main options available in Oracle triggers.
We will describe the syntax for triggers in the SQL-99 standard in Section 26.1.5.
Figure 26.3
A syntax summary for specifying triggers in the Oracle system (main options only).
[ WHEN <condition> ]
<trigger actions> ;
The previous section gave an overview of some of the main concepts for specifying active rules.
In this section, we discuss some additional issues concerning how rules are designed and
implemented. The first issue concerns activation, deactivation, and grouping of rules. In addition
to creating rules, an active database system should allow users to activate,
deactivate, and drop rules by referring to their rule names. A deactivated rule will not be
triggered by the triggering event. This feature allows users to selectively deactivate rules for certain
periods of time when they are not needed. The activate command will make the rule active again.
The drop command deletes the rule from the system. Another option is to group rules into
named rule sets, so the whole set of rules can be activated, deactivated, or dropped. It is also useful
to have a command that can trigger a rule or rule set via an explicit PROCESS RULES command
issued by the user.
The second issue concerns whether the triggered action should be executed before, after, instead
of, or concurrently with the triggering event. A before trigger executes the trigger before
executing the event that caused the trigger. It can be used in applications such as checking for
constraint violations. An after trigger executes the trigger after executing the event, and it can be
used in applications such as maintaining derived data and monitoring for specific events and
conditions. An instead of trigger executes the trigger instead of executing the event, and it can be
used in applications such as executing corresponding updates on base relations in response to an
event that is an update of a view.
Let us assume that the triggering event occurs as part of a transaction execution. We should first
consider the various options for how the triggering event is related to the evaluation of the rule’s
condition. The rule condition evaluation is also known as rule consideration, since the action is
to be executed only after considering whether the condition evaluates to true or false. There are
three main possibilities for rule consideration:
Immediate consideration. The condition is evaluated as part of the same transaction as the
triggering event, and is evaluated immediately. This case can be further categorized into three
options:
7
Evaluate the condition after executing the triggering event.
Deferred consideration. The condition is evaluated at the end of the trans-action that included the
triggering event. In this case, there could be many triggered rules waiting to have their conditions
evaluated.
Detached consideration. The condition is evaluated as a separate transaction, spawned from the
triggering transaction.
The next set of options concerns the relationship between evaluating the rule condition
and executing the rule action. Here, again, three options are possible: immediate, deferred,
or detached execution. Most active systems use the first option. That is, as soon as the condition
is evaluated, if it returns true, the action is immediately executed.
The Oracle system (see Section 26.1.1) uses the immediate consideration model, but it allows the
user to specify for each rule whether the before or after option is to be used with immediate
condition evaluation. It also uses the immediate execution model. The STARBURST system (see
Section 26.1.3) uses the deferred consideration option, meaning that all rules triggered by a
transaction wait until the triggering transaction reaches its end and issues its COMMIT
WORK command before the rule conditions are evaluated. 7
Another issue concerning active database rules is the distinction between row-
level rules and statement-level rules. Because SQL update statements (which act as triggering
events) can specify a set of tuples, one has to distinguish between whether the rule should be
considered once for the whole statement or whether it should be considered separately for each
row (that is, tuple) affected by the statement. The SQL-99 standard (see Section 26.1.5) and the
Oracle system (see Section 26.1.1) allow the user to choose which of the options is to be used for
each rule, whereas STAR-BURST uses statement-level semantics only. We will give examples of
how statement-level triggers can be specified in Section 26.1.3.
One of the difficulties that may have limited the widespread use of active rules, in spite of their
potential to simplify database and software development, is that there are no easy-to-use
techniques for designing, writing, and verifying rules. For exam-ple, it is quite difficult to verify
that a set of rules is consistent, meaning that two or more rules in the set do not contradict one
another. It is also difficult to guarantee termination of a set of rules under all circumstances. To
illustrate the termination.
8
problem briefly, consider the rules in Figure 26.4. Here, rule R1 is triggered by an INSERT event
on TABLE1 and its action includes an update event on Attribute1 of TABLE2. However,
rule R2’s triggering event is an UPDATE event on Attribute1 of TABLE2, and its action includes
an INSERT event on TABLE1. In this example, it is easy to see that these two rules can trigger
one another indefinitely, leading to non-termination. However, if dozens of rules are written, it is
very difficult to determine whether termination is guaranteed or not.
If active rules are to reach their potential, it is necessary to develop tools for the design, debugging,
and monitoring of active rules that can help users design and debug their rules.
Spatial Databases
Spatial data is associated with geographic locations such as cities, towns etc. A spatial database is
optimized to store and query data representing objects. These are the objects which are defined in
a geometric space.
Characteristics of Spatial Database
A spatial database system has the following characteristics
It is a database system
It offers spatial data types (SDTs) in its data model and query language.
It supports spatial data types in its implementation, providing at least spatial indexing and
efficient algorithms for spatial join.
Example
9
In general, spatial data can be of two types −
Vector data: This data is represented as discrete points, lines and polygons
Rastor data: This data is represented as a matrix of square cells.
The spatial data in the form of points, lines, polygons etc. is used by many different databases as
shown above.
Spatial operators :
1. Topological operators :
Topological properties do not vary when topological operations are applied, like
translation or rotation.
Topological operators are hierarchically structured in many levels. The base level offers operators,
ability to check for detailed topological relations between regions with a broad boundary. The
higher levels offer more abstract operators that allow users to query uncertain spatial data
independent of the geometric data model.
Examples –
open (region), close (region), and inside (point, loop).
2. Projective operators :
Projective operators, like convex hull are used to establish predicates regarding the
concavity convexity of objects.
10
Example –
Having inside the object’s concavity.
3. Metric operators :
Metric operators’s task is to provide a more accurate description of the geometry of the
object. They are often used to measure the global properties of singular objects, and to
measure the relative position of different objects, in terms of distance and direction.
Example –
length (of an arc) and distance (of a point to point).
Example –
Updating of a spatial object via translate, rotate, scale up or scale down, reflect, and shear.
3. Spatial Queries :
The requests for the spatial data which requires the use of spatial operations are called Spatial
Queries.
1. Range queries :
It finds all objects of a particular type that are within a given spatial area.
Example –
Finds all hospitals within the Siliguri area. A variation of this query is for a given location, find
all objects within a particular distance, for example, find all banks within 5 km range.
Example –
Finds the nearest police station from the location of accident.
Example –
Finds all Dhabas on a National Highway between two cities. It spatially joins township objects
and highway object.
11
Finds all hotels that are within 5 kilometers of a railway station. It spatially joins railway station
objects and hotels objects.
When it comes to comparing spatial databases, we can look at three primary features:
Spatial queries
Spatial indexes
Together, these three components comprise the basis of a spatial database. These three
components will help you decide which spatial database is most suitable for your enterprise or
business.
Spatial data comes in all shapes and sizes. All databases typically support points, lines, and
polygons, but some support many more spatial data types. Some databases abide by the standards
set by the Open Geospatial Consortium. Yet, that doesn’t mean it is easy to move the data between
databases.
This is where the FME platform reveals some of its strengths. Database barriers no longer matter,
as you can move your data wherever you want. With support for over 450 different systems and
applications, it can handle all your data tasks, spatial and otherwise.
Spatial queries perform an action on spatial data stored in the database. Some spatial queries can
be used to perform simple operations. However, some queries can become much more complex,
invoking spatial functions that span multiple tables. A spatial query using SQL allows you to
retrieve a specific subset of spatial data. This helps you retrieve only what you need from your
database.
This is how data is retrieved in spatial databases. The spatial query capabilities can vary from
database to database, both in terms of performance and functionality. This is important to consider
when you select your database.
Spatial queries drive a whole new class of business decisions retrieving requested data efficiently
for your business systems.
Spatial Indexes
What does the added size and complexity of spatial data mean for your data? Will your database
run slower? Will large spatial databases be too bulky for your database to store?
This is why spatial indexes are important. Spatial indexes are created with SQL commands. These
are generated from the database management interface or external program (i.e FME) with access
to your spatial database. Spatial indexes vary from database to database and are responsible for the
database performance necessary for adding spatial to your decision making.
Spatial data mining describes the process of discovering hidden patterns in large spatial data sets.
As a key driver of GIS application development, spatial data mining allows users to extract
valuable data on contiguous regions and investigate spatial patterns. In this scenario, spatial
variables like distance and direction are taken into account.
Data visualization software, such as Tableau, allows data scientists and marketers to connect
different spatial data files like Esri File Geodatabases, GeoJSON files, Keyhole Markup Language
(KML) files, MapInfo tables, Shapefiles and TopoJSON files. Once connected, users can create
points, lines and polygon maps using the information in spatial data files, lidar data files and
geospatial data files.
Spatial data is important for the internet of things (IoT). It helps IoT protocols use remote sensing
to collect data for spatial analysis. Spatial data is also used in transportation and logistics to help
companies understand which machine would work best at a specific location, make accurate time
estimations for deliveries and track deliveries in real time.
13
Environmental technologies also use spatial data to monitor temperature patterns, tidal patterns
and more. The ability to track at-risk areas in combination with historical data, weather data and
geospatial data gives scientists better information to predict natural disaster
One of the challenging aspects today for policymakers worldwide is to reduce the risks related to
climate change. GIS technologies help in understanding complex situations better and offering
concrete Geospatial solutions. Planning bodies can leverage technology to assess and implement
sustainable programs for the future.
GIS frameworks offer a scientific understanding of earth systems that lead to better decision-
making. Some GIS analysis examples are as follows.
Sea level analysis to measure rising levels and the threats they pose
With Evergreen Canada, one of our clients from the ESG sector, we built AI for Resilient City.
It’s an AI-driven 3-D data visualization tool that aims to help municipalities across Canada plan
for and mitigate the impacts of climate change.
The spatial structure of a geographical area plays a vital role in the lives of its inhabitants. To
ensure the quality of life of inhabitants, planners need better insights for effective decision-making.
So, improved knowledge of spatial structures and related socio-economic levels is vital.
Spatial pattern metrics from local climate zone classification helps in this aspect. It becomes
available by combining open GIS data, remote sensing, and machine learning. The data helps in
identifying a relationship between socioeconomic variables and spatial pattern metrics.
Some examples of variables include healthcare, education, and transportation. These variables also
help in grouping areas of any geography based on the quality of life they offer.
Apparently, we built an AI-driven application to predict the quality of life by evaluating the
socioeconomic data. This application leverages Deep Learning to process images captured from
satellite, enriched with census data and offers insights about rising in urbanization and poverty,
and anomalies in census-measured factors like literacy, employment, and healthcare.
14
Traffic Analysis
One of the better ways to identify problems in transportation systems is by modeling public transit
accessibility and traffic congestion. Traffic modeling also helps in identifying road stretches that
often exceed their capacity levels. People, usually in the low-income groups, lack vehicles which
make transit more difficult for them.
So inadequate public transportation can impact their ability to access employment and other
amenities freely. In many cases, public transportation does not cover every neighborhood.
Similarly, traffic congestion makes them unreliable.
Satellite data is finding increasing use in predicting disease risks across geographies. It can predict
the spatial distribution of infections and help plan the medication distribution for control and
preventive measures.
Geostatistical models are helpful, along with other factors like surface temperature and rainfall –
these help in understanding the prevalence of disease in society. Annual temperatures and distance
to water bodies are some other critical factors involved in the process.
One of our clients, the World Mosquito Program (WMP), alters mosquitos with natural bacteria to
limit their capacity to transmit dangerous diseases. These mosquitoes’ progeny lose their capacity
to transmit illness as well.
WMP collaborated with Gramener as part of a Microsoft AI for Good grant. Our AI-driven
approach evaluates population density using satellite images and Geospatial AI. It suggests a
neighborhood-level action to hasten social effect and save lives.
Gramener used computer vision models on high-resolution satellite pictures to estimate population
density and sensitivity to mosquito-borne illnesses at the sub-neighborhood level. The AI solution
creates a fine-grained release and monitoring strategy based on the city, population, and projected
coverage. This allows the WMP team to move quickly and maximize the effectiveness of their
solution.
When we talk of the current COVID-19 pandemic, it is a global challenge. Vaccinating as many
people as possible to achieve herd immunity status is critical to ending the pandemic. GIS
technologies can help in optimizing vaccine distribution to reduce the period to vaccinate everyone
eligible.
It is possible to map the population by identifying and segregating people based on their age groups
and the type of vaccines available. Analysis can show the number of people living close to a
15
vaccination site and the time taken to vaccinate them. It helps in the quick and efficient distribution
of vaccines to enable uniform coverage across cities, states, and countries.
Healthcare facilities are a vital component of any health system. Healthcare access is critical for
each individual living in a geographical area. Ensuring enough coverage of healthcare facilities is
something that defines a successful healthcare program. Spatial analysis helps identify the
locations of health facilities and their proximity to people.
GIS systems further help gain access to advanced metrics of healthcare and identify inequalities
for better planning. The correct data helps in planning appropriate access of healthcare facilities to
even the marginalized sections of society.
Reduced crop production has become a common phenomenon in recent years due to the
unpredictability of climate. Weather forecasts today play a critical role in improving crop
management. Crop yield prediction, through spatial analysis, helps in planning and executing
smooth logistical operations.
It is possible to study crop yield prediction through satellite imagery, soil conditions, and climate
data. Another crucial factor is the possibility of pest attacks in farms, which can also be predicted.
These resources combine to help identify a suitable time for crop production.
Livestock Monitoring
Livestock is a vital element of the economy, making their management an essential task. At places
where cattle roam around freely, spatial monitoring assumes much more importance. Studies have
also shown that livestock can release methane, which has a direct impact on global warming.
Larger herds can lead to higher methane emissions.
Nitrogen released in the soil is also hazardous as it can pollute water bodies as well. More
importantly, the effect of different species like cattle, swine, and goats can differ. GIS tools enable
online monitoring to check the damage caused to vegetation and landscape.
Another essential aspect is the manure production of animals, which is beneficial for biofuels.
Quantitative modeling tools help in calculating biogas production through the number of livestock
and the quality of manure. It assumes importance for countries where dependency is heavy on
livestock and the natural gas resources are less.
Soil properties mapping is essential to adopt sustainable farming practices. GPS tools can help in
identifying and collecting coordinates of sample areas. Researchers can then study the soil
16
properties like pH level, nitrogen content, nutrient levels, and much more. A GIS environment can
showcase the soil properties and their spatial variability. It is possible to do that with the help of
geostatistical analysis and interpolation techniques.
The spatial dependency level and spatial distribution of soil properties can vary significantly. This
data can help decision-makers to plan for better nutrient management. The prototype data also
remains helpful for future use of fertilizers.
The growth and productivity of crops depend on several factors like soil condition, weather, and
other management techniques. These can differ significantly across regions. To enable smart
farming, remote sensing data is ideal for mapping crops and understanding their performance.
A crop simulation model in the GIS framework can monitor crop performance through remote
sensing techniques. The remote sensing data can also help identify information related to crop
distribution, environment, and phenology.
The spatial analysis examples and applications are far and wide. We take a look at four such
applications in detail:
As the first spatial analysis example, it involves researching geography to identify the buildings
and other structures. For example, researchers might need to understand “how many hospitals are
available in a particular town.” This nonspatial query does not require knowledge of the physical
location of a hospital. However, if the question is “how many hospitals are within a distance of
five kilometers with each other,” it becomes spatial.
GIS can help in measuring the distance between the hospitals. It becomes possible as a GIS can
link spatial data with facts regarding geographical features on a map. The information remains
stored as attributes and characteristics of graphical representation. The lack of GIS means that
street networks will have simple street centerlines. It is not very beneficial from a visual
representation point of view.
GIS gives you the chance to use different symbols and showcase the database on the map. You
can show the building type like hospitals. The visual representation makes it easier for users to
study information seamlessly.
The distribution of population has spatial features. Population analysis through traditional methods
does not allow combining quantity, quality, data, and graphic methods. GIS helps in exhibiting the
17
spatial characteristics of population data on a macro level. The technology leverages display and
analysis functions to enable comprehensive representation.
The micro-level representation of population data involves adding public institutions, retail units,
and other structures that make up a geographic area. Such models also showcase the effect of such
structures on a population. GIS helps the decision-making authorities by integrating population
and spatial data. Population clustering remains a prominent spatial analysis example.
The third spatial analysis example is data annotation or exploratory insights, which involves using
tools and methods that uncover finer details of data. It also includes spatial and nonspatial patterns
and distributions. The raw data usually comes in tabular form, and making sense out of that set
can be difficult.
Exploratory analysis works with numeric data to identify the mean value. Some other statistics
involved in the process are median, standard deviation, and visualizations. Scatter plots and bar
charts are part of visualizations. Insights help in exploring spatial patterns and performing spatial
analysis.
This spatial analysis example is critical in visualizing information. Geospatial analysis tools help
in performing the visual mapping. Users can analyze data sets by adding them to maps. The layers
remain on background maps and can have charts, heatmap, line layers, and geodata. You can use
internal and external sources to gather data for layers and background maps.
Visual mapping involves gathering data from sources like smartphones, satellites, vehicles, and
wearable devices. These can power your analytics and dashboard reporting to improve the
decision-making process. You can also identify patterns and get insights that do not appear in raw
data available in spreadsheets.
Bottomline
Spatial analysis has assumed an essential role across industries today. Researchers and planners
across governmental and non-governmental agencies use spatial algorithms to study patterns
across geographies and plan their interventions. The backing of data also gives assurance regarding
the successful implementation of programs.
In the case of welfare programs of non-governmental organizations, it becomes much more critical.
It helps them spend their finances wisely to ensure maximum people get benefit from their
programs.
At Gramener, spatial analysis solutions are one of our key offerings. We help you leverage satellite
imagery and related information to solve your business challenges.
18
Mobile Database
A Mobile database is a database that can be connected to a mobile computing device over a
mobile network (or wireless network). Here the client and the server have wireless connections.
In today’s world, mobile computing is growing very rapidly, and it is huge potential in the field
of the database. It
will be applicable on different-different devices like android based mobile databases; iOS based
mobile databases, etc. Common examples of databases are Couch base Lite, Object Box, etc.
Mobile databases are physically separate from the central database server.
Mobile databases are capable of communicating with a central database server or other
mobile clients from remote sites.
With the help of a mobile database, mobile users must be able to work without a wireless
connection due to poor or even non-existent connections (disconnected).
A mobile database is used to analyze and manipulate data on mobile devices.
1. Fixed Hosts –
It performs the transactions and data management functions with the help of database servers.
2. Mobiles Units –
These are portable computers that move around a geographical region that includes the
cellular network that these units use to communicate to base stations.
3. Base Stations–
These are two-way radios installation in fixed locations that pass communication with
the mobile units to and from the fixed hosts.
Limitations
Here, we will discuss the limitation of mobile databases as follows.
19
It is less secured.
It is Hard to make theft-proof.
Mobility Management
With the convergence of the Internet and wireless mobile communications and with the rapid
growth in the number of mobile subscribers, mobility management emerges as one of the most
important and challenging problems for wireless mobile communication over the Internet.
Mobility management enables the serving networks to locate a mobile subscriber’s point of
attachment for delivering data packets (i.e. location management), and maintain a mobile
subscriber’s connection as it continues to change its point of attachment (i.e. handoff
management). The issues and functionalities of these activities are discussed in this section.
Location management
Location management enables the networks to track the locations of mobile nodes.
Location management has two major sub-tasks:
(i) location registration, and (ii) call delivery or paging. In location registration procedure,
the mobile node periodically sends specific signals to inform the network of its current location so
that the location database is kept updated. The call delivery procedure is invoked after the
completion of the location registration. Based on the information that has been registered in the
network during the location registration, the call delivery procedure queries the network about the
exact location of the mobile device so that a call may be delivered successfully. The design of a
location management scheme must address the following issues:
(iii) in a fully overlapping area where several wireless networks co-exist, an efficient and
robust algorithm must be designed so as to select the network through which a mobile device
should perform registration, deciding on where and how frequently the location information should
be stored, and how to determine the exact location of a mobile device within a specific time frame.
Handoff management
Handoff management is the process by which a mobile node keeps its connection active
when it moves from one access point to another. There are three stages in a handoff process. First,
the initiation of handoff is triggered by either the mobile device, or a network agent, or the
changing network conditions. The second stage is for a new connection generation, where the
network must find new resources for the handoff connection and perform any additional routing
20
operations. Finally, data-flow control needs to maintain the delivery of the data from the old
connection path to the new connection path according to the agreed-upon QoS guarantees.
Depending on the movement of the mobile device, it may undergo various types of handoff. In a
broad sense, handoffs may be of two types:
An inter-system handoff between heterogeneous networks may arise in the following scenarios
(i) when a user moves out of the serving network and enters an overlying network,
(ii) when a user connected to a network chooses to handoff to an underlying or overlaid
network for his/her service requirements,
(iii) when the overall load on the network is required to be distributed among different systems.
The design of handoff management techniques in all-IP based next-generation wireless networks
must address the following issues:
(i) signaling overhead and power requirement for processing handoff messages should be
minimized,
(ii) QoS guarantees must be made,
(iii) network resources should be efficiently used, and
(iv) the handoff mechanism should be scalable, reliable and robust.
21
layer. By receiving and analyzing, in advance, the signal strength reports and the information
regarding the direction of movement of the mobile node from the link layer, the system gets ready
for a network layer handoff so that packet loss is minimized and latency is reduced.
The disconnection of mobile stations for possibly long periods of time and bandwidth limitations
require a serious reevaluation of transaction model and transaction processing techniques. There
have been many proposals to model mobile transactions with different notions of a mobile
transaction. Most of these approaches view a mobile transaction as consisting of subtransactions
which have some flexibility in consistency and commit processing. The management of these
transactions may be static at the mobile unit or the database server, or may move from base station
to base station as the mobile unit moves.
Network disconnection may not be treated as failure, and if the data and methods needed to
complete a task are already present on the mobile device, processing may continue even though
disconnection has occurred. Because the traditional techniques for providing serializability (e.g.,
transaction monitors, scheduler, locks) do not function properly in a disconnected environment,
new mechanisms are to be developed for the management of mobile transaction processing
Applications of mobile computing may involve many different tasks, which can include long-lived
transactions as well as some data-processing tasks as remote order entry . Since users need to be
able to work effectively in disconnected state, mobile devices will require some degree of
transaction management. So, concurrency control schemes for mobile distributed databases should
support the autonomous operation of mobile devices during disconnections. These schemes should
also consider the message traffic with the realization of bandwidth limitations. Another issue in
these schemes would be to consider the new locality or place after the movement of the mobile
device. These challenging issues have been studied by many researchers but only some of the work
is included below.
In many of these models, relaxing some of the ACID properties and non-blocking execution in the
disconnected mobile unit, and caching of data before the request, adaptation of commit protocols
and recovery issues are examined. Each used its basic requirements for the transaction models.
However, the first of the following transaction models is a new model especially defined for the
mobile environment based on the traditional transaction models.
A mobile transaction model has been defined addressing the movement behavior of transactions.
Mobile transactions are named as Kangaroo Transactions which incorporate the property that the
transactions in a mobile environment hop from one base station to another as the mobile unit
moves. The model captures this movement behavior and the data behavior reflecting the access to
data located in databases throughout the static network.
22
The reference model assumed in, has a Data Access Agent (DAA) which is used for accessing data
in the database (of fixed host, base station or mobile unit) and each base station hosts a DAA.
When it receives a transaction request from a mobile user, the DAA forwards it to the specific base
stations or fixed hosts that contains the required data. DAA acts as a Mobile Transaction Manager
and data access coordinator for the site. It is built on top of an existing Global Database
System(GDBS). A GDBS assumes that the local DBMS systems perform the required transaction
processing functions including recovery and concurrency. A DDA’s view of the GDBS is similar
to that seen by a user at a fixed terminal and GDBS is not aware of the mobile nature of some
nodes in the network. DDA is also not aware of the implementation details of each requested
transaction.
When a mobile transaction moves to a new cell, the control of the transaction may move or may
retain at the originating site. If it remains at the originating site, messages would have to be sent
from the originating site to the current base station any time the mobile unit requests information.
If the transaction management function moves with the mobile unit, the overhead of these
messages can be avoided. For the logging side of this movement, each DAA will have the log
information for its corresponding portion of the executed transaction.
The model is based on traditional transaction concept which is a sequence of operations including,
read, write, begin transaction, end transaction, commit and abort transaction operations. The basic
structure is mainly a Local transaction (LT) to a particular DBMS. On the other hand, Global
Transactions (GT) can consist of either subtransactions viewed as LTs to some DBMS (Global
SubTransaction -GST) or subtransactions viewed as sequence of operations which can be global
themselves (GTs). This kind of nested viewing gives a recursive definition based on the limiting
bottom view of local transactions. A hopping property is added to model the mobility of the
transactions and Figure 2 shows this basic Kangaroo Transaction (KT) structure.
Each subtransaction represents the unit of execution at one base station and is called a Joey
Transaction (JT). The sequence of global and local transactions which are executed under a given
KT is defines ad a Pouch. The origin of base station initially creates a JT for its execution. A GT
and a JT are different from each other only JT is a part of KT and it must be coordinated by a DAA
23
at some base station site. A KT has a unique identification number consisting of the base station
number and unique sequence number within the base station. When the mobile unit moves from
one cell to another, the control of the KT changes to a new DAA at another base station. The DAA
at the new base station site creates a new JT as the result of the handoff process. JTs have also
identifications numbers in sequence where a JT ID has both the KT ID and the sequence number.
The mobility of the transaction model is captured by the use of split transactions. The old JT is
thus committed independently of the new JT. In Figure 2, JT1 is committed independently from
JT2 and JT3. If a failure of any JT occurs, that may result the entire KT to be undone by
compensating any previously completed JTs since the autonomy of the local DBMSs must be
assured. Therefore, a Kangaroo Transaction could be in a Split Mode or in a Compensating Mode.
A split transaction divides an ongoing transaction into serializable subtransactions. Earlier created
subtransaction may be committed and the second one can continue to its execution. However, the
decision as to abort or commit currently executing ones is left up to the component DBMSs.
Previously JTs may not be compensated so that neither Splitting Mode nor Compensating Mode
guarantees serializability of kangaroo transactions. Although Compensating Mode assures
atomicity, isolation may be violated because locks are obtained and released at the local transaction
level. With the Compensating Mode, Joey subtransactions are serializable. The Mobile transaction
Manager (MTM) keeps a Transaction Status Table on the base station DAA to maintain the status
of those transactions. It also keeps a local log into which the MTM writes the records needed for
recovery purposes, but the log does not contain any records related to recovering database
operations. Most records in the log are related to KT transaction status and some compensating
information.
Kangaroo Transaction model captures both the data and moving behavior of mobile transactions
and it is defined as a general model where it can provide mobile transaction processing in a
heterogeneous, multidatabase environment. The model can deal with both short-lived and long-
lived transactions. The mobile agents concept for multi-node processing of a KT can be used when
the user requests new subtransactions based on the results of earlier ones. This idea is discussed in
[6] as pointing out that there will be no need to keep status table and log files in the base stations
DAA. In this case, agent infrastructure must provide the movement of the state information with
the moving agent.
Clustering Model
A flexible, two-level consistency model has been introduced in to deal with the frequent,
predictable and varying disconnections. It is also pointed out that, maintaining data consistency
over all distributed sites injects unbearable overheads on mobile computing and a more flexible
open-nested model is proposed. The model is based on grouping semantically related or closely
located data together to form a cluster. Data are stored or cached at a mobile host (MH) to support
its autonomous operations during disconnections. A fully distributed environment is assumed
24
where users submit transactions from both mobile and fixed terminals. Transactions may involve
both remote data and data stored locally at the user’s device.
The items of a database are partitioned into clusters and they are the units of consistency in that all
data items inside a cluster are required to be fully consistent, while data items residing at different
clusters may exhibit bounded inconsistencies. Clustering may be constructed depending on the
physical location of data. By using this locality definition, data located at the same, neighbor, or
strongly connected hosts may be considered to belong to the same cluster, while data residing at
disconnected or remote hosts may be regarded as belonging to separate clusters. In this way, a
dynamic cluster configuration will be created.
It is also stated that, the nature of voluntary disconnection can be used in defining clusters.
Therefore, clusters of data may be explicitly created or merged by a probable disconnection or
connection of the associated mobile host. Also, the movement of the mobile will cause the place
of the mobile in the cluster, when it enters a new cell, it can change its cluster too.
On the other hand, clusters of data may be defined by using the semantics of data such as the
location data or by defining a user profile. Location data, which represent the address of a mobile
host, are fast changing data replicated over many sites. These data are often imprecise, since
updating all their copies creates overhead and there may be no need to provide consistency for
these kinds of data. On the other hand, by defining user profiles for the cluster creation, it may be
possible to differentiate users based on the requirements of their data and applications. For
example, data that are most often accessed by some user or data that are somewhat private to a
user can be considered to belong to the same cluster independent of their location or semantics.
defines the full consistency to be required for all data inside a cluster but degrees of consistency
for replicated data at different clusters. The degree of consistency may vary depending on the
availability of network bandwidth among clusters by allowing little deviation in availability. This
will provide applications with the capability to adapt to the currently available bandwidth,
providing the user with data of variable level of detail or quality. For example, in the instance of a
cooperative editing application, the application can display only one chapter or older versions of
chapters of the book under weak network connections and up-to-date copies of all chapters under
strong network connections.
The mobile database is seen as a set of data items which is partitioned to a set of clusters. Data
items are related by a number of restrictions called integrity constraints that express relationships
of data items that a database state must satisfy. Integrity constraints among data-items inside the
same cluster are called intra-cluster constraints and constraints among data items at different
clusters are called inter-cluster constraints. During disconnection or when connection is weak or
costly, the only data that user can access may not satisfy inter-cluster constraints strictly. To
maximize local processing and reduce network access, the user is allowed to interact with locally
-in a cluster available m- degree consistent data by using weak-read and weak-write operations.
These operations allow users to operate with the lack of strict consistency which can be tolerated
25
by the semantics of their applications. On the other hand, the standard read and write operations
are called strict read and strict write operations to differentiate them from weak operations.
Based on the ideas stated, two basic types of transaction are defined in: weak and strict
transactions. As the names imply, weak transactions consist only weak read and weak write
operations and they only access data copies that belong to same cluster and can be considered local
at that cluster. A weak read operation on a data item reads a locally available copy, which is the
value written by the last weak or strict write operation at that cluster. A weak write operation writes
a local copy and is not permanent unless it is committed in the merged network. Likewise, strict
transactions consist only strict read and strict write operations. A strict read operation is defined
as the one that reads the value of the data item which is written by the last strict write operation
where a strict write operation writing one or more copies of the data item.
Weak transactions have two commit points, a local commit in the associated cluster and an implicit
global commit after cluster merging. Updates made by locally committed weak transactions are
only visible to other weak transactions in the same cluster, but not visible to strict transactions
before merging, or local transactions become globally committed. How weak transactions can be
a part of concurrency controller has been shown and the criteria and graph-based tests for the
correctness of created schedules have been developed. .
The addition of weak operations to the database interface provides the users to access locally -in a
cluster, consistent data by issuing weak transactions and globally consistent data by issuing strict
transactions. Weak operations support disconnected operation since a mobile device can operate
disconnected as long as applications are satisfied with local copies. Users can use weak
transactions to update mostly private data and strict transactions to update highly used common
data. Furthermore, by allowing applications to specify their consistency requirements, better
bandwidth utilization can be achieved.
MultiDatabase Transactions
The mobile host can play many roles in a distributed database environment. It may simply submit
operations to be executed on a server or an agent at the fixed network. How multidatabase
transactions could be submitted from mobile workstations is examined in . A framework for mobile
computing in a cooperative multidatabase processing environment and a global transactions
manager facility are also introduced.
Each mobile client is assumed to submit a transaction to a coordinating agent. Once the transaction
has been submitted, the coordinating agent schedules and coordinates its execution on behalf of
the mobile client. Mobile units may voluntarily disconnect from the network prior to having any
associated transactions completed. They aimed an architecture that satisfies the following :
providing full-fledged transaction management framework so that the users and application
programs will be able to access data across multiple sites transparently,
26
enhancing database concurrency and data availability through the adoption of a distributed
concurrency control and recovery mechanism that preserves local autonomy,
implementing the concept extensibility to support various database systems in the
framework so that the components can cooperate with a relational or an object- oriented
database system,
providing an environment where the proposed transaction processing component operates
independently and transparently of the local DBMS.
incorporating the concept of mobile computing through the use of mobile workstations into
the model.
The Global Communication Manager (GCM) is responsible for the generation and management
of message queues within the local site. Additionally, it also communicates, delivers and
exchanges these messages with its peer sites and mobile hosts in the network.
The Global Transaction Manager (GTM) coordinates the submission of global subtransactions to
its relevant sites. The Global Transaction Manager Coordinator (GTMC) is the site where the
global transaction is initiated. All participating GTMs for that global transaction are known as
GTMPs. The GTM can be a Global Scheduling Submanager (GSS) or a Global Concurrency
Submanager (GCS). The GSS is responsible for the scheduling of global transactions and
subtransactions. The GCS is responsible for acquisition of necessary concurrency control
requirements needed for the successful execution of global transactions and subtransactions. The
GTM is responsible for the scheduling and commitment of global transactions while the Local
Transaction Manager (LTM) is responsible for the execution and recovery of transactions executed
locally.
27
The Global Recovery Manager (GRM) coordinates the commitment and recovery of global
transactions and subtransactions after a failure. It ensures that the effects of committed global
subtransactions are written to the underlying local database or none of the effects of aborted global
subtransactions are written at all. It also uses the write- ahead logging protocol so that the effect
to the database are written immediately without having to wait for the global subtransaction to
complete or commit.
Global Interface Manager (GIM) coordinates the submission of request/reply between the
MDSTPM and the local database manager which can be executing in a relational database system
or an object-oriented database system. This component provides extensibility function including
the translation of an SQL request to an object-oriented query language request.
The approach used in for the management of mobile workstations and the global transactions
submitted is to have these mobile workstations to be part of the MDS during its connections with
their respective coordinator node. Once a global transaction has been submitted, the coordinating
site can then schedule and coordinate the execution of the global transaction on behalf of the
mobile host. In this way, mobile workstation may disconnect from the network without waiting
the global transaction to complete. Also, the coordinating sites are assumed to be connected with
reliable communication networks which are less subject to failures.
28
An alternative mechanism to Remote Procedure Call (RPC) is proposed as Message and Queuing
Facility (MQF) for the implementation of the proposed approach. Request messages sent from a
mobile host to its coordinating site are handled asynchronously providing the mobile host to
disconnect itself. The coordinating node execute the messages on behalf of the mobile unit and it
is possible to query the status of the global transactions from mobile hosts.
In the proposed MQF, for each mobile workstation there exists a message queue and a transaction
queue. Request, acknowledgment and information type messages such as, request for
connection/reconnection, acknowledgment for connection/reconnection to mobile workstation,
ask message queue status can be used. To manage the transactions submitted, a simple global
transaction queuing mechanism is proposed. This approach is based on the finite state machine
concept. Set of possible state and transitions can be clearly defined between the beginning and
ending state of the global transaction. For the implementation of this mechanism five transaction
sub-queues are used (input queue, allocate queue, active queue, suspend queue, output queue) to
manage global transactions/subtransactions submitted to local site by the mobile workstation.
It is also noted that for an multidatabase to function correctly within this architecture and
management issues it seemed necessary to establish an MDSTPM component software at each site
in order to provide the integration. On the other hand, pointed out that this model ignores important
issues including interactive transactions that need input from the user and produce output,
transactions that involve data stored at mobile workstations and mobile host migration and beside
the model is offering a practical approach.
PRO-MOTION
A mobile transaction processing system PRO-MOTION has been developed by , and has the aim
of migrating existing database applications and supporting the development of new database
applications involving mobile and wireless data access. PRO-MOTION is said to be a mobile
transaction processing system which supports disconnected transaction processing in a mobile
client-server environment.
The underlying transaction processing model of PRO-MOTION is the concept of nested- split
transactions. Nested split transactions are an example of open nesting, which relaxes the top-level
atomicity restriction of closed nested transactions where an open nested transaction allows its
partial results to be observed outside the transaction. Consequently, one of the main issue for
describing the local transaction processing on the mobile host is visibility and allowing new
transactions to see uncommitted changes (dirty data) may result undesired dependencies and
cascading aborts. But since no updates on a disconnected MH can be incorporated in the server
database, subsequent transactions using the same data items normally could not proceed until
connection occurs and the mobile transaction commits. PRO-MOTION considers the entire mobile
sub-system as one extremely large, long-lived transaction which executes at the server with a
subtransaction executing at each MH. Each of these MH subtransactions, in turn, is the root of
29
another nested-split transaction. It is stated that, by making the results of a transaction visible as
soon as transaction begins to commit at the MH, it can provide additional transactions to progress
even though the data items involved have been modified by an active (i.e. non-committed)
transaction. In this way, local visibility and local commitment can reduce the blocking of
transactions during disconnection and minimize the probability of cascading aborts.
A compact is defined as a satisfied request to cache data, with its obligations, restrictions and state
information. It represents an agreement between the database server and the mobile host where the
database server delegates control of some data to the MH to be used for local transaction
processing. The database server need not to be aware of the operations executed by individual
transactions on the MH, but, rather, sees periodic updates to a compact for each of the data items
manipulated by the mobile transactions. Compacts are defined as objects encapsulating the cached
data, methods for the access of the cached data, current state information, consistency rules,
obligations and the interface methods. The main structure is shown in Figure 5.
30
The management of compacts is performed by the compact manager on the database server, and
the compact agent on each mobile host cooperatively. Compacts are obtained from the database
by requesting when a data demand is created by the MH. If data is available to satisfy the request,
the database server creates a compact with the help of compact manager. The compact is then
recorded to the compact store and transmitted to the MH to provide the data and methods to satisfy
the needs of transactions executing on the MH. It is possible to transmit the missing or outdated
components of a compact which avoids the expensive transmission of already available compact
methods on the MH. Once the compact is received by the MH, it is recorded in the compact registry
which is used by the compact agent to track the location and status of all local compacts.
Each compact has a common interface which is used by the compact agent to manage the compacts
in the compact registry list and to perform updates submitted by transactions run by applications
executing on the MH. The implementation of a common interface simplifies the design of the
compact agent and guarantees minimum acceptable functionality of a specific compact instance.
Additionally, each compact can have specialized methods which support the particular type of data
or concurrency control methods specific to itself.
Compacts are managed by the compact agent which is similar to cache management daemon in
Coda file system, handles disconnections and manages storage on a MH. Compact agent monitors
activity and interacts with the user and applications to determine the candidates for caching. Unlike
the Coda daemon, the compact agent acts as a transaction manager for transactions executing on
the MH, which in turn it is responsible from concurrency control, logging and recovery.
After a disconnection, while reconnecting to the database, the MH identifies a group of compacts
whose states reflect the updates of the locally committed transactions. The transactions in this
subset are split from uncommitted transactions and communicated to the compact manager, which
creates a split transaction for this group of updates. The compact manager then commits this split
transaction into the database making the updates visible to all transaction -fixed or mobile- waiting
for server commitment. All of these happen without releasing the locks held by the compact
manager root transaction.
31
Limiting all database access to the compact manager can provide a nested-split transaction
processing capability to the database server. If the compact manager is the only means to access
the database, every item in the database can be considered implicitly locked by the root transaction.
When an item is needed by a MH, the compact manager can read the data value and immediately
release any actual (i.e. server imposed) locks on the data item, knowing that it will not be accessed
by any transaction unknown to the compact manager. During the reconnection, the compact
manager locks the items necessary for the “split transaction”, writes the updates to the data items,
commits the “split transaction”, and re-reads and releases the altered items, maintaining the
implicit lock.
Compact agents perform hoarding when the mobile host is connected to the network and the
compact manager is storing compacts in preparation for an eventual disconnection. Hoarding
utilizes a list of resources required for processing transactions on the mobile host. The resource
list is built and maintained in the MH and compact agent adds items to the list by monitoring usage
of items by running applications. An expiration mechanism is used for matching the server-side
compacts, resynchronization and garbage collection. Compact agent also perform disconnected
processing when the mobile host is disconnected from the network and the compact manager is
processing transactions locally. The compact manager maintains an event log, which is used for
managing transaction processing, recovery, and resynchronization on the MH.
Local commitment is permitted to make the results visible to other transaction on the MH,
accepting the possibility of an eventual failure to commit at the server. Transactions which do not
have a local option will not commit locally until the updates have committed at the server. Because
more than one compacts may be used in a single transaction, the commitment of a transaction is
performed using s two-phase commit protocol where all participants reside on the MH. On the
other hand, resynchronization occurs when the MH has reconnected to the network and the
compact agent is reconciling the updates committed during the disconnection with the fixed
database.
PRO-MOTION uses a ten level scale to characterize the correctness of a transaction execution and
currently it is based on the degrees of isolation defined in ANSI SQL standard. Compacts are
written in Java and much of the code is maintained in Java virtual Machine and need not be
replicated in each compact. Simple compacts are implemented and studies are continuing in
designing a database server supporting compacts. It is claimed that PRO-MOTION offers many
advantages over other proposed systems where the latter rely on the application to enforce
consistency but PRO-MOTION is using data- centric approach.
Toggle Transactions
32
defined to be a collection of autonomous databases connected to a fixed network and a Mobile
Multidatabase Management System (MMDBMS). The MMDBMS management system is a set of
software modules that resides on the fixed network system. The respective Database Management
Systems (DBMS) of each independent database has the complete control over its database so that
they can differ in data models, transaction management mechanisms they used. Each local database
provides a service interface that specifies the operations accepted and the services provided to the
MMDBMS. Local transactions executed by the local users are transparent to the MMDBMS.
Global users,. either static or mobile users, are capable of accessing multiple databases by
submitting global transactions to the MMDBMS.
It is assumed that there is no need to define integrity constraints on data items residing at different
sites. As each local DBMS ensures the site-transactions executed by it do not violate any local
integrity constraints, and global transactions satisfy consistency property. Similarly, the Global
Transaction Manager (GTM) which manages the execution of global transactions, can rely on the
durability property of the local DBMS to ensure durability of committed global transactions. So,
it is noted that the GTM need only enforce the atomicity and isolation properties. In addition to
these the GTM of the MMDBMS should address disconnection and migrating transactions. The
interactive nature of global transactions, as well as disconnection and migration prolog the
execution time of global transactions which can be referred as Long-Lived Transactions (LLT).
The GTM is said to require the minimization of ill-effect upon LLTs, which can be caused by
conflicting of these with others.
A transaction management technique that addresses the above issues is proposed . In the Toggle
Transaction Management (TTM) technique, global transaction manager is designed to consist of
two layers : Global Coordinator layer and Site Manager layer. Global Coordinator layer consists
of Global Transaction Coordinators (GTCs) in each MSS and manages the overall execution and
migration of global transactions. The Site Manager layer consists of Site Transaction Managers
(STMs) in participating database sites and supervises the execution of vital or non-vital site-
transactions. Each global transaction is defined to have a data structure that contains the current
execution status of that transaction, and follow the user in migration from MSS to MSS. The main
communication framework is shown in Figure 6.
33
Global transactions are based on the Multi-Level transaction model in which the global transaction
consists of a set of compensatable transactions. Also, the vital site- transactions must succeed in
order for the global transaction to succeed. The abort of non- vital site-transactions do not force
the global transaction to be aborted. In this way, restrictions can be placed to enforce the atomicity
and isolation levels. Global transactions are initiated at some GTC component of the GTM. The
GTC submits the site-transactions to the STMs, handles disconnections and migration of the user,
logs responses that cannot be delivered to the disconnected user, enforces the atomicity and
isolation properties.
Two new states are defined in TTM to support disconnected operations: Disconnected and
Suspended. In disconnection, the transactions are put into Disconnected state and execution is
allowed to continue. If the disconnection stems from a catastrophic failure, the transactions are put
in the Suspended state and execution is suspended. This way, needless aborts will be minimized.
In order to minimize the ill-effects of the extended execution time of mobile transactions, a global
transaction can state its intent to commit by executing a toggle operation. If the operation is
successful, the GTM guarantees that the transaction would not be aborted due to atomicity or
isolation violations unless the transaction is suspended. Whenever a transaction requires to commit
or to be toggled, the TTM technique executes the Partial Global Serialization Graph (PGSG)
commit algorithm to verify the atomicity and isolation properties. If the first verification towards
the atomicity is failed, the transaction is aborted; else, the isolation property is checked.
If the violation can not be resolved, the transaction is aborted, otherwise the commit or toggle
operation succeeds.
In TTM technique, it is stated that, concurrency is limited as all site-transactions that execute at
each site are forced to conflict with each other. The artificial conflicts generated by the algorithm
will be eliminated by exploiting semantic information of site- transactions. Each service interface
34
will need to provide conflict information on all operations accepted by that site. This information
will be used to generate conflicts between site-transactions that actually conflict with each other.
TEMPORAL DATABASE
A Temporal Database is a database with built-in support for handling time sensitive data.
Usually, databases store information only about current state, and not about past states. For
example in a employee database if the address or salary of a particular person changes, the database
gets updated, the old value is no longer there. However for many applications, it is important to
maintain the past or historical values and the time at which the data was updated. That is, the
knowledge of evolution is required. That is where temporal databases are useful. It stores
information about the past, present and future. Any data that is time dependent is called the
temporal data and these are stored in temporal databases.
Temporal Databases store information about states of the real world across time. Temporal
Database is a database with built-in support for handling data involving time. It stores information
relating to past, present and future time of all events.
Healthcare Systems: Doctors need the patients’ health history for proper diagnosis.
Information like the time a vaccination was given or the exact time when fever goes high
etc.
Insurance Systems: Information about claims, accident history, time when policies are in
effect needs to be maintained.
Temporal Aspects
Valid Time: Time period during which a fact is true in real world, provided to the system.
Transaction Time: Time period during which a fact is stored in the database, based on
transaction serialization order and is the timestamp generated automatically by the system.
Temporal Relation
Temporal Relation is one where each tuple has associated time; either valid time or transaction
time or both associated with it.
Uni-Temporal Relations: Has one axis of time, either Valid Time or Transaction Time.
35
Bi-Temporal Relations: Has both axis of time – Valid time and Transaction time. It
includes Valid Start Time, Valid End Time, Transaction Start Time, Transaction End
Time.
His father registered his birth after three days on April 6, 1992.
In a non-temporal database, John’s address is entered as Chennai from 1992. When he registers
his new address in 2016, the database gets updated and the address field now shows his Mumbai
address. The previous Chennai address details will not be available. So, it will be difficult to find
out exactly when he was living in Chennai and when he moved to Mumbai.
To make the above example a temporal database, we’ll be adding the time aspect also to the
database. First let’s add the valid time which is the time for which a fact is true in real world. Valid
time is the time for which a fact is true in the real world. A valid time period may be in the past,
span the current time, or occur in the future.
36
The valid time temporal database contents look like this:
Name, City, Valid From, Valid Till
In our example, john was born on 3rd April 1992. Even though his father registered his birth three
days later, the valid time entry would be 3rd April of 1992. There are two entries for the valid time.
The Valid Start Time and the Valid End Time. So in this case 3rd April 1992 is the valid start
time. Since we do not know the valid end time we add it as infinity.
Johns father registers his birth on 6th April 1992, a new database entry is made:
Person(John, Chennai, 3-Apr-1992, ∞).
Similarly John changes his address to Mumbai on 10th Jan 2016. However, he has been living in
Mumbai from 21st June of the previous year. So his valid time entry would be 21 June 2015.
Uni-temporal Database
Bi-Temporal Relation (John’s Data Using Both Valid And Transaction Time)
Next we’ll see a bi-temporal database which includes both the valid time and transaction time.
Transaction time records the time period during which a database entry is made. So, now the
database will have four additional entries the valid from, valid till, transaction
entered and transaction superseded.
37
Similarly, when john registers his change of address in Mumbai, a new entry is made. The valid
from time for this entry is 21st June 2015, the actual date from which he started living in Mumbai.
whereas the transaction entered time would be 10th January 2016. We do not know how long he’ll
be living in Mumbai. So the transaction end time and the valid end time would be infinity. At the
same time the original entry is updated with the valid till time and the transaction superseded time.
John Chennai April 3, 1992 June 20, 2015 April 6, 1992 Jan 10, 2016
Bi-temporal Database
Advantages
The main advantages of this bi-temporal relations is that it provides historical and roll back
information. For example, you can get the result for a query on John’s history, like: Where did
John live in the year 2001?. The result for this query can be got with the valid time entry. The
transaction time entry is important to get the rollback information.
Oracle.
IBM DB2.
DEDUCTIVE DATABASE
A deductive database is a database system that can make deductions (i.e. conclude additional
facts) based on rules and facts stored in the (deductive) database. Datalog is the language typically
38
used to specify facts, rules and queries in deductive databases. Deductive databases have grown
out of the desire to combine logic programming with relational databases to construct systems that
support a powerful formalism and are still fast and able to deal with very large datasets. Deductive
databases are more expressive than relational databases but less expressive than logic
programming systems. In recent years, deductive databases such as Datalog have found new
application in data integration, information extraction, networking, program analysis, security,
and cloud computing.[1]
Deductive databases reuse many concepts from logic programming; rules and facts specified in
the deductive database language Datalog look very similar to those in Prolog. However important
differences between deductive databases and logic programming:
Order sensitivity and procedurality: In Prolog, program execution depends on the order of
rules in the program and on the order of parts of rules; these properties are used by
programmers to build efficient programs. In database languages (like SQL or Datalog),
however, program execution is independent of the order of rules and facts.
39
partly determined by the attribute names. In a deductive database, the meaning of an attribute value
in a tuple is determined solely by its position within the tuple. Rules are somewhat similar to
relational views. They specify virtual relations that are not actually stored but that can be formed
from the facts by applying inference mechanisms based on the rule specifications. The main
difference between rules and views is that rules may involve recursion and hence may yield virtual
relations that cannot be defined in terms of basic relational views.
The evaluation of Prolog programs is based on a technique called backward chaining, which
involves a top-down evaluation of goals. In the deductive databases that use Datalog, attention has
been devoted to handling large volumes of data stored in a relational database. Hence, evaluation
techniques have been devised that resemble those for a bottom-up evaluation. Prolog suffers from
the limitation that the order of specification of facts and rules is significant in evaluation; moreover,
the order of literals (defined in Section 26.5.3) within a rule is significant. The execution
techniques for Datalog programs attempt to circumvent these problems.
2. Prolog/Datalog Notation
The notation used in Prolog/Datalog is based on providing predicates with unique names.
A predicate has an implicit meaning, which is suggested by the predicate name, and a fixed
number of arguments. If the arguments are all constant values, the predicate simply states that a
certain fact is true. If, on the other hand, the predicate has variables as arguments, it is either
considered as a query or as part of a rule or constraint. In our discussion, we adopt the Prolog
convention that all constant
values in a predicate are either numeric or character strings; they are represented as identifiers (or
names) that start with a lowercase letter, whereas variable names always start with an uppercase
letter.
40
Consider the example shown in Figure 26.11, which is based on the relational data-base in Figure
3.6, but in a much simplified form. There are three predicate names: supervise,
superior, and subordinate. The SUPERVISE predicate is defined via a set of facts, each of which
has two arguments: a supervisor name, followed by the name of a direct supervisee (subordinate)
of that supervisor. These facts correspond to the actual data that is stored in the database, and they
can be considered as constituting a set of tuples in a relationSUPERVISE with two attributes
whose schema is
SUPERVISE(Supervisor, Supervisee)
Thus, SUPERVISE(X, Y ) states the fact that X supervises Y. Notice the omission of the attribute
names in the Prolog notation. Attribute names are only represented by virtue of the position of
each argument in a predicate: the first argument represents the supervisor, and the second argument
represents a direct subordinate.
The other two predicate names are defined by rules. The main contributions of deductive databases
are the ability to specify recursive rules and to provide a frame-work for inferring new information
based on the specified rules. A rule is of the form head :– body, where :– is read as if and only if.
A rule usually has a single predicate to the left of the :– symbol—called the head or left-hand
side(LHS) or conclusion of the rule—and one or more predicates to the right of the :–
symbol— called the body or right-hand side(RHS) or premise(s) of the rule. A predicate with
constants as arguments is said to be ground; we also refer to it as an instantiated predicate. The
arguments of the predicates that appear in a rule typically include a number of variable symbols,
although predicates can also contain constants as arguments. A rule specifies that, if a particular
assignment or binding of constant values to the variables in the body (RHS predicates)
makes all the RHS predicates true, it also makes the head (LHS predicate) true by using the same
assignment of constant values to variables. Hence, a rule provides us with a way of generating new
facts that are instantiations of the head of the rule. These new facts are based on facts that already
exist, corresponding to the instantiations (or bind-ings) of predicates in the body of the rule. Notice
that by listing multiple predicates in the body of a rule we implicitly apply the logical
AND operator to these predicates. Hence, the commas between the RHS predicates may be read
as meaning and.
Consider the definition of the predicate SUPERIOR in Figure 26.11, whose first argument is an
employee name and whose second argument is an employee who is either a direct or
an indirect subordinate of the first employee. By indirect subordinate, we mean the subordinate of
some subordinate down to any number of levels. Thus SUPERIOR(X, Y ) stands for the fact that X
is a superior of Ythrough direct or indirect supervision. We can write two rules that together specify
the meaning of the new predicate. The first rule under Rules in the figure states that for every value
of X and Y, if SUPERVISE(X, Y)—the rule body—is true, then SUPERIOR(X, Y )—the rule
head—is also true, since Y would be a direct subordinate of X (at one level down). This rule can
be used to generate all direct superior/subordinate relation-ships from the facts that define
41
the SUPERVISE predicate. The second recursive rule states that ifSUPERVISE(X,
Z) and SUPERIOR(Z, Y ) are both true, then SUPERIOR(X, Y) is also true. This is an example of
a recursive rule, where one of the rule body predicates in the RHS is the same as the rule head
predicate in the LHS. In general, the rule body defines a number of premises such that if they are
all true, we can deduce that the conclusion in the rule head is also true. Notice that if we have two
(or more) rules with the same head (LHS predicate), it is equivalent to saying that the predicate is
true (that is, that it can be instantiated) if either one of the bodies is true; hence, it is equivalent to
a logical OR operation. For example, if we have two rules X:– Y and X :– Z, they are equivalent to
a rule X :– Y OR Z. The latter form is not used in deductive systems, however, because it is not in
the stan-dard form of rule, called a Horn clause, as we discuss in Section 26.5.4.
A Prolog system contains a number of built-in predicates that the system can interpret directly.
These typically include the equality comparison operator =(X, Y), which returns true if X and Y are
identical and can also be written as X=Y by using the standard infix notation. Other comparison
operators for numbers, such as <, <=, >, and >=, can be treated as binary predicates. Arithmetic
functions such as +, –, *, and / can be used as arguments in predicates in Prolog. In contrast,
Datalog (in its basic form) does not allow functions such as arithmetic operations as arguments;
indeed, this is one of the main differences between Prolog and Datalog. However, extensions to
Datalog have been proposed that do include functions.
A query typically involves a predicate symbol with some variable arguments, and its meaning
(or answer) is to deduce all the different constant combinations that, when bound (assigned) to
the variables, can make the predicate true. For example, the first query in Figure 26.11 requests
the names of all subordinates of james at any level. A different type of query, which has only
constant symbols as arguments, returns either a true or a false result, depending on whether the
arguments provided can be deduced from the facts and rules. For example, the second query in
Figure 26.11 returns true, since SUPERIOR(james, joyce) can be deduced.
3. Datalog Notation
In Datalog, as in other logic-based languages, a program is built from basic objects called atomic
formulas. It is customary to define the syntax of logic-based languages by describing the syntax
of atomic formulas and identifying how they can be combined to form a program. In Datalog,
atomic formulas are literals of the form p(a1, a2, ..., an), where p is the predicate name and n is
the number of arguments for predicate p. Different predicate symbols can have different numbers
of arguments, and the number of arguments n of predicate p is sometimes called
the arity or degree of p. The arguments can be either constant values or variable names.
Asmentioned earlier, we use the convention that constant values either are numeric or start with
a lowercase character, whereas variable names always start with an uppercase character.
A number of built-in predicates are included in Datalog, which can also be used to construct
atomic formulas. The built-in predicates are of two main types: the binary comparison predicates
42
< (less), <= (less_or_equal), > (greater), and >= (greater_or_equal) over ordered domains; and the
comparison predicates = (equal) and /= (not_equal) over ordered or unordered domains. These can
be used as binary predicates with the same functional syntax as other predicates—for example, by
writing less(X, 3)—or they can be specified by using the customary infix notation X<3. Note that
because the domains of these predicates are potentially infinite, they should be used with care in
rule definitions. For example, the predicate greater(X, 3), if used alone, generates an infinite set of
values for X that satisfy the predicate (all integer numbers greater than 3).
Recall from Section 6.6 that a formula in the relational calculus is a condition that includes
predicates called atoms (based on relation names). Additionally, a formula can have quantifiers—
namely, the universal quantifier (for all) and the existential quantifier (there exists). In clausal
form, a formula must be transformed into another formula with the following characteristics:
· All variables in the formula are universally quantified. Hence, it is not necessary to include
the universal quantifiers (for all) explicitly; the quantifiers are removed, and all variables in the
formula are implicitly quantified by the universal quantifier.
· In clausal form, the formula is made up of a number of clauses, where each clause is
composed of a number of literals connected by OR logical connectives only. Hence, each clause
is a disjunction of literals.
· The clauses themselves are connected by AND logical connectives only, to form a formula.
Hence, the clausal form of a formula is a conjunction of clauses.
It can be shown that any formula can be converted into clausal form. For our purposes, we are
mainly interested in the form of the individual clauses, each of which is a disjunction of literals.
Recall that literals can be positive literals or negative literals. Consider a clause of the form:
This clause has n negative literals and m positive literals. Such a clause can be trans-formed into
the following equivalent logical formula:
43
where ⇒ is the implies symbol. The formulas (1) and (2) are equivalent, meaning that their truth
values are always the same. This is the case because if all the Pi liter-als (i = 1, 2, ..., n) are true,
the formula (2) is true only if at least one of the Qi’s is true, which is the meaning of
the ⇒ (implies) symbol. For formula (1), if all the Pi literals (i = 1, 2, ..., n) are true, their negations
are all false; so in this case formula
(1) is true only if at least one of the Qi’s is true. In Datalog, rules are expressed as a restricted
form of clauses called Horn clauses, in which a clause can contain at most one positive literal.
Hence, a Horn clause is either of the form
A Datalog rule, as in (6), is hence a Horn clause, and its meaning, based on formula (5), is that if
the predicates P1 AND P2 AND ...AND Pn are all true for a particular binding to their variable
arguments, then Q is also true and can hence be inferred. The Datalog expression (8) can be
considered as an integrity constraint, where all the predicates must be true to satisfy the query.
A Prolog or Datalog system has an internal inference engine that can be used to process and
compute the results of such queries. Prolog inference engines typically return one result to the
44
query (that is, one set of values for the variables in the query) at a time and must be prompted to
return additional results. On the con-trary, Datalog returns results set-at-a-time.
5. Interpretations of Rules
There are two main alternatives for interpreting the theoretical meaning of rules: proof-
theoretic and model-theoretic. In practical systems, the inference mechanism within a system
defines the exact interpretation, which may not coincide with either of the two theoretical
interpretations. The inference mechanism is a computational procedure and hence provides a
computational interpretation of the meaning of rules. In this section, first we discuss the two
theoretical interpretations. Then we briefly discuss inference mechanisms as a way of defining the
meaning of rules.
In the proof-theoretic interpretation of rules, we consider the facts and rules to be true statements,
or axioms. Ground axiomscontain no variables. The facts are ground axioms that are given to be
true. Rules are called deductive axioms, since they can be used to deduce new facts. The deductive
axioms can be used to con-struct proofs that derive new facts from existing facts. For example,
Figure 26.12 shows how to prove the fact SUPERIOR(james, ahmad) from the rules and facts
The second type of interpretation is called the model-theoretic interpretation. Here, given a finite
or an infinite domain of constant values,33 we assign to a predicate every possible combination of
values as arguments. We must then determine whether the predicate is true or false. In general, it
is sufficient to specify the combinations of arguments that make the predicate true, and to state
that all other combi-nations make the predicate false. If this is done for every predicate, it is called
an interpretation of the set of predicates. For example, consider the interpretation shown in
Figure 26.13 for the predicates SUPERVISE and SUPERIOR. This interpretation assigns a truth
value (true or false) to every possible combination of argument values (from a finite domain) for
the two predicates.
An interpretation is called a model for a specific set of rules if those rules are always true under
that interpretation; that is, for any values assigned to the variables in the rules, the head of the rules
is true when we substitute the truth values assigned to the predicates in the body of the rule by that
45
interpretation. Hence, whenever a particular substitution (binding) to the variables in the rules is
applied, if all the predicates in the body of a rule are true under the interpretation, the predicate in
the head of the rule must also be true. The interpretation shown in Figure 26.13 is a model for the
two rules shown, since it can never cause the rules to be violated. Notice that a rule is violated if a
particular binding of constants to the variables makes all the predicates in the rule body true but
makes the predicate in the rule head false. For example, if SUPERVISE(a, b) and SUPERIOR(b,
c) are both true under some interpretation, but SUPERIOR(a, c) is not true, the interpretation can-
not be a model for the recursive rule:
In the model-theoretic approach, the meaning of the rules is established by providing a model for
these rules. A model is called aminimal model for a set of rules if we cannot change any fact from
true to false and still get a model for these rules. For example, consider the interpretation in Figure
26.13, and assume that the SUPERVISE predicate is defined by a set of known facts, whereas
the SUPERIOR predicate is defined as an interpretation (model) for the rules. Suppose that we add
the predicate SUPERIOR(james, bob) to the true predicates. This remains a model for the rules
shown, but it is not a minimal model, since changing the truth value ofSUPERIOR(james,bob)
from true to false still provides us with a model for the rules. The model shown in Figure 26.13 is
the minimal model for the set of facts that are defined by the SUPERVISE predicate.
In general, the minimal model that corresponds to a given set of facts in the model-theoretic
interpretation should be the same as the facts generated by the proof
Rules
SUPERIOR(X, Y ) :– SUPERVISE(X, Y ).
SUPERIOR(X, Y ) :– SUPERVISE(X, Z ), SUPERIOR(Z, Y ).
Interpretation
Known Facts:
SUPERVISE(franklin, john) is true.
SUPERVISE(franklin, ramesh) is true.
SUPERVISE(franklin, joyce) is true.
SUPERVISE(jennifer, alicia) is true.
SUPERVISE(jennifer, ahmad) is true.
SUPERVISE(james, franklin) is true.
SUPERVISE(james, jennifer) is true.
SUPERVISE(X, Y ) is false for all other possible (X, Y ) combinations
Derived Facts:
SUPERIOR(franklin, john) is true.
SUPERIOR(franklin, ramesh) is true.
SUPERIOR(franklin, joyce) is true.
SUPERIOR(jennifer, alicia) is true.
SUPERIOR(jennifer, ahmad) is true.
SUPERIOR(james, franklin) is true.
SUPERIOR(james, jennifer) is true.
SUPERIOR(james, john) is true.
SUPERIOR(james, ramesh) is true.
SUPERIOR(james, joyce) is true.
SUPERIOR(james, alicia) is true.
SUPERIOR(james, ahmad) is true.
SUPERIOR(X, Y ) is false for all other possible (X, Y ) combinations
Figure 26.13 An interpretation that is a minimal model.
theoretic interpretation for the same original set of ground and deductive axioms. However, this is
generally true only for rules with a simple structure. Once we allow negation in the specification
of rules, the correspondence between interpretations does not hold. In fact, with negation,
numerous minimal models are possible for a given set of facts.
A third approach to interpreting the meaning of rules involves defining an inference mechanism
that is used by the system to deduce facts from the rules. This inference mechanism would define
a computational interpretation to the meaning of the rules. The Prolog logic programming
language uses its inference mechanism to define the meaning of the rules and facts in a Prolog
program. Not all Prolog pro-grams correspond to the proof-theoretic or model-theoretic
interpretations; it depends on the type of rules in the program. However, for many simple Prolog
pro-grams, the Prolog inference mechanism infers the facts that correspond either to the proof-
theoretic interpretation or to a minimal model under the model-theoretic interpretation.
There are two main methods of defining the truth values of predicates in actual Datalog
programs. Fact-defined predicates (orrelations) are defined by listing all the combinations of
values (the tuples) that make the predicate true. These correspond to base relations whose contents
are stored in a database system. Figure 26.14 shows the fact-defined
predicates EMPLOYEE, MALE,FEMALE, DEPARTMENT, SUPERVISE, PROJECT,
and WORKS_ON, which correspond to part of the relational database shown in Figure 3.6. Rule-
defined predicates (or views) are defined by being the head (LHS) of one or more Datalog rules;
they correspond to virtual rela tions whose contents can be inferred by the inference engine. Figure
26.15 shows a number of rule-defined predicates
A program or a rule is said to be safe if it generates a finite set of facts. The general theoretical
problem of determining whether a set of rules is safe is undecidable. However, one can determine
the safety of restricted forms of rules. For example, the rules shown in Figure 26.16 are safe. One
situation where we get unsafe rules that can generate an infinite number of facts arises when one
of the variables in the rule can range over an infinite domain of values, and that variable is not
limited to ranging over a finite relation. For example, consider the following rule:
BIG_SALARY(Y ) :– Y>60000
Here, we can get an infinite result if Y ranges over all possible integers. But suppose that we change
the rule as follows:
In this case, the rule is still theoretically safe. However, in Prolog or any other system that uses a
top-down, depth-first inference mechanism, the rule creates an infinite loop, since we first search
for a value for Y and then check whether it is a salary of an employee. The result is generation of
an infinite number of Y values, even though these, after a certain point, cannot lead to a set of true
RHS predicates. One definition of Datalog considers both rules to be safe, since it does not depend
on a particular inference mechanism. Nonetheless, it is generally advisable to write such a rule in
the safest form, with the predicates that restrict possible bindings of variables placed first. As
another example of an unsafe rule, consider the following rule:
HAS_SOMETHING(X, Y ) :– EMPLOYEE(X )
REL_ONE(A, B, C ).
REL_TWO(D, E, F ).
REL_THREE(G, H, I, J ).
SELECT_ONE_A_EQ_C(X, Y, Z ) :– REL_ONE(C, Y, Z ).
SELECT_ONE_B_LESS_5(X, Y, Z ) :– REL_ONE(X, Y, Z ), Y< 5.
SELECT_ONE_A_EQ_C_AND_B_LESS_5(X, Y, Z ) :– REL_ONE(C, Y, Z ), Y<5
SELECT_ONE_A_EQ_C_OR_B_LESS_5(X, Y, Z ) :– REL_ONE(C, Y, Z ).
SELECT_ONE_A_EQ_C_OR_B_LESS_5(X, Y, Z ) :– REL_ONE(X, Y, Z ), Y<5.
PROJECT_THREE_ON_G_H(W, X ) :– REL_THREE(W, X, Y, Z ).
UNION_ONE_TWO(X, Y, Z ) :– REL_ONE(X, Y, Z ).
UNION_ONE_TWO(X, Y, Z ) :– REL_TWO(X, Y, Z ).
INTERSECT_ONE_TWO(X, Y, Z ) :– REL_ONE(X, Y, Z ), REL_TWO(X, Y, Z ).
DIFFERENCE_TWO_ONE(X, Y, Z ) :– REL_TWO(X, Y, Z ) NOT(REL_ONE(X, Y, Z ).
CART PROD _ONE_THREE(T, U, V, W, X, Y, Z ) :–
REL_ONE(T, U, V), REL_THREE(W, X, Y, Z ).
NATURAL_JOIN_ONE_THREE_C_EQ_G(U, V, W, X, Y, Z ) :–
REL_ONE(U, V, W ), REL_THREE(W, X, Y, Z ).
Figure 26.16
Here, an infinite number of Y values can again be generated, since the variable Y appears only in
the head of the rule and hence is not limited to a finite set of values. To define safe rules more
formally, we use the concept of a limited variable. A variable X is limited in a rule if (1) it appears
in a regular (not built-in) predicate in the body of the rule; (2) it appears in a predicate of the
form X=c or c=Xor (c1<<=X and X<=c2) in the rule body, where c, c1, and c2 are constant values;
or (3) it appears in a predicate of the form X=Y orY=X in the rule body, where Y is a limited vari-
able. A rule is said to be safe if all its variables are limited.
It is straightforward to specify many operations of the relational algebra in the form of Datalog
rules that define the result of applying these operations on the database relations (fact predicates).
This means that relational queries and views can easily be specified in Datalog. The additional
power that Datalog provides is in the specification of recursive queries, and views based on
recursive queries. In this section, we show how some of the standard relational operations can be
specified as Datalog rules. Our examples will use the base relations (fact-defined
predicates) REL_ONE, REL_TWO, and REL_THREE, whose schemas are shown in Figure
26.16. In Datalog, we do not need to specify the attribute names as in Figure 26.16; rather, the
arity (degree) of each predicate is the important aspect. In a practical system, the domain (data
type) of each attribute is also important for operations such as UNION,INTERSECTION,
and JOIN, and we assume that the attribute types are compatible for the various operations, as
discussed in Chapter 3.
Figure 26.16 illustrates a number of basic relational operations. Notice that if the Datalog model
is based on the relational model and hence assumes that predicates (fact relations and query results)
specify sets of tuples, duplicate tuples in the same predicate are automatically eliminated. This
may or may not be true, depending on the Datalog inference engine. However, it is
definitely not the case in Prolog, so any of the rules in Figure 26.16 that involve duplicate
elimination are not correct for Prolog. For example, if we want to specify Prolog rules for
the UNION operation with duplicate elimination, we must rewrite them as follows:
If a query involves only fact-defined predicates, the inference becomes one of searching among
the facts for the query result. For example, a query such as
DEPARTMENT(X, Research)?
is a selection of all employee names X who work for the Research department. In relational
algebra, it is the query:
which can be answered by searching through the fact-defined predicate department(X,Y ). The
query involves relational SELECTand PROJECT operations on a base relation, and it can be
handled by the database query processing and opti-mization techniques discussed in Chapter 19.
When a query involves rule-defined predicates, the inference mechanism must compute the result
based on the rule definitions. If a query is nonrecursive and involves a predicate p that appears as
the head of a rule p :– p1, p2, ..., pn, the strategy is first to compute the relations corresponding
to p1, p2, ..., pn and then to compute the relation corresponding to p. It is useful to keep track of
the dependency among the predicates of a deductive database in a predicate dependency graph.
Figure 26.17 shows the graph for the fact and rule predicates shown in Figures 26.14 and 26.15.
The dependency graph contains a node for each predicate. Whenever a predicate A is specified in
the body (RHS) of a rule, and the head (LHS) of that rule is the predicate B, we say that B depends
on A, and we draw a directed edge from A to B. This indicates that in order to compute the facts
for the predicate B (the rule head), we must first compute the facts for all the predicates A in the
rule body. If the dependency graph has no cycles, we call the rule setnonrecursive. If there is at
least one cycle, we call the rule set recursive. In Figure 26.17, there is one recursively defined
predicate—namely, SUPERIOR—which has a recursive edge pointing back to itself. Additionally,
because the predicate subordinate depends onSUPERIOR, it also requires recursion in computing
its result.
A query that includes only nonrecursive predicates is called a nonrecursive query. In this section
we discuss only inference mechanisms for nonrecursive queries. In Figure 26.17, any query that
does not involve the predicates SUBORDINATE orSUPERIOR is nonrecursive. In the predicate
dependency graph, the nodes corresponding to fact-defined predicates do not have any incoming
edges, since all fact-defined predicates have their facts stored in a database relation. The contents
of a fact-defined predicate can be computed by directly retrieving the tuples in the cor-responding
database relation.
The main function of an inference mechanism is to compute the facts that correspond to query
predicates. This can be accomplished by generating a relational expression involving relational
operators as SELECT, PROJECT, JOIN, UNION, and SET DIFFERENCE (with appropriate
provision for dealing with safety issues) that, when executed, provides the query result. The query
can then be executed by utilizing the internal query processing and optimization operations of a
relational data-base management system. Whenever the inference mechanism needs to compute
the fact set corresponding to a nonrecursive rule-defined predicate p, it first locates all the rules
that have p as their head. The idea is to compute the fact set for each such rule and then to apply
the UNIONoperation to the results, since UNION corresponds to a logical OR operation. The
dependency graph indicates all predicates q on which each p depends, and since we assume that
the predicate is nonrecursive, we can always determine a partial order among such predicates q.
Before computing the fact set for p, first we compute the fact sets for all predicates q on
which p depends, based on their partial order. For example, if a query involves the
predicate UNDER_40K_SUPERVISOR, we must first compute both SUPERVISOR
and OVER_40K_EMP. Since the latter two depend only on the fact-defined
predicates EMPLOYEE, SALARY, and SUPERVISE, they can be computed directly from
the stored database relations.
MULTIMEDIA DATABASE
Multimedia database is the collection of interrelated multimedia data that includes text, graphics
(sketches, drawings), images, animations, video, audio etc and have vast amounts of multisource
multimedia data. The framework that manages different types of multimedia data which can be
stored, delivered and utilized in different ways is known as multimedia database management
system. There are three classes of the multimedia database which includes static media, dynamic
media and dimensional media.
The multimedia databases are used to store multimedia data such as images, animation, audio,
video along with text. This data is stored in the form of multiple file types like .txt(text),
.jpg(images), .swf(videos), .mp3(audio) etc.
CAP theorem
It is very important to understand the limitations of NoSQL database. NoSQL can not provide
consistency and high availability together. This was first expressed by Eric Brewer in CAP
Theorem.
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of
three guarantees for a database: Consistency, Availability and Partition Tolerance.
Here Consistency means that all nodes in the network see the same data at the same time.
Availability is a guarantee that every request receives a response about whether it was
successful or failed. However it does not guarantee that a read request returns the most recent
write.The more number of users a system can cater to better is the availability.
Partition Tolerance is a guarantee that the system continues to operate despite arbitrary
message loss or failure of part of the system. In other words, even if there is a network outage
in the data center and some of the computers are unreachable, still the system continues to
perform.
What is CAP theorem in NoSQL databases?
CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of
three guarantees for a database: Consistency, Availability and Partition Tolerance. Here
Consistency means that all nodes in the network see the same data at the same time.
What Is Database Sharding? Sharding is a method for distributing a single dataset across
multiple databases, which can then be stored on multiple machines. This allows for larger
datasets to be split in smaller chunks and stored in multiple data nodes, increasing the total
storage capacity of the system.
What is difference between sharding and partitioning?
Sharding and partitioning are both about breaking up a large data set into smaller subsets. The
difference is that sharding implies the data is spread across multiple computers while
partitioning does not. Partitioning is about grouping subsets of data within a single database
instance.
What are the types of sharding?
Sharding Architectures
Key Based Sharding. This technique is also known as hash-based sharding. ...
Horizontal or Range Based Sharding. In this method, we split the data based on the ranges of
a given value inherent in each entity. ...
Vertical Sharding. ...
Directory-Based Sharding.
NoSQL
It provides a mechanism for storage and retrieval of data other than tabular relations model
used in relational databases. NoSQL database doesn't use tables for storing data. It is
generally used to store big data and real-time web applications.
Advantages of NoSQL
What is MongoDB?
MongoDB Advantages
o Easy to use
o Light Weight
o Extremely faster than RDBMS
There is no create database command in MongoDB. Actually, MongoDB do not provide any
command to create database.
If there is no existing database, the following command is used to create a new database.
Syntax:
use DATABASE_NAME
we are going to create a database "javatpointdb"
>use javatpointdb
>db
To check the database list, use the command show dbs:
>show dbs
In MongoDB, the db.collection.insert() method is used to add or insert new documents into a
collection in your database.
>db.movie.insert({"name":"javatpoint"})
The dropDatabase command is used to drop a database. It also deletes the associated data
files. It operates on the current database.
Syntax:
db.dropDatabase()
This syntax will delete the selected database. In the case you have not selected any database,
it will delete default "test" database.
If you want to delete the database "javatpointdb", use the dropDatabase() command as
follows:
>db.dropDatabase()
MongoDB Create Collection
In MongoDB, db.createCollection(name, options) is used to create collection. But usually you
don?t need to create collection. MongoDB creates collection automatically when you insert
some documents. It will be explained later. First see how to create collection:
Syntax:
db.createCollection(name, options)
Name: is a string type, specifies the name of the collection to be created.
Options: is a document type, specifies the memory size and indexing of the collection. It is
an optional parameter.
To check the created collection, use the command "show collections".
>show collections
How does MongoDB create collection automatically
MongoDB creates collections automatically when you insert some documents. For example:
Insert a document named seomount into a collection named SSSIT. The operation will create
the collection if the collection does not currently exist.
>db.SSSIT.insert({"name" : "seomount"})
>show collections
SSSIT
MongoDB update documents
In MongoDB, update() method is used to update or modify the existing documents of a
collection.
Syntax:
db.COLLECTION_NAME.update(SELECTIOIN_CRITERIA, UPDATED_DATA)
Example
Consider an example which has a collection name javatpoint. Insert the following documents
in collection:
db.javatpoint.insert(
{
course: "java",
details: {
duration: "6 months",
Trainer: "Sonoo jaiswal"
},
Batch: [ { size: "Small", qty: 15 }, { size: "Medium", qty: 25 } ],
category: "Programming language"
}
var Allcourses =
[
{
Course: "Java",
details: { Duration: "6 months", Trainer: "Sonoo Jaiswal" },
Batch: [ { size: "Medium", qty: 25 } ],
category: "Programming Language"
},
{
Course: ".Net",
details: { Duration: "6 months", Trainer: "Prashant Verma" },
Batch: [ { size: "Small", qty: 5 }, { size: "Medium", qty: 10 }, ],
category: "Programming Language"
},
{
Course: "Web Designing",
details: { Duration: "3 months", Trainer: "Rashmi Desai" },
Batch: [ { size: "Small", qty: 5 }, { size: "Large", qty: 10 } ],
category: "Programming Language"
}
];
Pass this Allcourses array to the db.collection.insert() method to perform a bulk insert.
1. Deletion criteria: With the use of its syntax you can remove the documents from the
collection.
If you want to remove all documents from a collection, pass an empty query document {} to
the remove() method. The remove() method does not remove the indexes.
db.javatpoint.remove({})
Indexing in MongoDB :
MongoDB uses indexing in order to make the query processing more efficient. If there is no
indexing, then the MongoDB must scan every document in the collection and retrieve only
those documents that match the query. Indexes are special data structures that stores some
information related to the documents such that it becomes easy for MongoDB to find the
right data file. The indexes are order by the value of the field specified in the index.
Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an index.
Syntax db.COLLECTION_NAME.createIndex({KEY:1})
Example
db.mycol.createIndex({“age”:1})
{
“createdCollectionAutomatically” : false,
“numIndexesBefore” : 1,
“numIndexesAfter” : 2,
“ok” : 1
}
In order to drop an index, MongoDB provides the dropIndex() method.
Syntax
db.NAME_OF_COLLECTION.dropIndex({KEY:1})
The dropIndex() methods can only delete one index at a time. In order to delete (or drop)
multiple indexes from the collection, MongoDB provides the dropIndexes() method that
takes multiple indexes as its parameters.
Syntax –
In MongoDB, you can search by field, range query and it also supports regular expression
searches.
2. Indexing:
3. Replication:
MongoDB can run over multiple servers. The data is duplicated to keep the system up and
also keep its running condition in case of hardware failure.
5. Load balancing:
10. Stores files of any size easily without complicating your stack.
Now a day many companies using MongoDB to create new types of applications, improve
performance and availability.
The MongoDB Replication methods are used to replicate the member to the replica sets.
rs.add(host, arbiterOnly)
The add method adds a member to the specified replica set. We are required to connect to the
primary set of the replica set to this method. The connection to the shell will be terminated if
the method will trigger an election for primary. For example - if we try to add a new member
with a higher priority than the primary. An error will be reflected by the mongo shell even if
the operation succeeds.
Example:
In the following example we will add a new secondary member with default vote.
Sharding is a method to distribute the data across different machines. Sharding can be used by
MongoDB to support deployment on very huge scale data sets and high throughput
operations.
MongoDBsh.addShard(<url>) command
A shard replica set added to a sharded cluster using this command. If we add it among the
shard of cluster, it affects the balance of chunks. It starts transferring chunks to balance the
cluster.
<replica_set>/<hostname><:port>,<hostname><:port>, ...
Syntax:
sh.addShard("<replica_set>/<hostname><:port>")
Example:
sh.addShard("repl0/mongodb3.example.net:27327")
Output:
It will add a shard to specify the name of the replica set and the hostname of at least one
member of the replica set.
Cassandra
What is Cassandra?
NoSQL database is Non-relational database. It is also called Not Only SQL. It is a database
that provides a mechanism to store and retrieve data other than the tabular relations used in
relational databases. These databases are schema-free, support easy replication, have simple
API, eventually consistent, and can handle huge amounts of data.
Data model in Cassandra is totally different from normally we see in RDBMS. Let's see how
Cassandra stores its data.
Cluster
Cassandra database is distributed over several machines that are operated together. The
outermost container is known as the Cluster which contains different nodes. Every node
contains a replica, and in case of a failure, the replica takes charge. Cassandra arranges the
nodes in a cluster, in a ring format, and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. Following are the basic attributes
of Keyspace in Cassandra:
o Replication factor: It specifies the number of machine in the cluster that will receive
copies of the same data.
o Replica placement Strategy: It is a strategy which species how to place replicas in
the ring. There are three types of strategies such as:
What is Keyspace?
A keyspace is an object that is used to hold column families, user defined types. A keyspace
is like RDBMS database which contains column families, indexes, user defined types, data
center awareness, strategy used in keyspace, replication factor, etc.
Syntax:
o Simple Strategy: Simple strategy is used in the case of one data center. In this
strategy, the first replica is placed on the selected node and the remaining nodes are
placed in clockwise direction in the ring without considering rack or node location.
o Network Topology Strategy: This strategy is used in the case of more than one data
o
o centers. In this strategy, you have to provide replication factor for each data center
separately.
Replication Factor: Replication factor is the number of replicas of data placed on different
nodes. More than two replication factor are good to attain no single point of failure. So, 3 is
good replication factor.
Example:
Using a Keyspace
To use the created keyspace, you have to use the USE command.
Syntax:
USE <identifier>
Cassandra Alter Keyspace
The "ALTER keyspace" command is used to alter the replication factor, strategy name and
durable writes properties in created keyspace in Cassandra.
Syntax:
In Cassandra, "DROP Keyspace" command is used to drop keyspaces with all the data,
column families, user defined types and indexes from Cassandra.
Syntax:
In Cassandra, CREATE TABLE command is used to create a table. Here, column family is
used to store data just like table in RDBMS.
So, you can say that CREATE TABLE command is used to create a column family in
Cassandra.
Syntax:
CREATE TABLE tablename(
column1 name datatype PRIMARYKEY,
column2 name data type,
column3 name data type.
)
Primary key(ColumnName1,ColumnName2 . . .)
Example:
ALTER TABLE command is used to alter the table after creating it. You can use the ALTER
command to perform two types of operations:
o Add a column
o Drop a column
Syntax:
Adding a Column
You can add a column in the table by using the ALTER command. While adding column, you
have to aware that the column name is not conflicting with the existing column names and
that the table is not defined with compact storage option.
Syntax:
A new column is added. You can check it by using the SELECT command.
Dropping a Column
You can also drop an existing column from a table by using ALTER command. You should
check that the table is not defined with compact storage option before dropping a column
from a table.
Syntax:
Now you can see that a column named "student_email" is dropped now.
If you want to drop the multiple columns, separate the columns name by ",".
Syntax:
DROP TABLE <tablename>
Example:
After using the following command:
DROP TABLE student;
The table named "student" is dropped now. You can use DESCRIBE command to verify if
the table is deleted or not. Here the student table has been deleted; you will not find it in the
column families list.
Cassandra Truncate Table
TRUNCATE command is used to truncate a table. If you truncate a table, all the rows of the
table are deleted permanently.
Syntax:
TRUNCATE <tablename>
Example:
Cassandra Batch
Syntax:
BEGIN BATCH
<insert-stmt>/ <update-stmt>/ <delete-stmt>
APPLY BATCH
WHERE clause is used with SELECT command to specify the exact location from where we
have to fetch data.
Syntax:
UPDATE command is used to update data in a Cassandra table. If you see no result after
updating the data, it means data is successfully updated otherwise an error will be returned.
While updating data in Cassandra table, the following keywords are commonly used:
o Where: The WHERE clause is used to select the row that you want to update.
o Set: The SET clause is used to set the value.
o Must: It is used to include all the columns composing the primary key.
Syntax:
UPDATE <tablename>
SET <column name> = <new value>
<column name> = <value>....
WHERE <condition>
Cassandra DELETE Data
DELETE command is used to delete data from Cassandra table. You can delete the complete
table or a selected row by using this command.
Syntax:
DELETE FROM <identifier> WHERE <condition>;
Delete an entire row
To delete the entire row of the student_id "3", use the following command:
DELETE FROM student WHERE student_id=3;
Delete a specific column name
Example:
Delete the student_fees where student_id is 4.
The HAVING clause places the condition in the groups defined by the GROUP BY clause in
the SELECT statement.
This SQL clause is implemented after the 'GROUP BY' clause in the 'SELECT' statement.
This clause is used in SQL because we cannot use the WHERE clause with the SQL
aggregate functions. Both WHERE and HAVING clauses are used for filtering the records in
SQL queries.
HAVING SUM(Emp_Salary)>12000;
MIN Function with HAVING Clause:
If you want to show each department and the minimum salary in each department, you have
to write the following query:
SELECT MIN(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;
MAX Function with HAVING Clause:
SELECT MAX(Emp_Salary), Emp_Dept FROM Employee GROUP BY Emp_Dept;
AVERAGE CLAUSE:
3) Cassandra stores data in tabular form MongoDB stores data in JSON format.
like SQL format.
4) Cassandra is got license by Apache. MongoDB is got license by AGPL and drivers
by Apache.
Hive is a data warehouse system which is used to analyze structured data. It is built on the top
of Hadoop. It was developed by Facebook.
Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and
User Defined Functions (UDF).
Features of Hive
Hive data types are categorized in numeric types, string types, misc types, and complex types.
A list of Hive data types is given below.
Integer Types
Date/Time Types
TIMESTAMP
DATES
The Date value is used to specify a particular year, month and day, in the form YYYY--MM--
DD. However, it didn't provide the time of the day. The range of Date type lies between
0000--01--01 to 9999--12--31.
String Types
STRING
The string is a sequence of characters. It values can be enclosed within single quotes (') or
double quotes (").
Varchar
The varchar is a variable length type whose range lies between 1 and 65535, which specifies
that the maximum number of characters allowed in the character string.
CHAR
In Hive, the database is considered as a catalog or namespace of tables. So, we can maintain
multiple tables within a database where a unique name is assigned to each table. Hive also
provides a default database with a name default.
o Initially, we check the default database provided by Hive. So, to check the list of
existing databases, follow the below command: -
o hive> create database demo
In this section, we will see various ways to drop the existing database.
In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide
range of flexibility where the data files for tables are stored. It provides two types of table: -
o Internal table
o External table
Internal Table
The internal tables are also called managed tables as the lifecycle of their data is controlled by
the Hive. By default, these tables are stored in a subdirectory under the directory defined by
hive. metastore. warehouse.dir (i.e. /user/hive/warehouse). The internal tables are not flexible
enough to share with other tools like Pig. If we try to drop the internal table, Hive deletes
both table schema and data.
hive> create table demo.employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
Let's see the metadata of the created table by using the following command:-
External Table
The external table allows us to create and access a table and a data externally.
The external keyword is used to specify the external table, whereas the location keyword is
used to determine the location of loaded data.
As the table is external, the data is not present in the Hive directory. Therefore, if we try to
drop the table, the metadata of the table will be deleted, but the data still exists.
hive> create external table emplist (Id int, Name string , Salary float)
row format delimited
fields terminated by ','
location '/HiveDirectory';
Once the internal table has been created, the next step is to load the data into it. So, in Hive,
we can easily load data from any file to the database.
o Let's load the data of the file into the database by using the following command: -
Hive facilitates us to drop a table by using the SQL drop table command. Let's follow the
below steps to drop the table from the database.
o Let's check the list of existing databases by using the following command: -
In Hive, we can perform modifications in the existing table like changing the table name,
column name, comments, and table properties. It provides SQL like commands to alter the
table.
Rename a Table
If we want to change the name of an existing table, we can rename that table by using the
following signature: -
In Hive, we can add one or more columns in an existing table by using the following
signature:
Change Column
In Hive, we can rename a column, change its type and position. Here, we are changing the
name of the column by using the following signature: -
Hive allows us to delete one or more columns by replacing them with the new columns. Thus,
we cannot drop the column directly.
alter table employee_data replace columns( id string, first_name string, age int);
Partitioning in Hive
The partitioning in Hive means dividing the table into some parts based on the values of a
particular column like date, course, city or country. The advantage of partitioning is that since
the data is stored in slices, the query response time becomes faster.
Static Partitioning
o Create the table and provide the partitioned columns by using the following
command: -
hive> create table student (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';
o Load the data of another file into the same table and pass the values of partition
columns with it by using the following command: -
In dynamic partitioning, the values of partitioned columns exist within the table. So, it is not
required to pass the values of partitioned columns manually.
hive> create table student_part (id int, name string, age int, institute string)
partitioned by (course string)
row format delimited
fields terminated by ',';
o Now, insert the data of dummy table into the partition table.
hive> insert into student_part
partition(course)
select id, name, age, institute, course
from stud_demo;
What is Graph?
A graph is a pictorial representation of objects which are connected by some pair of links. A
graph contains two elements: Nodes (vertices) and relationships (edges).
A graph database is a database which is used to model the data in the form of graph. It store
any kind of data using:
o Nodes
o Relationships
o Properties
Nodes: Nodes are the records/data in graph databases. Data is stored as properties and
properties are simple name/value pairs.
Relationships: It is used to connect nodes. It specifies how the nodes are related.
3. In graph database there are properties and In RDBMS, there are columns and
their values. data.
4. In graph database the connected nodes are In RDBMS, constraints are used
defined by relationships. instead of that.
MongoDB vs OrientDB
MongoDB and OrientDB contains many common features but the engines are fundamentally
different. MongoDB is pure Document database and OrientDB is a hybrid Document with
graph engine.
Uses the B-Tree algorithm for all Supports three different indexing
Indexes indexes. algorithms so that the user can
achieve best performance.
The following table illustrates the comparison between relational model, document model,
and OrientDB document model −
The SQL Reference of the OrientDB database provides several commands to create, alter, and
drop databases.
Create database
The following statement is a basic syntax of Create Database command.
Following are the details about the options in the above syntax.
<database-url> − Defines the URL of the database. URL contains two parts, one is <mode>
and the second one is <path>.
<mode> − Defines the mode, i.e. local mode or remote mode.
<path> − Defines the path to the database.
<user> − Defines the user you want to connect to the database.
<password> − Defines the password for connecting to the database.
<storage-type> − Defines the storage types. You can choose between PLOCAL and
MEMORY.
Example
You can use the following command to create a local database named demo.
If the database is successfully created, you will get the following output.
Database created successfully.
If the command is executed successfully, you will get the following output.
Database updated successfully
We have already created a database named ‘demo’ in the previous chapters. In this example,
we will connect to that using the user admin.
You can use the following command to connect to demo database.
The following statement is the basic syntax of the Drop database command.
DROP DATABASE [<database-name> <server-username> <server-user-password>]
Following are the details about the options in the above syntax.
<database-name> − Database name you want to drop.
<server-username> − Username of the database who has the privilege to drop a database.
<server-user-password> − Password of the particular user.
In this example, we will use the same database named ‘demo’ that we created in an earlier
chapter. You can use the following command to drop a database demo.
If this command is successfully executed, you will get the following output.
Database 'demo' deleted successfully
INSERT DATABASE
The following statement is the basic syntax of the Insert Record command.
INSERT INTO [class:]<class>|cluster:<cluster>|index:<index>
[(<field>[,]*) VALUES (<expression>[,]*)[,]*]|
[SET <field> = <expression>|<sub-command>[,]*]|
[CONTENT {<JSON>}]
[RETURN <expression>]
[FROM <query>]
Following are the details about the options in the above syntax.
SET − Defines each field along with the value.
CONTENT − Defines JSON data to set field values. This is optional.
RETURN − Defines the expression to return instead of number of records inserted. The most
common use cases are −
@rid − Returns the Record ID of the new record.
@this − Returns the entire new record.
The following command is to insert the second record into the Customer table.
The following command is to insert the next two records into the Customer table.
INSERT INTO Customer (id, name, age) VALUES (04,'javeed', 21), (05,'raja', 29)
SELECT COMMAND
The following statement is the basic syntax of the SELECT command.
SELECT [ <Projections> ] [ FROM <Target> [ LET <Assignment>* ] ]
[ WHERE <Condition>* ]
[ GROUP BY <Field>* ]
[ ORDER BY <Fields>* [ ASC|DESC ] * ]
[ UNWIND <Field>* ]
[ SKIP <SkipRecords> ]
[ LIMIT <MaxRecords> ]
[ FETCHPLAN <FetchPlan> ]
[ TIMEOUT <Timeout> [ <STRATEGY> ] ]
[ LOCK default|record ]
[ PARALLEL ]
[ NOCACHE ]
Following are the details about the options in the above syntax.
<Projections> − Indicates the data you want to extract from the query as a result records set.
FROM − Indicates the object to query. This can be a class, cluster, single Record ID, set of
Record IDs. You can specify all these objects as target.
WHERE − Specifies the condition to filter the result-set.
LET − Indicates the context variable which are used in projections, conditions or sub queries.
GROUP BY − Indicates the field to group the records.
ORDER BY − Indicates the filed to arrange a record in order.
UNWIND − Designates the field on which to unwind the collection of records.
SKIP − Defines the number of records you want to skip from the start of the result-set.
LIMIT − Indicates the maximum number of records in the result-set.
FETCHPLAN − Specifies the strategy defining how you want to fetch results.
TIMEOUT − Defines the maximum time in milliseconds for the query.
LOCK − Defines the locking strategy. DEFAULT and RECORD are the available lock
strategies.
PARALLEL − Executes the query against ‘x’ concurrent threads.
NOCACHE − Defines whether you want to use cache or not.
Example
Method 1 − You can use the following query to select all records from the Customer table.
UPDATE QUERY
Update Record command is used to modify the value of a particular record. SET is the basic
command to update a particular field value.
The following statement is the basic syntax of the Update command.
UPDATE <class>|cluster:<cluster>|<recordID>
[SET|INCREMENT|ADD|REMOVE|PUT <field-name> = <field-value>[,]*] |[CONTENT|
MERGE <JSON>]
[UPSERT]
[RETURN <returning> [<returning-expression>]]
[WHERE <conditions>]
[LOCK default|record]
[LIMIT <max-records>] [TIMEOUT <timeout>]
Following are the details about the options in the above syntax.
SET − Defines the field to update.
INCREMENT − Increments the specified field value by the given value.
ADD − Adds the new item in the collection fields.
REMOVE − Removes an item from the collection field.
PUT − Puts an entry into map field.
CONTENT − Replaces the record content with JSON document content.
MERGE − Merges the record content with a JSON document.
LOCK − Specifies how to lock the records between load and update. We have two options to
specify Default and Record.
UPSERT − Updates a record if it exists or inserts a new record if it doesn’t. It helps in
executing a single query in the place of executing two queries.
RETURN − Specifies an expression to return instead of the number of records.
LIMIT − Defines the maximum number of records to update.
TIMEOUT − Defines the time you want to allow the update run before it times out.
Try the following query to update the age of a customer ‘Raja’.
Orientdb {db = demo}> UPDATE Customer SET age = 28 WHERE name = 'Raja'
Truncate
Truncate Record command is used to delete the values of a particular record.
The following statement is the basic syntax of the Truncate command.
TRUNCATE RECORD <rid>*
Where <rid>* indicates the Record ID to truncate. You can use multiple Rids separated by
comma to truncate multiple records. It returns the number of records truncated.
Try the following query to truncate the record having Record ID #11:4.
DELETE
Delete Record command is used to delete one or more records completely from the database.
The following statement is the basic syntax of the Delete command.
DELETE FROM <Class>|cluster:<cluster>|index:<index>
[LOCK <default|record>]
[RETURN <returning>]
[WHERE <Condition>*]
[LIMIT <MaxRecords>]
[TIMEOUT <timeout>]
Following are the details about the options in the above syntax.
LOCK − Specifies how to lock the records between load and update. We have two options to
specify Default and Record.
RETURN − Specifies an expression to return instead of the number of records.
LIMIT − Defines the maximum number of records to update.
TIMEOUT − Defines the time you want to allow the update run before it times out.
Note − Don’t use DELETE to remove Vertices or Edges because it effects the integrity of the
graph.
Try the following query to delete the record having id = 4.
orientdb {db = demo}> DELETE FROM Customer WHERE id = 4
OrientDB Features
providing more functionality and flexibility, while being powerful enough to replace your
operational DBMS.
SPEED
OrientDB was engineered from the ground up with performance as a key specification. It’s
fast on both read and write operations. Stores up to 120,000 records per secon
No more Joins: relationships are physical links to the records.
Better RAM use.
Traverses parts of or entire trees and graphs of records in milliseconds.
Traversing speed is not affected by the database size.
ENTERPRISE
Incremental backups
Unmatched security
24x7 Support
Query Profiler
Distributed Clustering configuration
Metrics Recording
Live Monitor with configurable alerts
With a master-slave architecture, the master often becomes the bottleneck. With OrientDB,
throughput is not limited by a single server. Global throughput is the sum of the throughput
of all the servers.
Structured, Semi structured, and Unstructured Data – XML Hierarchical Data Model –
XML Documents – Document Type Definition – XML Schema – XML Documents and Databases
– XML Querying – XPath – XQuery
Big Data includes huge volume, high velocity, and extensible variety of data. These are 3
types: Structured data, Semi-structured data, and Unstructured data.
Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example: Relational data.
Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyze. With some processes, you can
store them in the relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.
Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence and
analytics applications. Example: Word, PDF, Text, Media logs.
Differences between Structured, Semi-structured and Unstructured data:
Only textual
Query Structured query allow Queries over anonymous queries are
performance complex joining nodes are possible possible
An XML document has a self descriptive structure. It forms a tree structure which is referred as an
XML tree. The tree structure makes easy to describe an XML document.
A tree structure contains root element (as parent), child element and so on. It is very easy to
traverse all succeeding branches and sub-branches and leaf nodes starting from the root.
<?xml version="1.0"?>
<college>
<student>
<firstname>Tamanna</firstname>
<lastname>Bhatia</lastname>
<contact>09990449935</contact>
<email>[email protected]</email>
<address>
<city>Ghaziabad</city>
<state>Uttar Pradesh</state>
<pin>201007</pin>
</address>
</student>
</college>
In the above example, first line is the XML declaration. It defines the XML version 1.0. Next line
shows the root element (college) of the document. Inside that there is one more element (student).
Student element contains five branches named <firstname>, <lastname>, <contact>, <Email> and
<address>.<address> branch contains 3 sub-branches named <city>, <state> and <pin>.
XML Tree Rules
These rules are used to figure out the relationship of the elements. It shows if an element is a child
or a parent of the other element.
Ancestors: The containing element which contains other elements is called "Ancestor" of other
element. In the above example Root element (College) is ancestor of all other elements.
What is xml?
XML tags are not predefined. You must define your own tags.
XML is platform independent and language independent.
XML Example
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
The root element in the example is <bookstore>. All elements in the document are contained
within <bookstore>.The <book> element has 4 children: <title>,< author>, <year> and <price>.
XML Attributes
XML elements can have attributes. By the use of attributes we can add the information about the
element.
<book>
<book category="computer">
<author> A & B </author>
</book>
XML Comments
XML comments are just like HTML comments. We know that the comments are used to make
codes more understandable other developers.
A well-formed XML document is an XML document with correct syntax. It is very necessary to
know about valid XML document before knowing XML validation.
Valid XML document
DTD stands for Document Type Definition. It defines the legal building blocks of an XML
document. It is used to define document structure with a list of legal elements and attributes.
Purpose of DTD:
elements and define the strucIts main purpose is to define the structure of an XML document. It
contains a list of legal ture with the help of them.
Example:
<?xml version="1.0"?>
<!DOCTYPE employee SYSTEM "employee.dtd">
<employee>
<firstname>vimal</firstname>
<lastname>jaiswal</lastname>
<email>[email protected]</email>
</employee>
Description of DTD:
<!DOCTYPE employee : It defines that the root element of the document is employee.
<!ELEMENT employee: It defines that the employee element contains 3 elements "firstname,
lastname and email".
<!ELEMENT firstname: It defines that the firstname element is #PCDATA typed. (parse-able data
type).
<!ELEMENT lastname: It defines that the lastname element is #PCDATA typed. (parse-able data
type).
<!ELEMENT email: It defines that the email element is #PCDATA typed. (parse-able data type).
XML DTD
A DTD defines the legal elements of an XML document
In simple words we can say that a DTD defines the document structure with a list of legal elements
and attributes.
XML schema is a XML based alternative to DTD.
Actually DTD and XML schema both are used to form a well formed XML document.
We should avoid errors in XML documents because they will stop the XML programs.
XML schema
It is defined as an XML language
Uses namespaces to allow for reuses of existing definitions
It supports a large number of built in data types and definition of derived data types
Valid and well-formed XML document with External DTD
Let's take an example of well-formed and valid XML document. It follows all the rules of DTD.
employee.xml
<?xml version="1.0"?>
<!DOCTYPE employee SYSTEM "employee.dtd">
<employee>
<firstname>vimal</firstname>
<lastname>jaiswal</lastname>
<email>[email protected]</email>
</employee>
In the above example, the DOCTYPE declaration refers to an external DTD file. The content of the
file is shown in below paragraph.
employee.dtd
<!DOCTYPE address [
]>
<address>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</address>
Description of DTD
<!DOCTYPE employee : It defines that the root element of the document is employee.
<!ELEMENT employee: It defines that the employee element contains 3 elements "firstname,
lastname and email".
<!ELEMENT firstname: It defines that the firstname element is #PCDATA typed. (parse-able
data type).
<!ELEMENT lastname: It defines that the lastname element is #PCDATA typed. (parse-able
data type).
<!ELEMENT email: It defines that the email element is #PCDATA typed. (parse-able data type).
XML CSS
Purpose of CSS in XML
CSS (Cascading Style Sheets) can be used to add style and display information to an XML
document. It can format the whole XML document.
To link XML files with CSS, you should use the following syntax:
cssemployee.css
employee
{
background-color: pink;
}
firstname,lastname,email
{
font-size:25px;
display:block;
color: blue;
margin-left: 50px;
}
employee.dtd
employee.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="cssemployee.css"?>
<!DOCTYPE employee SYSTEM "employee.dtd">
<employee>
<firstname>vimal</firstname>
<lastname>jaiswal</lastname>
<email>[email protected]</email>
</employee>
CDATA vs PCDATA
CDATA
CDATA: (Unparsed Character data): CDATA contains the text which is not parsed further in an
XML document. Tags inside the CDATA text are not treated as markup and entities will not be
expanded.
<?xml version="1.0"?>
<employee>
<![CDATA[
<firstname>vimal</firstname>
<lastname>jaiswal</lastname>
<email>[email protected]</email>
]]>
</employee>
In the above CDATA example, CDATA is used just after the element employee to make the
data/text unparsed, so it will give the value of employee:
<firstname>vimal</firstname><lastname>jaiswal</lastname><email>[email protected]</e
mail>
PCDATA
PCDATA: (Parsed Character Data): XML parsers are used to parse all the text in an XML
document. PCDATA stands for Parsed Character data. PCDATA is the text that will be parsed by
a parser. Tags inside the PCDATA will be treated as markup and entities will be expanded.
In other words you can say that a parsed character data means the XML parser examine the data
and ensure that it doesn't content entity if it contains that will be replaced.
<?xml version="1.0"?>
<employee>
<firstname>vimal</firstname>
<lastname>jaiswal</lastname>
<email>[email protected]</email>
</employee>
In the above example, the employee element contains 3 more elements 'firstname', 'lastname', and
'email', so it parses further to get the data/text of firstname, lastname and email to give the value of
employee as:
vimaljaiswal [email protected]
XML Schema:
XML schema is a language which is used for expressing constraint about XML documents.
There are so many schema languages which are used now a days for example Relax- NG and XSD
(XML schema definition).
An XML schema is used to define the structure of an XML document. It is like DTD but
provides more control on XML structure.
Example:
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.javatpoint.com"
xmlns="http://www.javatpoint.com"
elementFormDefault="qualified">
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
1.SimpleType 2.ComplexType
SimpleType
The simpleType allows you to have text-based elements. It contains less attributes, child elements,
and cannot be left empty.
ComplexType
The complexType allows you to hold multiple attributes and elements. It can contain additional
sub elements and can be left empty.
XML Database:
XML database is a data persistence software system used for storing the huge amount of
information in XML format. It provides a secure place to store XML documents.
You can query your stored data by using XQuery, export and serialize into desired format. XML
databases are usually associated with document-oriented databases.
DTD vs XSD
There are many differences between DTD (Document Type Definition) and XSD (XML Schema
Definition). In short, DTD provides less control on XML structure whereas XSD (XML schema)
provides more control.
1) DTD stands for Document Type XSD stands for XML Schema Definition.
Definition.
2) DTDs are derived from SGML syntax. XSDs are written in XML.
3) DTD doesn't support datatypes. XSD supports datatypes for elements and
attributes.
5) DTD doesn't define order for child XSD defines order for child elements.
elements.
7) DTD is not simple to learn. XSD is simple to learn because you don't need
to learn new language.
8) DTD provides less control on XML XSD provides more control on XML structure.
structure.
XML Database:
XML database is a data persistence software system used for storing the huge amount of
information in XML format.
1. XML-enabled database
XML-enable Database:
XML-enable database works just like a relational database. It is like an extension provided
for the conversion of XML documents. In this database, data is stored in table, in the form of rows
and columns.
Native XML database is used to store large amount of data. Instead of table format, Native
XML database is based on container format. You can query data by XPath expressions.
Native XML database is preferred over XML-enable database because it is highly capable to store,
maintain and query XML documents.
Example:
<?xml version="1.0"?>
<contact-info>
<contact1>
<name>Vimal Jaiswal</name>
<company>SSSIT.org</company>
<phone>(0120) 4256464</phone>
</contact1>
<contact2>
<company>SSSIT.org</company>
<phone>09990449935</phone>
</contact2>
</contact-info>
XPath:
XPath is an important and core component of XSLT standard. It is used to traverse the elements
and attributes in an XML document.
XPath defines structure: XPath is used to define the parts of an XML document i.e. element,
attributes, text, namespace, processing-instruction, comment, and document nodes.
XPath provides path expression: XPath provides powerful path expressions, select nodes, or list of
nodes in XML documents.
XPath is a core component of XSLT: XPath is a major element in XSLT standard and must be
followed to work with XSLT documents.
XPath is a standard function: XPath provides a rich library of standard functions to manipulate
string values, numeric values, date and time comparison, node and QName manipulation,
sequence manipulation, Boolean values etc.
XPath Expression
XPath defines a pattern or path expression to select nodes or node sets in an XML document.
These patterns are used by XSLT to perform transformations. The path expressions look like very
similar to the general expressions we used in traditional file system.
XPath specifies seven types of nodes that can be output of the execution of the XPath expression.
o Root
o Element
o Text
o Attribute
o Comment
o Processing Instruction
o Namespace
We know that XPath uses a path expression to select node or a list of nodes from an XML
document.
A list of useful paths and expression to select any node/ list of nodes from an XML document:
Let's take an example to see the usage of XPath expression. Here, we use an xml file
"employee.xml" and a stylesheet for that xml file named "employee.xsl". The XSL file uses the
XPath expressions under select attribute of various XSL tags to fetchvalues of id, firstname,
lastname, nickname andsalary of each employee node.
Employee.xml
Employee.xsl
XQuery:
XQuery is a functional query language used to retrieve information stored in XML format. It is
same as for XML what SQL is for databases. It was designed to query XML data.
XQuery Features:
There are many features of XQuery query language. A list of top features are given below:
XQuery is a functional language. It is used to retrieve and query XML based data.
XQuery is expression-oriented programming language with a simple type system.
XQuery is analogous to SQL. For example: As SQL is query language for databases, same as
XQuery is query language for XML.
XQuery is XPath based and uses XPath expressions to navigate through XML documents.
Advantages of XQuery:
XQuery can be used to retrieve both hierarchal and tabular data.
courses.xml
<?xml version="1.0" encoding="UTF-8"?>
<courses>
<course category="JAVA">
<title lang="en">Learn Java in 3 Months.</title>
<trainer>Sonoo Jaiswal</trainer>
<year>2008</year>
<fees>10000.00</fees>
</course>
<course category="Dot Net">
<title lang="en">Learn Dot Net in 3 Months.</title>
<trainer>Vicky Kaushal</trainer>
<year>2008</year>
<fees>10000.00</fees>
</course>
<course category="C">
<title lang="en">Learn C in 2 Months.</title>
<trainer>Ramesh Kumar</trainer>
<year>2014</year>
<fees>3000.00</fees>
</course>
<course category="XML">
<title lang="en">Learn XML in 2 Months.</title>
<trainer>Ajeet Kumar</trainer>
<year>2015</year>
<fees>4000.00</fees>
</course>
</courses>
courses.xqy
for $x in doc("courses.xml")/courses/course
where $x/fees>5000
return $x/title
This example will display the title elements of the courses whose fees are greater than 5000.
Create a Java based XQuery executor program to read the courses.xqy, passes it to the XQuery
expression processor, and executes the expression. After that the result will be displayed.
XQueryTester.java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import javax.xml.xquery.XQConnection;
import javax.xml.xquery.XQDataSource;
import javax.xml.xquery.XQException;
import javax.xml.xquery.XQPreparedExpression;
import javax.xml.xquery.XQResultSequence;
import com.saxonica.xqj.SaxonXQDataSource;
e.printStackTrace();
}
catch (XQException e) {
e.printStackTrace();
}
}
private static void execute() throws FileNotFoundException, XQException{
InputStream inputStream = new FileInputStream(new File("courses.xqy"));
XQDataSource ds = new SaxonXQDataSource();
XQConnection conn = ds.getConnection();
XQPreparedExpression exp = conn.prepareExpression(inputStream);
XQResultSequence result = exp.executeQuery();
while (result.next()) {
System.out.println(result.getItemAsString(null));
}
}}
Execute XQuery against XML
Put the above three files to a same location. We put them on desktop in a folder name XQuery2.
Compile XQueryTester.java using console. You must have JDK 1.5 or later installed on your
computer and classpaths are configured.
Compile:
javac XQueryTester.java
Execute:
javaXQueryTester
XQuery FLWOR
FLWOR is an acronym which stands for "For, Let, Where, Order by, Return".
• For - It is used to select a sequence of nodes.
• Let - It is used to bind a sequence to a variable.
• Where - It is used to filter the nodes.
Example
Following is a sample XML document that contains information on a collection of books. We will
use a FLWOR expression to retrieve the titles of those books with a price greater than 30.
books.xml
<?xml version="1.0" encoding="UTF-8"?>
<books>
<book category="JAVA">
<title lang="en">Learn Java in 24 Hours</title>
<author>Robert</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="DOTNET">
<title lang="en">Learn .Net in 24 hours</title>
<author>Peter</author>
<year>2011</year>
<price>70.50</price>
</book>
<book category="XML">
<title lang="en">Learn XQuery in 24 hours</title>
<author>Robert</author>
<author>Peter</author>
<year>2013</year>
<price>50.00</price>
</book>
<book category="XML">
<title lang="en">Learn XPath in 24 hours</title>
<author>Jay Ban</author>
<year>2010</year>
<price>16.50</price>
</book>
</books>
The following Xquery document contains the query expression to be executed on the above XML
document.
books.xqy
let $books := (doc("books.xml")/books/book)
return <results>
{
for $x in $books
where $x/price>30
order by $x/price
return $x/title
}</results>
Result
<title lang="en">Learn XQuery in 24 hours</title>
<title lang="en">Learn .Net in 24 hours</title>
2. Let's take an XML document having the information on the collection of courses. We will use a
FLWOR expression to retrieve the titles of those courses which fees are greater than 2000.
courses.xml
<?xml version="1.0" encoding="UTF-8"?>
<courses>
<course category="JAVA">
<trainer>Ramesh Kumar</trainer>
<year>2014</year>
<fees>3000.00</fees>
</course>
<course category="XML">
<title lang="en">Learn XML in 2 Months.</title>
<trainer>Ajeet Kumar</trainer>
<year>2015</year>
<fees>4000.00</fees>
</course>
</courses>
Let's take the Xquery document named "courses.xqy" that contains the query expression to be
executed on the above XML document.
courses.xqy
let $courses := (doc("courses.xml")/courses/course)
return <results>
{
for $x in $courses
where $x/fees>2000
order by $x/fees
return $x/title
}
</results>
Create a Java based XQuery executor program to read the courses.xqy, passes it to the XQuery
expression processor, and executes the expression. After that the result will be displayed.
XQueryTester.java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;
import javax.xml.xquery.XQConnection;
import javax.xml.xquery.XQDataSource;
import javax.xml.xquery.XQException;
import javax.xml.xquery.XQPreparedExpression;
import javax.xml.xquery.XQResultSequence;
import com.saxonica.xqj.SaxonXQDataSource;
public class XQueryTester {
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (XQException e) {
e.printStackTrace();
}
}
}
}
courses.xml
Here, we use three different types of XQuery statement that will display the same result having
fees is greater than 2000.
Put the above three files to a same location. We put them on desktop in a folder name XQuery3.
Compile XQueryTester.java using console. You must have JDK 1.5 or later installed on your
computer and classpaths are configured.
XQuery vs XPath:
Index Xquery XPath
1) XQuery is a functional programming and query language that is used to query a group of XML
data.
XPath is a xml path language that is used to select nodes from an xml document using queries.
2) XQuery is used to extract and manipulate data from either xml documents or relational databases
and ms office documents that support an xml data source.
XPath is used to compute values like strings, numbers and boolean types from another xml
documents.
3) Xquery is represented in the form of a tree model with seven nodes, namely processing
instructions, elements, document nodes, attributes, namespaces, text nodes, and comments.
xpath is represented as tree structure, navigate it by selecting different nodes.
xpath was created to define a common syntax and behavior model for xpointer and xslt
UNIT V INFORMATION RETRIVAL AND WEB SEARCH
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the
user or the user has asked for in the form of a query. The documents and the queries are
represented in a similar manner, so that document selection and ranking can be formalized by
a matching function that returns a retrieval status value (RSV) for each document in the
collection. Many of the Information Retrieval systems represent document contents by a set
of descriptors, called terms, belonging to a vocabulary V. An IR model determines the query-
document matching function according to four main approaches:
The estimation of the probability of user’s relevance rel for each
document d and query q with respect to a set R q of training documents: Prob (rel|d, q, Rq)
Types of IR Models
Components of Information Retrieval/ IR Model
● Acquisition: In this step, the selection of documents and other objects from
various web resources that consist of text-based documents takes place. The
required data is collected by web crawlers and stored in the database.
● Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual & automatic techniques as well. example: Abstracting
contains summarizing and Bibliographic description that contains author, title,
sources, data, and metadata.
● File Organization: There are two types of file organization methods.
i.e. Sequential: It contains documents by document data. Inverted: It contains
term by term, list of records under each term. Combination of
both.
● Query: An IR process starts when a user enters a query into the system. Queries
are formal statements of information needs, for example, search strings in web
search engines. In information retrieval, a query does not uniquely identify a
single object in the collection. Instead, several objects may match the query,
perhaps with different degrees of relevancy.
Difference Between Information Retrieval and Data Retrieval
The software program that deals with Data retrieval deals with obtaining data from a
the organization, storage, retrieval, database management system such as
and evaluation of information from ODBMS. It is A process of identifying and
document repositories particularly retrieving the data from the database, based on
textual information. the query provided by user or application.
Does not provide a solution to the Provides solutions to the user of the database
user of the database system. system.
The User Task: The information first is supposed to be translated into a query by the user. In
the information retrieval system, there is a set of words that convey the semantics of the
information that is required whereas, in a data retrieval system, a query expression is used to
convey the constraints which are satisfied by the objects. Example: A user wants to search for
something but ends up searching with another thing. This means that the user is browsing
and not searching. The above figure shows the interaction of the user through different tasks.
● Logical View of the Documents: A long time ago, documents were represented
through a set of index terms or keywords. Nowadays, modern computers represent
documents by a full set of words which reduces the set of representative
keywords. This can be done by eliminating stopwords i.e. articles and connectives.
These operations are text operations. These text operations reduce the complexity
of the document representation from full text to set of index terms.
The purpose of this chapter is two-fold: First, we want to set the stage for the problems in
information retrieval that we try to address in this thesis. Second, we want to give the reader a
quick overview of the major textual retrieval methods, because the InfoCrystal can help to
visualize the output from any of them. We begin by providing a general model of the
information retrieval process. We then briefly describe the major retrieval methods and
characterize them in terms of their strengths and shortcomings.
The goal of information retrieval (IR) is to provide users with those documents that will satisfy their
information need. We use the word "document" as a general term that could also include
non-textual information, such as multimedia objects. Figure 4.1 provides a general overview
of the information retrieval process, which has been adapted from Lancaster and Warner
(1993). Users have to formulate their information need in a form that can be understood by
the retrieval mechanism. There are several steps involved in this translation process that we
will briefly discuss below. Likewise, the contents of large document collections need to be
described in a form that allows the retrieval mechanism to identify the potentially relevant
documents quickly. In both cases, information may be lost in the transformation process
leading to a computer-usable representation. Hence, the matching process is inherently
imperfect.
Marchionini (1992) contends that some sort of spreadsheet is needed that supports users in
the problem definition as well as other information seeking tasks. The InfoCrystal is such a
spreadsheet because it assists users in the formulation of their information needs and the
exploration of the retrieved documents, using the a visual interface that supports a "what-if"
functionality. He further predicts that advances in computing power and speed, together with
improved information retrieval procedures, will continue to blur the distinctions between
problem articulation and examination of results. The InfoCrystal is both a visual query
language and a tool for visualizing retrieval results.
The information need can be understood as forming a pyramid, where only its peak is made
visible by users in the form of a conceptual query (see Figure 2.1). The conceptual query
captures the key
concepts and the relationships among them. It is the result of a conceptual analysis that
operates on the information need, which may be well or vaguely defined in the user's mind.
This analysis can be challenging, because users are faced with the general "vocabulary
problem" as they are trying to translate their information need into a conceptual query. This
problem refers to the fact that a single word can have more than one meaning, and,
conversely, the same concept can be described by surprisingly many different words. Furnas,
Landauer, Gomez and Dumais (1983) have shown that two people use the same main word to
describe an object only 10 to 20% of the time. Further, the concepts used to represent the
documents can be different from the concepts used by the user. The conceptual query can
take the form of a natural language statement, a list of concepts that can have degrees of
importance assigned to them, or it can be statement that coordinates the concepts using
Boolean operators. Finally, the conceptual query has to be translated into a query surrogate
that can be understood by the retrieval system.
Figure 2.1: represents a general model of the information retrieval process, where both the
user's information need and the document collection have to be translated into the form of
surrogates to enable the matching process to be performed. This figure has been adapted from
Lancaster and Warner (1993).
Similarly, the meanings of documents need to be represented in the form of text surrogates
that can be processed by computer. A typical surrogate can consist of a set of index terms or
descriptors. The text surrogate can consist of multiple fields, such as the title, abstract,
descriptor fields to capture the meaning of a document at different levels of resolution or
focusing on different characteristic aspects of a document. Once the specified query has been
executed by IR system, a user is presented with the retrieved document surrogates. Either the
user is satisfied by the retrieved information or he will evaluate the retrieved documents and
modify the query to initiate a further search. The process of query modification based on user
evaluation of the retrieved documents is known as relevance feedback [Lancaster and Warner
1993]. Information retrieval is an inherently interactive process, and the users can change
direction by modifying the query surrogate, the conceptual query or their understanding of
their information need.
It is worth noting here the results, which have been obtained in studies investigating the
information-seeking process, that describe information retrieval in terms of the cognitive
and affective symptoms commonly experienced by a library user. The findings by Kuhlthau
et al. (1990) indicate that thoughts about the information need become clearer and more
focused as users move through the search process. Similarly, uncertainty, confusion, and
frustration are nearly universal experiences in the early stages of the search process, and
they decrease as the search process progresses and feelings of being confident, satisfied,
sure and relieved increase. The studies also indicate that cognitive attributes may affect the
search process. User's expectations of the information system and the search process may
influence the way they approach searching and therefore affect the intellectual access to
information.
Analytical search strategies require the formulation of specific, well-structured queries and a
systematic, iterative search for information, whereas browsing involves the generation of
broad query terms and a scanning of much larger sets of information in a relatively
unstructured fashion. Campagnoni et al. (1989) have found in information retrieval studies in
hypertext systems that the predominant search strategy is "browsing" rather than "analytical
search". Many users, especially novices, are unwilling or unable to precisely formulate their
search objectives, and browsing places less cognitive load on them. Furthermore, their
research showed that search strategy is only one dimension of effective information retrieval;
individual differences in visual skill appear to play an equally important role.
These two studies argue for information displays that provide a spatial overview of the data
elements and that simultaneously provide rich visual cues about the content of the individual
data elements.
Such a representation is less likely to increase the anxiety that is a natural part of the early
stages of the search process and it caters for a browsing interaction style, which is appropriate
especially in the beginning, when many users are unable to precisely formulate their search
objectives.
The following major models have been developed to retrieve information: the Boolean
model, the Statistical model, which includes the vector space and the probabilistic retrieval
model, and the Linguistic and Knowledge-based models. The first model is often referred to
as the "exact match" model; the latter ones as the "best match" models [Belkin and Croft
1992]. The material presented here is based on the textbooks by Lancaster and Warner (1992)
as well as Frakes and Baeza-Yates (1992), the review article by Belkin and Croft (1992), and
discussions with Richard Marcus, my thesis advisor and mentor in the field of information
retrieval.
Queries generally are less than perfect in two respects: First, they retrieve some irrelevant
documents. Second, they do not retrieve all the relevant documents. The following two
measures are usually used to evaluate the effectiveness of a retrieval method. The first one,
called the precision rate, is equal to the proportion of the retrieved documents that are
actually relevant. The second one, called the recall rate, is equal to the proportion of all
relevant documents that are actually retrieved. If searchers want to raise precision, then they
have to narrow their queries. If searchers want to raise recall, then they broaden their query.
In general, there is an inverse relationship between precision and recall. Users need help to
become knowledgeable in how to manage the precision and recall trade-off for their
particular information need [Marcus 1991].
In Table 2.1 we summarize the defining characteristics of the standard Boolean approach
and list its key advantages and disadvantages. It has the following strengths: 1) It is easy to
implement and it is computationally efficient [Frakes and Baeza-Yates 1992]. Hence, it is
the standard model for the
current large-scale, operational retrieval systems and many of the major on-line information
services use it. 2) It enables users to express structural and conceptual constraints to describe
important linguistic features [Marcus 1991]. Users find that synonym specifications
(reflected by OR-clauses) and phrases (represented by proximity relations) are useful in the
formulation of queries [Cooper 1988, Marcus 1991]. 3) The Boolean approach possesses a
great expressive power and clarity.
Boolean retrieval is very effective if a query requires an exhaustive and unambiguous
selection. 4) The Boolean method offers a multitude of techniques to broaden or narrow a
query. 5) The Boolean approach can be especially effective in the later stages of the search
process, because of the clarity and exactness with which relationships between concepts can
be represented.
The standard Boolean approach has the following shortcomings: 1) Users find it difficult to
construct effective Boolean queries for several reasons [Cooper 1988, Fox and Koll 1988,
Belkin and Croft 1992]. Users are using the natural language terms AND, OR or NOT that
have a different meaning when used in a query. Thus, users will make errors when they form
a Boolean query, because they resort to their knowledge of English.
Table 2.1: summarizes the defining characteristics of the standard Boolean approach and list
the its key advantages and disadvantages.
For example, in ordinary conversation a noun phrase of the form "A and B" usually refers to
more entities than would "A" alone, whereas when used in the context of information
retrieval it refers to fewer documents than would be retrieved by "A" alone. Hence, one of the
common mistakes made by users is to substitute the AND logical operator for the OR logical
operator when translating an English sentence to a Boolean query. Furthermore, to form
complex queries, users must be familiar with the rules of precedence and the use of
parentheses. Novice users have difficulty using parentheses, especially nested parentheses.
Finally, users are overwhelmed by the multitude of ways a query can be structured or
modified, because of the combinatorial explosion of feasible queries as the number of
concepts increases. In particular, users have difficulty identifying and applying the different
strategies that are available for narrowing or broadening a Boolean query [Marcus 1991,
Lancaster and Warner 1993]. 2) Only documents that satisfy a query exactly are retrieved. On
the one hand, the AND operator is too severe because it does not distinguish between the
case when none of the concepts are satisfied and the case where all except one are satisfied.
Hence, no or very few documents are retrieved when more than three and four criteria are
combined with the Boolean operator AND (referred to as the Null Output problem). On the
other hand, the OR operator does not reflect how many concepts have been satisfied. Hence,
often too many documents are retrieved (the Output Overload problem). 3) It is difficult to
control the number of retrieved documents. Users are often faced with the null-output or the
information overload problem and they are at loss of how to modify the query to retrieve the
reasonable number documents. 4) The traditional Boolean approach does not provide a
relevance ranking of the retrieved documents, although modern Boolean approaches can
make use of the degree of coordination, field level and degree of stemming present to rank
them [Marcus 1991]. 5) It does not represent the degree of uncertainty or error due the
vocabulary problem [Belkin and Croft 1992].
2.3.1.2 Narrowing and Broadening Techniques
As mentioned earlier, a Boolean query can be described in terms of the following four
operations: degree and type of coordination, proximity constraints, field specifications and
degree of stemming as expressed in terms of word/string specifications. If users want to
(re)formulate a Boolean query then they need to make informed choices along these four
dimensions to create a query that is sufficiently broad or narrow depending on their
information needs. Most narrowing techniques lower recall as well as raise precision, and
most broadening techniques lower precision as well as raise recall. Any query can be
reformulated to achieve the desired precision or recall characteristics, but generally it is
difficult to achieve both. Each of the four kinds of operations in the query formulation has
particular operators, some of which tend to have a narrowing or broadening effect. For each
operator with a narrowing effect, there is one or more inverse operators with a broadening
effect [Marcus 1991]. Hence, users require help to gain an understanding of how changes
along these four dimensions will affect the broadness or narrowness of a query.
Figure 2.2: captures how coordination, proximity, field level and stemming affect the
broadness or narrowness of a Boolean query. By moving in the direction in which the
wedges are expanding the query is broadened.
Figure 2.2 shows how the four dimensions affect the broadness or narrowness of a query: 1)
Coordination: the different Boolean operators AND, OR and NOT have the following effects
when used to add a further concept to a query: a) the AND operator narrows a query; b) the
OR broadens it; c) the effect of the NOT depends on whether it is combined with an AND or
OR operator. Typically, in searching textual databases, the NOT is connected to the AND, in
which case it has a narrowing effect like the AND operator. 2) Proximity: The closer together
two terms have to appear in a document, the more narrow and precise the query. The most
stringent proximity constraint requires the two terms to be adjacent. 3) Field level: current
document records have fields associated with them, such as the "Title", "Index", "Abstract" or
"Full-text" field: a) the more fields that are searched, the broader the query; b) the individual
fields have varying degrees of precision associated with them, where the "title" field is the
most specific and the "full-text" field is the most general. 4) Stemming: The shorter the prefix
that is used in truncation-based searching, the broader the query. By reducing a term to its
morphological stem and using it as a prefix, users can retrieve many terms that are
conceptually related to the original term [Marcus 1991].
Using Figure 2.2, we can easily read off how to broaden query. We just need to move in the
direction in which the wedges are expanding: we use the OR operator (rather than the AND),
impose no proximity constraints, search over all fields and apply a great deal of stemming.
Similarly, we can formulate a very narrow query by moving in the direction in which the
wedges are contracting: we use the AND operator (rather than the OR), impose proximity
constraints, restrict the search to the
title field and perform exact rather than truncated word matches. In Chapter 4 we will show
how Figure 2.2 indicates how the broadness or narrowness of a Boolean query could be
visualized.
There have been attempts to help users overcome some of the disadvantages of the traditional
Boolean discussed above. We will now describe such a method, called Smart Boolean,
developed by Marcus [1991, 1994] that tries to help users construct and modify a Boolean
query as well as make better choices along the four dimensions that characterize a Boolean
query. We are not attempting to provide an in-depth description of the Smart Boolean
method, but to use it as a good example that illustrates some of the possible ways to make
Boolean retrieval more user-friendly and effective. Table 2.2 provides a summary of the key
features of the Smart Boolean approach.
Users start by specifying a natural language statement that is automatically translated into a
Boolean Topic representation that consists of a list of factors or concepts, which are
automatically coordinated using the AND operator. If the user at the initial stage can or wants
to include synonyms, then they are coordinated using the OR operator. Hence, the Boolean
Topic representation connects the different factors using the AND operator, where the factors
can consist of single terms or several synonyms connected by the OR operator. One of the
goals of the Smart Boolean approach is to make use of the structural knowledge contained in
the text surrogates, where the different fields represent contexts of useful information.
Further, the Smart Boolean approach wants to use the fact that related concepts can share a
common stem. For example, the concepts "computers" and "computing" have the common
stem comput*.
Table 2.2: summarizes the defining characteristics of the Smart Boolean approach and list the its key
advantages and disadvantages.
The initial strategy of the Smart Boolean approach is to start out with the broadest possible
query within the constraints of how the factors and their synonyms have been coordinated.
Hence, it modifies the Boolean Topic representation into the query surrogate by using only
the stems of the concepts and searches for them over all the fields. Once the query surrogate
has been performed, users are guided in the process of evaluating the retrieved document
surrogates. They choose from a list of reasons to indicate why they consider certain
documents as relevant. Similarly, they can indicate why other documents are not relevant by
interacting with a list of possible reasons. This user feedback is used by the Smart Boolean
system to automatically modify the Boolean Topic representation or the query surrogate,
whatever is more appropriate. The Smart Boolean approach offers a rich set of strategies for
modifying a query based on the received relevance feedback or the expressed need to narrow
or broaden the query. The Smart Boolean retrieval paradigm has been implemented in the
form of a system called CONIT, which is one of the earliest expert retrieval systems that was
able to demonstrate that ordinary users, assisted by such a system, could perform equally well
as experienced search intermediaries [Marcus 1983]. However, users have to navigate
through a series of menus listing different choices, where it might be hard for them to
appreciate the
implications of some of these choices. A key limitation of the previous versions of the
CONIT system has been that lacked a visual interface. The most recent version has a
graphical interface and it uses the tiling metaphor suggested by Anick et al. (1991), and
discussed in section 10.4, to visualize Boolean coordination [Marcus 1994]. This
visualization approach suffers from the limitation that it enables users to visualize specific
queries, whereas we will propose a visual interface that represents all whole range of related
Boolean queries in a single display, making changes in Boolean coordination more user-
friendly. Further, the different strategies of modifying a query in CONIT require a better
visualization metaphor to enable users to make use these search heuristics. In Chapter 4 we
show how some of these modification techniques can be visualized.
Several methods have been developed to extend the Boolean model to address the following issues:
1) The Boolean operators are too strict and ways need to be found to soften them. 2) The
standard Boolean approach has no provision for ranking. The Smart Boolean approach and
the methods described in this section provide users with relevance ranking [Fox and Koll
1988, Marcus 1991]. 3) The Boolean model does not support the assignment of weights to
the query or document terms. We will briefly discuss the P-norm and the Fuzzy Logic
approaches that extend the Boolean model to address the above issues.
Table 2.3: summarizes the defining characteristics of the Extended Boolean approach and list
the its key advantages and disadvantages.
The P-norm method developed by Fox (1983) allows query and document terms to have
weights, which have been computed by using term frequency statistics with the proper
normalization procedures. These normalized weights can be used to rank the documents in the
order of decreasing distance from the point (0, 0, ... , 0) for an OR query, and in order of
increasing distance from the point (1, 1, ... , 1) for an AND query. Further, the Boolean
operators have a coefficient P associated with them to indicate the degree of strictness of the
operator (from 1 for least strict to infinity for most strict, i.e., the Boolean case). The P-norm
uses a distance-based measure and the coefficient P determines the degree of exponentiation
to be used. The exponentiation is an expensive computation, especially for P-values greater
than one.
In Fuzzy Set theory, an element has a varying degree of membership to a set instead of the
traditional binary membership choice. The weight of an index term for a given document
reflects the degree to which this term describes the content of a document. Hence, this weight
reflects the degree of membership of the document in the fuzzy set associated with the term
in question. The degree of membership for union and intersection of two fuzzy sets is equal
to the maximum and minimum, respectively, of the degrees of membership of the elements of
the two sets. In the "Mixed Min and Max" model developed by Fox and Sharat (1986) the
Boolean operators are softened by
considering the query-document similarity to be a linear combination of the min and max
weights of the documents.
The vector space and probabilistic models are the two major examples of the statistical
retrieval approach. Both models use statistical information in the form of term frequencies to
determine the relevance of documents with respect to a query. Although they differ in the
way they use the term frequencies, both produce as their output a list of documents ranked by
their estimated relevance. The statistical retrieval models address some of the problems of
Boolean retrieval methods, but they have disadvantages of their own. Table 2.4 provides
summary of the key features of the vector space and probabilistic approaches. We will also
describe Latent Semantic Indexing and clustering approaches that are based on statistical
retrieval approaches, but their objective is to respond to what the user's query did not say,
could not say, but somehow made manifest [Furnas et al. 1983, Cutting et al. 1991].
The vector space model represents the documents and queries as vectors in a
multidimensional space, whose dimensions are the terms used to build an index to represent
the documents [Salton 1983]. The creation of an index involves lexical scanning to identify
the significant terms, where morphological analysis reduces different word forms to common
"stems", and the occurrence of those stems is computed. Query and document surrogates are
compared by comparing their vectors, using, for example, the cosine similarity measure. In
this model, the terms of a query surrogate can be weighted to take into account their
importance, and they are computed by using the statistical distributions of the terms in the
collection and in the documents [Salton 1983]. The vector space model can assign a high
ranking score to a document that contains only a few of the query terms if these terms occur
infrequently in the collection but frequently in the document. The vector space model makes
the following assumptions: 1) The more similar a document vector is to a query vector, the
more likely it is that the document is relevant to that query. 2) The words used to define the
dimensions of the space are orthogonal or independent. While it is a reasonable first
approximation, the assumption that words are pairwise independent is not realistic.
The probabilistic retrieval model is based on the Probability Ranking Principle, which states
that an information retrieval system is supposed to rank the documents based on their
probability of relevance to the query, given all the evidence available [Belkin and Croft
1992]. The principle takes into account that there is uncertainty in the representation of the
information need and the documents. There can be a variety of sources of evidence that are
used by the probabilistic retrieval
methods, and the most common one is the statistical distribution of the terms in both the
relevant and non-relevant documents.
We will now describe the state-of-art system developed by Turtle and Croft (1991) that uses
Bayesian inference networks to rank documents by using multiple sources of evidence to
compute the conditional probability
P(Info need|document) that an information need is satisfied by a given document. An
inference network consists of a directed acyclic dependency graph, where edges represent
conditional dependency or causal relations between propositions represented by the nodes.
The inference network consists of a document network, a concept representation network that
represents indexing vocabulary, and a query network representing the information need. The
concept representation network is the interface between documents and queries. To compute
the rank of a document, the inference network is instantiated and the resulting probabilities
are propagated through the network to derive a probability associated with the node
representing the information need. These probabilities are used to rank documents.
The statistical approaches have the following strengths: 1) They provide users with a
relevance ranking of the retrieved documents. Hence, they enable users to control the output
by setting a relevance threshold or by specifying a certain number of documents to display. 2)
Queries can be easier to formulate because users do not have to learn a query language and
can use natural language. 3) The uncertainty inherent in the choice of query concepts can be
represented. However, the statistical approaches have the following shortcomings: 1) They
have a limited expressive power. For example, the NOT operation can not be represented
because only positive weights are used. It can be proven that only 2N*N of the 22N possible
Boolean queries can be generated by the statistical approaches that use weighted linear sums
to rank the documents. This result follows from the analysis of Linear Threshold Networks or
Boolean Perceptrons [Anthony and Biggs 1992]. For example, the very common and
important Boolean query ((A and B) or (C and D)) can not be represented by a vector space
query (see section 5.4 for a proof). Hence, the statistical approaches do not have the
expressive power of the Boolean approach. 3) The statistical approach lacks the structure to
express important linguistic features such as phrases. Proximity constraints are also difficult
to express, a feature that is of great use for experienced searchers. 4) The computation of the
relevance scores can be computationally expensive. 5) A ranked linear list provides users
with a limited view of the information space and it does not directly suggest how to modify a
query if the need arises [Spoerri 1993, Hearst 1994]. 6) The queries have to contain a large
number of words to improve the retrieval performance. As is the case for the Boolean
approach, users are faced with the problem of having to choose the appropriate words that are
also used in the relevant documents.
Table 2.4 summarizes the advantages and disadvantages that are specific to the vector space
and probabilistic model, respectively. This table also shows the formulas that are commonly
used to compute the term weights. The two central quantities used are the inverse term
frequency in a collection (idf), and the frequencies of a term i in a document j (freq(i,j)). In the
probabilistic model, the weight computation also considers how often a term appears in the
relevant and irrelevant documents, but this presupposes that the relevant documents are
known or that these frequencies can be reliably estimated.
Table 2.4: summarizes the defining characteristics of the statistical retrieval approach, which
includes the vector space and the probabilistic model and we list the their key advantages and
disadvantages.
If users provide the retrieval system with relevance feedback, then this information is used by
the statistical approaches to recompute the weights as follows: the weights of the query terms
in the relevant documents are increased, whereas the weights of the query terms that do not
appear in the relevant documents are decreased [Salton and Buckley 1990]. There are
multiple ways of computing and updating the weights, where each has its advantages and
disadvantages. We do not discuss these formulas in more detail, because research on
relevance feedback has shown that significant effectiveness improvements can be gained by
using quite simple feedback techniques [Salton and Buckley 1990]. Furthermore, what is
important to this thesis is that the statistical retrieval approach generates a ranked list,
however how this ranking has been computed in detail is immaterial for the purpose of this
thesis.
Several statistical and AI techniques have been used in association with domain semantics to
extend the vector space model to help overcome some of the retrieval problems described
above, such as the "dependence problem" or the "vocabulary problem". One such method is
Latent Semantic Indexing (LSI). In LSI the associations among terms and documents are
calculated and exploited in the retrieval process. The assumption is that there is some "latent"
structure in the pattern of word usage across documents and that statistical techniques can be
used to estimate this latent structure. An advantage of this approach is that queries can
retrieve documents even if they have no words in common. The LSI technique captures
deeper associative structure than simple term-to-term correlations and is completely
automatic. The only difference between LSI and vector space methods is that LSI represents
terms and documents in a reduced dimensional space of the derived indexing dimensions. As
with the vector space method, differential term weighting and relevance feedback can
improve LSI performance substantially.
Foltz and Dumais (1992) compared four retrieval methods that are based on the vector-space
model. The four methods were the result of crossing two factors, the first factor being whether
the retrieval method used Latent Semantic Indexing or keyword matching, and the second
factor being whether the profile was based on words or phrases provided by the user (Word
profile), or documents that the user had previously rated as relevant (Document profile). The
LSI match-document profile method proved to be the most successful of the four methods.
This method combines the advantages of both LSI and the document profile. The document
profile provides a simple, but
effective, representation of the user's interests. Indicating just a few documents that are of
interest is as effective as generating a long list of words and phrases that describe one's
interest. Document profiles have an added advantage over word profiles: users can just
indicate documents they find relevant without having to generate a description of their
interests.
In the simplest form of automatic text retrieval, users enter a string of keywords that are used
to search the inverted indexes of the document keywords. This approach retrieves documents
based solely on the presence or absence of exact single word strings as specified by the
logical representation of the query. Clearly this approach will miss many relevant documents
because it does not capture the complete or deep meaning of the user's query. The Smart
Boolean approach and the statistical retrieval approaches, each in their specific way, try to
address this problem (see Table 2.5). Linguistic and knowledge-based approaches have also
been developed to address this problem by performing a morphological, syntactic and
semantic analysis to retrieve documents more effectively [Lancaster and Warner 1993]. In a
morphological analysis, roots and affixes are analyzed to determine the part of speech (noun,
verb, adjective etc.) of the words. Next complete phrases have to be parsed using some form
of syntactic analysis. Finally, the linguistic methods have to resolve word ambiguities and/or
generate relevant synonyms or quasi-synonyms based on the semantic relationships between
words. The development of a sophisticated linguistic retrieval system is difficult and it
requires complex knowledge bases of semantic information and retrieval heuristics. Hence
these systems often require techniques that are commonly referred to as artificial intelligence
or expert systems techniques.
We will now describe in some detail the DR-LINK system developed by Liddy et al., because
it represents an exemplary linguistic retrieval system. DR-LINK is based on the principle that
retrieval should take place at the conceptual level and not at the word level. Liddy et al.
attempt to retrieve documents on the basis of what people mean in their query and not just
what they say in their query. DR-LINK system employs sophisticated, linguistic text
processing techniques to capture the conceptual information in documents. Liddy et al. have
developed a modular system that represents and matches text at the lexical, syntactic,
semantic, and the discourse levels of language. Some of the modules that have been
incorporated are: The Text Structurer is based on discourse linguistic theory that suggests that
texts of a particular type have a predictable structure which serves as an indication where
certain information can be found. The Subject Field Coder uses an established semantic
coding scheme from a machine-readable dictionary to tag each word with its disambiguated
subject code (e.g., computer science, economics) and to then produce a fixed-length, subject-
based vector representation of the document and the query. The Proper Noun Interpreter uses
a variety of processing heuristics and knowledge bases to produce: a canonical representation
of each proper noun; a classification of each proper noun into thirty-seven categories; and an
expansion of group nouns into their constituent proper noun members. The Complex
Nominal Phraser provides means for precise matching of complex semantic constructs when
expressed as either adjacent nouns or a
non-predicating adjective and noun pair. Finally, The Natural Language Query Constructor
takes as input a natural language query and produces a formal query that reflects the
appropriate logical combination of text structure, proper noun, and complex nominal
requirements of the user's information need. This module interprets a query into pattern-
action rules that translate each sentence into a first-order logic assertion, reflecting the
Boolean-like requirements of queries.
Table 2.5: characterizes the major retrieval methods in terms of how deal with lexical,
morphological, syntactic and semantic issues.
To summarize, the DR-LINK retrieval system represents content at the conceptual level rather
than at the word level to reflect the multiple levels of human language comprehension. The
text representation combines the lexical, syntactic, semantic, and discourse levels of
understanding to predict the relevance of a document. DR-LINK accepts natural language
statements, which it translates into a precise Boolean representation of the user's relevance
requirements. It also produces a summary-level, semantic vector representations of queries
and documents to provide a ranking of the documents.
2.4 Conclusion
There is a growing discrepancy between the retrieval approach used by existing commercial
retrieval systems and the approaches investigated and promoted by a large segment of the
information retrieval research community. The former is based on the Boolean or Exact
Matching retrieval model, whereas the latter ones subscribe to statistical and linguistic
approaches, also referred to as the Partial Matching approaches. First, the major criticism
leveled against the Boolean approach is that its queries are difficult to formulate. Second, the
Boolean approach makes it possible to represent structural and contextual information that
would be very difficult to represent using the statistical approaches. Third, the Partial
Matching approaches provide users with a ranked output, but these ranked lists obscure
Table 2.6: lists some of the key problems in the field of information retrieval and possible solutions.
valuable information. Fourth, recent retrieval experiments have shown that the Exact and
Partial matching approaches are complementary and should therefore be combined [Belkin et
al. 1993].
In Table 2.6 we summarize some of the key problems in the field of information retrieval and
possible solutions to them. We will attempt to show in this thesis: 1) how visualization can
offer ways to address these problems; 2) how to formulate and modify a query; 3) how to
deal with large sets of retrieved documents, commonly referred to as the information
overload problem. In particular, this
thesis overcomes one of the major "bottlenecks" of the Boolean approach by showing how
Boolean coordination and its diverse narrowing and broadening techniques can be visualized,
thereby making it more user-friendly without limiting its expressive power. Further, this
thesis shows how both the Exact and Partial Matching approaches can be visualized in the
same visual framework to enable users to make effective use of their respective strengths.
TEXT PREPROCESSING
The information retrieval is the task of obtaining relevant information from a large collection
of databases. Preprocessing plays an important role in information retrieval to extract the
relevant information. A text preprocessing approach works in two steps. Firstly, spell check
utility is used for enhancing stemming and secondly, synonyms of similar tokens are
combined.The commonly used text preprocessing techniques are:
1. Stopword Removal
Stopwords are very commonly used words in a language that play a major role in the
formation of a sentence but which seldom contribute to the meaning of that sentence. Words
that are expected to occur in 80 percent or more of the documents in a collection are typically
referred to as stopwords, and they are rendered potentially useless. Because of the
commonness and function of these words, they do not contribute much to the relevance of a
document for a query search.
Examples include words such as the, of, to, a, and, in, said, for, that, was, on, he, is, with, at,
by, and it. Removal of stopwords from a document must be performed before indexing.
Articles, prepositions, conjunctions, and some pronouns are generally classified as
stopwords. Queries must also be preprocessed for stopword removal before the actual
retrieval process. Removal of stopwords results in elimination of possible spurious indexes,
thereby reducing the size of an index structure by about 40 percent or more. However, doing
so could impact the recall if the stopword is an integral part of a query (for example, a search
for the phrase ‘To be or not to be,’ where removal of stopwords makes the query
inappropriate, as all the words in the phrase are stopwords). Many search engines do not
employ query stopword removal for this reason.
2. Stemming
A stem of a word is defined as the word obtained after trimming the suffix and prefix
of an original word. For example, ‘comput’ is the stem word for computer, computing, and
computation. These suffixes and prefixes are very common in the English language for
supporting the notion of verbs, tenses, and plural forms. Stemming reduces the different
forms of the word formed by inflection (due to plurals or tenses) and derivation to a common
stem.A stemming algorithm can be applied to reduce any word to its stem. In English, the
most famous stemming algorithm is Martin Porter’s stemming algorithm. The Porter stemmer
is a simplified version of Lovin’s technique that uses a reduced set of about 60 rules (from
260 suffix patterns in Lovin’s technique) and organizes them into sets; conflicts within one
subset of rules are resolved before going on to the next. Using stemming for preprocessing
data results in a decrease in the size of the indexing structure and an increase in recall,
possibly at the cost of precision.
3. Utilizing a Thesaurus
A thesaurus comprises a precompiled list of important concepts and the main word
that describes each concept for a particular domain of knowledge. For each concept in this
list, a set of synonyms and related words is also compiled. Thus, a synonym can be converted
to its matching concept during preprocessing. This preprocessing step assists in providing a
standard vocabulary for
indexing and searching. Usage of a thesaurus, also known as a collection of synonyms, has a
substantial impact on the recall of information systems. This process can be complicated
because many words have different meanings in different contexts. UMLS is a large
biomedical thesaurus of millions of concepts (called the Metathesaurus) and a semantic
network of meta concepts and relationships that organize the Metathesaurus. The concepts
are assigned labels from the semantic network. This thesaurus of concepts contains synonyms
of medical terms, hierarchies of broader and narrower terms, and other relationships among
words and concepts that make it a very extensive resource for information retrieval of
documents in the medical domain. Figure 27.3 illustrates part of the UMLS Semantic
Network.
WordNet is a manually constructed thesaurus that groups words into strict synonym sets
called synsets. These synsets are divided into noun, verb, adjective, and adverb
categories. Within each category, these synsets are linked together by appropriate
relationships such as class/subclass or “is-a” relationships for nouns.
WordNet is based on the idea of using a controlled vocabulary for indexing, thereby eliminating
redundancies. It is also useful in providing assistance to users with locating terms for proper query
formulation.
4. Other Preprocessing Steps: Digits, Hyphens, Punctuation Marks, Cases
Digits, dates, phone numbers, e-mail addresses, URLs, and other standard
types of text may or may not be removed during preprocessing. Web search engines, however,
index them in order to to use this type of information in the document metadata to improve
precision and recall (see Section 27.6 for detailed definitions of precision and recall).
Hyphens and punctuation marks may be handled in different ways. Either the entire phrase with the
hyphens/punctuation marks may be used, or they may be eliminated. In some systems, the character
representing the hyphen/punctuation mark may be removed, or may be replaced with a space.
Different information retrieval systems follow different rules of processing. Handling
hyphens automatically can be complex: it can either be done as a classification problem, or
more commonly by some heuristic rules.
Most information retrieval systems perform case-insensitive search, converting all the letters
of the text to uppercase or lowercase. It is also worth noting that many of these text
preprocessing steps are language specific, such as involving accents and diacritics and the
idiosyncrasies that are associated with a particular language.
5. Information Extraction
Information extraction (IE) is a generic term used for extracting structured con-tent
from text. Text analytic tasks such as identifying noun phrases, facts, events, people, places,
and relationships are examples of IE tasks. These tasks are also called named entity
recognition tasks and use rule-based approaches with either a the-saurus, regular expressions
and grammars, or probabilistic approaches. For IR and search applications, IE technologies
are mostly used to identify contextually relevant features that involve text analysis, matching,
and categorization for improving the relevance of search systems. Language technologies
using part-of-speech tagging are applied to semantically annotate the documents with
extracted features to aid search relevance.
Inverted Index
An inverted index is an index data structure storing a mapping from content, such as words or
numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap
like data structure that directs you from a word to a document or a web page.
There are two types of inverted indexes:
A record-level inverted index contains a list of references to documents for each word.
A word-level inverted index additionally contains the positions of each word within a
document. The latter form offers more functionality, but needs more processing power and
space to be created.
Suppose we want to search the texts “hello everyone, ” “this article is based on inverted
index, ” “which is hashmap like data structure”. If we index by (text, word within the
text), the index with location in text is:
hello (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
is (2, 3); (3, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
data (3, 5)
structure (3, 6)
The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has an entry (1,
1) and word “is” is in document 2 and 3 at ‘3rd’ and ‘2nd’ positions respectively (here
position is based on word).
The index may have weights, frequencies, or other indicators.
Steps to build an inverted index:
Words Document
ant doc1
demo doc2
world doc1, doc2
●Inverted index is to allow fast full text searches, at a cost of increased processing
when a document is added to the database.
● It is easy to develop.
● It is the most popular data structure used in document retrieval systems, used on a
large scale for example in search engines.
Inverted Index also has disadvantage:
● Large storage overhead and high maintenance costs on update, delete and insert.
Evaluative Masures :
Data collected through web analytics may include traffic sources, referring sites, page views,
paths taken and conversion rates. The compiled data often forms a part of customer
relationship management analytics (CRM analytics) to facilitate and streamline better
business decisions.
Web analytics enables a business to retain customers, attract more visitors and increase the
dollar volume each customer spends.
Determine the likelihood that a given customer will repurchase a product after purchasing it in
the past.
Personalize the site to customers who visit it repeatedly.
Monitor the amount of money individual customers or specific groups of customers spend.
Observe the geographic regions from which the most and the least customers visit the site and
purchase specific products.
Predict which products customers are most and least likely to buy in the future.
The objective of web analytics is to serve as a business metric for promoting specific products
to the customers who are most likely to buy them and to determine which products a specific
customer is most likely to purchase. This can help improve the ratio of revenue to marketing
costs.
In addition to these features, web analytics may track the clickthrough and drilldown
behavior of customers within a website, determine the sites from which customers most often
arrive, and communicate with browsers to track and analyze online behavior. The results of
web analytics are provided in the form of tables, charts and graphs.
Setting goals. The first step in the web analytics process is for businesses to determine goals
and the end results they are trying to achieve. These goals can include increased sales,
customer satisfaction and brand awareness. Business goals can be both quantitative and
qualitative.
Collecting data. The second step in web analytics is the collection and storage of data.
Businesses can collect data directly from a website or web analytics tool, such as Google
Analytics. The data mainly comes from Hypertext Transfer Protocol requests -- including
data at the network and application levels -- and can be combined with external data to
interpret web usage. For example, a user's Internet Protocol address is typically associated
with many factors, including geographic location and clickthrough rates.
Processing data. The next stage of the web analytics funnel involves businesses processing
the collected data into actionable information.
Identifying key performance indicators (KPIs). In web analytics, a KPI is a quantifiable
measure to monitor and analyze user behavior on a website. Examples include bounce rates,
unique users, user sessions and on-site search queries.
Developing a strategy. This stage involves implementing insights to formulate strategies that
align with an organization's goals. For example, search queries conducted on-site can help an
organization develop a content strategy based on what users are searching for on its website.
Experimenting and testing. Businesses need to experiment with different strategies in order
to find the one that yields the best results. For example, A/B testing is a simple strategy to help
learn how an audience responds to different content. The process involves creating two or
more versions of content and then displaying it to different audience segments to reveal
which version of the content performs better.
Log file analysis, also known as log management, is the process of analyzing data gathered
from log files to monitor, troubleshoot and report on the performance of a website. Log files
hold records of virtually every action taken on a network server, such as a web server, email
server, database server or file server.
Page tagging is the process of adding snippets of code into a website's HyperText Markup
Language code using a tag management system to track website visitors and their
interactions across the website. These snippets of code are called tags. When businesses add
these tags to a website, they can be used to track any number of metrics, such as the number
of pages viewed, the number of unique visitors and the number of specific products viewed.
Web analytics tools
Web analytics tools report important statistics on a website, such as where visitors came from,
how long they stayed, how they found the site and their online activity while on the site. In
addition to web analytics, these tools are commonly used for product analytics, social media
analytics and marketing analytics.
Google Analytics. Google Analytics is a web analytics platform that monitors website traffic,
behaviors and conversions. The platform tracks page views, unique visitors, bounce rates,
referral Uniform Resource Locators, average time on-site, page abandonment, new vs.
returning visitors and demographic data.
Optimizely. Optimizely is a customer experience and A/B testing platform that helps
businesses test and optimize their online experiences and marketing efforts, including
conversion rate optimization.
Kissmetrics. Kissmetrics is a customer analytics platform that gathers website data and
presents it in an easy-to-read format. The platform also serves as a customer intelligence tool,
as it enables businesses to dive deeper into customer behavior and use this information to
enhance their website and marketing campaigns.
Crazy Egg. Crazy Egg is a tool that tracks where customers click on a page. This information
can help organizations understand how visitors interact with content and why they leave the
site. The tool tracks visitors, heatmaps and user session recordings.
an unstructured nature i.e. usually text which satisfies an information need from within large
collections which is stored on computers.
□ For example, Information Retrieval can be when a user enters a query into the system.
▪ In MongoDB user can Use Cmd To Work and Use Queries Like: MongoDB
◻ To Get Started Work in MongoDb , Show Databases ◻ to Get All The
Databases to be displayed , Use Students Details( Name of our Database
Table ) to Work which ll be in the list already or else We r going to Start
work with a new table DB ….
□ Not only librarians, professional searchers, etc engage themselves in the activity of
classifying the mails so that it can be placed directly into particular folders.
□ An IR system has the ability to represent, store, organize, and access information items.
□ Keywords are what people are searching for in search engines. ( Like if We wants Notes for
Organization Employees are Ranked According to their Posting / Position like ◻ Owner of
the Organization / Project Manager ◻ Team Leader ◻ Workers / Team Members .
provides greater access to networks due to digital communication and it gives free
access to publish on a larger medium.
Web Trends in The Coming Years:
□ When the Internet was introduced back in the 1980s, the sole purpose of it was
to communicate data locally on an inter-connected wired network for research
purposes.
□ Since then, it has expanded and evolved in bits and pieces.
□ The internet now holds a very strong place in our lives, and without it, our
lives seem impossible.
□ The internet of today runs in all the domains of our life from a simple search to sectors like
1. WebRTC.
2. Internet of Things (IoT)
❖ Internet of Things (IoT)◻ Internet of things is considered as the backbone of the modern
internet, as only via IoT consumers, governments and businesses would be able to
interact with the physical world.
□ Thus, this will help the problems to be solved in a much better and engaging way.
□ The vision of an advanced and closely operated internet system cannot be visualized
without the use of these smart devices.
□ These smart devices need not necessarily be computerized devices but can also be
non-computerized devices such as fan, fridge, air conditioner, etc.
□ These devices will be given the potential to create user-specific data that can be
optimized for better user experience and to increase human productivity.
? The goal of IoT is to form a network of internet-connected devices, which
can interact internally for better usage.
? Many developed countries have already started using IoT, and a common
example is the use of light sensors in public places.
? Whenever a vehicle/object will pass through the road, the first street light will
be lightened and will lighten all the other lights on that road which are
internally connected, thus creating a smarter and energy-saving model.
? Around 35 billion devices are connected to the internet in 2020, and the
number of connections to the internet is expected to go up to 50 billion by
2030.
? Thus, IOT proves to be one of the emerging web technologies in the coming decades.
❖ Progressive Web Apps ◻ The smartphones we use in current scenarios are loaded with
apps, with the choice of users to download or remove any app depending upon his
liking.
? But what if, we do not have to download or remove any app to use its services?
? The idea behind progressive web apps is much similar to this.
? Such apps would cover the screen of the smartphones and would enable us to
use or try any app upon our liking without actually downloading it.
□ It is a combination of web and app technology to give the user a much smoother
experience.
? The advantage of using progressive web apps is that the users will not face any
hassle to download and update the app time of time thus saving data.
□ Also, the app companies would not need to release the app for every updated version.
□ This would also eliminate the use complexity to create responsive apps, as the progressive
web apps can be used in any device and will give the same experience, despite the
screen size.
? Further development into progressive web apps can also enable users to use it in
an offline mode, thus paving a way for those who are not connected with the
internet.
□ The ease of use and availability will increase thus benefiting the user and making
life much simpler.
□ A very common example is the ‘Try Now’ feature that we get in the Google play store for
specific apps, it more or less uses the same technology of progressive web apps to
run the app without actually downloading it.
❖ Social Networking via Virtual Reality◻ The rise of virtual reality in the last few
years is due to its ability to fill the gap between reality and virtual.
? The same idea of virtual reality is now being thought to be used with social networking.
? Making the idea of social networking a base i.e. to interact with people over long
distances & the idea of virtual reality is used on top.
□ Social networking sites are devising ways so that the users not only confine themselves
over-communicating online but to provide a way through which they have access to
the world of virtual reality.
□ The video calling/conferencing shall not remain a visual perception but would be
changed to a complete 360-degree experience.
□ The user will be able to feel much more than just communication and can interact in a much
better way.
? This idea of mixing social networking with virtual reality might be a challenging
one, but the kind of user experience, one could get will be amazing.
? World’s largest social networking company, Facebook has started to develop a
platform way back in 2014 and was able to successfully create a virtual
environment where users were not just able to communicate but also feel their
surroundings, but the platform has not been open to public yet.
□ These Web trends will shortly arrive in the coming years, and the availability of these
technologies will once again prove that the internet is not stagnant and is always
improving to provide a better user experience.
□ Improvising these technologies will make the internet take a very essential place in
our lives, just like the way it has taken now.