0% found this document useful (0 votes)
34 views15 pages

Data Integration A Theoretical Perspective

The document discusses data integration from a theoretical perspective. It introduces the concepts of a global schema, source schemas, and the mapping between them. It also discusses issues like query processing, dealing with inconsistent data sources, and reasoning about queries in data integration systems.

Uploaded by

Freccs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views15 pages

Data Integration A Theoretical Perspective

The document discusses data integration from a theoretical perspective. It introduces the concepts of a global schema, source schemas, and the mapping between them. It also discusses issues like query processing, dealing with inconsistent data sources, and reasoning about queries in data integration systems.

Uploaded by

Freccs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220266329

Data Integration: A Theoretical Perspective

Conference Paper · January 2002


DOI: 10.1145/543613.543644 · Source: DBLP

CITATIONS READS
2,386 8,230

1 author:

Maurizio Lenzerini
Sapienza University of Rome
414 PUBLICATIONS   21,368 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Maurizio Lenzerini on 02 June 2014.

The user has requested enhancement of the downloaded file.


Data Integration: A Theoretical Perspective

Maurizio Lenzerini
Dipartimento di Informatica e Sistemistica
Università di Roma “La Sapienza”
Via Salaria 113, I-00198 Roma, Italy
[email protected]

ABSTRACT the global schema and the sources are established by defin-
ing every source as a view over the global schema. Our goal
Data integration is the problem of combining data residing is to discuss the characteristics of the two modeling mecha-
at different sources, and providing the user with a unified nisms, and to mention other possible approaches.
view of these data. The problem of designing data integra-
tion systems is important in current real world applications, Irrespectively of the method used for the specification of
and is characterized by a number of issues that are interest- the mapping between the global schema and the sources,
ing from a theoretical point of view. This document presents one basic service provided by the data integration system
on overview of the material to be presented in a tutorial on is to answer queries posed in terms of the global schema.
data integration. The tutorial is focused on some of the the- Given the architecture of the system, query processing in
oretical issues that are relevant for data integration. Special data integration requires a reformulation step: the query
attention will be devoted to the following aspects: modeling over the global schema has to be reformulated in terms of
a data integration application, processing queries in data a set of queries over the sources. In this tutorial, such a
integration, dealing with inconsistent data sources, and rea- reformulation problem will be analyzed for both the case of
soning on queries. local-as-view, and the case of global-as-view mappings. A
main theme will be the strong relationship between query
1. INTRODUCTION processing in data integration and the problem of query an-
swering with incomplete information.
Data integration is the problem of combining data residing
at different sources, and providing the user with a unified Since sources are in general autonomous, in many real-world
view of these data [60, 61, 89]. The problem of designing applications the problem arises of mutually inconsistent data
data integration systems is important in current real world sources. In practice, this problem is generally dealt with by
applications, and is characterized by a number of issues that means of suitable transformation and cleaning procedures
are interesting from a theoretical point of view. This tutorial applied to data retrieved from the sources. In this tutorial,
is focused on some of these theoretical issues, with special we address this issue from a more theoretical perspective.
emphasis on the following topics.
Finally, there are several tasks in the operation of a data in-
The data integration systems we are interested in this work tegration system where the problem of reasoning on queries
are characterized by an architecture based on a global (e.g., checking whether two queries are equivalent) is rele-
schema and a set of sources. The sources contain the real vant. Indeed, query containment is one of the basic prob-
data, while the global schema provides a reconciled, inte- lems in database theory, and we will discuss several notions
grated, and virtual view of the underlying sources. Model- generalizing this problem to a data integration setting.
ing the relation between the sources and the global schema
is therefore a crucial aspect. Two basic approaches have The paper is organized as follows. Section 2 presents our
been proposed to this purpose. The first approach, called formalization of a data integration system. In Section 3 we
global-as-view, requires that the global schema is expressed discuss the various approaches to modeling. Sections 4 and 5
in terms of the data sources. The second approach, called present an overview of the methods for processing queries
local-as-view, requires the global schema to be specified in- in the local-as-view and in the global-as-view approach, re-
dependently from the sources, and the relationships between spectively. Section 6 discusses the problem of dealing with
inconsistent sources. Section 7 provides an overview on the
problem of reasoning on queries. Finally, Section 8 con-
cludes the paper by mentioning some open problems, and
several research issues related to data integration that are
not addressed in the tutorial.

2. DATA INTEGRATION FRAMEWORK


In this section we set up a logical framework for data integra-
. tion. We restrict our attention to data integration systems
based on a so-called global schema (or, mediated schema). language LG may be very simple (basically allowing the defi-
In other words, we refer to data integration systems whose nition of a set of relations), or may allow for various forms of
aim is combining the data residing at different sources, and integrity constraints to be expressed over the symbols of AG .
providing the user with a unified view of these data. Such a Analogously, the type (e.g., relational, semistructured, etc.)
unified view is represented by the global schema, and pro- and the expressive power of LS varies from one approach to
vides a reconciled view of all data, which can be queried by another.
the user. Obviously, one of the main task in the design of
a data integration system is to establish the mapping be- We now specify the semantics of a data integration system.
tween the sources and the global schema, and such a map- In what follows, a database (DB) for a schema T is simply a
ping should be suitably taken into account in formalizing a set of collection of sets, one for each symbol in the alphabet
data integration system. of T (e.g., one relation for every relation schema of T , if
T is relational, or one set of objects for each class of T ,
It follows that the main components of a data integration if T is object-oriented, etc.). We also make a simplifying
system are the global schema, the sources, and the mapping. assumption on the domain for the various sets. In particular,
Thus, we formalize a data integration system I in terms of we assume that the structures constituting the databases
a triple hG, S, Mi, where involved in our framework (both the global database and
the source databases) are defined over a fixed domain Γ.
• G is the global schema, expressed in a language LG over In order to assign semantics to a data integration system
an alphabet AG . The alphabet comprises a symbol for I = hG, S, Mi, we start by considering a source database
each element of G (i.e., relation if G is relational, class for I, i.e., a database D that conforms to the source schema
if G is object-oriented, etc.). S and satisfies all constraints in S. Based on D, we now
• S is the source schema, expressed in a language LS specify which is the information content of the global schema
over an alphabet AS . The alphabet AS includes a G. We call global database for I any database for G. A global
symbol for each element of the sources. database B for I is said to be legal with respect to D, if:

• M is the mapping between G and S, constituted by a • B is legal with respect to G, i.e., B satisfies all the
set of assertions of the forms constraints of G;

qS ; qG ,
• B satisfies the mapping M with respect to D.
qG ; qS

where qS and qG are two queries of the same arity, The notion of B satisfying the mapping M with respect to D
respectively over the source schema S, and over the depends on how to interpret the assertions in the mapping.
global schema G. Queries qS are expressed in a query We will see in the next section that several approaches are
language LM,S over the alphabet AS , and queries qG conceivable. Here, we simply note that, no matter which is
are expressed in a query language LM,G over the al- the interpretation of the mapping, in general, several global
phabet AG . Intuitively, an assertion qS ; qG speci- databases exist that are legal for I with respect to D. This
fies that the concept represented by the query qS over observation motivates the relationship between data integra-
the sources corresponds to the concept in the global tion and databases with incomplete information [91], which
schema represented by the query qG (similarly for an will be discussed in several ways later on in the paper.
assertion of type qG ; qS ). We will discuss several
ways to make this intuition precise in the following Finally, we specify the semantics of queries posed to a data
sections. integration system. As we said before, such queries are ex-
pressed in terms of the symbols in the global schema of I.
In general, if q is a query of arity n and DB is a database,
Intuitively, the source schema describes the structure of the we denote with q DB the set of tuples (of arity n) in DB that
sources, where the real data are, while the global schema satisfy q.
provides a reconciled, integrated, and virtual view of the
underlying sources. The assertions in the mapping establish Given a source database D for I, the answer q I,D to a query
the connection between the elements of the global schema q in I with respect to D, is the set of tuples t of objects in
and those of the source schema. Γ such that t ∈ q B for every global database B that is legal
for I with respect to D. The set q I,D is called the set of
Queries to I are posed in terms of the global schema G, and certain answers to q in I with respect to D.
are expressed in a query language LQ over the alphabet AG .
A query is intended to provide the specification of which Note that, from the point of view of logic, finding certain
data to extract from the virtual database represented by answers is a logical implication problem: check whether it
the integration system. logically follows from the information on the sources that t
satisfies the query. The dual problem is also of interest: find-
The above definition of data integration system is general ing the so-called possible answers to q, i.e., checking whether
enough to capture virtually all approaches in the literature. t ∈ q B for some global database B that is legal for I with
Obviously, the nature of a specific approach depends on the respect to D. Finding possible answers is a consistency prob-
characteristics of the mapping, and on the expressive power lem: check whether assuming that t is in the answer set of
of the various schema and query languages. For example, the q does not contradict the information on the sources.
3. MODELING other words, given a source database D, from the fact
that a tuple is in sD one can conclude that it satisfies
One of the most important aspects in the design of a data in- the associated view over the global schema, while from
tegration system is the specification of the correspondence the fact that a tuple is not in sD one cannot conclude
between the data at the sources and those in the global that it does not satisfy the corresponding view. For-
schema. Such a correspondence is modeled through the no- mally, when as(s) = sound , a database B satisfies the
tion of mapping as introduced in the previous section. It assertion s ; qG with respect to D if
is exactly this correspondence that will determine how the
queries posed to the system are answered. sD ⊆ qGB
Note that, from a logical point of view, a sound source
In this section we discuss mappings which can be expressed s with arity n is modeled through the first order as-
in terms of first order logic assertions. Mappings going be- sertion
yond first order logic are briefly discussed in Section 6.
∀x s(x) → qG (x)
Two basic approaches for specifying the mapping in a data where x denotes variables x1 , . . . , xn .
integration system have been proposed in the literature,
called local-as-view (LAV), and global-as-view (GAV), re- • Complete views. When a source s is complete (de-
spectively [89, 60]. We discuss these approaches separately. noted with as(s) = complete), its extension provides
We then end the section with a comparison of the two kinds any superset of the tuples satisfying the corresponding
of mapping. view. In other words, from the fact that a tuple is
in sD one cannot conclude that such a tuple satisfies
3.1 Local as view the corresponding view. On the other hand, from the
fact that a tuple is not in sD one can conclude that
In a data integration system I = hG, S, Mi based on the such a tuple does not satisfy the view. Formally, when
LAV approach, the mapping M associates to each element as(s) = complete, a database B satisfies the assertion
s of the source schema S a query qG over G. In other words, s ; qG with respect to D if
the query language LM,S allows only expressions consti-
tuted by one symbol of the alphabet AS . Therefore, a LAV sD ⊇ qGB
mapping is a set of assertions, one for each element s of S, From a logical point of view, a complete source s with
of the form arity n is modeled through the first order assertion
s ; qG ∀x qG (x) → s(x)

From the modeling point of view, the LAV approach is based • Exact Views. When a source s is exact (denoted with
on the idea that the content of each source s should be as(s) = exact), its extension is exactly the set of tuples
characterized in terms of a view qG over the global schema. of objects satisfying the corresponding view. Formally,
A notable case of this type is when the data integration when as(s) = exact, a database B satisfies the asser-
system is based on an enterprise model, or an ontology [58]. tion s ; qG with respect to D if
This idea is effective whenever the data integration system is sD = qGB
based on a global schema that is stable and well-established
in the organization. Note that the LAV approach favors From a logical point of view, an exact source s with
the extensibility of the system: adding a new source simply arity n is modeled through the first order assertion
means enriching the mapping with a new assertion, without ∀x s(x) ↔ qG (x)
other changes.

To better characterize each source with respect to the global Typically, in the literature, when the specification of as(s)
schema, several authors have proposed more sophisticated is missing, source s is considered sound. This is also the
assertions in the LAV mapping, in particular with the goal assumption we make in this paper.
of establishing the assumption holding for the various source
extensions [1, 53, 65, 24]. Formally, this means that in the Information Manifold [62], and the system presented in [78]
LAV mapping, a new specification, denoted as(s), is associ- are examples of LAV systems. Information Manifold ex-
ated to each source element s. The specification as(s) deter- presses the global schema in terms of a Description Logic [8],
mines how accurate is the knowledge on the data satisfying and adopts the language of conjunctive queries as query lan-
the sources, i.e., how accurate is the source with respect to guages LQ , and LM,G . The system described in [78] uses
the associated view qG . Three possibilities have been con- an XML global schema, and adopts XML-based query lan-
sidered1 : guages for both user queries and queries in the mapping.
More powerful schema languages for expressing the global
schema are reported in [42, 59, 22, 21]. In particular, [42, 59]
• Sound views. When a source s is sound (denoted with discusses the case where various forms of relational integrity
as(s) = sound ), its extension provides any subset of constraints are expressible in the global schema, including
the tuples satisfying the corresponding view qG . In functional and inclusion dependencies, whereas [22, 21] con-
1
In some papers, for example [24], different assumptions on sider a setting where the global schema is expressed in terms
the domain of the database (open vs. closed) are also taken of Description Logics [11], which allow for the specification
into account. of various types of constraints.
3.2 Global as view Most of current data integration systems follow the GAV
In the GAV approach, the mapping M associates to each approach. Notable examples are TSIMMIS [51], Garlic [30],
element g in G a query qS over S. In other words, the query COIN [52], MOMIS [10], Squirrel [92], and IBIS [17]. Anal-
language LM,G allows only expressions constituted by one ogously to the case of LAV systems, these systems usually
symbol of the alphabet AG . Therefore, a GAV mapping is adopt simple languages for expressing both the global and
a set of assertions, one for each element g of G, of the form the source schemas. IBIS is the only system we are aware
of that takes into account integrity constraints in the global
g ; qS schema.

From the modeling point of view, the GAV approach is based 3.3 Comparison between GAV and LAV
on the idea that the content of each element g of the global
The LAV and the GAV approaches are compared in [89] from
schema should be characterized in terms of a view qS over
the point of view of query processing. Generally speaking, it
the sources. In some sense, the mapping explicitly tells the
is well known that processing queries in the LAV approach
system how to retrieve the data when one wants to evalu-
is a difficult task. Indeed, in this approach the only knowl-
ate the various elements of the global schema. This idea is
edge we have about the data in the global schema is through
effective whenever the data integration system is based on
the views representing the sources, and such views provide
a set of sources that is stable. Note that, in principle, the
only partial information about the data. Since the mapping
GAV approach favors the system in carrying out query pro-
associates to each source a view over the global schema, it
cessing, because it tells the system how to use the sources
is not immediate to infer how to use the sources in order
to retrieve data. However, extending the system with a new
to answer queries expressed over the global schema. On
source is now a problem: the new source may indeed have
the other hand, query processing looks easier in the GAV
an impact on the definition of various elements of the global
approach, where we can take advantage that the mapping
schema, whose associated views need to be redefined.
directly specifies which source queries corresponds to the el-
ements of the global schema. Indeed, in most GAV systems,
To better characterize each element of the global schema
query answering is based on a simple unfolding strategy.
with respect to the sources, more sophisticated assertions in
the GAV mapping can be used, in the same spirit as we saw
From the point of view of modeling the data integration sys-
for LAV. Formally, this means that in the GAV mapping, a
tem, the GAV approach provides a specification mechanism
new specification, denoted as(g) (either sound , complete, or
that has a more procedural flavor with respect to the LAV
exact) is associated to each element g of the global schema.
approach. Indeed, while in LAV the designer may concen-
When as(g) = sound (resp., complete, exact), a database
trate on declaratively specifying the content of the source in
B satisfies the assertion g ; qS with respect to a source
terms of the global schema, in GAV, one is forced to spec-
database D if
ify how to get the data of the global schema by means of
qSD ⊆ g B (resp., qSD ⊇ g B , qSD = g B ) queries over the sources. A throughout analysis of the dif-
ferences/similarities of the two approaches from the point of
The logical characterization of sound views and complete
view of modeling is still missing. A first attempt is reported
views in GAV is therefore through the first order assertions
in [19, 18], where the authors address the problem of check-
∀x qS (x) → g(x), ∀x g(x) → qS (x) ing whether a LAV system can be transformed into a GAV
one, and vice-versa. They deal with transformations that are
respectively.
equivalent with respect to query answering, i.e., that enjoy
the property that queries posed to the original system have
It is interesting to observe that the implicit assumption in
the same answers when posed to the target system. Results
many GAV proposals is the one of exact views. Indeed, in a
on query reducibility from LAV to GAV systems may be use-
setting where all the views are exact, there are no constraints
ful, for example, to derive a procedural specification from a
in the global schema, and a first order query language is used
declarative one. Conversely, results on query reducibility
as LM,S , a GAV data integration system enjoys what we can
from GAV to LAV may be useful to derive a declarative
call the “single database property”, i.e., it is characterized
characterization of the content of the sources starting from
by a single database, namely the global database that is
a procedural specification. We briefly discuss the notions of
obtained by associating to each element the set of tuples
query-preserving transformation, and of query-reducibility
computed by the corresponding view over the sources. This
between classes of data integration systems.
motivates the widely shared intuition that query processing
in GAV is easier than in LAV. However, it should be pointed
Given two integration systems I = hG, S, Mi and I 0 =
out that the single database property only holds in such a
hG 0 , S, M0 i over the same source schema S and such that
restricted setting.
all elements of G are also elements of G 0 , I 0 is said to be
query-preserving with respect to I, if for every query q to I
In particular, the possibility of specifying constraints in G
and for every source database D, we have that
greatly enhances the modeling power of GAV systems, espe-
0
cially in those situations where the global schema is intended q I,D = q I ,D
to be expressed in terms of a conceptual data model, or in
terms of an ontology [16]. In these cases, the language LG In other words, I 0 is query-preserving with respect to I if,
is in fact sufficiently powerful to allow for specifying, either for each query over the global schema of I and each source
implicitly or explicitly, various forms of integrity constraints database, the certain answers to the query with respect to
on the global database. the source database that we get from the two integration
systems are identical. A class C1 of integration systems is to see that answering queries in LAV systems is essentially
query-reducible to a class C2 of integration systems if there an extended form of reasoning in the presence of incomplete
exist a function f : C1 → C2 such that, for each I1 ∈ C1 we information [91]. Indeed, when we answer a query over the
have that f (I1 ) is query-preserving with respect to I1 . global schema on the basis of a LAV mapping, we know only
the extensions of the views associated to the sources, and
With the two notions in place, the question of query re- this provides us with only partial information on the global
ducibility between LAV and GAV is studied in [18] within a database. As we already observed, in general, there are sev-
setting where views are considered sound, the global schema eral possible global databases that are legal for the data
is expressed in the relational model, and the queries used integration system with respect to a given source database.
in the integration systems (both the queries on the global This observation holds even for a setting where only sound
schema, and the queries in the mapping) are expressed in views are allowed in the mapping. The problem is even more
the language of conjunctive queries. The results show that complicated when sources can be modeled as complete or
in such a setting none of the two transformations is pos- exact views. In particular, dealing with exact sources essen-
sible. On the contrary, if one extends the framework, al- tially means applying the closed world assumption on the
lowing for integrity constraints in the global schema, then corresponding views [1, 85].
reducibility holds in both directions. In particular, inclu-
sion dependencies and a simple form of equality-generating The following example rephrases an example given in [1].
dependencies suffice for a query-preserving transformation Consider a data integration system I with global relational
from a LAV system into a GAV one, whereas single head schema G containing (among other relations) a binary rela-
full dependencies are sufficient for the other direction. Both tion couple, and two constants Ann and Bill. Consider also
transformations result in a query-preserving system whose two sources female and male, respectively with associated
size is linearly related to the size of the original one. views

Although in this paper we mainly refer to the LAV and GAV female(f ) ; { f, m | couple(f, m) }
approaches to data integration, it is worth noticing that male(m) ; { f, m | couple(f, m) }
more general types of mappings have been also discussed
in the literature. For example, [49] introduces the so-called and consider a source database D with femaleD = {Ann} and
GLAV approach. In GLAV, the relationships between the maleD = {Bill}, and assume that there are no constraints
global schema and the sources are established by making imposed by a schema. If both sources are sound, we only
use of both LAV and GAV assertions. More precisely, in a know that some couple has Ann as its female component and
GLAV mapping as introduced in [49], every assertion has Bill as its male component. Therefore, the query
the form qS ; qG , where qS is a conjunctive query over the
source schema, and qG is a conjunctive query over the global Q = { x, y | couple(x, y) }
schema. A database B satisfies the assertion qS ; qG with
asking for all couples would return an empty answer, i.e.,
respect to a source database D if qSD ⊆ qGB . Thus, the GLAV
QcI,D = ∅. However, if both sources are exact, we can con-
approach models a situation where sources are sound. Inter-
clude that all couples have Ann as their female component
estingly, the technique presented in [19, 18] can be extended
and Bill as their male component, and hence that (Ann, Bill)
for transforming any GLAV system into a GAV one. The
is the only couple, i.e., QI,D
c = {(Ann, Bill)}.
key idea is that a GLAV assertion can be transformed into
a GAV assertion plus an inclusion dependency. Indeed, for
Since in LAV, sources are modeled as views over the global
each assertion
schema, the problem of processing a query is traditionally
q S ; qG called view-based query processing. Generally speaking, the
problem is to compute the answer to a query based on a set
in the GLAV system (where the arity of both queries is n), of views, rather than on the raw data in the database [89,
we introduce a new relation symbol r of arity n in the global 60].
schema of the resulting GAV system, and we associate to r
the sound view qS by means of There are two approaches to view-based query processing,
r ; qS called view-based query rewriting and view-based query an-
swering, respectively. In the former approach, we are given
plus the inclusion dependency a query q and a set of view definitions, and the goal is to
r ⊆ qG . reformulate the query into an expression of a fixed language
LR that refers only to the views and provides the answer
Now, it is immediate to verify that the above inclusion de- to q. The crucial point is that the language in which we
pendency can be treated exactly with the same technique in- want the rewriting is fixed, and in general coincides with
troduced in the LAV to GAV transformation, and therefore, the language used for expressing the original query. In a
from the GLAV system we can obtain a query-preserving LAV data integration setting, query rewriting aims at re-
GAV system whose size is linearly related to the size of the formulating, in a way that is independent from the current
original system. source database, the original query in terms of a query to
the sources. Obviously, it may happen that no rewriting in
4. QUERY PROCESSING IN LAV the target language LR exists that is equivalent to the orig-
inal query. In this case, we are interested in computing a
In this section we discuss query processing in the LAV ap- so-called maximally contained rewriting, i.e., an expression
proach. From the definition given in Section 3, it is easy that captures the original query in the best way.
Sound CQ CQ6= PQ Datalog FOL and the queries used in the LAV mapping should be ex-
CQ PTIME coNP PTIME PTIME undec. pressed in a query language for semistructured data. The
CQ6= PTIME coNP PTIME PTIME undec.
PQ coNP coNP coNP coNP undec.
main difficulty arising in this context is that languages
Datalog coNP undec. coNP undec. undec. for querying semistructured data enable expressing regular-
FOL undec. undec. undec. undec. undec. path queries [2, 15, 45]. A regular-path query asks for all
Exact CQ CQ6= PQ Datalog FOL pairs of nodes in the database connected by a path con-
CQ coNP coNP coNP coNP undec. forming to a regular expression, and therefore may contain a
CQ6= coNP coNP coNP coNP undec. restricted form of recursion. Note that, when the query con-
PQ coNP coNP coNP coNP undec. tains unrestricted recursion, both view-based query rewrit-
Datalog undec. undec. undec. undec. undec. ing and view-based query answering become undecidable,
FOL undec. undec. undec. undec. undec.
even when the views are not recursive [43].
Table 1: Complexity of view-based query answering
Table 2 summarizes the results presented in [24]. Both data
complexity, and expression complexity (complexity with re-
spect to the size of the query and the view definitions) are
In view-based query answering, besides the query q and the
taken into account. All upper bound results have been ob-
view definitions, we are also given the extensions of the
tained by automata-theoretic techniques. In the analysis,
views. The goal is to compute the set of tuples t such that
a further distinction is proposed for characterizing the do-
the knowledge on the view extensions logically implies that
main of the database (open vs. closed domain assumption).
t is an answer to q, i.e., t is in the answer to q in all the
In the closed domain assumption we assume that the global
databases that are consistent with the views. It is easy to
database contains only objects stored in the sources. The re-
see that, in a LAV data integration framework, this is ex-
sults show that none of the cases can be solved in polynomial
actly the problem of computing the certain answers to q with
time (unless P = NP). This can be explained by observing
respect to a source database.
that the need for considering various forms of incompleteness
expressible in the query language (due to union and tran-
Notice the difference between the two approaches. In query
sitive closure), is a source of complexity for query answer-
rewriting, query processing is divided in two steps, where
ing. Obviously, under closed domain, our knowledge is more
the first one re-expresses the query in terms of a given query
accurate than in the case of the open domain assumption,
language over the alphabet of the view names, and the sec-
and this rules out the need for some combinatorial reason-
ond one evaluates the rewriting over the view extensions.
ing. This provides the intuition of why under closed domain
In query answering, we do not pose any limitations on how
the problem is “only” coNP-complete in all cases, for data,
queries are processed, and the only goal is to exploit all
expression, and combined complexity. On the other hand,
possible information, in particular the view extensions, to
under open domain, we cannot exclude the possibility that
compute the answer to the query.
the database contains more objects than those known to
be in the views. For combined complexity, this means that
A large number of results have been reported for both ap-
we are forced to reason about the definition of the query
proaches. We first focus on view-based query answering.
and the views. Indeed, the problem cannot be less complex
than comparing two regular path queries, and this explains
Query answering has been extensively investigated in the
the PSPACE lower bound. Interestingly, the table shows
last years [1, 53, 43, 66, 4, 21]. A comprehensive framework
that the problem does not exceed the PSPACE complexity.
for view-based query answering, as well as several interesting
Moreover, the data complexity remains in coNP, and there-
results, is presented in [53]. The framework considers var-
fore, although we are using a query language that is powerful
ious assumptions for interpreting the view extensions with
enough to express a (limited) form of recursion, the prob-
respect to the corresponding definitions (closed, open, and
lem is no more complex than in the case of disjunctions of
exact view assumptions). In [1], an analysis of the com-
conjunctive queries [1].
plexity of the problem under the different assumptions is
carried out for the case where the views and the queries are
While regular-path queries represent the core of any query
expressed in terms of various languages (conjunctive queries
language for semistructured data, their expressive power is
without and with inequalitites, positive queries, Datalog,
limited. Several authors point out that extensions are re-
and first-order queries). The complexity is measured with
quired for making them useful in real settings (see for ex-
respect to the size of the view extensions (data complexity).
ample [14, 15, 80]). Indeed, the results in [24] have been
Table 1 summarizes the results presented in [1]. Note that,
extended to query language with the inverse operator [26],
for the query languages considered in that paper, the exact
and to the class of union of conjunctive regular-path queries
view assumption complicates the problem. For example, the
in [28].
data complexity of query answering for the case of conjunc-
tive queries is PTIME under the sound view assumption,
Turning our attention to view-based query rewriting, several
and coNP-complete for exact views. This can be explained
recent papers investigate the rewriting question for different
by noticing that the exact view assumption introduces a
classes of queries. The problem is investigated for the case
form of negation, and therefore it may force to reason by
of conjunctive queries (with or without arithmetic compar-
cases on the objects stored in the views.
isons) in [66, 84], for disjunctive views in [4], for queries with
aggregates in [87, 37, 56], for recursive queries and nonre-
In [24], the problem is studied for a setting where the global
cursive views in [43], for queries expressed in Description
schema models a semistructured database, i.e., a labeled
Logics in [9], for regular-path queries and their extensions
directed graphs. It follows that both the user queries,
domain views Complexity
data expression combined 12 calvin rome 21
s1D :
all sound coNP coNP coNP 15 alice hong kong 24
closed all exact coNP coNP coNP AF hotdog corp.
arbitrary coNP coNP coNP s2D :
BN banana ltd .
all sound coNP PSPACE PSPACE
open all exact coNP PSPACE PSPACE 12 AF
s3D :
arbitrary coNP PSPACE PSPACE 16 BN

Table 2: Complexity of view-based query answering


for regular-path queries Figure 1: Extension of sources for the example

are exact. It is easy to see that, under these assumptions,


in [23, 26, 27], and in the presence of integrity constraints
query processing can be based on a simple unfolding strat-
in [59, 44]. Rewriting techniques for query optimization are
egy. When we have a query q over the alphabet AG of
described, for example, in [34, 3, 88], and in [46, 80, 82] for
the global schema, every element of AG is substituted with
the case of path queries in semistructured data.
the corresponding query over the sources, and the resulting
query is then evaluated at the sources. As we said before,
We already noted that view-based query rewriting and view-
such a strategy suffices mainly because the data integration
based query answering are different problems. Unfortu-
system enjoys the single database property. Notably, the
nately, their similarity sometimes gives raise to a sort of
same strategy applies also in the case of sound views.
confusion between the two notions. Part of the problem
comes from the fact that when the query and the views are
However, when the language LG used for expressing the
conjunctive queries, the best possible rewriting is express-
global schema allows for integrity constraints, and the views
ible as union of conjunctive queries, which is basically the
are sound, then query processing in GAV systems becomes
same language as the one of the original query and views.
more complex. Indeed, in this case, integrity constraints can
However, for other query languages this is not the case. Ab-
in principle be used in order to overcome incompleteness of
stracting from the language used to express the rewriting,
data at the sources. The following example shows that, by
we can define a rewriting of a query with respect to a set of
taking into account foreign key constraints, one can obtain
views as a function that, given the extensions of the views,
answers that would be missed by simply unfolding the user
returns a set of tuples that is contained in the answer set of
query.
the query in every database consistent with the views. We
call the rewriting that returns precisely such set the perfect
Let I = hG, S, Mi be a data integration system, where G is
rewriting of the query with respect to the views. Observe
constituted by the relations
that, by evaluating the perfect rewriting over given view
extensions, one obtains the same set of tuples provided by employee(Ecode, Ename, Ecity)
view-based query answering. i.e., in data integration termi- company(Ccode, Cname)
nology, the set of certain answers to the query with respect employed(Ecode, Ccode)
to the view extension. Hence, the perfect rewriting is the
and the constraints
best rewriting one can obtain, given the available informa-
tion on both the definitions and the extensions of the views. key(employee) = {Ecode}
key(company) = {Ccode}
An immediate consequence of the relationship between per- employed[Ecode] ⊆ employee[Ecode]
fect rewriting and query answering is that the data com- employed[Ccode] ⊆ company[Ccode]
plexity of evaluating the perfect rewriting over the view ex- The source schema S consists of three sources. Source s1 ,
tensions is the same as the data complexity of answering of arity 4, contains information about employees with their
queries using views. Typically, one is interested in queries code, name, city, and date of birth. Source s2 , of arity 2,
that can be evaluated in PTIME (i.e., are PTIME functions contains codes and names of companies. Finally, Source
in data complexity), and hence we would like rewritings to s3 , of arity 2, contains information about employment in
be PTIME as well. For queries and views that are conjunc- companies. The mapping M is defined by
tive queries (without union), the perfect rewriting is a union
of conjunctive queries and hence is PTIME [1]. However, al- employee ; { x, y, z | s1 (x, y, z, w) }
ready for very simple query languages containing union the company ; : { x, y | s2 (x, y) }
perfect rewriting is not PTIME in general. Hence, for such employed ; : { x, w | s3 (x, w) }
languages it would be interesting to characterize which in-
Now consider the following user query q, asking for codes of
stances of query rewriting admit a perfect rewriting that is
employees:
PTIME. By establishing a tight connection between view-
based query answering and constraint-satisfaction problems, { x | employee(x, y, z) }
it is argued in [27] that this is a difficult task.
Suppose that the data stored in the source database D are
those depicted in Figure 1: by simply unfolding q we obtain
5. QUERY PROCESSING IN GAV the answer {12}. However, due to the integrity constraint
employed[Ecode] ⊆ employee[Ecode], we know that 16 is the
Most GAV data integration systems do not allow integrity code of a person, even if it does not appear in sD 1 . The
constraints in the global schema, and assume that views correct answer to q is therefore {12, 16}. Observe that we
do not know any value for the attributes of the employee used to generate the expanded query expand q associated
whose Ecode is 16. to the original query q. This is done by performing a par-
tial evaluation [40] with respect to ΠG of the body of q 0 ,
Given a source database D, let us call “retrieved global which is the query obtained by substituting in q each predi-
database” the global database that is obtained by popu- cate ri with ri0 . In the partial evaluation tree, a node is not
lating each relation r in the global schema according to expanded anymore either when no atom in the node uni-
the mapping, i.e., by populating r with the tuples obtained fies with a head of a rule, or when the node is subsumed
by evaluating the query that the mapping associates to q. by (i.e., is more specific than) one of its predecessors. In
In general, integrity constraints may be violated in the re- the latter case, the node gets an empty node as a child;
trieved global database (e.g., the retrieved global database intuitively this is because such a node cannot provide any
for the above example). Regarding key constraints, let us answer that is not already provided by its more general pre-
assume that the query that the mapping associates to each decessor. These conditions guarantee that the construction
global schema relation r is such that the data retrieved for of the partial evaluation tree for a query always terminates.
r do not violate the key constraint of r. In other words, the Then, the expansion expand q of q is a union of conjunctive
management of key constraints is left to the designer (see queries whose body is constituted by the disjunction of all
next section for a discussion on this subject). On the other nonempty leaves of the partial evaluation tree. It is possible
hand, the management of foreign key constraints cannot be to show that, by unfolding expand q according to the map-
left to the designer, since it is strongly related to the incom- ping, and evaluating the resulting query over the sources,
pleteness of the sources. Moreover, since foreign keys are one obtains exactly the set of certain answers of q to I with
interrelation constraints, they cannot be dealt with in the respect to D [17].
GAV mapping, which, by definition, works on each global
relation in isolation. 6. INCONSISTENCIES BETWEEN SOUR-
The assumption of sound views asserts that the tuples re- CES
trieved for a relation r are a subset of the tuples that the
The formalization of data integration presented in the pre-
system assigns to r; therefore, we may think of completing
vious sections is based on a first order logic interpretation
the retrieved global database by suitably adding tuples in or-
of the assertions in the mapping, and, therefore, is not able
der to satisfy foreign key constraints, while still conforming
to cope with inconsistencies between sources. Indeed, if in
to the mapping. When a foreign key constraint is violated,
a data integration system I = hG, S, Mi, the data retrieved
there are several ways of adding tuples to the retrieved global
from the sources do not satisfy the integrity constraints of
database to satisfy such a constraint. In other words, in the
G, then no global database exists for I, and query answering
presence of foreign key constraints in the global schema, the
becomes meaningless. This is the situation occurring when
semantics of a data integration system must be formulated
data in the sources are mutually inconsistent. In practice,
in terms of a set of databases, instead of a single one. Since
this situation is generally dealt with by means of suitable
we are interested in the certain answers q I,D to a query q,
transformation and cleaning procedures to be applied to
i.e., the tuples that satisfy q in all global databases that are
data retrieved by the sources (see [12, 50]). In this section,
legal for I with respect to D, the existence of several such
we address the problem from a more theoretical perspective.
databases complicates the task of query answering.
Several recent papers aim at formally dealing with inconsis-
In [17], a system called IBIS is presented, that takes into
tencies in databases, in particular for providing informative
account key and foreign key constraints over the global rela-
answers even in the case of a database that does not sat-
tional schema. The system uses the foreign key constraints
isfy its integrity constraints (see, for example, [13, 6, 7, 54]).
in order to retrieve data that could not be obtained in tradi-
Although interesting, such results are not specifically tai-
tional data integration systems. The language for express-
lored to the case of different consistent data sources that
ing both the user query and the queries in the mapping is
are mutually inconsistent, that is the case of interest in
the one of union of conjunctive queries. To process a query
data integration. This case is addressed in [76], where the
q, IBIS expands q by taking into account the foreign key
authors propose an operator for merging databases under
constraints on the global relations appearing in the atoms.
constraints. Such operator allows one to obtain maximal
Such an expansion is performed by viewing each foreign key
amount of information from each database by means of a
constraint r1 [X] ⊆ r2 [Y], where X and Y are sets of h at-
majority criterion used in case of conflict. However, also
tributes and Y is a key for r2 , as a logic programming [77]
the approach described in [76] does not take explicitly into
rule
account the notion of mapping as introduced in our data
r20 (X, ~ . . . , fn (X))
~ fh+1 (X), ~ ~ Xh+1 , . . . , Xm )
← r10 (X, integration setting.

where each fi is a Skolem function, X ~ is a vector of h vari- In data integration, according to the definition of mapping
ables, and we have assumed for simplicity that the attributes satisfaction as given in Section 3, it may be the case that
involved in the foreign key are the first h ones. Each ri0 is the data retrieved from the sources cannot be reconciled in
a predicate, corresponding to the global relation ri , defined the global schema in such a way that both the constraints
by the above rules for foreign key constraints, together with of the global schema, and the mapping are satisfied. For
the rule example, this happens when a key constraint specified for
the relation r in the global schema is violated by the tuples
ri0 (X1 , . . . , Xn ) ← ri (X1 , . . . , Xn )
retrieved by the view associated to r, since the assumption
Once such a logic program ΠG has been defined, it can be of sound views does not allow us to disregard tuples from
r with duplicate keys. If we do not want to conclude in • the query language LM,S is the language of union of
this case that no global database exists that is legal for I conjunctive queries,
with respect to D, we need a different characterization of
the mapping. In particular, we need a characterization that • the views in the mapping are intended to be sound.
allows us support query processing even when the data at
the sources are incoherent with respect to the integrity con- In such a setting, an algorithm is proposed for computing the
straints on the global schema. certain answers of a query in the new semantical framework
presented above. The algorithm checks whether a given tu-
A possible solution is to characterize the data integration ple t is a certain answer to a query q with respect to a given
system I = hG, S, Mi (with M = {r1 ; V1 , . . . , rn ; source database D in coNP data complexity (i.e., with re-
Vn }), in terms of those global databases that spect to the size of D). Based on this result, the problem of
computing the certain answers in the presented framework
1. satisfy the integrity constraints of G, and can be shown to be coNP-complete in data complexity.

2. approximate at best the satisfaction of the assertions 7. REASONING ON QUERIES


in the mapping M, i.e., that are as sound as possible.
Recent work addresses the problem of reasoning on queries
in data integration systems. The basic form of reasoning on
In other, the integrity constraints of G are considered strong, queries is checking containment, i.e., verifying whether one
whereas the mapping is considered soft. Given a source query returns a subset of the result computed by the other
database D for I, we can now define an ordering between query in all databases. Most of the results on query con-
the global databases for I as follows. If B1 and B2 are two tainment concern conjunctive queries and their extensions.
databases that are legal with respect to G, we say that B1 In [33], NP-completeness has been established for conjunc-
is better than B2 with respect to D, denoted as B1 D B2 , tive queries, in [63, 90], Πp2 -completeness of containment of
if there exists an assertion ri ; Vi in M such that conjunctive queries with inequalities is proved, and in [86]
the case of queries with the union and difference operators is
- (riB1 ∩ ViD ) ⊃ (riB2 ∩ ViD ), and studied. For various classes of Datalog queries with inequal-
ities, decidability and undecidability results are presented
- (rjB1 ∩ VjD ) ⊇ (rjB2 ∩ VjD ), for all rj ; Vj in M with in [35] and [90], respectively. Other papers consider the
j=6 i; case of query containment in the presence of various types
of constraints [5, 39, 32, 69, 71, 70, 20], and for regular-path
queries and their extensions [47, 25, 28, 41].
Intuitively, this means that there is at least one assertion for
which B1 satisfies the sound mapping better than B2 , while Besides the usual notion of containment, several other no-
for no other assertion B2 is better than B1 . In other words, tions have been introduced related to the idea of comparing
B1 approximates the sound mapping better than B2 . queries in a data integration setting, especially in the con-
text of the LAV approach.
It is easy to verify that the relation D is a partial order.
With this notion in place, we can now define the notion of B In [79], a query is said to be contained in another query
satisfying the mapping M with respect to D in our setting: relative to a set of sources modeled as views, if, for each ex-
a database B that is legal with respect to G satisfies the tension of the views, the certain answers to the former query
mapping M with respect to D if B is maximal with respect are a subset of the certain answers to the latter. Note that
to D , i.e., for no other global database B0 that is legal with this reasoning problem is different from the usual contain-
respect to G, we have that B0 D B. ment checking: here we are comparing the two queries with
respect to the certain answers computable on the basis of
The notion of legal database for I with respect to D, and the views available. The difference becomes evident if one
the notion of certain answer remain the same, given the considers a counterexample to relative containment: Q1 is
new definition of satisfaction of mapping. It is immediate not contained in Q2 relative to views V if there is a tuple
to verify that, if there exists a legal database for I with t and an extension E of V, such that for each database DB
respect to D under the first order logic interpretation of the consistent with E (i.e., a database DB such that, the result
mapping, then the new semantics and the old one coincide, V DB of evaluating the views over DB is exactly E), t is an
in the sense that, for each query q, the set q I,D of certain answer of Q1 to DB, but there is a database DB0 consistent
answers computed under the first order semantics coincides with E such that t is not an answer of Q2 to DB0 . In other
with the set of certain answers computed under the new words, Q1 is not contained in Q2 relative to views V if there
semantics presented here. are two databases DB and DB0 such that V DB = V DB and
0

0
QDB
1 = QDB2 .
The problem of inconsistent sources in data integration is
addressed in [64], in particular for the case where: In [79], it is shown that the problem of checking relative con-
tainment is ΠP 2 complete in the case of conjunctive queries

• the global schema is a relational schema with key and and views. In [74], such results are extended to the case
foreign key constraints, where views have limited access patterns.

• the mapping is of type GAV, In [72], the authors introduce the notion of “p-containment”
(where “p” stands for power): a view set V is said to be p- lem is PSPACE-complete with respect to the view defini-
contained in another view set W, i.e., W has at least the tions, and EXPSPACE-complete with respect to the query.
answering power of V, if W can answer all queries that can
be answered using V. It is interesting to observe that, for the case of exact views,
the search for a counterexample cannot be restricted to lin-
The notion of “information content” of materialized views ear databases. Actually, the question of losslessness under
is studied in [57] for a restricted class of aggregate queries, the exact view assumption is largely unexplored. To the
with the goal of devising techniques for checking whether best of our knowledge, the problem is open even for a set-
a set of views is sufficient for completely answering a given ting where both the query and the views are conjunctive
query based on the views. queries.

One of the ideas underlying the above mentioned papers is 8. CONCLUSIONS


the one of losslessness: a set of views is lossless with respect
to a query, if, no matter what the database is, we can answer The aim of this tutorial was to provide an overview of some
the query by solely relying on the content of the views. This of the theoretical issues underlying data integration. Sev-
question is relevant for example in mobile computing, where eral interesting problems remain open in each of the topics
we may be interested in checking whether a set of cached that we have discussed. For example, more investigation
data allows us to derive the requested information without is needed for a deep understanding of the relationship be-
accessing the network, or in data warehouse design, in par- tween the LAV and the GAV approaches. Open problems
ticular for the view selection problem [36], where we have to remain on algorithms and complexity for view-based query
measure the quality of the choice of the views to materialize processing, in particular for the case of rich languages for
in the data warehouse. In data integration, losslessness may semistructured data, for the case of exact views, and for the
help in the design of the data integration system, in par- case of integrity constraints in the global schema. Query
ticular, by selecting a minimal subset of sources to access processing in GAV with constraints has been investigated
without losing query-answering power. only recently, and interesting classes of constraints have not
been considered yet. The treatment of mutually inconsis-
The definition of losslessness relies on that of certain an- tent sources, and the issue of reasoning on queries present
swers: a set of views is lossless with respect to a query, many open research questions.
if for every database, we can answer the query over that
database by computing the certain answers based on the Moreover, data integration is such a rich field that several
view extensions. It follows that there are at least two ver- important related aspects not addressed here can be identi-
sions of losslessness, namely, losslessness under the sound fied, including the following.
view assumption, and losslessness under the exact view as-
sumption. • How to build an appropriate global schema, and how
to discover inter-schema [31] and mapping assertions
The first version is obviously weaker than the second one. (LAV or GAV) in the design of a data integration sys-
If views V are lossless with respect to a query Q under the tem (see, for instance, [83]).
sound view assumption, then we know that, from the in-
tensional point of views, V contain enough information to • How to (automatically) synthesize wrappers that
completely answer Q, even though the possible incomplete- present the data at the sources in a form [] that is
ness of the view extensions may prevent us form obtaining suitable for their use in the mapping.
all the answers that Q would get from the database. On the • How to deal with possible limitations in accessing the
other hand, if V are lossless with respect to a query Q under sources, both in LAV [84, 67, 68] and in GAV [75, 48,
the exact view assumption, then we know that they contain 73, 74].
enough information to completely answer Q, both from the
intensional and from the extensional point of view. • How to incorporate the notions of quality (data qual-
ity, quality of answers, etc.) [81], and data cleaning [12]
In [29], the problem of losslessness is addressed in a context into a formal framework for data integration.
where both the query and the views are expressed as regular • How to learn rules that allow for automatically map-
path queries. It is shown that, in the case of the sound view ping data items in different sources (for example, for
assumption, the problem is solvable by a technique that is inferring that two key values in different sources actu-
based on searching for a counterexample to losslessness, i.e., ally refer to the same real-world object [38]).
two databases that are both coherent with the view exten-
sions, and that differ in the answers to the query. Different • How to go beyond the architecture based on a global
from traditional query containment, the search for a coun- schema, so as, for instance, to model data exchange,
terexample is complicated by the presence of a quantification transformation, and cooperation rather than data in-
over all possible view extensions. The key observation in [29] tegration (see, e.g., [55]), or to devise information in-
is that, under the sound view assumption, we can restrict tegration facilities for the Semantic Web.
our attention to counterexamples that are linear databases, • How to optimize the evaluation of queries posed to a
and this allows devising a method that uses, via automata- data integration system [3].
theoretic techniques, the known connection between view-
based query answering and constraint satisfaction [27]. As
We believe that each of the above issues is characterized by
far as the computational complexity is concerned, the prob-
interesting research problems still to investigate.
9. ACKNOWLEDGMENTS [10] D. Beneventano, S. Bergamaschi, S. Castano, A. Corni,
R. Guidetti, G. Malvezzi, M. Melchiori, and M. Vincini.
I warmly thank Diego Calvanese, Giuseppe De Giacomo Information integration: the MOMIS project demon-
and Moshe Y. Vardi, with whom I carried out most of stration. In Proc. of the 26th Int. Conf. on Very Large
my research work on data integration during the last years. Data Bases (VLDB 2000), 2000.
Also, I thank all colleagues that I have been working with
in several data integration projects, in particular Andrea [11] A. Borgida. Description logics in data management.
Calı̀, Domenico Lembo, Daniele Nardi, Riccardo Rosati, and IEEE Trans. on Knowledge and Data Engineering,
all participants to the ESPRIT LTR project “DWQ (Data 7(5):671–682, 1995.
Warehouse Quality)”, and the MIUR (Italian Ministry of
University and Research) project “D2I (From Data To In- [12] M. Bouzeghoub and M. Lenzerini. Introduction to the
formation)”. special issue on data extraction, cleaning, and reconcil-
iation. Information Systems, 26(8):535–536, 2001.
Finally, this is the first paper I can dedicate to my son
Domenico, and for that I want to thank his mother. [13] F. Bry. Query answering in information systems with
integrity constraints. In IFIP WG 11.5 Working Conf.
10. REFERENCES on Integrity and Control in Information System. Chap-
man & Hall, 1997.
[1] S. Abiteboul and O. Duschka. Complexity of answering
queries using materialized views. In Proc. of the 17th [14] P. Buneman. Semistructured data. In Proc. of the 16th
ACM SIGACT SIGMOD SIGART Symp. on Principles ACM SIGACT SIGMOD SIGART Symp. on Principles
of Database Systems (PODS’98), pages 254–265, 1998. of Database Systems (PODS’97), pages 117–121, 1997.

[2] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and [15] P. Buneman, S. Davidson, G. Hillebrand, and D. Su-
J. L. Wiener. The Lorel query language for semistruc- ciu. A query language and optimization technique for
tured data. Int. J. on Digital Libraries, 1(1):68–88, unstructured data. In Proc. of the ACM SIGMOD Int.
1997. Conf. on Management of Data, pages 505–516, 1996.
[3] S. Adali, K. S. Candan, Y. Papakonstantinou, and V. S. [16] A. Calı̀, D. Calvanese, G. De Giacomo, and M. Lenz-
Subrahmanian. Query caching and optimization in dis- erini. Accessing data integration systems through con-
tributed mediator systems. In Proc. of the ACM SIG- ceptual schemas. In Proc. of the 20th Int. Conf. on Con-
MOD Int. Conf. on Management of Data, pages 137– ceptual Modeling (ER 2001), 2001.
148, 1996.
[4] F. N. Afrati, M. Gergatsoulis, and T. Kavalieros. An- [17] A. Calı̀, D. Calvanese, G. De Giacomo, and M. Lenz-
swering queries using materialized views with disjunc- erini. Data integration under integrity constraints. In
tion. In Proc. of the 7th Int. Conf. on Database Theory Proc. of the 14th Conf. on Advanced Information Sys-
(ICDT’99), volume 1540 of Lecture Notes in Computer tems Engineering (CAiSE 2002), 2002. To appear.
Science, pages 435–452. Springer, 1999.
[18] A. Calı̀, D. Calvanese, G. De Giacomo, and M. Lenz-
[5] A. V. Aho, Y. Sagiv, and J. D. Ullman. Equivalence erini. On the expressive power of data integration sys-
among relational expressions. SIAM J. on Computing, tems. Submitted for pubblication, 2002.
8:218–246, 1979.
[19] A. Calı̀, G. De Giacomo, and M. Lenzerini. Mod-
[6] M. Arenas, L. E. Bertossi, and J. Chomicki. Consis- els of information integration: Turning local-as-view
tent query answers in inconsistent databases. In Proc. into global-as-view. In Foundations of Models for Infor-
of the 18th ACM SIGACT SIGMOD SIGART Symp. mation Integration. On line proceedings, http://www.
on Principles of Database Systems (PODS’99), pages fmldo.org/FMII-2001, 2001.
68–79, 1999.
[20] D. Calvanese, G. De Giacomo, and M. Lenzerini. On the
[7] M. Arenas, L. E. Bertossi, and J. Chomicki. Specify-
decidability of query containment under constraints. In
ing and querying database repairs using logic programs
Proc. of the 17th ACM SIGACT SIGMOD SIGART
with exceptions. In Proc. of the 4th Int. Conf. on Flexi-
Symp. on Principles of Database Systems (PODS’98),
ble Query Answering Systems (FQAS’00), pages 27–41.
pages 149–158, 1998.
Springer, 2000.
[8] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and [21] D. Calvanese, G. De Giacomo, and M. Lenzerini.
P. F. Patel-Schneider, editors. The Description Logic Answering queries using views over description logics
Handbook: Theory, Implementation and Applications. knowledge bases. In Proc. of the 17th Nat. Conf. on Ar-
Cambridge University Press, 2002. To appear. tificial Intelligence (AAAI 2000), pages 386–391, 2000.

[9] C. Beeri, A. Y. Levy, and M.-C. Rousset. Rewriting [22] D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi,
queries using views in description logics. In Proc. of and R. Rosati. Description logic framework for infor-
the 16th ACM SIGACT SIGMOD SIGART Symp. on mation integration. In Proc. of the 6th Int. Conf. on
Principles of Database Systems (PODS’97), pages 99– Principles of Knowledge Representation and Reasoning
108, 1997. (KR’98), pages 2–13, 1998.
[23] D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. [35] S. Chaudhuri and M. Y. Vardi. On the equivalence of
Vardi. Rewriting of regular expressions and regular recursive and nonrecursive Datalog programs. In Proc.
path queries. In Proc. of the 18th ACM SIGACT SIG- of the 11th ACM SIGACT SIGMOD SIGART Symp.
MOD SIGART Symp. on Principles of Database Sys- on Principles of Database Systems (PODS’92), pages
tems (PODS’99), pages 194–204, 1999. 55–66, 1992.
[24] D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. [36] R. Chirkova, A. Y. Halevy, and D. Suciu. A formal
Vardi. Answering regular path queries using views. In perspective on the view selection problem. In Proc.
Proc. of the 16th IEEE Int. Conf. on Data Engineering of the 27th Int. Conf. on Very Large Data Bases
(ICDE 2000), pages 389–398, 2000. (VLDB 2001), pages 59–68, 2001.
[25] D. Calvanese, G. De Giacomo, M. Lenzerini, and
[37] S. Cohen, W. Nutt, and A. Serebrenik. Rewriting ag-
M. Y. Vardi. Containment of conjunctive regular path
gregate queries using views. In Proc. of the 18th ACM
queries with inverse. In Proc. of the 7th Int. Conf. on
SIGACT SIGMOD SIGART Symp. on Principles of
Principles of Knowledge Representation and Reasoning
Database Systems (PODS’99), pages 155–166, 1999.
(KR 2000), pages 176–185, 2000.
[26] D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. [38] W. W. Cohen. Integration of heterogeneous databases
Vardi. Query processing using views for regular path without common domains using queries based on tex-
queries with inverse. In Proc. of the 19th ACM SIGACT tual similarity. In Proc. of the ACM SIGMOD Int.
SIGMOD SIGART Symp. on Principles of Database Conf. on Management of Data, pages 201–212, 1998.
Systems (PODS 2000), pages 58–66, 2000.
[39] A. C. K. David S. Johnson. Testing containment
[27] D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. of conjunctive queries under functional and inclusion
Vardi. View-based query processing and constraint sat- dependencies. J. of Computer and System Sciences,
isfaction. In Proc. of the 15th IEEE Symp. on Logic in 28(1):167–189, 1984.
Computer Science (LICS 2000), pages 361–371, 2000.
[40] G. De Giacomo. Intensional query answering by par-
[28] D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. tial evaluation. J. of Intelligent Information Systems,
Vardi. View-based query answering and query contain- 7(3):205–233, 1996.
ment over semistructured data. In Proc. of the 8th
Int. Workshop on Database Programming Languages [41] A. Deutsch and V. Tannen. Optimization properties for
(DBPL 2001), 2001. classes of conjunctive regular path queries. In Proc. of
[29] D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. the 8th Int. Workshop on Database Programming Lan-
Vardi. Lossless regular views. In Proc. of the 21st ACM guages (DBPL 2001), 2001.
SIGACT SIGMOD SIGART Symp. on Principles of
Database Systems (PODS 2002), pages 58–66, 2002. [42] O. Duschka. Query Planning and Optimization in In-
formation Integration. PhD thesis, Stanford University,
[30] M. J. Carey, L. M. Haas, P. M. Schwarz, M. Arya, 1997.
W. F. Cody, R. Fagin, M. Flickner, A. Luniewski,
W. Niblack, D. Petkovic, J. Thomas, J. H. Williams, [43] O. M. Duschka and M. R. Genesereth. Answering re-
and E. L. Wimmers. Towards heterogeneous multi- cursive queries using views. In Proc. of the 16th ACM
media information systems: The Garlic approach. In SIGACT SIGMOD SIGART Symp. on Principles of
Proc. of the 5th Int. Workshop on Research Issues Database Systems (PODS’97), pages 109–116, 1997.
in Data Engineering – Distributed Object Management
(RIDE-DOM’95), pages 124–131. IEEE Computer So- [44] O. M. Duschka and A. Y. Levy. Recursive plans for
ciety Press, 1995. information gathering. In Proc. of the 15th Int. Joint
Conf. on Artificial Intelligence (IJCAI’97), pages 778–
[31] T. Catarci and M. Lenzerini. Representing and using 784, 1997.
interschema knowledge in cooperative information sys-
tems. J. of Intelligent and Cooperative Information Sys- [45] M. F. Fernandez, D. Florescu, J. Kang, A. Y. Levy,
tems, 2(4):375–398, 1993. and D. Suciu. Catching the boat with strudel: Experi-
ences with a web-site management system. In Proc. of
[32] E. P. F. Chan. Containment and minimization of posi-
the ACM SIGMOD Int. Conf. on Management of Data,
tive conjunctive queries in oodb’s. In Proc. of the 11th
pages 414–425, 1998.
ACM SIGACT SIGMOD SIGART Symp. on Principles
of Database Systems (PODS’92), pages 202–211, 1992. [46] M. F. Fernandez and D. Suciu. Optimizing regular path
[33] A. K. Chandra and P. M. Merlin. Optimal implementa- expressions using graph schemas. In Proc. of the 14th
tion of conjunctive queries in relational data bases. In IEEE Int. Conf. on Data Engineering (ICDE’98), pages
Proc. of the 9th ACM Symp. on Theory of Computing 14–23, 1998.
(STOC’77), pages 77–90, 1977.
[47] D. Florescu, A. Levy, and D. Suciu. Query contain-
[34] S. Chaudhuri, S. Krishnamurthy, S. Potarnianos, and ment for conjunctive queries with regular expressions.
K. Shim. Optimizing queries with materialized views. In In Proc. of the 17th ACM SIGACT SIGMOD SIGART
Proc. of the 11th IEEE Int. Conf. on Data Engineering Symp. on Principles of Database Systems (PODS’98),
(ICDE’95), Taipei (Taiwan), 1995. pages 139–148, 1998.
[48] D. Florescu, A. Y. Levy, I. Manolescu, and D. Suciu. [62] T. Kirk, A. Y. Levy, Y. Sagiv, and D. Srivastava. The
Query optimization in the presence of limited access Information Manifold. In Proceedings of the AAAI 1995
patterns. In Proc. of the ACM SIGMOD Int. Conf. on Spring Symp. on Information Gathering from Hetero-
Management of Data, pages 311–322, 1999. geneous, Distributed Enviroments, pages 85–91, 1995.

[49] M. Friedman, A. Levy, and T. Millstein. Navigational [63] A. C. Klug. On conjunctive queries containing inequal-
plans for data integration. In Proc. of the 16th Nat. ities. J. of the ACM, 35(1):146–160, 1988.
Conf. on Artificial Intelligence (AAAI’99), pages 67–
73. AAAI Press/The MIT Press, 1999. [64] D. Lembo, M. Lenzerini, and R. Rosati. Source in-
consistency and incompleteness in data integration. In
[50] H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Proc. of the 9th Int. Workshop on Knowledge Repre-
An extensible framework for data cleaning. Technical sentation meets Databases (KRDB 2002), 2002.
Report 3742, INRIA, Rocquencourt, 1999.
[65] A. Y. Levy. Obtaining complete answers from incom-
[51] H. Garcia-Molina, Y. Papakonstantinou, D. Quass, plete databases. In Proc. of the 22nd Int. Conf. on Very
A. Rajaraman, Y. Sagiv, J. D. Ullman, V. Vassalos, Large Data Bases (VLDB’96), pages 402–412, 1996.
and J. Widom. The TSIMMIS approach to mediation:
Data models and languages. J. of Intelligent Informa- [66] A. Y. Levy, A. O. Mendelzon, Y. Sagiv, and D. Sri-
tion Systems, 8(2):117–132, 1997. vastava. Answering queries using views. In Proc. of the
14th ACM SIGACT SIGMOD SIGART Symp. on Prin-
[52] C. H. Goh, S. Bressan, S. E. Madnick, and M. D. Siegel. ciples of Database Systems (PODS’95), pages 95–104,
Context interchange: New features and formalisms for 1995.
the intelligent integration of information. ACM Trans.
on Information Systems, 17(3):270–293, 1999. [67] A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying
heterogenous information sources using source descrip-
[53] G. Grahne and A. O. Mendelzon. Tableau tech- tions. In Proc. of the 22nd Int. Conf. on Very Large
niques for querying information sources through global Data Bases (VLDB’96), 1996.
schemas. In Proc. of the 7th Int. Conf. on Database
Theory (ICDT’99), volume 1540 of Lecture Notes in [68] A. Y. Levy, A. Rajaraman, and J. D. Ullman. Answer-
Computer Science, pages 332–347. Springer, 1999. ing queries using limited external query processors. In
Proc. of the 15th ACM SIGACT SIGMOD SIGART
[54] G. Greco, S. Greco, and E. Zumpano. A logic program- Symp. on Principles of Database Systems (PODS’96),
ming approach to the integration, repairing and query- pages 227–237, 1996.
ing of inconsistent databases. In Proc. of the 17th Int.
Conf. on Logic Programming (ICLP’01), volume 2237 [69] A. Y. Levy and M.-C. Rousset. CARIN: A represen-
of Lecture Notes in Artificial Intelligence, pages 348– tation language combining Horn rules and description
364. Springer, 2001. logics. In Proc. of the 12th Eur. Conf. on Artificial In-
telligence (ECAI’96), pages 323–327, 1996.
[55] S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu.
What can databases do for peer-to-peer? In Proc. of the [70] A. Y. Levy and M.-C. Rousset. Combining horn rules
Int. Workshop on the Web and Databases (WebDB’01), and description logics in CARIN. Artificial Intelligence,
2001. 104(1–2):165–209, 1998.

[56] S. Grumbach, M. Rafanelli, and L. Tininini. Query- [71] A. Y. Levy and D. Suciu. Deciding containment for
ing aggregate data. In Proc. of the 18th ACM SIGACT queries with complex objects. In Proc. of the 16th ACM
SIGMOD SIGART Symp. on Principles of Database SIGACT SIGMOD SIGART Symp. on Principles of
Systems (PODS’99), pages 174–184, 1999. Database Systems (PODS’97), pages 20–31, 1997.

[57] S. Grumbach and L. Tininini. On the content of ma- [72] C. Li, M. Bawa, and J. D. Ullman. Minimizing view sets
terialized aggregate views. In Proc. of the 19th ACM without loosing query-answering power. In Proc. of the
SIGACT SIGMOD SIGART Symp. on Principles of 8th Int. Conf. on Database Theory (ICDT 2001), pages
Database Systems (PODS 2000), pages 47–57, 2000. 99–103, 2001.

[58] M. Gruninger and J. Lee. Ontology applications and de- [73] C. Li and E. Chang. Query planning with limited source
sign. Communications of the ACM, 45(2):39–41, 2002. capabilities. In Proc. of the 16th IEEE Int. Conf. on
Data Engineering (ICDE 2000), pages 401–412, 2000.
[59] J. Gryz. Query folding with inclusion dependencies. In
Proc. of the 14th IEEE Int. Conf. on Data Engineering [74] C. Li and E. Chang. On answering queries in the pres-
(ICDE’98), pages 126–133, 1998. ence of limited access patterns. In Proc. of the 8th Int.
Conf. on Database Theory (ICDT 2001), pages 219–
[60] A. Y. Halevy. Answering queries using views: A survey. 233, 2001.
Very Large Database J., 10(4):270–294, 2001.
[75] C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina,
[61] R. Hull. Managing semantic heterogeneity in databases: Y. Papakonstantinou, J. D. Ullman, and M. Valiveti.
A theoretical perspective. In Proc. of the 16th ACM Capability based mediation in TSIMMIS. In Proc. of
SIGACT SIGMOD SIGART Symp. on Principles of the ACM SIGMOD Int. Conf. on Management of Data,
Database Systems (PODS’97), 1997. pages 564–566, 1998.
[76] J. Lin and A. O. Mendelzon. Merging databases under
View publication stats
[91] R. van der Meyden. Logical approaches to incomplete
constraints. Int. J. of Cooperative Information Systems, information. In J. Chomicki and G. Saake, editors, Log-
7(1):55–76, 1998. ics for Databases and Information Systems, pages 307–
356. Kluwer Academic Publisher, 1998.
[77] J. W. Lloyd. Foundations of Logic Programming (Sec-
ond, Extended Edition). Springer, Berlin, Heidelberg, [92] G. Zhou, R. Hull, R. King, and J.-C. Franchitti. Using
1987. object matching and materialization to integrate het-
erogeneous databases. In Proc. of the 3rd Int. Conf. on
[78] I. Manolescu, D. Florescu, and D. Kossmann. Answer- Cooperative Information Systems (CoopIS’95), pages
ing XML queries on heterogeneous data sources. In 4–18, 1995.
Proc. of the 27th Int. Conf. on Very Large Data Bases
(VLDB 2001), pages 241–250, 2001.

[79] T. D. Millstein, A. Y. Levy, and M. Friedman. Query


containment for data integration systems. In Proc. of
the 19th ACM SIGACT SIGMOD SIGART Symp. on
Principles of Database Systems (PODS 2000), pages
67–75, 2000.

[80] T. Milo and D. Suciu. Index structures for path expres-


sions. In Proc. of the 7th Int. Conf. on Database Theory
(ICDT’99), volume 1540 of Lecture Notes in Computer
Science, pages 277–295. Springer, 1999.

[81] F. Naumann, U. Leser, and J. C. Freytag. Quality-


driven integration of heterogenous information systems.
In Proc. of the 25th Int. Conf. on Very Large Data
Bases (VLDB’99), pages 447–458, 1999.

[82] Y. Papakonstantinou and V. Vassalos. Query rewriting


using semistructured views. In Proc. of the ACM SIG-
MOD Int. Conf. on Management of Data, 1999.

[83] E. Rahn and P. A. Bernstein. A survey of approaches


to automatic schema matching. Very Large Database J.,
10(4):334–350, 2001.

[84] A. Rajaraman, Y. Sagiv, and J. D. Ullman. Answering


queries using templates with binding patterns. In Proc.
of the 14th ACM SIGACT SIGMOD SIGART Symp.
on Principles of Database Systems (PODS’95), 1995.

[85] R. Reiter. On closed world data bases. In H. Gallaire


and J. Minker, editors, Logic and Databases, pages 119–
140. Plenum Publ. Co., New York, 1978.

[86] Y. Sagiv and M. Yannakakis. Equivalences among rela-


tional expressions with the union and difference opera-
tors. J. of the ACM, 27(4):633–655, 1980.

[87] D. Srivastava, S. Dar, H. V. Jagadish, and A. Levy.


Answering queries with aggregation using views. In
Proc. of the 22nd Int. Conf. on Very Large Data Bases
(VLDB’96), pages 318–329, 1996.

[88] O. G. Tsatalos, M. H. Solomon, and Y. E. Ioannidis.


The GMAP: A versatile tool for phyisical data indepen-
dence. Very Large Database J., 5(2):101–118, 1996.

[89] J. D. Ullman. Information integration using logical


views. In Proc. of the 6th Int. Conf. on Database Theory
(ICDT’97), volume 1186 of Lecture Notes in Computer
Science, pages 19–40. Springer, 1997.

[90] R. van der Meyden. The Complexity of Querying In-


definite Information. PhD thesis, Rutgers University,
1992.

You might also like