Lecture 09
Lecture 09
Data Integration
May 30th, 2002
Agenda/Administration
• Project demo scheduling.
• Reading pointers for exam.
What is Data Integration
• Providing
– Uniform (same query interface to all sources)
– Access to (queries; eventually updates too)
– Multiple (we want many, but 2 is hard too)
– Autonomous (DBA doesn’t report to you)
– Heterogeneous (data models are different)
– Structured (or at least semi-structured)
– Data Sources (not only databases).
The Problem: Data Integration
m ybook s .c om M e dia te d S c he m a
B ooks In ven tory O rd ers S h ip p in g R eview s
M o rga n- C us to me r
East O rde rs
K a ufma n F e dE x R e v ie w s
P re ntic e - W est
UP S NY Time s
Ha ll
... ... a lt.bo o ks .
re v ie w s
optimizer
Which data Data source
model? Execution engine catalog
Sources can be: relational, hierarchical (IMS), structure files, web sites.
Research Projects
• Garlic (IBM),
• Information Manifold (AT&T)
• Tsimmis, InfoMaster (Stanford)
• The Internet Softbot/Razor/Tukwila (UW)
• Hermes (Maryland)
• DISCO, Agora (INRIA, France)
• SIMS/Ariadne (USC/ISI)
Industry
• Nimble Technology
• Enosys Markets
• IBM starting to announce stuff
• BEA marketing announcing stuff too.
Dimensions to Consider
• How many sources are we accessing?
• How autonomous are they?
• Meta-data about sources?
• Is the data structured?
• Queries or also updates?
• Requirements: accuracy, completeness,
performance, handling inconsistencies.
• Closed world assumption vs. open world?
Outline
• Wrappers
• Semantic integration and source descriptions:
– Modeling source completeness
– Modeling source capabilities
• Query optimization
• Query execution
• Peer-data management systems
• Creating schema mappings
Wrapper Programs
• Task: to communicate with the data sources
and do format translations.
• They are built w.r.t. a specific source.
• They can sit either at the source or at the
mediator.
• Often hard to build (very little science).
• Can be “intelligent”: perform source-
specific optimizations.
Example
Transform:
<b> Introduction to DB </b>
<i> Phil Bernstein </i>
<i> Eric Newcomer </i>
Addison Wesley, 1999
into:
<book>
<title> Introduction to DB </title>
<author> Phil Bernstein </author>
<author> Eric Newcomer </author>
<publisher> Addison Wesley </publisher>
<year> 1999 </year>
</book>
Data Source Catalog
• Contains all meta-information about the
sources:
– Logical source contents (books, new cars).
– Source capabilities (can answer SQL queries)
– Source completeness (has all books).
– Physical properties of source and network.
– Statistics about the data (like in an RDBMS)
– Source reliability
– Mirror sources
– Update frequency.
Content Descriptions
• User queries refer to the mediated schema.
• Data is stored in the sources in a local
schema.
• Content descriptions provide the semantic
mappings between the different schemas.
• Data integration system uses the
descriptions to translate user queries into
queries on the sources.
Desiderata from Source
Descriptions
• Expressive power: distinguish between
sources with closely related data. Hence, be
able to prune access to irrelevant sources.
• Easy addition: make it easy to add new data
sources.
• Reformulation: be able to reformulate a user
query into a query on the sources efficiently
and effectively.
Reformulation Problem
• Given:
– A query Q posed over the mediated schema
– Descriptions of the data sources
• Find:
– A query Q’ over the data source relations, such
that:
• Q’ provides only correct answers to Q, and
• Q’ provides all possible answers from to Q given
the sources.
Approaches to Specifying Source
Descriptions
• Global-as-view: express the mediated
schema relations as a set of views over the
data source relations
• Local-as-view: express the source relations
as views over the mediated schema.
• Can be combined with no additional cost.
Global-as-View
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS
select * from S1 [S1(title,dir,year,genre)]
union
select * from S2 [S2(title, dir,year,genre)]
union [S3(title,dir), S4(title,year,genre)]
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
where S3.title=S4.title
Global-as-View: Example 2
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create Source S1 as
select *
from Cites
given paper1
Create Source S2 as
select paper1
from Cites
• Problem:
– Few and unreliable statistics about the data.
– Unexpected (possibly bursty) network transfer
rates.
– Generally, unpredictable environment.
• General solution: (research area)
– Adaptive query processing.
– Interleave optimization and execution. As you
get to know more about your data, you can
improve your plan.
Tukwila Data Integration System
data
Novel components:
– Event handler
– Optimization-execution loop
Double Pipelined Join (Tukwila)
Emergency
Workers (EW)
Portland Vancouver Fire
Fire District (PFD) District (VFD)
National Washington
Guard State
agent-name agent-phone
1-1 mapping non 1-1 mapping
house
name phone
Why Matching is Difficult
• Structures represent same entity differently
– different names => same entity:
• area & address => location
– same names => different entities:
• area => location or square-feet
• Intended semantics is typically subjective!
– IBM Almaden Lab = IBM?
• Schema, data and rules never fully capture semantics!
– not adequately documented, certainly not for machine
consumption.
• Often hard for humans (committees are formed!)
Desiderata from Proposed
Solutions
• Accuracy, efficiency, ease of use.
• Realistic expectations:
– Unlikely to be fully automated. Need user in the loop.
• Some notion of semantics for mappings.
• Extensibility:
– Solution should exploit additional background
knowledge.
• “Memory”, knowledge reuse:
– System should exploit previous manual or
automatically generated matchings.
– Key idea behind LSD.
Learning for Mapping
• Context: generating semantic mappings between
a mediated schema and a large set of data source
schemas.
• Key idea: generate the first mappings manually,
and learn from them to generate the rest.
• Technique: multi-strategy learning (extensible!)
• L(earning) S(ource) D(escriptions) [SIGMOD 2001].
Data Integration (a simple
PDMS)
Find houses with four bathrooms priced under $500,000
mediated schema
Query reformulation
and optimization.