0% found this document useful (0 votes)

46 views

Data Warehousing/Mining Comp 150 DW Semistructured Data: Instructor: Dan Hebert

This document provides an overview of semistructured data and its applications. It defines semistructured data as data that has no rigid schema or places only loose constraints on data. Examples given include web data, data exchange formats, and some database systems. The document then discusses motivations for working with semistructured data such as data integration and browsing. Finally, it introduces models for representing semistructured data, including the Object Exchange Model, and approaches for querying semistructured data.

Uploaded by

Sameen Fatima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Data Warehousing/Mining Comp 150 DW Semistructured Data: Instructor: Dan Hebert

Uploaded by

Sameen Fatima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 33

Data Warehousing/Mining

Comp 150 DW
Semistructured Data

Instructor: Dan Hebert

Data Warehousing/Mining
Semistructured Data
 Everything that has no rigid schema
– Schema is contained within the data (self-describing), OR
– No separate schema, OR
– Schema exists but places only loose constraints on data
 Emerged as an important topic for a variety of reasons
– Many data sources like WWW which we would like to treat as
databases but cannot for the lack of schema
– Desirable to have an extremely flexible format for data exchange
between disparate databases
– May want to view structured data as semistructured data for the
purpose of browsing

Data Warehousing/Mining
Motivation
 Some data really is unstructured/semistructured
– World Wide Web,
– Data exchange formats
– Some exotic database management systems, e.g.,
ACeDB, popular with biologists
 Data integration
 Browsing

Data Warehousing/Mining
Motivation - World Wide Web

 Why do we want to treat the Web as a database?

– To maintain integrity
– To query based on structure (as opposed to content)
– To introduce some “organization”.
 But the Web has no structure. The best we can say is
that it is an enormous graph.

Data Warehousing/Mining
Motivation - Data Formats
 Much (probably most) of the world’s data is in
data formats
 These are formats defined for the interchange
and archiving of data
 Data formats vary in generality. ASN.1 and
XDR are quite general
 Scientific data formats tend to be “fixed
schemas”
 The textual representation given by data
formats is sometimes not immediately
translatable into a standard relational/object-
oriented representation
Data Warehousing/Mining
Motivation - Data Integration
 Goal is to integrate all types of information, including
unstructured information
– Irregular, missing information, structure not fully known, dynamic
schema evolution, etc.
 Traditional data models and languages not well suited
– Cannot accommodate heterogeneous data sets (different types and
structures), etc.
– Difficult to build software that will easily convert between two
disparate models
 OEM (Object Exchange Model)
– Semistructured data model from TSIMMIS project at Stanford
– Internal data structure for exchange of data between DBMSs
– Used by other systems: e.g., Windows 95 registry, Lotus Notes
Data Warehousing/Mining
Motivation - Browsing
 To query a database one needs to understand the
schema.
 However schemas have opaque terminology and
the user may want to start by querying the data
with little or no knowledge of the schema.
– Where in the database is the string “Casablanca” to be
found?
– Are there integers in the database greater than 216 ?
– What objects in the database have an attribute name
that starts with “act”?
 While extensions to relational query languages
have been proposed for such queries, there is no
generic technique for interpreting them.
Data Warehousing/Mining
The Model
 Represent data as some kind of graph-like or tree-
like model
– Cycles are allowed but usually refer to them as trees
– Several different approaches with minor differences
(easy to convert)
 Data on labels or edges, nodes carry information or not
 Straightforward to encode relational and object-
oriented databases
– Issue: object identity

Data Warehousing/Mining
Querying Semistructured Data
 There are (at least) three approaches to this
problem
– Add arbitrary features to SQL or to your favorite query
language
– Find some principled approach to programs that are
based on the type of the data
– Represent the graph (or whatever the structure is) as
appropriate predicates and use some variety of datalog
on that structure

Data Warehousing/Mining
The “Extend SQL” Approach
 In fact it is an attempt to extend the
philosophy of OQL and comprehension
syntax to these new structures
 It is the approach taken in the design of
UnQL and also of Lorel
 Looks very similar to OQL (path expressions)

Data Warehousing/Mining 1
Example
select Entry.Movie.Title
from DB
where Entry.Movie.Director...

Data Warehousing/Mining 1
Syntax Issues
 Need (path) variables to tie paths and edges
together
 Paths of arbitrary length
– “Find all strings in db”
– “Find whether “Allen” acted in “Casablanca”
– Need regular expresions to constrain paths
 Rich set of overloadings for operators to deal
with comparisons of objects with values and
of values with sets

Data Warehousing/Mining 1
Underlying Computational Strategy
 Model graph as a relational database and use
relational query language.
– Database large relation (node-id, label, node-id)
– Used by Stanford group in LORE/LOREL
 Complications
– Labels are from heterogeneous set of types, need
more than one relation
– Additional relations if info to be stored in nodes
– Various navigation issues

Data Warehousing/Mining 1
Semistructured Data - Case Study
Object Exchange Model

Data Warehousing/Mining 1
OEM Features

• Common model for heterogeneous information

exchange, self-describing
• Each object:
OID Label Type Value
 OID unique identifier or NULL
 Label  character string descriptor
 Type  atomic data type or set
 Value  atomic value or set of object references

• “Help pages” for labels

• Query language OEM-QL
15
Data Warehousing/Mining 1
Representing Semistructured Data Using OEM
Label

<collection, {b1, a1, ...}>

b1: <book, {t, a}> Set Value

t: <title, “Database and ...”>

Memory Atomic Value
Addresses a: <author, {n, p}>
n: <name, “Jeff Ullman”>
p: <picture, “/gifs/ullman.gif”>
a1: <article, {v, w, x}>
v: <author, “Gio Wiederhold”>
w: <title, “Mediators in the …”>
x: <journal, “IEEE Computer”>
...

16
Data Warehousing/Mining 1
An OEM Query Language: OEM-QL

• Logic-based language for OEM

– Match object patterns, generate variable bindings,
construct new OEM objects from existing ones
• Get articles published in “IEEE Computer”
P :-
P:<articles {<journal “IEEE Computer”>}>
• Get titles of books by “Jeff Ullman”
<answer_title T> :-
<book {<author “Jeff Ullman”> <title T>}>
17
Data Warehousing/Mining 1
Semistructured Data - Case Study
WWW Extraction

Data Warehousing/Mining 1
Problem
 Lots of valuable information on the Web
– irregular structure
– highly dynamic
 Embedded in HTML
 Limited query facilities

Data Warehousing/Mining 1
Data Extraction Tool

 Flexible, easy to use

 Accommodate virtually any HTML source
 Interface with existing system, e.g., data
warehouse, user interface for querying
Query

World WH Data
Wide Extractor Integrator Warehouse
Web

Specification

Data Warehousing/Mining 2
Approach

 Extract Web data into OEM format

– Query using OEM-QL
 Python-based, configurable parser
 Declarative description of HTML source
– location of data on page
– how to package data into OEM
 “Regular expression”-like syntax
 Human intelligence rather than A.I.

Data Warehousing/Mining 2
Extractor Specification

Consists of commands of the form:

[ “variable(s)”, “source”, “pattern” ]

Data Warehousing/Mining 2
HTML Source File
<HTML>
<HEAD>
..
.
<TABLE>
<TR>
<TH><I> header 1 </I></TH>
<TH><I> header 2 </I></TH>
<TH><I> header 3 </I></TH>
</TR>
<TR>
<TD> text 1 </TD>
<TD><A HREF=http://www.stuff/> text 2 </A></TD>
<TD> text 3 </TD>
</TR>
..
.
</TABLE>
..
.
</BODY>
</HTML>
Data Warehousing/Mining 2
Specification File
[

[“root”, “get('http://www.example.test/')”, “#” ],

[“__tempvar1”, “root”, “<table>#</table>” ],

[“tempvar2”, “split (tempvar1,’</tr>’)”, “#” ],

[“rows”, “__tempvar2[1:-1]”, “#” ],

[“header1,header2_url,header2,header3”, “rows”,
“*<td>#</td>*<a*href=#>#</a>*<td>#</td>*”]

]
Data Warehousing/Mining 2
Result OEM Object
<root complex {
<rows complex {
<header1 string “text 1”>
<header2_url string “http://www.stuff”>
<header2 string “text 2”
<header3 string “text 3”>
}>
<rows complex {
...

}>
...

}>
Data Warehousing/Mining 2
Basic Syntax:Variable

 variable(l:p:t)
– optional parameters for specification of
corresponding OEM object
 l: label name
 t: type
 p: parent object

 _variable
– temporary data structure, does not appear as OEM
object

Data Warehousing/Mining 2
Basic Syntax: Source

 split(variable,token)
– creates a list with multiple elements using token as the element
separator

 get(URL)
– obtain contents of HTML file at address URL

Data Warehousing/Mining 2
Basic Syntax: Patterns

 token1 # token2
– match and store current input (between tokens)

 token1 * token2
– match, don’t store current input (between tokens)

Data Warehousing/Mining 2
Syntactic Sugar

 Functions for extracting commonly used HTML

constructs
– extract_table(variable),pattern
– split_table_row(variable)
– split_table_column(variable)
– extract_list(variable),pattern
– split_list(variables)

Data Warehousing/Mining 2
Advanced Features
 Customization of output
– structure, label names, data type, ...
 Extraction across multiple HTML pages
 Graceful recovery from parse errors
– resume parsing using next input from source
 Multiple patterns in single command
– follow different parse tree depending on structure in
source

Data Warehousing/Mining 3
Sample Extraction Scenario

...
Data Warehousing/Mining 3
Extracted OEM Data
root complex {
temperature complex {
city_temp complex {
country string “Austria”
city_url url http://www…
city string “Vienna”
weather_today string “snow”
high_today string “-2”
low_today string “-7”
weather_tom string “snow”
high_tomorrow string “-2”
low_tomorrow string “-7”
}
city_temp complex {
country string “Belgium”
city_url url http://www…
city string “Brussles”
…
}
…
}
OEM-QL}query:
<city C {<high H> < low L>}> :-
<temperature {<city_temp
{<country “Germany”> <city C> <high_today H> <low_today L>}>}>
Data Warehousing/Mining 3
Evaluation
 Better than
– writing programs
– YACC, PERL, etc.
– A.I.

 Can do better
– GUI tool to simplify the generation of extractor
specification
– Machine learning or data mining techniques to
automatically infer structure...

Data Warehousing/Mining 3

Mattec Integration Technical Reference 2021.1
No ratings yet
Mattec Integration Technical Reference 2021.1
283 pages
Data Warehousing/Mining Comp 150 Data Warehousing Introduction (Not in Book)
No ratings yet
Data Warehousing/Mining Comp 150 Data Warehousing Introduction (Not in Book)
35 pages
Data Warehousing/Mining Comp 150 Data Warehousing Introduction (Not in Book)
No ratings yet
Data Warehousing/Mining Comp 150 Data Warehousing Introduction (Not in Book)
35 pages
Integrating Data Warehouses With Web Data: A Survey
No ratings yet
Integrating Data Warehouses With Web Data: A Survey
16 pages
Data - Warehouse - Architectural Components part-III
No ratings yet
Data - Warehouse - Architectural Components part-III
23 pages
Mining Kind of data
No ratings yet
Mining Kind of data
24 pages
Data Warehousing & Data Mining
No ratings yet
Data Warehousing & Data Mining
10 pages
Selected Topics of Recent Trends in Information Technology
No ratings yet
Selected Topics of Recent Trends in Information Technology
21 pages
Wk3-4 Data Warehouse
No ratings yet
Wk3-4 Data Warehouse
60 pages
Data Warehousing and OLAP Technology
No ratings yet
Data Warehousing and OLAP Technology
51 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
73 pages
2 Data Warehouse
No ratings yet
2 Data Warehouse
61 pages
Chapter5_DataWarehouse
No ratings yet
Chapter5_DataWarehouse
77 pages
Unit-1 DWDM Savita
No ratings yet
Unit-1 DWDM Savita
35 pages
Data Warehousing
100% (1)
Data Warehousing
51 pages
BDA Unit - I
No ratings yet
BDA Unit - I
92 pages
CSEP 546 Data Mining: Instructor: Pedro Domingos
No ratings yet
CSEP 546 Data Mining: Instructor: Pedro Domingos
63 pages
Unit 1 (DWDM)
No ratings yet
Unit 1 (DWDM)
52 pages
Multitier DW Architecture & Implementation
No ratings yet
Multitier DW Architecture & Implementation
63 pages
Unit 2 Data Warehousing and OLAP
No ratings yet
Unit 2 Data Warehousing and OLAP
72 pages
DM-M1-PPT v1.11
No ratings yet
DM-M1-PPT v1.11
84 pages
Module-1
No ratings yet
Module-1
78 pages
Data Warehousing and Data Mining - Unit2
No ratings yet
Data Warehousing and Data Mining - Unit2
14 pages
Csb4318 DWDM Unit - 1 Revised
No ratings yet
Csb4318 DWDM Unit - 1 Revised
68 pages
Data Warehouse Data Mining: Rahul Sachdeva
No ratings yet
Data Warehouse Data Mining: Rahul Sachdeva
35 pages
Data Warehouse and OLAP
No ratings yet
Data Warehouse and OLAP
55 pages
DM Module 1
No ratings yet
DM Module 1
16 pages
CSE 592 Data Mining: Instructor: Pedro Domingos
No ratings yet
CSE 592 Data Mining: Instructor: Pedro Domingos
63 pages
D W H Info: Main Menu DWH Concepts and Fundamentals Back
No ratings yet
D W H Info: Main Menu DWH Concepts and Fundamentals Back
7 pages
MIS416 Chapter5 by DrAsimAlwabel
No ratings yet
MIS416 Chapter5 by DrAsimAlwabel
46 pages
Database Management Systems
No ratings yet
Database Management Systems
19 pages
iiwas02dbb
No ratings yet
iiwas02dbb
5 pages
Data Mining Assignment
0% (1)
Data Mining Assignment
11 pages
BI Notes
No ratings yet
BI Notes
9 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2
86 pages
CCS341-DATA WAREHOUSING - 1805692571-Ccs341-Question-Bank
No ratings yet
CCS341-DATA WAREHOUSING - 1805692571-Ccs341-Question-Bank
10 pages
Warehousing Web Data: Keywords
No ratings yet
Warehousing Web Data: Keywords
5 pages
$RDY0D56
No ratings yet
$RDY0D56
71 pages
1 & 2 Data Warehousing_021052
No ratings yet
1 & 2 Data Warehousing_021052
80 pages
Data Warehousing: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Data Warehousing: Lecturer: Dr. Nguyen Thi Ngoc Anh
23 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Data Warehouse and OLAP
No ratings yet
Data Warehouse and OLAP
30 pages
Module 2
No ratings yet
Module 2
43 pages
Unit 1 (DWDM).docx
No ratings yet
Unit 1 (DWDM).docx
50 pages
ALL YOU NEED Data_Mining_and_Warehousing
No ratings yet
ALL YOU NEED Data_Mining_and_Warehousing
42 pages
CHP 19
No ratings yet
CHP 19
63 pages
Data Warehousing
No ratings yet
Data Warehousing
30 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
70 pages
Iare DWDM PPT Cse
No ratings yet
Iare DWDM PPT Cse
249 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
86 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
46 pages
data mining 4
No ratings yet
data mining 4
59 pages
Warehouse
No ratings yet
Warehouse
60 pages
Unit 1 (DWDM)
No ratings yet
Unit 1 (DWDM)
51 pages
Interview questions Data Warehouse
No ratings yet
Interview questions Data Warehouse
35 pages
CH 1
No ratings yet
CH 1
53 pages
Lec1 - Introduction To DWH
No ratings yet
Lec1 - Introduction To DWH
41 pages
Datawarehouse: Fact Table
No ratings yet
Datawarehouse: Fact Table
55 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
29 pages
Data Warehousing: Data Warehouse and OLAP Technology
No ratings yet
Data Warehousing: Data Warehouse and OLAP Technology
44 pages
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Types & Uses of Databases: Connolly/Beggs Ramakrishnan
No ratings yet
Types & Uses of Databases: Connolly/Beggs Ramakrishnan
10 pages
Case Study-Time Management
No ratings yet
Case Study-Time Management
2 pages
Case Study-Work Ethics 2
No ratings yet
Case Study-Work Ethics 2
1 page
Case Study: Whistle-Blowing and The Environment
No ratings yet
Case Study: Whistle-Blowing and The Environment
1 page
Case Study-Leadership 3
0% (1)
Case Study-Leadership 3
2 pages
Case Study - Leadership 2
No ratings yet
Case Study - Leadership 2
2 pages
Case Study - Communication Management
No ratings yet
Case Study - Communication Management
2 pages
Case Study-Work Ethics 2
No ratings yet
Case Study-Work Ethics 2
1 page
7 Habits of Highly Effective People Summary PDF
No ratings yet
7 Habits of Highly Effective People Summary PDF
13 pages
Ej1137389 PDF
No ratings yet
Ej1137389 PDF
13 pages
Cases
No ratings yet
Cases
5 pages
ER Examples
100% (1)
ER Examples
6 pages
Master of Science (Computer Science) MS Computer Science: Required Courses
No ratings yet
Master of Science (Computer Science) MS Computer Science: Required Courses
3 pages
Course Syllabus: IS4200 (Enterprise Resource Planning System)
No ratings yet
Course Syllabus: IS4200 (Enterprise Resource Planning System)
4 pages
Research I Automatic Ontology Construction From Relational Databases
No ratings yet
Research I Automatic Ontology Construction From Relational Databases
24 pages
TDWI Data Integration Techniques Outline
No ratings yet
TDWI Data Integration Techniques Outline
4 pages
Pentaho Data Integration (PDI) Tutorial
No ratings yet
Pentaho Data Integration (PDI) Tutorial
33 pages
Introduction To Informatica
No ratings yet
Introduction To Informatica
66 pages
Data Virtualization Data Sheet 1934
No ratings yet
Data Virtualization Data Sheet 1934
8 pages
Improving Data Quality & Sustenance For A Oil Major: Meenakshisundaram. T
No ratings yet
Improving Data Quality & Sustenance For A Oil Major: Meenakshisundaram. T
17 pages
Distributed Data Mining
No ratings yet
Distributed Data Mining
119 pages
Data Cloud Consultant Exam Valid Dumps Questions
No ratings yet
Data Cloud Consultant Exam Valid Dumps Questions
15 pages
Bridging Data Silos Using Big Data Integration
No ratings yet
Bridging Data Silos Using Big Data Integration
6 pages
Unit-I (Data Analytics)
No ratings yet
Unit-I (Data Analytics)
22 pages
Informatica Guide
No ratings yet
Informatica Guide
159 pages
DWDM Unit - 1 MCQ: by Arpit Sharma 01629802018
No ratings yet
DWDM Unit - 1 MCQ: by Arpit Sharma 01629802018
27 pages
Technical Seminar Report On Matillion Technology
No ratings yet
Technical Seminar Report On Matillion Technology
5 pages
ODI Training Material
100% (2)
ODI Training Material
97 pages
Sreenivas K: A Pentaho Data Integration Tool
No ratings yet
Sreenivas K: A Pentaho Data Integration Tool
42 pages
Data Integration
No ratings yet
Data Integration
26 pages
WD Presentation Maria Docarmo Moreno - Portugal
No ratings yet
WD Presentation Maria Docarmo Moreno - Portugal
8 pages
ETL Tool Comparison
No ratings yet
ETL Tool Comparison
16 pages
IDQ Learning
No ratings yet
IDQ Learning
36 pages
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
No ratings yet
The Log - What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction - LinkedIn Engineering
38 pages
Data Integration
No ratings yet
Data Integration
7 pages
Azure Synapse Analytics Overview
No ratings yet
Azure Synapse Analytics Overview
251 pages
Talend Big Data Reading A File
No ratings yet
Talend Big Data Reading A File
2 pages
data-analytics-uses-challenges-and-best-practices-at-public-research-universities
No ratings yet
data-analytics-uses-challenges-and-best-practices-at-public-research-universities
13 pages
What's New: Informatica Cloud Data Integration April 2022
No ratings yet
What's New: Informatica Cloud Data Integration April 2022
29 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
Charu Agarwal 8949403859 - Resume
No ratings yet
Charu Agarwal 8949403859 - Resume
1 page
Guided Tutorial For Pentaho Data Integration Using Mysql
No ratings yet
Guided Tutorial For Pentaho Data Integration Using Mysql
39 pages
Attunity Magic of Data Integration Eguide
No ratings yet
Attunity Magic of Data Integration Eguide
8 pages
Introduction To ODI Agents and Creating A ODI Standalone Agent
No ratings yet
Introduction To ODI Agents and Creating A ODI Standalone Agent
6 pages

Data Warehousing/Mining Comp 150 DW Semistructured Data: Instructor: Dan Hebert

Uploaded by

Data Warehousing/Mining Comp 150 DW Semistructured Data: Instructor: Dan Hebert

Uploaded by

Data Warehousing/Mining

Instructor: Dan Hebert

 Why do we want to treat the Web as a database?

• Common model for heterogeneous information

• “Help pages” for labels

<collection, {b1, a1, ...}>

t: <title, “Database and ...”>

• Logic-based language for OEM

 Flexible, easy to use

 Extract Web data into OEM format

Consists of commands of the form:

[ “variable(s)”, “source”, “pattern” ]

[“root”, “get('http://www.example.test/')”, “#” ],

[“__tempvar1”, “root”, “*<table>#</table>*” ],

[“__tempvar2”, “split (__tempvar1,’</tr>’)”, “#” ],

[“rows”, “__tempvar2[1:-1]”, “#” ],

 Functions for extracting commonly used HTML

You might also like

[“__tempvar1”, “root”, “<table>#</table>” ],

[“tempvar2”, “split (tempvar1,’</tr>’)”, “#” ],