Advanced Database Systems
XML Data Management
Firma convenzione
Politecnico di Milano e Veneranda Fabbrica
Instructor
del Duomo di Milano
Eric Umuhoza, PhD
Aula Magna –
[email protected] Rettorato
@EricUmuhoza
Mercoledì 27 maggio 2015
Acknowledgement: I am grateful to Dr. Sara Comai, professor of DB at Politecnico di Milano for
allowing me to reuse her slides.
XML
eXtensible Markup Language
Data representation format proposed by W3C (WWW Consortium) for
Web documents, such as:
books,
product catalogs,
order forms,
messages
The Origin of XML
Original idea: a meta-language used to specify markup languages
As in HTML
XML data are contained in documents
data properties are expressed with mark-ups
XML was designed to describe data and to focus on what data are
HTML was designed to display data and to focus on how data look like
4
HTML vs XML
<h1>The Idea <bib>
Methodology</h1><br> <book>
<ul> <title>The Idea
<li>by S. Ceri, Methodology </title>
P. Fraternali </li> <author> S. Ceri </author>
<li> Addison-Wesley</li> <author> P. Fraternali
<li> US$ 49 </li> </author>
</ul> <pub>Addison-Wesley</pub>
<price> US$ 49 </price>
</book>
</bib>
Advantages of XML
XML allows to separate data from presentation
XML can be used to exchange data between incompatible systems
XML can be uses to share data (plain text)
XML can be used to store data (in XML files)
6
XML is used to exchange data
With XML, data can be exchanged between incompatible systems
In the real world, computer systems and databases contain data in
incompatible formats. One of the most time-consuming challenges for
developers has been to exchange data between such systems over the
Internet.
Converting the data to XML can greatly reduce this complexity and
create data that can be read by many different types of applications.
XML is the main language for exchanging financial information between
businesses over the Internet.
7
XML is used to share data
With XML, plain text files can be used to share data
Since XML data are stored in plain text format, XML provides a
software- and hardware-independent way of sharing data.
This makes it much easier to create data that different applications can
work with. It also makes it easier to expand or upgrade a system to new
operating systems, servers, applications, and new browsers.
8
XML is used to store data
With XML, plain text files can be used to store data
XML data are also stored in files or in databases. Applications can be
written to store and retrieve information from the store, and generic
applications can be used to display the data.
Data management extensions include data models (DTD,XSD), query
languages (XQuery, XSLT)
Data management occurs
o Within native systems (eXists,Galax,ISI-XQ,BaseX,...)
o Within relational systems (Oracle, DB2, SQLServer)
9
XML can make your data more useful
With XML, data are available to more users
Since XML is independent of hardware, software and applications, you
can make your data available to other than only standard HTML
browsers
Other clients and applications can access your XML files as data
sources, like they are accessing databases. Your data can be made
available to all kinds of "reading machines"
XML is the mother of new special-purpose languages.
o E.g. the Wireless Markup Language (WML), used to markup
Internet applications for handheld devices like mobile phones, is
written in XML
10
Syntax
The syntax rules of XML are very simple and very strict. The rules are
very easy to learn, and very easy to use.
Because of this, creating software that can read and manipulate XML
is very easy.
XML documents use a self-describing and simple syntax.
11
Example (1)
An example XML document:
note
<?xml version="1.0" encoding="ISO-8859-1"?>
<note>
<to>Tim</to>
to from heading body
<from>John</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Tim John Reminder Don't
forget me
…
12
Example (2)
The first line in the document - the XML declaration - defines the XML
version and the character encoding used in the document.
In this case the document conforms to the 1.0 specification of XML and
uses the ISO-8859-1 (Latin-1/West European) character set.
The next line describes the root element of the document
This document is a note:
o <note>
13
Example (3)
The next 4 lines describe 4 child elements of the root
to, from, heading, and body:
<to>Tim</to>
<from>John</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
And finally the last line defines the end of the root element:
</note>
Can you detect from this example that the XML document contains
a Note to Tim from John? Don’t you agree that XML is indeed quite
self-descriptive?
14
XML Syntax (1)
All XML elements must have a closing tag
Note: You might have noticed from the previous example that the XML
declaration did not have a closing tag. This is not an error. The declaration is
not a part of the XML document itself. It is not an XML element, and it should
not have a closing tag.
XML tags are case sensitive (unlike HTML)
The tag <Letter> is different from the tag <letter>.
Opening and closing tags must therefore be written with the same case:
o <Message>This is incorrect</message>
o <message>This is correct</message>
The syntax for comments in XML is the same as that of HTML
<!-- This is a comment -->
15
XML Syntax (2)
All XML elements must be properly nested
All XML documents must have a root element
All XML documents must contain a single root element
All other elements must be within this root element.
All elements can have sub elements (child elements). Sub elements must be
correctly nested within their parent element:
o <root> <child> <subchild>.....</subchild> </child> </root>
Attribute values must always be quoted -- it is illegal to omit quotation
marks around attribute values
<incorrectNote date=12/11/2002>
<note date="12/11/2002">
16
Elements
Elements can have different content types
An XML element is everything from (including) the element's start tag to
(including) the element's end tag.
Elements can have different content types
simple content
element content
mixed content
empty content.
An element can also have attributes.
17
Example
<book> ROOT ELEMENT
WITH ELEMENT CONTENT
<title> My First XML</title> SIMPLE ELEMENT
<prod id="33-657" media="paper"></prod> EMPTY ELEMENT
WITH ATTRIBUTES
<chapter> Introduction to XML MIXED CONTENT
<para> What is HTML </para>
<para> What is XML </para>
</chapter>
<chapter>XML Syntax
<para>Elements must have a closing tag</para>
<para>Elements must be properly nested</para>
</chapter>
</book>
18
Example
Book is the root element.
Title, prod, and chapter are child elements of book.
Book is the parent element of title, prod, and chapter.
Title, prod, and chapter are siblings because they have the same parent.
Book has element content, because it contains other elements.
Chapter has mixed content because it contains both text and other
elements.
Para has simple content (or text content) because it contains only text.
Prod has empty content, because it carries no information.
Only the prod element has attributes. The attribute named id has the
value "33-657". The attribute named media has the value "paper".
19
Element naming
XML elements must follow these naming rules:
Names can contain letters, numbers, and other characters
Names must not start with a number or punctuation character
Names must not start with the xml (or XML or Xml ...)
Names cannot contain spaces
20
Element naming
Take care when "inventing" element names and follow these
simple rules:
Any name can be used, no words are reserved, but the idea is to
make names descriptive. Names with an underscore separator are
nice.
o Examples: <first_name>, <last_name>.
Avoid "-" and "." in names. For example, if you name something
"first-name“, it could be a mess if your software tries to subtract
name from first.
Element names can be as long as you like, but don't exaggerate.
Names should be short and simple, like this:
o <book_title> not like this: <the_title_of_the_book>.
21
Attributes
XML elements can have attributes.
From HTML you will remember this: <IMG SRC="computer.gif">. The
SRC attribute provides additional information about the IMG element.
In HTML (and in XML) attributes provide additional information about
elements:
<img src="computer.gif"> <a href="demo.asp">
Attributes often provide information that is not a part of the data. In the
example below, the file type is irrelevant to the data, but important to the
software that wants to manipulate the element:
<file type="gif">computer.gif</file>
22
Attributes
Quote Styles, "female" or 'female'?
Attribute values must always be enclosed in quotes, either single
or double. For a person's sex, the person tag can be:
<person sex="female"> or <person sex='female'>
Note
o If the attribute value itself contains double quotes, it is necessary to
use single quotes, like in this example:
<gangster name='George "Shotgun" Ziegler'>
o If instead the attribute value itself contains single quotes, it is
necessary to use double quotes, like in this example:
<gangster name="George 'Shotgun' Ziegler">
23
Attributes
Should you avoid using attributes?
Some of the problems with using attributes are:
attributes cannot contain multiple values (child elements can)
attributes are not easily expandable (for future changes)
attributes cannot describe structures (child elements can)
attributes are more difficult to manipulate by program code
attribute values are not easy to test against a Document Type
Definition (DTD)
o which is used to define the legal elements of an XML document
24
Elements vs Attributes
Try to use elements to describe data.
Use attributes only to provide information that is not
relevant to the data or for metadata.
Example: ID references can be used to access XML elements
25
Name conflicts
Since element names in XML are not predefined, a name
conflict will occur when two different documents use the
same element names.
This XML document carries information in a table:
o <table> <tr> <td>Apples</td> <td>Bananas</td> </tr> </table>
This XML document carries information about a table (a piece of
furniture):
o <table> <name>African Coffee Table</name> <width>80</width>
<length>120</length> </table>
If these two XML documents were added together, there
would be an element name conflict because both
documents contain a <table> element with different
content and definition.
26
Namespaces
Name conflicts are solved by using a prefix
This XML document carries information in a table:
<h:table> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr>
</h:table>
This XML document carries information about a piece of
furniture:
<f:table> <f:name>African Coffee Table</f:name>
<f:width>80</f:width> <f:length>120</f:length> </f:table>
Now there will be no name conflict because the two
documents use a different name for their <table> element
(<h:table> and <f:table>).
The prefix helped us create two different types of <table>
elements
27
Uniform Resource Identifiers (URIs)
A Uniform Resource Identifier (URI) is a string of characters
which identifies an Internet Resource. The most common URI
is the Uniform Resource Locator (URL) which identifies an
Internet domain address. Another, not so common type of URI
is the Universal Resource Name (URN). Usually URLs are
used.
28
Namespace References
This XML document carries information in a table:
<h:table xmlns:h="http://www.w3.org/TR/html4/"> <h:tr>
<h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr> </h:table>
This XML document carries information about a piece of
furniture:
<f:table xmlns:f="http://www.w3schools.com/furniture">
<f:name>African Coffee Table</f:name> <f:width>80</f:width>
<f:length>120</f:length> </f:table>
29
The XML Namespace (xmlns) Attribute
The XML namespace attribute is placed in the start tag of an
element and has the following syntax:
xmlns:namespace-prefix="namespaceURI“
When a namespace is defined in the start tag of an element,
all child elements with the same prefix are associated with the
same namespace.
Note that the address used to identify the namespace is not
used by the parser to look up information. The only purpose is to
give the namespace a unique name. However, very often
companies use the namespace as a pointer to a real Web page
containing information about the namespace.
30
XML
A Well Formed XML document has correct XML syntax
A Well Formed XML document is a document that conforms to
the XML syntax rules that were described
A Valid XML document also conforms to a DTD
A Valid XML document is a Well Formed XML document, which
also conforms to the rules of a Document Type Definition (DTD)
31
XML Document Type Definition (DTD)
The purpose of a Document Type Definition is to define the
legal building blocks of an XML document. It defines the
document structure with a list of legal elements.
A DTD can be declared inline in the XML document, or as an
external reference.
If the DTD is included in the XML source file, it should be
wrapped in a DOCTYPE definition with the following syntax:
<!DOCTYPE root-element [element-declarations]>
https://www.w3schools.com/xml/xml_dtd.asp
32
Why use a DTD?
With DTD, each of your XML files can carry a description of its
own format with it.
With a DTD, independent groups of people can agree to use a
common DTD for interchanging data.
Your application can use a standard DTD to verify that the
data you receive from the outside world is valid.
You can also use a DTD to verify your own data.
33
Internal DTD
<?xml version="1.0"?>
<!DOCTYPE note
[ <!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCdata)>
<!ELEMENT from (#PCdata)>
<!ELEMENT heading (#PCdata)>
<!ELEMENT body (#PCdata)> ]>
<note>
<to>Tim</to>
<from>John</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend</body>
</note>
34
DTD
The DTD above is interpreted like this:
!DOCTYPE note (in line 2) defines that this is a document of the
type note.
!ELEMENT note (in line 3) defines the note element as having
four elements: "to,from,heading,body".
!ELEMENT to (in line 4) defines the to element to be of the type
"#PCdata".
!ELEMENT from (in line 5) defines the from element to be of the
type "#PCdata“.
and so on.....
35
External DTD
If the DTD is external to the XML file, it should be wrapped in a
DOCTYPE definition with the following syntax:
<!DOCTYPE root-element SYSTEM "filename">
<?xml version="1.0"?>
<!DOCTYPE note SYSTEM "note.dtd">
<note> <to>Tim</to> <from>John</from> <heading>Reminder</heading>
<body>Don't forget me this weekend!</body> </note>
This is a copy of the file "note.dtd" containing the DTD:
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCdata)>
<!ELEMENT from (#PCdata)>
<!ELEMENT heading (#PCdata)>
<!ELEMENT body (#PCdata)>
36
Declaring an element
In the DTD, XML elements are declared with an element
declaration.
An element declaration has the following syntax:
<!ELEMENT element-name category>
or
<!ELEMENT element-name (element-content)>
37
Declaring an element
Empty elements
Empty elements are declared with the category keyword EMPTY:
o <!ELEMENT element-name EMPTY>
o example: <!ELEMENT br EMPTY>
Elements with only character data
Elements with only character data are declared with #PCdata inside
parentheses:
o <!ELEMENT element-name (#PCdata)>
Elements with any contents
Elements declared with the category keyword ANY, can contain any
combination of parsable data:
o <!ELEMENT element-name ANY>
38
Elements with children
Empty elements
Elements with one or more children are defined with the
name of the children elements inside parentheses:
<!ELEMENT element-name (child-element-name)>
<!ELEMENT element-name (child-element-name, child-element-
name,.....)>
example: <!ELEMENT note (to, from, heading, body)>
When children are declared in a sequence separated by
commas, the children must appear in the same sequence in
the document.
In a full declaration, the children must also be declared, and
the children can also have children.
39
One, min one
Declaring only one occurrence of the same element
<!ELEMENT element-name (child-name)>
<!ELEMENT note (message)>
The example declaration above declares that the child element
message must occur once, and only once inside the "note"
element.
Declaring minimum one occurrence of the same element
<!ELEMENT element-name (child-name+)>
<!ELEMENT note (message+)>
The + sign in the example above declares that the child element
message must occur one or more times inside the "note"
element.
40
Zero or more, zero or one
Declaring zero or more occurrences of the same element
<!ELEMENT element-name (child-name*)>
<!ELEMENT note (message*)>
The * sign in the example above declares that the child element
message can occur zero or more times inside the "note" element.
Declaring zero or one occurrences of the same element
<!ELEMENT element-name (child-name?)>
<!ELEMENT note (message?)>
The ? sign in the example above declares that the child element
message can occur zero or one times inside the "note" element
41
Alternative and mixed content
Declaring either/or content
example:<!ELEMENT note (to,from,header,(message|body))>
The example above declares that the "note" element must contain
a "to" element, a "from" element, a "header" element, and either a
"message" or a "body" element.
Declaring mixed content
example:<!ELEMENT note (#PCdata|to|from|header|message)*>
The example above declares that the "note" element can contain
zero or more occurrences of parsed character, "to", "from",
"header", or "message" elements
42
Declaring attributes
An attribute declaration has the following syntax:
<!ATTLIST element-name attribute-name attribute-type default-value>
Example of declaration for <payment type="check" />
!ATTLIST payment type Cdata "check">
43
Attribute type
The attribute-type can have the following values:
Cdata The value is character data
(en1|en2|..) The value must be one from an enumerated list
ID The value is a unique id
IDREF The value is the id of another element
IDREFS The value is a list of other ids
NMTOKEN The value is a valid XML name
NMTOKENS The value is a list of valid XML names
ENTITY The value is an entity
ENTITIES The value is a list of entities
NOTATION The value is a name of a notation
xml: The value is a predefined xml value
44
Default values
The default-value can have the following values:
Value The default value of the attribute
#REQUIRED The attribute value must be included in the element
#IMPLIED The attribute does not have to be included
#FIXED value The attribute value is fixed
45
Example of attribute declarations
<!ELEMENT PRODUCT ( ………… )
<!ATTLIST PRODUCT
code ID #REQUIRED
label CDATA #IMPLIED
status (available|unavailable) ‘available’ >
46
A simple DTD
<!DOCTYPE NEWSPAPER [
<!ELEMENT NEWSPAPER (ARTICLE+)>
<!ELEMENT ARTICLE (HEADLINE,BYLINE,LEAD,BODY,NOTES)>
<!ELEMENT HEADLINE (#PCDATA)>
<!ELEMENT BYLINE (#PCDATA)>
<!ELEMENT LEAD (#PCDATA)>
<!ELEMENT BODY (#PCDATA)>
<!ELEMENT NOTES (#PCDATA)>
<!ATTLIST ARTICLE AUTHOR CDATA #REQUIRED
EDITOR CDATA #IMPLIED
DATE CDATA #IMPLIED
EDITION CDATA #IMPLIED >
]>
For more examples:
https://www.w3schools.com/xml/xml_dtd_examples.asp
47
References
https://www.w3schools.com/xml/xml_dtd.asp
https://www.w3schools.com/xml/xml_dtd_ex
amples.asp