XML
XML
Introduction
XML stands for Extensible Markup Language and is a text-based markup language derived from Standard
Generalized Markup Language (SGML). This tutorial will teach you the basics of XML. The tutorial is divided
into sections such as XML Basics, Advanced XML, and XML tools. Each of these sections contain related topics
with simple and useful examples.
XML stands for Extensible Markup Language. It is a text-based markup language derived from Standard
Generalized Markup Language (SGML).
XML tags identify the data and are used to store and organize the data, rather than specifying how to display
it like HTML tags, which are used to display the data. XML is not going to replace HTML in the near future, but
it introduces new possibilities by adopting many successful features of HTML.
There are three important characteristics of XML that make it useful in a variety of systems and solutions −
XML is extensible − XML allows you to create your own self-descriptive tags, or language, that suits
your application.
XML carries the data, does not present it − XML allows you to store the data irrespective of how it
will be presented.
XML is a public standard − XML was developed by an organization called the World Wide Web
Consortium (W3C) and is available as an open standard.
XML Usage
What is Markup?
XML is a markup language that defines set of rules for encoding documents in a format that is both human-
readable and machine-readable. So what exactly is a markup language? Markup is information added to a
document that enhances its meaning in certain ways, in that it identifies the parts and how they relate to each
other. More specifically, a markup language is a set of symbols that can be placed in the text of a document to
demarcate and label the parts of that document.
Following example shows how XML markup looks, when embedded in a piece of text
This snippet includes the markup symbols, or the tags such as <message>...</message> and <text>... </text>.
The tags <message> and </message> mark the start and the end of the XML code fragment. The tags <text>
and </text> surround the text Hello, world!.
A programming language consists of grammar rules and its own vocabulary which is used to create computer
programs. These programs instruct the computer to perform specific tasks. XML does not qualify to be a
programming language as it does not perform any computation or algorithms. It is usually stored in a simple
text file and is processed by special software that is capable of interpreting XML.
SYNTAX
In this chapter, we will discuss the simple syntax rules to write an XML document. Following is a complete
XML document –
You can notice there are two kinds of information in the above example −
Markup, like <contact-info>
The text, or the character data, Tutorials Point and (040) 123-4567.
The following diagram depicts the syntax rules to write different types of markup and text in an XML
document.
XML Declaration
The XML document can optionally have an XML declaration. It is written as follows
Where version is the XML version and encoding specifies the character encoding used in the document.
Syntax Rules for XML Declaration
The XML declaration is case sensitive and must begin with "<?xml>" where "xml" is written in lower-
case.
If document contains XML declaration, then it strictly needs to be the first statement of the XML
document.
The XML declaration strictly needs be the first statement in the XML document.
An HTTP protocol can override the value of encoding that you put in the XML declaration.
An XML file is structured by several XML-elements, also called XML-nodes or XML-tags. The names of XML-
elements are enclosed in triangular brackets < > as shown below −
Syntax Rules for Tags and Elements
Element Syntax − Each XML-element needs to be closed either with start or with end elements as shown
below
Nesting of Elements − An XML-element can contain multiple XML-elements as its children, but the children
elements must not overlap. i.e., an end tag of an element must have the same name as that of the most recent
unmatched start tag.
The Following example shows incorrect nested tags
Root Element − An XML document can have only one root element. For example, following is not a correct
XML document, because both the x and y elements occur at the top level without a root element
Case Sensitivity − The names of XML-elements are case-sensitive. That means the name of the start and the
end elements need to be exactly in the same case.
For example, <contact-info> is different from <Contact-Info>
XML Attributes
An attribute specifies a single property for the element, using a name/value pair. An XML-element can have
one or more attributes. For example
Attribute names are defined without quotation marks, whereas attribute values must always appear
in quotation marks. Following example demonstrates incorrect xml syntax
In the above syntax, the attribute value is not defined in quotation marks.
XML References
References usually allow you to add or include additional text or markup in an XML document. References
always begin with the symbol "&" which is a reserved character and end with the symbol ";". XML has two
types of references −
Entity References − An entity reference contains a name between the start and the end delimiters.
For example & where amp is name. The name refers to a predefined string of text and/or
markup.
Character References − These contain references, such as A, contains a hash mark (“#”)
followed by a number. The number always refers to the Unicode code of a character. In this case, 65
refers to alphabet "A".
XML Text
The names of XML-elements and XML-attributes are case-sensitive, which means the name of start and end
elements need to be written in the same case. To avoid character encoding problems, all XML files should be
saved as Unicode UTF-8 or UTF-16 files.
Whitespace characters like blanks, tabs and line-breaks between XML-elements and between the XML-
attributes will be ignored.
Some characters are reserved by the XML syntax itself. Hence, they cannot be used directly. To use them, some
replacement-entities are used, which are listed below
XML Documents
An XML document is a basic unit of XML information composed of elements and other markup in an orderly
package. An XML document can contains wide variety of data. For example, database of numbers, numbers
representing molecular structure or a mathematical equation.
Document Prolog comes at the top of the document, before the root element. This section contains −
XML declaration
Document type declaration
You can learn more about XML declaration in this chapter − XML Declaration
Document Elements are the building blocks of XML. These divide the document into a hierarchy of sections,
each serving a specific purpose. You can separate a document into multiple sections so that they can be
rendered differently, or used by a search engine. The elements can be containers, with a combination of text
and other elements.
XML Declaration
This chapter covers XML declaration in detail. XML declaration contains details that prepare an XML
processor to parse the XML document. It is optional, but when used, it must appear in the first line of the XML
document.
Syntax
Each parameter consists of a parameter name, an equals sign (=), and parameter value inside a quote.
Following table shows the above syntax in detail
Encoding UTF-8, UTF-16, ISO-10646- It defines the character encoding used in the document.
UCS-2, ISO-10646-UCS-4, UTF-8 is the default encoding used.
ISO-8859-1 to ISO-8859-9,
ISO-2022-JP, Shift_JIS, EUC-
JP
Standalone yes or no It informs the parser whether the document relies on
the information from an external source, such as
external document type definition (DTD), for its
content. The default value is set to no. Setting it
to yes tells the processor there are no external
declarations required for parsing the document.
Rules
et us learn about one of the most important part of XML, the XML tags. XML tags form the foundation of XML.
They define the scope of an element in XML. They can also be used to insert comments, declare settings
required for parsing the environment, and to insert special instructions.
We can broadly categorize XML tags as follows
Start Tag
The beginning of every non-empty XML element is marked by a start-tag. Following is an example of start-tag
End Tag
Every element that has a start tag should end with an end-tag. Following is an example of end-tag
Note, that the end tags include a solidus ("/") before the name of an element.
Empty Tag
The text that appears between start-tag and end-tag is called content. An element which has no content is
termed as empty. An empty element can be represented in two ways as follows −
A start-tag immediately followed by an end-tag as shown below
Empty-element tags may be used for any element which has no content.
Following are the rules that need to be followed to use XML tags −
Rule 1
XML tags are case-sensitive. Following line of code is an example of wrong syntax </Address>, because of the
case difference in two tags, which is treated as erroneous syntax in XML.
Following code shows a correct way, where we use the same case to name the start and the end tag.
Rule 2
XML tags must be closed in an appropriate order, i.e., an XML tag opened inside another element must be
closed before the outer element is closed. For example
XML Elements
XML elements can be defined as building blocks of an XML. Elements can behave as containers to hold text,
elements, attributes, media objects or all of these.
Each XML document contains one or more elements, the scope of which are either delimited by start and end
tags, or for empty elements, by an empty-element tag.
Syntax
where,
element-name is the name of the element. The name its case in the start and end tags must match.
attribute1, attribute2 are attributes of the element separated by white spaces. An attribute defines
a property of the element. It associates a name with a value, which is a string of characters. An
attribute is written as
name is followed by an = sign and a string value inside double(" ") or single(' ') quotes.
Empty Element
XML Attributes
This chapter describes the XML attributes. Attributes are part of XML elements. An element can have multiple
unique attributes. Attribute gives more information about XML elements. To be more precise, they define
properties of elements. An XML attribute is always a name-value pair.
Syntax
Attributes are used to distinguish among elements of the same name, when you do not want to create a new
element for every situation. Hence, the use of an attribute can add a little more detail in differentiating two or
more similar elements.
In the above example, we have categorized the plants by including attribute category and assigning different
values to each of the elements. Hence, we have two categories of plants, one flowers and other shrubs. Thus,
we have two plant elements with different attributes.
You can also observe that we have declared this attribute at the beginning of XML.
Attribute Types
StringType It takes any literal string as a value. CDATA is a StringType. CDATA is character data.
This means, any string of non-markup characters is a legal part of the attribute.
This is a more constrained type. The validity constraints noted in the grammar are
applied after the attribute value is normalized. The TokenizedType attributes are
given as −
TokenizedType
ID − It is used to specify the element as unique.
IDREF − It is used to reference an ID that has been named for another
element.
IDREFS − It is used to reference all IDs of an element.
ENTITY − It indicates that the attribute will represent an external entity in
the document.
ENTITIES − It indicates that the attribute will represent external entities in
the document.
NMTOKEN − It is similar to CDATA with restrictions on what data can be part
of the attribute.
NMTOKENS − It is similar to CDATA with restrictions on what data can be
part of the attribute.
This has a list of predefined values in its declaration. out of which, it must assign one
value. There are two types of enumerated attribute −
XML – DTD
The XML Document Type Declaration, commonly known as DTD, is a way to describe XML language precisely.
DTDs check vocabulary and validity of the structure of XML documents against grammatical rules of
appropriate XML language.
An XML DTD can be either specified inside the document, or it can be kept in a separate document and then
liked separately.
Syntax
Basic syntax of a DTD is as follows
Internal DTD
A DTD is referred to as an internal DTD if elements are declared within the XML files. To refer it as internal
DTD, standalone attribute in XML declaration must be set to yes. This means, the declaration works
independent of an external source.
Syntax
Following is the syntax of internal DTD
where root-element is the name of root element and element-declarations is where you declare the elements.
Example
Following is a simple example of internal DTD
Let us go through the above code −
Start Declaration − Begin the XML declaration with the following statement.