SlideShare a Scribd company logo
BASIC XML SYNTAX
 XML markup describes and provides structure to the content of an XML document
or data packet.
 The tag markup syntax of XML is very similar to HTML (both are based upon SGML), with angle
brackets used to delimit tags.
 All tags begin with a less-than sign (<) and end with a greater-than sign (>).
 Unlike HTML, XML is case-sensitive, including element tags and attribute values, that
is:<Invoice> ( <INVOICE>
 ( <invoice> ( <INvoice>
 Characters
 Because XML is intended for worldwide use, characters are not limited to the 7-bitASCII
character set. XML uses most of the characters that are defined in the 16-bit Unicode character
set (currently congruent with ISO/IEC 10646).There are two Unicode formats that are used as
the basis of XML characters: UTF-8 and UTF-16. XML allows the use of almost any character
encoding that can be mapped to Unicode (such as EBCDIC, Big5, etc.).There are numerous
other character encodings that can be used with some XML tools, but UTF-8 and UTF-16
support is required of all XML processors.
 The current Unicode specification can be found at: http://www.unicode.org, and
ISO/IEC 10646 documentation can be ordered at http://www.iso.ch.The UTF
acronym can mean "UnicodeTransformation Format" (according to Unicode), or
 "UCSTransformation Format" (in IEC or IETF documents) - essentially they mean the
same thing, since Unicode and ISO/IEC 10646 are nearly identical.
 UTF-8 is commonly used in North America and Europe, since the first 128
character values map directly to 7-bit US-ASCII (conversely any 7-bit ASCII string is
valid UTF-8). UTF-8 is a multi-byte encoding, with character values represented in
one to six bytes.This encoding is less popular in Asia, since most Asian characters
and ideographs require the longest encoded forms.
 UTF-8 is described at: http://www.ietf.org/rfc/rfc2279.txt The UTF-16 encoding uses 16-bit
values for characters, with the full range of 65,536 possible 16-bit values being split into
two parts.There are 63,486 values available to represent single 16-bit character values.
The other 2,048 values are reserved to provide paired 16-bit code values for an additional
1,048,544 character values.These are called surrogate pairs, but so far none of these
values are being used.
 UTF-16 is described at: http://www.ietf.org/rfc/rfc2781.txtThese are relatively new
standards, and so much of the world's text isn't yet stored in Unicode. However, it was
designed to be a superset of most existing character encodings, and so the conversion of
legacy data to Unicode is straightforward. For example, convertingASCII to the UTF-16
form of Unicode merely requires stuffing a zero into the high-order byte of the 16-bit
character, and simply preserving the low-order byte as is. Of course, this means that twice
the storage space is required, compared to the same text in ASCII.As noted above, 7-bit
ASCII doesn't even need conversion to be treated as the UTF-8 encoding
SPECIAL MARKUP CHARACTERS
 Five characters have special meaning in XML mark-up:
 < - Less-than sign (left angle bracket)
 > - Greater-than sign (right angle bracket)
 & - Ampersand
 ' - Apostrophe (single quotation mark)
 " - Quotation mark (double quotation mark)
 Use &lt; for <
 Use &gt; for >
 Use &amp; for &
 Use &apos; for ‘ and Use &quot; for "
ELEMENTS
 An element is XML's basic container for content - it may contain character data,
other elements, and/or other markup (comments, PIs, entity references, etc.).
Since they represent discrete objects, elements can be thought of as the "nouns"
of XML.
 Elements are delimited with a start-tag and an end-tag. If an element has no
content, it is known as an empty element, and may be represented with either a
start-tag/end-tag pair or using an abbreviation: the empty-element tag. Unlike
the looser syntax of HTML and SGML, the end-tag cannot be omitted, except
when using an empty-element tag.
 All three types of tags are shown in this example:
 <html> <!-- start-tag -->
 <img src="logo.png" /> <!-- empty-element tag -->
 </html> <!-- end-tag -->
 Each of these tags consists of the element type name (this must be a valid XML
name) enclosed within a pair of angle brackets
 (< >). Let's look at XML tags in more detail.
 <html> <!-- start-tag -->
 <img src="logo.png" /> <!-- empty-element tag -->
 </html> <!-- end-tag -->
TAGS
 The opening delimiter of an element is called the start-tag. Start-tags are comprised of an element type name, and
perhaps some attributes (which we'll look at later in this chapter), enclosed within a pair of angle brackets.
 We can think of start-tags as "opening" a container - which is then "closed" with an end-tag. End-tags are comprised of a
forward slash (/) followed by an element type name, enclosed within the usual angle brackets.
 The name in an end-tag must match the element name in a corresponding start-tag. Everything between the start-tag
and the end-tag of an element is contained within that element.The following are legal pairs of start- and end-tags:
 <Invoice> ... </Invoice>
 <INVOICE> ... </INVOICE>
 <INVOICE > ... </INVOICE >
 <Wrox:Invoice> ... </Wrox:Invoice>
EMPTY-ELEMENTTAGS
 Empty elements are those that have no content, though there may be associated
attributes. Let's say that we wanted to explicitly indicate certain points within our
XML data (see the next section). We could just add a start- and end-tag pair
without any text between
 for example:
 <point></point>
THE STRUCTURE OF XML DATA
 All XML data must conform to both syntax requirements and a simple container
structure. Such data is known as well formed (see relevant section later in this
chapter for more details). All well-formed XML documents can be comprised of
one to three parts:
 An optional prolog, which may contain important information about the rest of
the data. The body, which consists of one or more elements in the form a
hierarchical tree. An optional "miscellaneous" epilog that follows the element
tree.These parts, and the unfamiliar syntax in the following illustration, will be
described in greater detail later in this chapter.
 Prolog
 <?xml version="1.0"?>
 <!-- Comments and/or PIs allowed here -->
 <!DOCTYPE textfile SYSTEM "http://www.mySite.com/MyDTDs/Textfile.dtd">
 <!-- Comments and/or PIs allowed here -->
 <textfile>
 <line>A Simple Example</line>
 <line> byYoursTruly</line>
 <line>This is the 3rd line of a simple 5-line text file.</line>
 <line>..the middle line..</line>
 <line>And lastly, a final line of text.</line>
 <EOF/>
 </textfile>
Basic xml syntax
Basic xml syntax
 The body sub-tree always has a single root node called the document element (sometimes referred to as the root
element) - if not,
 the data is not well-formedXML!
 Any well-formedXML document must be a simple hierarchical tree with a single root node, called the "document root".
This
 document tree contains a secondary tree of elements, with its own singular root node, called the "document element".
 The document root of each XML document is also the main point of attachment for the document's description using a
DTD or
 Schema (see Chapters 5 and 6 for more about these). A Processing Instruction (PI - more about these later) is often used
to attach
 a stylesheet as well (see Chapter 9).
 Since well-formedXML data has a tree structure, it can be modeled and manipulated as a tree. A standard model for this
 approach is theW3C Document Object Model (DOM), which will be discussed in Chapter 11.
 Now let's look at the body of the XML document in greater depth
 The Document Element
 This element is the parent of all other elements in the tree, and thus it may not be
contained in any other element. Because the
 document root and the document element are not the same thing, it is better not
to refer to the document element as the "root
 element" (even though it is the root of the element sub-tree)
Basic xml syntax
 String Literals
 String literals are used for the values of attributes, internal entities, and external
identifiers. All string literals in XML are
 enclosed by delimiter pairs, using either an apostrophe (') or a quotation mark (").
The one restriction upon these literals is that
 the character used for the delimiters may not appear within the literal - if an
apostrophe appears in the literal, the quotation mark
 delimiter must be used, and vice versa.
 "string"
 'string'
 "..Jack's cow said &quot;moo&quot;"
 '..Jack&apos;s cow said "moo"'
ATTRIBUTES
 If elements are the "nouns" of XML, then attributes are its "adjectives".
 Often there is some information about an element that we wish to attach to it, as
opposed to including it as a string inside the
 element, or one of its children.This can be done using attributes, each of which is
comprised of a name-value pair. Both starttags
 and empty-element tags may include attributes within the tag. Attribute values
must always be string literals, so the attribute
 value can use either of the two delimiters:
ELEMENTSVS. ATTRIBUTES
 The decision to use an element versus an attribute is not a simple one. Much
discussion and argument has occurred about this
 topic on both the XML-L and XML-DEV lists. Some argue that attributes should
never be used - that they add unnecessary
 processing complexity, and that anything that can be represented as an attribute
would be better contained within a child element.
 Others extol the advantage of being able to validate attribute values and assign
default values using a DTD. Experiments have
 shown that, despite superficial appearances, use of generic data compression
(such as gzip, zlib, or LZW) has shown that neither
 form has an inherent advantage for data storage or transmission.
CHARACTER DATA
 Character data is plain text that contains no element tags or other markup, except perhaps, character and entity
references.
 Remember too, that because XML is intended for worldwide use, text means Unicode, not just ASCII (see the
"Characters"
 section earlier in this chapter).
 The ampersand (&) and less-than (<) characters are used as XML's opening delimiters, and thus may never appear in
their literal
 form (except in CDATA sections, which are discussed later). If these characters are needed within character data, they
must be
 escaped using the entity references; &lt; or &amp;. It is not necessary to escape the other markup characters (like >), but
they
 may be escaped (using &gt; in this case), if only for the sake of consistency within the character data.
 These escape sequences are part of the set of five such strings defined by the XML specification, and implemented in all
 compliant XML parsers.
WHITESPACE
 Whitespace is an important linguistic concept for both human and computer
languages. Only four characters are treated as whitespace in XML data:
 XML's rule for handling whitespace is very simple: all whitespace characters
(except for the CR character) within the content are preserved by the parser and
passed unmodified to the application, while whitespace within element tags and
attribute values may be removed.This is unlike the rampant removal of
whitespace carried out in HTML browsers.
SPECIAL-PURPOSE MARKUP
 We've already discussed just about every aspect of XML syntax that is necessary
to create well-formed XML data (elements,attributes, and character/entity
references). There are three additional syntactic constructs that deviate from the
familiar syntax of tags (<tagname>) or entity references (&ref;).These are:
 Comments
 Processing Instructions (PIs)
 CDATA sections
COMMENTS
 It is often useful to insert notes, or comments, into a document.These comments
might provide a revision log, historical notes, or any other sort of meta-data that
would be meaningful to the creator and editors of a document (serving to
enhance its human readability), but aren't truly part of the document's content.
Comments may appear anywhere in a document outside of other markup (that is,
you can't put a comment in the middle of a start- or end-tag).
 The basic syntax of an XML comment is:
 <!--...comment text...-->
PROCESSING INSTRUCTIONS (PIS)
 XML, like SGML, is a descriptive markup language, and so it does not presume to
try to explain how to actually process an
 element or its contents.This is a powerful advantage in that it provides
presentation flexibility, and OS- and applicationindependence.
 However, there are times when it is desirable to pass processing hints (or perhaps
some script code) to the
 application along with the document.The Processing Instruction (PI) is the
mechanism that XML provides for this purpose.
CDATA SECTIONS
 CDATA sections are a method of including text that contains characters that
would otherwise be interpreted as markup.This feature is primarily useful to
authors who wish to include examples of XML markup in their documents (like the
examples in this book).This is probably the only good reason to include CDATA
sections in a document, since almost all advantages of XML are lost when using
these sections.
 The basic syntax of a CDATA section is:
 <![CDATA[...]]>
 <![CDATA[&Warn; - &Disclaimer; &lt;&copy; 2001 &USCG; &amp; &USN; &gt; ]]>
 <example>&amp;Warn; - &amp;Disclaimer; &amp;lt;&amp;copy; 2001
&amp;USCG; &amp;amp; &amp;USN;
 &amp;gt;
 </example>
DOCUMENT STRUCTURE
 Prolog
 The prolog is the appetizer - used to signal the beginning of XML data. It describes the data's
character encoding, and provides some other configuration hints to the XML parser and
application.
 XML Declaration
 All XML documents should begin with an XML Declaration.This declaration is not required in
most XML documents, but it serves to explicitly identify the data as XML, and does permit
some optimizations when processing the document. If the XML data uses an encoding other
than UTF-8 or UTF-16, then an XML Declaration with the correct encoding must be used.
 If this declaration is included, then the string literal "<?xml " must be the very first six characters
of the document – no preceding whitespace or embedded comments are allowed.
 While this declaration looks exactly like a processing instruction, strictly speaking
it is not a PI (it is a unique declaration defined by the XML 1.0 REC). Nevertheless,
the XML Declaration uses PI-like delimiters and an attribute-like parameter syntax
that is similar to the one used in element tags (either " or ' may be used to delimit
the value strings). For example:
 <?xml version="1.0" encoding='utf-8' standalone="yes"?>
 <?xml version='1.0' encoding='utf-8'?>
DOCUMENTTYPE DECLARATION
 This should not be confused with the DTD (Remember: DocumentType
Definition)! Rather, the DocumentType Declaration can refer to an external DTD
and/or contain part of the DTD.
 Body
 This is, of course, the main course of the XML data, which we've discussed at
length in terms of its components: elements,
 attributes, character data, etc. It is worth reiterating that the body may contain
comments, PIs, and/or whitespace characters
 interleaved with elements and character data.The elements must comprise a
hierarchical tree, with a single root node.
EPILOG
 The XML epilog is the dessert with potentially unpleasant consequences! It may
include comments, PIs, and/or whitespace.Comments and whitespace don't cause
any significant problems. However, it is unclear whether PIs in the epilog should
be applied to the elements in the preceding XML data, or a subsequent XML
document (if any).This may well be a solution in search of a problem, or it may
just be a problem in and of itself. XML does not define any end-ofdocument
indicator, and many applications will use the document element end-tag for this
purpose. In this case, the epilog is never read, let alone processed.
 This is a "real design error" as considered byTim Bray (one of the XML 1.0 REC
editors). It is probably inadvisable to use it without a very compelling reason - and
the prior knowledge that it will likely not be interoperable with other XML
applications.
VALID XML
 Any XML data object is considered valid XML if it is well formed, and it meets
certain further validity constraints and matches a grammar describing the
document's content. Like SGML, XML can provide such a description of document
structure in the form of an XML Schema or a DTD
 The SGML equivalent of a well-formed document is known as tag-valid.The SGML
equivalent of a valid document is type-valid
XML PARSERS
 In addition to specifying the syntax of XML, theW3C described some of the
behavior of the lower tier of XML's client architecture (the XML processor or
parser)
 Parser Levels
 Two levels of parser ("processor") behavior are defined in the XML 1.0 REC:
 Non-validating - ensures that the data is well-formed XML, but need not resolve
any external resources
 Validating - ensures both well-formedness and validity using a DTD, and must
resolve external resources
 Parser Implementations
 There are two different implementation approaches to processing the XML data:
 Event-driven parser - Processes XML data sequentially, handling components one at a
time
 Tree-based parser - Constructs a tree representation of the entire document and provides
access to individual nodes in
 the tree (can be constructed on top of an event-driven parser)
 Much quasi-religious argument has occurred about this dichotomy, but each approach has
its merits. Like so many other realworld
 problems, XML processing may have vastly different requirements, and thus different
approaches may be best for
 different situations.
EVENT-DRIVEN PARSERS
 The event-driven model should be quite familiar to programmers of modern GUI
interfaces and operating systems. In this case, the XML parser executes a call-
back to the application for each component of the XML data: element (with
attributes), character data, processing instructions, notation, or comments. It's up
to the application to handle the XML data as it is provided via the call-backs - the
XML parser does not maintain the element tree structure, or any of the data after
it has been parsed.The eventdriven method requires very modest system
resources, even for extremely large documents; and because of its simple, low-
level access to the structure of the XML data, provides great flexibility in handling
the data within the XML application.
TREE-BASED PARSERS
 One of the most widely used structures in software engineering is the simple
hierarchical tree. All well-formed XML data is defined to be such a tree, and thus
common and mature algorithms may be used to traverse the nodes of an XML
document, search for content, and/or edit the document tree.These tree
algorithms have the advantage of years of academic and commercial
development.
 XML parsers that use this approach generally conform to the W3C's Document
Object Model (DOM).The DOM is a platform and language-neutral interface that
allows manipulation of tree-structured documents. On the other hand, the DOM
tree must be built in memory, before the document can be manipulated - high-
performance virtual memory support is imperative for larger documents! Once
the tree is built, an application may access the DOM via a related API.

More Related Content

PPTX
Basic XML
PDF
XML Introduction
PDF
SQL Server - Querying and Managing XML Data
PPT
Introduction to XML
PPT
01 Xml Begin
Basic XML
XML Introduction
SQL Server - Querying and Managing XML Data
Introduction to XML
01 Xml Begin

What's hot (20)

PPT
Introduction to XML
PPT
Xml Java
PPTX
XML Introduction
PDF
Xml tutorial
 
PPT
XML and DTD
PPT
Xml 215-presentation
PPT
Introduction to XML
PPTX
Extensible Markup Language (XML)
PPTX
Xml ppt
PPTX
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - Introduction to XML
PPTX
Introduction to xml
PPTX
Xml dtd
PPTX
XML, DTD & XSD Overview
PPT
Introduction to XML
PPTX
XML | Computer Science
PPTX
XML-Extensible Markup Language
Introduction to XML
Xml Java
XML Introduction
Xml tutorial
 
XML and DTD
Xml 215-presentation
Introduction to XML
Extensible Markup Language (XML)
Xml ppt
Fergus Fahey - DRI/ARA(I) Training: Introduction to EAD - Introduction to XML
Introduction to xml
Xml dtd
XML, DTD & XSD Overview
Introduction to XML
XML | Computer Science
XML-Extensible Markup Language
Ad

Similar to Basic xml syntax (20)

PPTX
PPTX
Sgml and xml
PPTX
Web Development Course - XML by RSOLUTIONS
PDF
xml introduction in web technologies subject
PPTX
PDF
XMLin Web development and Applications.pdf
PPTX
PHP XML
DOC
Web Technology XML Attributes and elementsUnit 3.doc
PPT
PPT
XML Presentation-2
PPTX
Unit 5 xml (1)
PDF
XML for Web Development Information tech.pdf
PPT
Introduction to xml
PDF
M.FLORENCE DAYANA WEB DESIGN -Unit 5 XML
PPTX
DOCX
Oracle soa xml faq
PPTX
xml.pptx
PPTX
Sgml and xml
Web Development Course - XML by RSOLUTIONS
xml introduction in web technologies subject
XMLin Web development and Applications.pdf
PHP XML
Web Technology XML Attributes and elementsUnit 3.doc
XML Presentation-2
Unit 5 xml (1)
XML for Web Development Information tech.pdf
Introduction to xml
M.FLORENCE DAYANA WEB DESIGN -Unit 5 XML
Oracle soa xml faq
xml.pptx
Ad

More from Raghu nath (20)

PPTX
Mongo db
PDF
Ftp (file transfer protocol)
PDF
MS WORD 2013
PDF
Msword
PDF
Ms word
PDF
Javascript part1
PDF
Regular expressions
PDF
Selection sort
PPTX
Binary search
PPTX
JSON(JavaScript Object Notation)
PDF
Stemming algorithms
PPTX
Step by step guide to install dhcp role
PPTX
Network essentials chapter 4
PPTX
Network essentials chapter 3
PPTX
Network essentials chapter 2
PPTX
Network essentials - chapter 1
PPTX
Python chapter 2
PPTX
python chapter 1
PPTX
Linux Shell Scripting
PPTX
Mongo db
Ftp (file transfer protocol)
MS WORD 2013
Msword
Ms word
Javascript part1
Regular expressions
Selection sort
Binary search
JSON(JavaScript Object Notation)
Stemming algorithms
Step by step guide to install dhcp role
Network essentials chapter 4
Network essentials chapter 3
Network essentials chapter 2
Network essentials - chapter 1
Python chapter 2
python chapter 1
Linux Shell Scripting

Recently uploaded (20)

PPTX
Institutional Correction lecture only . . .
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
master seminar digital applications in india
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Basic Mud Logging Guide for educational purpose
PDF
01-Introduction-to-Information-Management.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
Pharma ospi slides which help in ospi learning
PDF
Complications of Minimal Access Surgery at WLH
Institutional Correction lecture only . . .
Microbial diseases, their pathogenesis and prophylaxis
master seminar digital applications in india
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Basic Mud Logging Guide for educational purpose
01-Introduction-to-Information-Management.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
O7-L3 Supply Chain Operations - ICLT Program
STATICS OF THE RIGID BODIES Hibbelers.pdf
Pre independence Education in Inndia.pdf
Anesthesia in Laparoscopic Surgery in India
Computing-Curriculum for Schools in Ghana
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Final Presentation General Medicine 03-08-2024.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
human mycosis Human fungal infections are called human mycosis..pptx
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Pharma ospi slides which help in ospi learning
Complications of Minimal Access Surgery at WLH

Basic xml syntax

  • 2.  XML markup describes and provides structure to the content of an XML document or data packet.
  • 3.  The tag markup syntax of XML is very similar to HTML (both are based upon SGML), with angle brackets used to delimit tags.  All tags begin with a less-than sign (<) and end with a greater-than sign (>).  Unlike HTML, XML is case-sensitive, including element tags and attribute values, that is:<Invoice> ( <INVOICE>  ( <invoice> ( <INvoice>  Characters  Because XML is intended for worldwide use, characters are not limited to the 7-bitASCII character set. XML uses most of the characters that are defined in the 16-bit Unicode character set (currently congruent with ISO/IEC 10646).There are two Unicode formats that are used as the basis of XML characters: UTF-8 and UTF-16. XML allows the use of almost any character encoding that can be mapped to Unicode (such as EBCDIC, Big5, etc.).There are numerous other character encodings that can be used with some XML tools, but UTF-8 and UTF-16 support is required of all XML processors.
  • 4.  The current Unicode specification can be found at: http://www.unicode.org, and ISO/IEC 10646 documentation can be ordered at http://www.iso.ch.The UTF acronym can mean "UnicodeTransformation Format" (according to Unicode), or  "UCSTransformation Format" (in IEC or IETF documents) - essentially they mean the same thing, since Unicode and ISO/IEC 10646 are nearly identical.  UTF-8 is commonly used in North America and Europe, since the first 128 character values map directly to 7-bit US-ASCII (conversely any 7-bit ASCII string is valid UTF-8). UTF-8 is a multi-byte encoding, with character values represented in one to six bytes.This encoding is less popular in Asia, since most Asian characters and ideographs require the longest encoded forms.
  • 5.  UTF-8 is described at: http://www.ietf.org/rfc/rfc2279.txt The UTF-16 encoding uses 16-bit values for characters, with the full range of 65,536 possible 16-bit values being split into two parts.There are 63,486 values available to represent single 16-bit character values. The other 2,048 values are reserved to provide paired 16-bit code values for an additional 1,048,544 character values.These are called surrogate pairs, but so far none of these values are being used.  UTF-16 is described at: http://www.ietf.org/rfc/rfc2781.txtThese are relatively new standards, and so much of the world's text isn't yet stored in Unicode. However, it was designed to be a superset of most existing character encodings, and so the conversion of legacy data to Unicode is straightforward. For example, convertingASCII to the UTF-16 form of Unicode merely requires stuffing a zero into the high-order byte of the 16-bit character, and simply preserving the low-order byte as is. Of course, this means that twice the storage space is required, compared to the same text in ASCII.As noted above, 7-bit ASCII doesn't even need conversion to be treated as the UTF-8 encoding
  • 6. SPECIAL MARKUP CHARACTERS  Five characters have special meaning in XML mark-up:  < - Less-than sign (left angle bracket)  > - Greater-than sign (right angle bracket)  & - Ampersand  ' - Apostrophe (single quotation mark)  " - Quotation mark (double quotation mark)  Use &lt; for <  Use &gt; for >  Use &amp; for &  Use &apos; for ‘ and Use &quot; for "
  • 7. ELEMENTS  An element is XML's basic container for content - it may contain character data, other elements, and/or other markup (comments, PIs, entity references, etc.). Since they represent discrete objects, elements can be thought of as the "nouns" of XML.  Elements are delimited with a start-tag and an end-tag. If an element has no content, it is known as an empty element, and may be represented with either a start-tag/end-tag pair or using an abbreviation: the empty-element tag. Unlike the looser syntax of HTML and SGML, the end-tag cannot be omitted, except when using an empty-element tag.
  • 8.  All three types of tags are shown in this example:  <html> <!-- start-tag -->  <img src="logo.png" /> <!-- empty-element tag -->  </html> <!-- end-tag -->  Each of these tags consists of the element type name (this must be a valid XML name) enclosed within a pair of angle brackets  (< >). Let's look at XML tags in more detail.
  • 9.  <html> <!-- start-tag -->  <img src="logo.png" /> <!-- empty-element tag -->  </html> <!-- end-tag -->
  • 10. TAGS  The opening delimiter of an element is called the start-tag. Start-tags are comprised of an element type name, and perhaps some attributes (which we'll look at later in this chapter), enclosed within a pair of angle brackets.  We can think of start-tags as "opening" a container - which is then "closed" with an end-tag. End-tags are comprised of a forward slash (/) followed by an element type name, enclosed within the usual angle brackets.  The name in an end-tag must match the element name in a corresponding start-tag. Everything between the start-tag and the end-tag of an element is contained within that element.The following are legal pairs of start- and end-tags:  <Invoice> ... </Invoice>  <INVOICE> ... </INVOICE>  <INVOICE > ... </INVOICE >  <Wrox:Invoice> ... </Wrox:Invoice>
  • 11. EMPTY-ELEMENTTAGS  Empty elements are those that have no content, though there may be associated attributes. Let's say that we wanted to explicitly indicate certain points within our XML data (see the next section). We could just add a start- and end-tag pair without any text between  for example:  <point></point>
  • 12. THE STRUCTURE OF XML DATA  All XML data must conform to both syntax requirements and a simple container structure. Such data is known as well formed (see relevant section later in this chapter for more details). All well-formed XML documents can be comprised of one to three parts:  An optional prolog, which may contain important information about the rest of the data. The body, which consists of one or more elements in the form a hierarchical tree. An optional "miscellaneous" epilog that follows the element tree.These parts, and the unfamiliar syntax in the following illustration, will be described in greater detail later in this chapter.  Prolog  <?xml version="1.0"?>
  • 13.  <!-- Comments and/or PIs allowed here -->  <!DOCTYPE textfile SYSTEM "http://www.mySite.com/MyDTDs/Textfile.dtd">  <!-- Comments and/or PIs allowed here -->
  • 14.  <textfile>  <line>A Simple Example</line>  <line> byYoursTruly</line>  <line>This is the 3rd line of a simple 5-line text file.</line>  <line>..the middle line..</line>  <line>And lastly, a final line of text.</line>  <EOF/>  </textfile>
  • 17.  The body sub-tree always has a single root node called the document element (sometimes referred to as the root element) - if not,  the data is not well-formedXML!  Any well-formedXML document must be a simple hierarchical tree with a single root node, called the "document root". This  document tree contains a secondary tree of elements, with its own singular root node, called the "document element".  The document root of each XML document is also the main point of attachment for the document's description using a DTD or  Schema (see Chapters 5 and 6 for more about these). A Processing Instruction (PI - more about these later) is often used to attach  a stylesheet as well (see Chapter 9).  Since well-formedXML data has a tree structure, it can be modeled and manipulated as a tree. A standard model for this  approach is theW3C Document Object Model (DOM), which will be discussed in Chapter 11.  Now let's look at the body of the XML document in greater depth
  • 18.  The Document Element  This element is the parent of all other elements in the tree, and thus it may not be contained in any other element. Because the  document root and the document element are not the same thing, it is better not to refer to the document element as the "root  element" (even though it is the root of the element sub-tree)
  • 20.  String Literals  String literals are used for the values of attributes, internal entities, and external identifiers. All string literals in XML are  enclosed by delimiter pairs, using either an apostrophe (') or a quotation mark ("). The one restriction upon these literals is that  the character used for the delimiters may not appear within the literal - if an apostrophe appears in the literal, the quotation mark  delimiter must be used, and vice versa.
  • 21.  "string"  'string'  "..Jack's cow said &quot;moo&quot;"  '..Jack&apos;s cow said "moo"'
  • 22. ATTRIBUTES  If elements are the "nouns" of XML, then attributes are its "adjectives".  Often there is some information about an element that we wish to attach to it, as opposed to including it as a string inside the  element, or one of its children.This can be done using attributes, each of which is comprised of a name-value pair. Both starttags  and empty-element tags may include attributes within the tag. Attribute values must always be string literals, so the attribute  value can use either of the two delimiters:
  • 23. ELEMENTSVS. ATTRIBUTES  The decision to use an element versus an attribute is not a simple one. Much discussion and argument has occurred about this  topic on both the XML-L and XML-DEV lists. Some argue that attributes should never be used - that they add unnecessary  processing complexity, and that anything that can be represented as an attribute would be better contained within a child element.  Others extol the advantage of being able to validate attribute values and assign default values using a DTD. Experiments have  shown that, despite superficial appearances, use of generic data compression (such as gzip, zlib, or LZW) has shown that neither  form has an inherent advantage for data storage or transmission.
  • 24. CHARACTER DATA  Character data is plain text that contains no element tags or other markup, except perhaps, character and entity references.  Remember too, that because XML is intended for worldwide use, text means Unicode, not just ASCII (see the "Characters"  section earlier in this chapter).  The ampersand (&) and less-than (<) characters are used as XML's opening delimiters, and thus may never appear in their literal  form (except in CDATA sections, which are discussed later). If these characters are needed within character data, they must be  escaped using the entity references; &lt; or &amp;. It is not necessary to escape the other markup characters (like >), but they  may be escaped (using &gt; in this case), if only for the sake of consistency within the character data.  These escape sequences are part of the set of five such strings defined by the XML specification, and implemented in all  compliant XML parsers.
  • 25. WHITESPACE  Whitespace is an important linguistic concept for both human and computer languages. Only four characters are treated as whitespace in XML data:  XML's rule for handling whitespace is very simple: all whitespace characters (except for the CR character) within the content are preserved by the parser and passed unmodified to the application, while whitespace within element tags and attribute values may be removed.This is unlike the rampant removal of whitespace carried out in HTML browsers.
  • 26. SPECIAL-PURPOSE MARKUP  We've already discussed just about every aspect of XML syntax that is necessary to create well-formed XML data (elements,attributes, and character/entity references). There are three additional syntactic constructs that deviate from the familiar syntax of tags (<tagname>) or entity references (&ref;).These are:  Comments  Processing Instructions (PIs)  CDATA sections
  • 27. COMMENTS  It is often useful to insert notes, or comments, into a document.These comments might provide a revision log, historical notes, or any other sort of meta-data that would be meaningful to the creator and editors of a document (serving to enhance its human readability), but aren't truly part of the document's content. Comments may appear anywhere in a document outside of other markup (that is, you can't put a comment in the middle of a start- or end-tag).  The basic syntax of an XML comment is:  <!--...comment text...-->
  • 28. PROCESSING INSTRUCTIONS (PIS)  XML, like SGML, is a descriptive markup language, and so it does not presume to try to explain how to actually process an  element or its contents.This is a powerful advantage in that it provides presentation flexibility, and OS- and applicationindependence.  However, there are times when it is desirable to pass processing hints (or perhaps some script code) to the  application along with the document.The Processing Instruction (PI) is the mechanism that XML provides for this purpose.
  • 29. CDATA SECTIONS  CDATA sections are a method of including text that contains characters that would otherwise be interpreted as markup.This feature is primarily useful to authors who wish to include examples of XML markup in their documents (like the examples in this book).This is probably the only good reason to include CDATA sections in a document, since almost all advantages of XML are lost when using these sections.
  • 30.  The basic syntax of a CDATA section is:  <![CDATA[...]]>  <![CDATA[&Warn; - &Disclaimer; &lt;&copy; 2001 &USCG; &amp; &USN; &gt; ]]>  <example>&amp;Warn; - &amp;Disclaimer; &amp;lt;&amp;copy; 2001 &amp;USCG; &amp;amp; &amp;USN;  &amp;gt;  </example>
  • 31. DOCUMENT STRUCTURE  Prolog  The prolog is the appetizer - used to signal the beginning of XML data. It describes the data's character encoding, and provides some other configuration hints to the XML parser and application.  XML Declaration  All XML documents should begin with an XML Declaration.This declaration is not required in most XML documents, but it serves to explicitly identify the data as XML, and does permit some optimizations when processing the document. If the XML data uses an encoding other than UTF-8 or UTF-16, then an XML Declaration with the correct encoding must be used.  If this declaration is included, then the string literal "<?xml " must be the very first six characters of the document – no preceding whitespace or embedded comments are allowed.
  • 32.  While this declaration looks exactly like a processing instruction, strictly speaking it is not a PI (it is a unique declaration defined by the XML 1.0 REC). Nevertheless, the XML Declaration uses PI-like delimiters and an attribute-like parameter syntax that is similar to the one used in element tags (either " or ' may be used to delimit the value strings). For example:  <?xml version="1.0" encoding='utf-8' standalone="yes"?>  <?xml version='1.0' encoding='utf-8'?>
  • 33. DOCUMENTTYPE DECLARATION  This should not be confused with the DTD (Remember: DocumentType Definition)! Rather, the DocumentType Declaration can refer to an external DTD and/or contain part of the DTD.  Body  This is, of course, the main course of the XML data, which we've discussed at length in terms of its components: elements,  attributes, character data, etc. It is worth reiterating that the body may contain comments, PIs, and/or whitespace characters  interleaved with elements and character data.The elements must comprise a hierarchical tree, with a single root node.
  • 34. EPILOG  The XML epilog is the dessert with potentially unpleasant consequences! It may include comments, PIs, and/or whitespace.Comments and whitespace don't cause any significant problems. However, it is unclear whether PIs in the epilog should be applied to the elements in the preceding XML data, or a subsequent XML document (if any).This may well be a solution in search of a problem, or it may just be a problem in and of itself. XML does not define any end-ofdocument indicator, and many applications will use the document element end-tag for this purpose. In this case, the epilog is never read, let alone processed.  This is a "real design error" as considered byTim Bray (one of the XML 1.0 REC editors). It is probably inadvisable to use it without a very compelling reason - and the prior knowledge that it will likely not be interoperable with other XML applications.
  • 35. VALID XML  Any XML data object is considered valid XML if it is well formed, and it meets certain further validity constraints and matches a grammar describing the document's content. Like SGML, XML can provide such a description of document structure in the form of an XML Schema or a DTD  The SGML equivalent of a well-formed document is known as tag-valid.The SGML equivalent of a valid document is type-valid
  • 36. XML PARSERS  In addition to specifying the syntax of XML, theW3C described some of the behavior of the lower tier of XML's client architecture (the XML processor or parser)  Parser Levels  Two levels of parser ("processor") behavior are defined in the XML 1.0 REC:  Non-validating - ensures that the data is well-formed XML, but need not resolve any external resources  Validating - ensures both well-formedness and validity using a DTD, and must resolve external resources
  • 37.  Parser Implementations  There are two different implementation approaches to processing the XML data:  Event-driven parser - Processes XML data sequentially, handling components one at a time  Tree-based parser - Constructs a tree representation of the entire document and provides access to individual nodes in  the tree (can be constructed on top of an event-driven parser)  Much quasi-religious argument has occurred about this dichotomy, but each approach has its merits. Like so many other realworld  problems, XML processing may have vastly different requirements, and thus different approaches may be best for  different situations.
  • 38. EVENT-DRIVEN PARSERS  The event-driven model should be quite familiar to programmers of modern GUI interfaces and operating systems. In this case, the XML parser executes a call- back to the application for each component of the XML data: element (with attributes), character data, processing instructions, notation, or comments. It's up to the application to handle the XML data as it is provided via the call-backs - the XML parser does not maintain the element tree structure, or any of the data after it has been parsed.The eventdriven method requires very modest system resources, even for extremely large documents; and because of its simple, low- level access to the structure of the XML data, provides great flexibility in handling the data within the XML application.
  • 39. TREE-BASED PARSERS  One of the most widely used structures in software engineering is the simple hierarchical tree. All well-formed XML data is defined to be such a tree, and thus common and mature algorithms may be used to traverse the nodes of an XML document, search for content, and/or edit the document tree.These tree algorithms have the advantage of years of academic and commercial development.  XML parsers that use this approach generally conform to the W3C's Document Object Model (DOM).The DOM is a platform and language-neutral interface that allows manipulation of tree-structured documents. On the other hand, the DOM tree must be built in memory, before the document can be manipulated - high- performance virtual memory support is imperative for larger documents! Once the tree is built, an application may access the DOM via a related API.