Basic xml syntax

 XML markup describes and provides structure to the content of an XML document
or data packet.

 The tag markup syntax of XML is very similar to HTML (both are based upon SGML), with angle
brackets used to delimit tags.
 All tags begin with a less-than sign (<) and end with a greater-than sign (>).
 Unlike HTML, XML is case-sensitive, including element tags and attribute values, that
is:<Invoice> ( <INVOICE>
 ( <invoice> ( <INvoice>
 Characters
 Because XML is intended for worldwide use, characters are not limited to the 7-bitASCII
character set. XML uses most of the characters that are defined in the 16-bit Unicode character
set (currently congruent with ISO/IEC 10646).There are two Unicode formats that are used as
the basis of XML characters: UTF-8 and UTF-16. XML allows the use of almost any character
encoding that can be mapped to Unicode (such as EBCDIC, Big5, etc.).There are numerous
other character encodings that can be used with some XML tools, but UTF-8 and UTF-16
support is required of all XML processors.

 The current Unicode specification can be found at: http://www.unicode.org, and
ISO/IEC 10646 documentation can be ordered at http://www.iso.ch.The UTF
acronym can mean "UnicodeTransformation Format" (according to Unicode), or
 "UCSTransformation Format" (in IEC or IETF documents) - essentially they mean the
same thing, since Unicode and ISO/IEC 10646 are nearly identical.
 UTF-8 is commonly used in North America and Europe, since the first 128
character values map directly to 7-bit US-ASCII (conversely any 7-bit ASCII string is
valid UTF-8). UTF-8 is a multi-byte encoding, with character values represented in
one to six bytes.This encoding is less popular in Asia, since most Asian characters
and ideographs require the longest encoded forms.

 UTF-8 is described at: http://www.ietf.org/rfc/rfc2279.txt The UTF-16 encoding uses 16-bit
values for characters, with the full range of 65,536 possible 16-bit values being split into
two parts.There are 63,486 values available to represent single 16-bit character values.
The other 2,048 values are reserved to provide paired 16-bit code values for an additional
1,048,544 character values.These are called surrogate pairs, but so far none of these
values are being used.
 UTF-16 is described at: http://www.ietf.org/rfc/rfc2781.txtThese are relatively new
standards, and so much of the world's text isn't yet stored in Unicode. However, it was
designed to be a superset of most existing character encodings, and so the conversion of
legacy data to Unicode is straightforward. For example, convertingASCII to the UTF-16
form of Unicode merely requires stuffing a zero into the high-order byte of the 16-bit
character, and simply preserving the low-order byte as is. Of course, this means that twice
the storage space is required, compared to the same text in ASCII.As noted above, 7-bit
ASCII doesn't even need conversion to be treated as the UTF-8 encoding

SPECIAL MARKUP CHARACTERS
 Five characters have special meaning in XML mark-up:
 < - Less-than sign (left angle bracket)
 > - Greater-than sign (right angle bracket)
 & - Ampersand
 ' - Apostrophe (single quotation mark)
 " - Quotation mark (double quotation mark)
 Use < for <
 Use > for >
 Use & for &
 Use ' for ‘ and Use " for "

ELEMENTS
 An element is XML's basic container for content - it may contain character data,
other elements, and/or other markup (comments, PIs, entity references, etc.).
Since they represent discrete objects, elements can be thought of as the "nouns"
of XML.
 Elements are delimited with a start-tag and an end-tag. If an element has no
content, it is known as an empty element, and may be represented with either a
start-tag/end-tag pair or using an abbreviation: the empty-element tag. Unlike
the looser syntax of HTML and SGML, the end-tag cannot be omitted, except
when using an empty-element tag.

 All three types of tags are shown in this example:
 <html> 
 <img src="logo.png" /> 
 </html> 
 Each of these tags consists of the element type name (this must be a valid XML
name) enclosed within a pair of angle brackets
 (< >). Let's look at XML tags in more detail.

 <html> 
 <img src="logo.png" /> 
 </html>

TAGS
 The opening delimiter of an element is called the start-tag. Start-tags are comprised of an element type name, and
perhaps some attributes (which we'll look at later in this chapter), enclosed within a pair of angle brackets.
 We can think of start-tags as "opening" a container - which is then "closed" with an end-tag. End-tags are comprised of a
forward slash (/) followed by an element type name, enclosed within the usual angle brackets.
 The name in an end-tag must match the element name in a corresponding start-tag. Everything between the start-tag
and the end-tag of an element is contained within that element.The following are legal pairs of start- and end-tags:
 <Invoice> ... </Invoice>
 <INVOICE> ... </INVOICE>
 <INVOICE > ... </INVOICE >
 <Wrox:Invoice> ... </Wrox:Invoice>

EMPTY-ELEMENTTAGS
 Empty elements are those that have no content, though there may be associated
attributes. Let's say that we wanted to explicitly indicate certain points within our
XML data (see the next section). We could just add a start- and end-tag pair
without any text between
 for example:
 <point></point>

THE STRUCTURE OF XML DATA
 All XML data must conform to both syntax requirements and a simple container
structure. Such data is known as well formed (see relevant section later in this
chapter for more details). All well-formed XML documents can be comprised of
one to three parts:
 An optional prolog, which may contain important information about the rest of
the data. The body, which consists of one or more elements in the form a
hierarchical tree. An optional "miscellaneous" epilog that follows the element
tree.These parts, and the unfamiliar syntax in the following illustration, will be
described in greater detail later in this chapter.
 Prolog
 <?xml version="1.0"?>

 
 <!DOCTYPE textfile SYSTEM "http://www.mySite.com/MyDTDs/Textfile.dtd">


 <textfile>
 <line>A Simple Example</line>
 <line> byYoursTruly</line>
 <line>This is the 3rd line of a simple 5-line text file.</line>
 <line>..the middle line..</line>
 <line>And lastly, a final line of text.</line>
 <EOF/>
 </textfile>

 The body sub-tree always has a single root node called the document element (sometimes referred to as the root
element) - if not,
 the data is not well-formedXML!
 Any well-formedXML document must be a simple hierarchical tree with a single root node, called the "document root".
This
 document tree contains a secondary tree of elements, with its own singular root node, called the "document element".
 The document root of each XML document is also the main point of attachment for the document's description using a
DTD or
 Schema (see Chapters 5 and 6 for more about these). A Processing Instruction (PI - more about these later) is often used
to attach
 a stylesheet as well (see Chapter 9).
 Since well-formedXML data has a tree structure, it can be modeled and manipulated as a tree. A standard model for this
 approach is theW3C Document Object Model (DOM), which will be discussed in Chapter 11.
 Now let's look at the body of the XML document in greater depth

 The Document Element
 This element is the parent of all other elements in the tree, and thus it may not be
contained in any other element. Because the
 document root and the document element are not the same thing, it is better not
to refer to the document element as the "root
 element" (even though it is the root of the element sub-tree)

 String Literals
 String literals are used for the values of attributes, internal entities, and external
identifiers. All string literals in XML are
 enclosed by delimiter pairs, using either an apostrophe (') or a quotation mark (").
The one restriction upon these literals is that
 the character used for the delimiters may not appear within the literal - if an
apostrophe appears in the literal, the quotation mark
 delimiter must be used, and vice versa.

 "string"
 'string'
 "..Jack's cow said "moo""
 '..Jack's cow said "moo"'

ATTRIBUTES
 If elements are the "nouns" of XML, then attributes are its "adjectives".
 Often there is some information about an element that we wish to attach to it, as
opposed to including it as a string inside the
 element, or one of its children.This can be done using attributes, each of which is
comprised of a name-value pair. Both starttags
 and empty-element tags may include attributes within the tag. Attribute values
must always be string literals, so the attribute
 value can use either of the two delimiters:

ELEMENTSVS. ATTRIBUTES
 The decision to use an element versus an attribute is not a simple one. Much
discussion and argument has occurred about this
 topic on both the XML-L and XML-DEV lists. Some argue that attributes should
never be used - that they add unnecessary
 processing complexity, and that anything that can be represented as an attribute
would be better contained within a child element.
 Others extol the advantage of being able to validate attribute values and assign
default values using a DTD. Experiments have
 shown that, despite superficial appearances, use of generic data compression
(such as gzip, zlib, or LZW) has shown that neither
 form has an inherent advantage for data storage or transmission.

CHARACTER DATA
 Character data is plain text that contains no element tags or other markup, except perhaps, character and entity
references.
 Remember too, that because XML is intended for worldwide use, text means Unicode, not just ASCII (see the
"Characters"
 section earlier in this chapter).
 The ampersand (&) and less-than (<) characters are used as XML's opening delimiters, and thus may never appear in
their literal
 form (except in CDATA sections, which are discussed later). If these characters are needed within character data, they
must be
 escaped using the entity references; < or &. It is not necessary to escape the other markup characters (like >), but
they
 may be escaped (using > in this case), if only for the sake of consistency within the character data.
 These escape sequences are part of the set of five such strings defined by the XML specification, and implemented in all
 compliant XML parsers.

WHITESPACE
 Whitespace is an important linguistic concept for both human and computer
languages. Only four characters are treated as whitespace in XML data:
 XML's rule for handling whitespace is very simple: all whitespace characters
(except for the CR character) within the content are preserved by the parser and
passed unmodified to the application, while whitespace within element tags and
attribute values may be removed.This is unlike the rampant removal of
whitespace carried out in HTML browsers.

SPECIAL-PURPOSE MARKUP
 We've already discussed just about every aspect of XML syntax that is necessary
to create well-formed XML data (elements,attributes, and character/entity
references). There are three additional syntactic constructs that deviate from the
familiar syntax of tags (<tagname>) or entity references (&ref;).These are:
 Comments
 Processing Instructions (PIs)
 CDATA sections

COMMENTS
 It is often useful to insert notes, or comments, into a document.These comments
might provide a revision log, historical notes, or any other sort of meta-data that
would be meaningful to the creator and editors of a document (serving to
enhance its human readability), but aren't truly part of the document's content.
Comments may appear anywhere in a document outside of other markup (that is,
you can't put a comment in the middle of a start- or end-tag).
 The basic syntax of an XML comment is:


PROCESSING INSTRUCTIONS (PIS)
 XML, like SGML, is a descriptive markup language, and so it does not presume to
try to explain how to actually process an
 element or its contents.This is a powerful advantage in that it provides
presentation flexibility, and OS- and applicationindependence.
 However, there are times when it is desirable to pass processing hints (or perhaps
some script code) to the
 application along with the document.The Processing Instruction (PI) is the
mechanism that XML provides for this purpose.

CDATA SECTIONS
 CDATA sections are a method of including text that contains characters that
would otherwise be interpreted as markup.This feature is primarily useful to
authors who wish to include examples of XML markup in their documents (like the
examples in this book).This is probably the only good reason to include CDATA
sections in a document, since almost all advantages of XML are lost when using
these sections.

 The basic syntax of a CDATA section is:
 <![CDATA[...]]>
 <![CDATA[&Warn; - &Disclaimer; <© 2001 &USCG; & &USN; > ]]>
 <example>&Warn; - &Disclaimer; &lt;&copy; 2001
&USCG; &amp; &USN;
 &gt;
 </example>

DOCUMENT STRUCTURE
 Prolog
 The prolog is the appetizer - used to signal the beginning of XML data. It describes the data's
character encoding, and provides some other configuration hints to the XML parser and
application.
 XML Declaration
 All XML documents should begin with an XML Declaration.This declaration is not required in
most XML documents, but it serves to explicitly identify the data as XML, and does permit
some optimizations when processing the document. If the XML data uses an encoding other
than UTF-8 or UTF-16, then an XML Declaration with the correct encoding must be used.
 If this declaration is included, then the string literal "<?xml " must be the very first six characters
of the document – no preceding whitespace or embedded comments are allowed.

 While this declaration looks exactly like a processing instruction, strictly speaking
it is not a PI (it is a unique declaration defined by the XML 1.0 REC). Nevertheless,
the XML Declaration uses PI-like delimiters and an attribute-like parameter syntax
that is similar to the one used in element tags (either " or ' may be used to delimit
the value strings). For example:
 <?xml version="1.0" encoding='utf-8' standalone="yes"?>
 <?xml version='1.0' encoding='utf-8'?>

DOCUMENTTYPE DECLARATION
 This should not be confused with the DTD (Remember: DocumentType
Definition)! Rather, the DocumentType Declaration can refer to an external DTD
and/or contain part of the DTD.
 Body
 This is, of course, the main course of the XML data, which we've discussed at
length in terms of its components: elements,
 attributes, character data, etc. It is worth reiterating that the body may contain
comments, PIs, and/or whitespace characters
 interleaved with elements and character data.The elements must comprise a
hierarchical tree, with a single root node.

EPILOG
 The XML epilog is the dessert with potentially unpleasant consequences! It may
include comments, PIs, and/or whitespace.Comments and whitespace don't cause
any significant problems. However, it is unclear whether PIs in the epilog should
be applied to the elements in the preceding XML data, or a subsequent XML
document (if any).This may well be a solution in search of a problem, or it may
just be a problem in and of itself. XML does not define any end-ofdocument
indicator, and many applications will use the document element end-tag for this
purpose. In this case, the epilog is never read, let alone processed.
 This is a "real design error" as considered byTim Bray (one of the XML 1.0 REC
editors). It is probably inadvisable to use it without a very compelling reason - and
the prior knowledge that it will likely not be interoperable with other XML
applications.

VALID XML
 Any XML data object is considered valid XML if it is well formed, and it meets
certain further validity constraints and matches a grammar describing the
document's content. Like SGML, XML can provide such a description of document
structure in the form of an XML Schema or a DTD
 The SGML equivalent of a well-formed document is known as tag-valid.The SGML
equivalent of a valid document is type-valid

XML PARSERS
 In addition to specifying the syntax of XML, theW3C described some of the
behavior of the lower tier of XML's client architecture (the XML processor or
parser)
 Parser Levels
 Two levels of parser ("processor") behavior are defined in the XML 1.0 REC:
 Non-validating - ensures that the data is well-formed XML, but need not resolve
any external resources
 Validating - ensures both well-formedness and validity using a DTD, and must
resolve external resources

 Parser Implementations
 There are two different implementation approaches to processing the XML data:
 Event-driven parser - Processes XML data sequentially, handling components one at a
time
 Tree-based parser - Constructs a tree representation of the entire document and provides
access to individual nodes in
 the tree (can be constructed on top of an event-driven parser)
 Much quasi-religious argument has occurred about this dichotomy, but each approach has
its merits. Like so many other realworld
 problems, XML processing may have vastly different requirements, and thus different
approaches may be best for
 different situations.

EVENT-DRIVEN PARSERS
 The event-driven model should be quite familiar to programmers of modern GUI
interfaces and operating systems. In this case, the XML parser executes a call-
back to the application for each component of the XML data: element (with
attributes), character data, processing instructions, notation, or comments. It's up
to the application to handle the XML data as it is provided via the call-backs - the
XML parser does not maintain the element tree structure, or any of the data after
it has been parsed.The eventdriven method requires very modest system
resources, even for extremely large documents; and because of its simple, low-
level access to the structure of the XML data, provides great flexibility in handling
the data within the XML application.

TREE-BASED PARSERS
 One of the most widely used structures in software engineering is the simple
hierarchical tree. All well-formed XML data is defined to be such a tree, and thus
common and mature algorithms may be used to traverse the nodes of an XML
document, search for content, and/or edit the document tree.These tree
algorithms have the advantage of years of academic and commercial
development.
 XML parsers that use this approach generally conform to the W3C's Document
Object Model (DOM).The DOM is a platform and language-neutral interface that
allows manipulation of tree-structured documents. On the other hand, the DOM
tree must be built in memory, before the document can be manipulated - high-
performance virtual memory support is imperative for larger documents! Once
the tree is built, an application may access the DOM via a related API.

Basic xml syntax

More Related Content

What's hot (20)

Similar to Basic xml syntax (20)

More from Raghu nath (20)

Recently uploaded (20)

Basic xml syntax