Skip to main content

XML

Updated Feb 27, 2019 ·

Overview

Extensible Markup Language (XML) is a derivative of Structured, Generalized Markup Language (SGML), and also the parent of HyperText Markup Language (HTML). XML is a generic methodology for wrapping textual data in symmetrical tags to indicate semantics. XML filenames typically end in ".xml".

A simple XML document might look like this:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Instance list -->
<vms>
<vm>
<vmid>0101af9811012</vmid>
<type>t1.nano</type>
</vm>
<vm>
<vmid>0102bg8908023</vmid>
<type>t1.micro</type>
</vm>
</vms>

This example simulates information you might receive from a cloud computing management API that is listing virtual machine instances.

XML Document Body

The body of an XML document contains data enclosed in tag pairs. Tags are surrounded by < > for opening and </ > for closing.

  • Data Elements:

    • Each element is enclosed in symmetrical tag pairs.

      <name>Example Name</name>
  • Nested Structure:

    • Tags can contain other tags, forming a tree structure.

      <person>
      <name>John</name>
      <age>30</age>
      </person>
  • Root Tag:

    • The document is always wrapped in an outermost tag pair, called the root.

      <people>
      <person>
      <name>John</name>
      <age>30</age>
      </person>
      </people>

User-Defined Tag Names

XML tag names can be created by the user, making it important to choose clear, descriptive names for data elements and their relationships. Note that when consuming XML from an API, the tag names are usually predefined by the API and documented for reference.

  • Names with Purpose:

    • Use meaningful names that clearly represent the data's role.

      <vm>
      <vmid>001</vmid>
      <type>Linux</type>
      </vm>
  • Repetitive Tags:

    • Tags can repeat to represent multiple instances of the same element type.

      <vms>
      <vm>
      <vmid>001</vmid>
      <type>Linux</type>
      </vm>
      <vm>
      <vmid>002</vmid>
      <type>Windows</type>
      </vm>
      </vms>

Special Character Encoding

In XML, data is represented as readable text, but encoding special characters can be challenging due to conflicts with XML syntax.

Example: The characters < and > cannot be directly used in data fields because they denote XML tags. Instead, they can be encoded as &lt; and &gt;.

<example>&lt;tag&gt;Content&lt;/tag&gt;</example>

When using XML that follows a schema or interacts with APIs like NETCONF, HTML entities are typically not allowed. Instead, special characters should be represented with numeric codes, like &#60; for < and &#62; for >.

Example: For a less-than symbol, you would write it as &#60; instead of using <.

<example>&#60;tag&#62;Content&#60;/tag&#62;</example>

To simplify special character encoding, you can use CDATA blocks to enclose entire raw character strings. This allows you to bypass standard encoding rules.

Example:

<greeting><![CDATA[Hello, world! This is a test: 1 < 2 and 3 > 2.]]></greeting>

This method lets you include complex characters without disrupting the XML structure.

XML Prologue

The XML prologue is the first line of an XML file, formatted between <? and ?>. It includes the tag name xml along with attributes for the version and character encoding. The most common version is "1.0," and "UTF-8" is typically the default character encoding, although "UTF-16" may also be used.

<?xml version="1.0" encoding="UTF-8"?>

Including the prologue ensures that XML documents are reliably interpreted by parsers, editors, and other software tools.

Comments in XML

XML files can include comments, using the same commenting convention used in HTML documents. For example:

<!-- This is an XML comment. It can go anywhere -->

XML Attributes

XML allows you to embed attributes within tags to convey additional information. In the example below, both the XML version number and character encoding are specified in the XML declaration, while the vmid and type elements are included as attributes within the vm tags:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Instance list -->
<vms>
<vm vmid="0101af9811012" type="t1.nano" />
<vm vmid="0102bg8908023" type="t1.micro"/>
</vms>

Key Points:

  • Attribute values must be in quotes.
  • Each attribute name must be unique within an element.
  • Use shorthand notation with a slash for empty elements.

XML Namespaces

Some XML messages need references to specific namespaces to define tagnames and their usage. These namespaces are set by organizations and standards bodies, often hosted as public documents. They are identified by Uniform Resource Names (URNs), allowing access to documents without needing to know their physical location.

Example:

<rpc message-id="101" xmlns="urn:ietf:params:xml:ns:netconf:base:1.0">
<kill-session>
<session-id>4</session-id>
</kill-session>
</rpc>

In this example, the xmlns attribute indicates that the content follows the NETCONF 1.0 standard. The attributes in the <rpc> tag specify the message ID and the XML namespace needed to interpret the contained tags, instructing the remote entity to terminate a specific session.

Interpreting XML

Using the previous example, we can see a one-dimensional array of objects called 'instances,' each identified by bracketing tags. Each instance includes two key-value pairs for a unique instance ID and VM server type.

<?xml version="1.0" encoding="UTF-8"?>
<!-- Instance list -->
<vms>
<vm vmid="0101af9811012" type="t1.nano" />
<vm vmid="0102bg8908023" type="t1.micro"/>
</vms>

A similar Python data structure can be represented as follows:

vms = [
{
"vmid": "0101af9811012",
"type": "t1.nano"
},
{
"vmid": "0102bg8908023",
"type": "t1.micro"
}
]

XML does not explicitly indicate that a set of tags and data should be treated as a list, requiring us to infer the writer's intention. This challenge is lessened in modern formats like JSON and YAML, where <vm> tags are replaced with plain brackets (). This results in a Python list of 'vm objects,' each containing two key/value pairs.