The Document Type Definition

The remainder of this chapter focuses on the DTD—how to structure a DTD and how to create a DTD for your documents. As mentioned previously, the DTD acts as a rule book that allows authors to create new documents of the same type and with the same characteristics as a base document. For example, suppose a DTD was created for use by the medical community. Documents created with this DTD could contain items such as a patient's name, medications, medical history, and so on. The information could easily be read by any medical institution that used an XML-based document system. This system would not only provide a standardized document format for all organizations, but it would provide a format to be used within departments of a single organization. The same document format can be used by doctors, nurses, administrative staff, pharmacists, specialists, and others. Another DTD advantage, which will be discussed shortly, is that a DTD can be modified to suit the needs of a particular application. This is where the notion of subclassing comes in.

DTD Structure

A DTD can comprise two parts: an external DTD subset and an internal DTD subset. An external DTD subset is a DTD that exists outside the content of the document, which is usually the case when a common DTD is used, as in the medical example above. An internal DTD subset is a DTD that is included within the XML document. A document can contain one or both types of subsets. If a document contains both, the internal subset is processed first and takes precedence over any external subset. This is beneficial when an author is using an external DTD but wants to customize some parts of the DTD for a specific application. We'll look at an example of this in the section Class DTDs.

If you want to include an internal DTD subset in your document, you simply write it directly in the document type declaration. An external DTD subset, however, must be included via a DTD reference, which tells the processor where to find the external subset by specifying the name of the DTD file. The DTD reference also contains information about the creator of the DTD, about the purpose of the DTD, and about the language used. This is demonstrated in the declaration below:

<!DOCTYPE catalog PUBLIC "-//flowers//DTD Standard //EN"
  "http://www.wildflowers.com/dtd/Wldflr.dtd">

Creating a Simple DTD

Before we get into too much detail, let's create a document with a simple DTD to see what one looks like. We'll modify the Memo document created in Chapter 2 so that it is an Email document with an internal DTD subset. The new document, which you can find in the Chap04\Lst4_1.xml file on the companion CD, is shown in Code Listing 4-1.

Code Listing 4-1.

<?xml version="1.0"?>

<!DOCTYPE EMAIL [
  <!ELEMENT EMAIL (TO, FROM, CC, SUBJECT, BODY)>
  <!ELEMENT TO (#PCDATA)>

  <!ELEMENT FROM (#PCDATA)>
  <!ELEMENT CC (#PCDATA)>
  <!ELEMENT SUBJECT (#PCDATA)>
  <!ELEMENT BODY (#PCDATA)>
]>

<EMAIL>
  <TO>Jodie@msn.com</TO>
  <FROM>Bill@msn.com</FROM>
  <CC>Philip@msn.com</CC>
  <SUBJECT>My First DTD</SUBJECT>
  <BODY>Hello, World!</BODY>
</EMAIL>

Notice that the code contains additional information in the document type declaration. This is the internal DTD subset, and it identifies the elements that are allowed in the document and the type of data they can contain. If you run this document by displaying the HTML page found in the Chap04\Lst4_1.htm file on the companion CD, and then clicking the Start button, the document looks similar to the one that was created in Chapter 2 (without the DTD), as shown in Figure 4-2.

Figure 4-2. The Email document with an internal DTD subset.

NOTE
The HTML page that displays the XML document is not discussed in detail here, but it is similar to the HTML page we created in Chapter 2. You can use the Lst4_1.htm page to view all the sample XML documents in this chapter. Simply change the filename in the xmlDoc.load statement from Lst4_1.xml to the filename of the XML document you want to display.

This document differs from the document created in Chapter 2 because in this case the XML processor is validating the document against the DTD. In other words, the HTML page that displays the document is parsing the page using a validating processor. This simply means that the processor is checking the document against the DTD to make sure that all the code used in the document is allowable.

NOTE
Not every XML processor is a validating processor. A nonvalidating processor can sometimes be useful as a performance improvement when the processor does not need to read and validate against a DTD, as is the case with many applications.

To see how validation works, let's add an element to the document that is not part of the DTD. We'll add a Signature element near the end of the document, as shown below.

<?xml version="1.0"?>

<!DOCTYPE EMAIL [
  <!ELEMENT EMAIL (TO, FROM, CC, SUBJECT, BODY)>
  <!ELEMENT TO (#PCDATA)>
  <!ELEMENT FROM (#PCDATA)>
  <!ELEMENT CC (#PCDATA)>
  <!ELEMENT SUBJECT (#PCDATA)>
  <!ELEMENT BODY (#PCDATA)>
]>

<EMAIL>
  <TO>Jodie@msn.com</TO>
  <FROM>Bill@msn.com</FROM>
  <CC>Philip@msn.com</CC>
  <SUBJECT>My First DTD</SUBJECT>
  <BODY>Hello, World!</BODY>
  <SIGNATURE>Bill</SIGNATURE>
</EMAIL>

If you try to run this document, you will see an error message similar to the following because the processor does not find a declaration for the Signature element in the DTD:

Element content is invalid according to the DTD/Schema.

NOTE
In addition to processing the document via an HTML page, you can simply run the XML code through a validating parser to see the results. In this example, the validating parser is activated from a command-line interface. To learn how to do this, see the "Using Msxml from the Command Line" section in the Introduction of this book.

Back to the original XML file. This time, we'll change part of the DTD. Notice the first element declaration concerns the Email element:

<!ELEMENT EMAIL (TO, FROM, CC, SUBJECT, BODY)>

Within this line of code, in parentheses, is a list of the other content elements that the document can contain. This list is called a content model, and it identifies the child elements that the Email element must contain and the order of those child elements. (For more details on the content model, see the next section.) If you remove the Subject element from the content model, the DTD looks like this:

<!DOCTYPE EMAIL [
  <!ELEMENT EMAIL (TO, FROM, CC, BODY)>
  <!ELEMENT TO (#PCDATA)>
  <!ELEMENT FROM (#PCDATA)>
  <!ELEMENT CC (#PCDATA)>
  <!ELEMENT SUBJECT (#PCDATA)>
  <!ELEMENT BODY (#PCDATA)>
]>

As you might expect, running the document causes the processor to return an error because the document did not follow the specified content model. Let's go back to the original document to change the order of the From element and the Cc element so that the bottom portion of the document looks like this:

<EMAIL>
  <TO>Jodie@msn.com</TO>
  <CC>Philip@msn.com</CC>
  <FROM>Bill@msn.com</FROM>
  <SUBJECT>My First DTD</SUBJECT>
  <BODY>Hello, World!</BODY>
  <SIGNATURE>Bill</SIGNATURE>
</MEMO>

Again, when you try to run the document, an error is returned because the processor got one element when it was expecting another.

By now it should be clear to you that the DTD acts as a strict rule book for the XML document. Because the rules are strict, it is important that you take care while authoring your DTDs if you plan to use them. The DTD shown in this chapter has been simple, not to mention rigid. The rest of this chapter looks at the other pieces that can be added to a DTD to make it more robust and flexible.

Element Declarations

Each element declaration contains the element name and the type of data the element contains, called its content specification, which consists of one of four types:

A list of other elements, called the content model

The keyword EMPTY

The keyword ANY

Mixed content

The Empty-Element Declaration

To declare that an element cannot contain any content, you can use the keyword EMPTY in the element declaration, as shown here:

<!ELEMENT TEST EMPTY>

A Test element in a document containing the above declaration could never contain content and would be required to be an empty element, such as <TEST/>. Although it might seem that empty elements would not be useful, they can contain attributes that provide meaningful content or they can provide specific functions in a document. The <BR> tag in HTML is an example of an empty-element tag. The <BR> tag tells an HTML processor to insert a line break in a document, but the element never contains content.

The Any-Element Declaration

At the opposite end of the scale is the ANY content specification. If an element declaration uses the keyword ANY for the content specification, that type of element can contain any content allowed by the DTD in any order. The any-element declaration looks like this:

<!ELEMENT TEST ANY>

Mixed Content

The content specification can also be a single set of alternatives in which the alternatives are separated by pipe symbols (|). For example:

<!ELEMENT EXAMPLE (#PCDATA|x|y|z)*>

The use of #PCDATA; character data such as x, y, and z; the pipe symbol (|); and the asterisk (*) are discussed in detail below.

Data Types

XML stays relatively simple when it comes to including data types, but some tricky issues are worth a look. In document content, XML allows for parsed character data (declared with the keyword #PCDATA, as shown above) and character data (declared with the keyword CDATA). Parsed character data is marked up character data—that is, it contains markup tags. Character data is ordinary text that can include characters normally reserved for markup. XML processors assume that content in an XML file is parsed character data by default. (The exception to this is attribute data, which is generally character data. This is covered in detail later in this chapter.)

While parsed character data is usually used in the content of an XML document, character data can be used when an author wants to include data that does not get parsed. For example, examine the usage of a character data section in the following document:

<?xml version="1.0"?>

<LESSON>
  <TITLE>Working with XML Markup</TITLE>
  <EXAMPLE>
    <![CDATA[<ELEMENT>A sample element</ELEMENT>]]>
  </EXAMPLE>
</LESSON>

The data in the Example element will be displayed as <ELEMENT>A sample element
</ELEMENT>, and the markup tags will not be parsed. As shown here, to declare a section as character data, you must mark the beginning of the section with the sequence <![CDATA[ and mark the end with two closing brackets: ]]. Any data that resides inside this set of markers will be interpreted as straight unparsed data.

Structure Symbols

XML uses a set of symbols for specifying the structure of an element declaration. You have already seen some of these symbols, such as the pipe and the comma. Table 4-1 identifies each of the available symbols, the purpose of each symbol, an example of how each is used, and what each symbol means.

Table 4-1 Element Declaration Symbols

Symbol	Purpose	Example	Meaning
Parentheses	Encloses a sequence, a group of elements, or a set of alternatives	(content1, content2)	Element must contain the sequence content1 and content2.
Comma	Separates items in a sequence and identifies the order in which they must appear	(content1, content2, content3)	Element must contain content1, content2, and content3 in the specified order.
Pipe	Separates items in a group of alternatives	(content1\| content2\| content3)	Element must contain either content1, content2,or content3.
Question mark	Indicates that an item must appear one time or not at all	content1?	Element might contain content1. If content1does appear, it must appear only once.
Asterisk	Indicates that the item can appear as many times as the author wants	content1*	Element can contain content1. If it appears, it can appear once or more.
Plus sign	Indicates that an item must appear once or more	content1+	Element must contain content1 at least once, but it can appear more than once.
No symbol	Indicates that exactly one item must appear	content1	Element must contain content1.

Let's look at a simple example by adding to the content model from our example document.

<!ELEMENT EMAIL (TO+, FROM, CC*, SUBJECT?, BODY?)>

This declaration indicates that:

The To element is required and can appear more than once.

The From element must appear exactly once.

The Cc element is optional, but it can appear one or more times.

The Subject element is optional, but it can appear only once if included.

The Body element is optional, but it can appear only once if included.

Attributes

In addition to defining the structure of an element and the kind of content it contains, you can associate attributes with an element. Attributes provide additional information about the element or the content of that element. If you work with HTML, you are familiar with attributes. Take, for example, the following HTML code:

<HTML>

  <HEAD>
    <TITLE>Database Web Site</TITLE>
  </HEAD>

  <BODY>
    <A HREF="http://mspress.microsoft.com">
      Click here for a Web link
    </A>
    <BR>
    <IMG SRC="Schemas2.gif" border="0" ALT="A Schema Map">
  </BODY>

</HTML>

The Anchor element (indicated by the <A> tag) and the Image element (indicated by the <IMG> tag) both contain attributes: the Anchor element contains the HREF attribute, and the Image element contains the attributes SRC, BORDER, and ALT. These attributes provide additional information of use mostly to the browser. Notice that the Anchor element contains the content, Click here for a Web link, but the Image element appears to be an empty element—it contains no visible content. In reality, however, this is not entirely true. While the element does not contain content, the SRC attribute is a filename that tells the processor which file to display. So in this case, a graphic file is displayed on the Web page. This example presents an important aspect about attributes. Attributes can, and usually do, contain important information that is not part of the content of the element. This means that, while the results of the attribute are usually visible, often the attribute itself is more important to the XML processor than it is to the person viewing the content.

Attribute Declarations

In XML, attributes are declared in the DTD using the following syntax:

<!ATTLIST ElementName AttributeName Type Default>

Here <!ATTLIST> is the tag that identifies an attribute declaration. The ElementName entry is the name of the element to which the attribute(s) apply. The AttributeName entry, obviously, is the name of the attribute. The Type entry identifies the type of attribute being declared. The Default entry specifies the default for the attribute.

NOTE
The attribute declaration can be located anywhere in the DTD, but keeping the attribute declaration close to the corresponding element declaration can make the DTD easier for humans to read. You can also include multiple attribute declarations for a single element. In this case, the processor will combine all the declarations into one big list. If the processor encounters more than one declaration for the same attribute, only the first one will be counted.

Table 4-2 lists the types of attributes that are available in XML.

Table 4-2 Attribute Types in XML

Attribute Type	Usage
CDATA	Only character data can be used in the attribute.
ENTITY	Attribute value must refer to an external binary entity declared in the DTD.
ENTITIES	Same as ENTITY, but allows multiple values separated by white space.
ID	Attribute value must be a unique identifier. If a document contains ID attributes with the same value, the processor should generate an error.
IDREF	Value must be a reference to an ID declared elsewhere in the document. If the attribute does not match the referenced ID value, the processor should generate an error.
IDREFS	Same as IDREF, but allows multiple values separated by white space.
NMTOKEN	Attribute value is any mixture of name token characters, which must be letters, numbers, periods, dashes, colons, or underscores.
NMTOKENS	Same as NMTOKEN, but allows multiple values separated by white space.
NOTATION	Attribute value must refer to a notation declared elsewhere in the DTD. Declaration can also be a list of notations. The value must be one of the notations in the list. Each notation must have its own declaration in the DTD.
Enumerated	Attribute value must match one of the included values. For example: <!ATTLIST MyAttribute (content1\|content2)>.

The final part of the attribute declaration is the default for the attribute value. The default can come in one of four types. Table 4-3 shows the available attribute defaults.

Table 4-3 Attribute Defaults

Default	Usage
#REQUIRED	Every element containing this attribute must specify a value for that attribute. A missing value results in an error.
#IMPLIED	This attribute is optional. The processor can ignore this attribute if no value is found.
#FIXED fixedvalue	This attribute must have the value fixedvalue. If the attribute is not included in the element, fixedvalue is assumed.
default	Identifies a default value for an attribute. If the element does not include the attribute, the value default is assumed.

Let's take a look at how attributes are used by adding some attribute declarations to the DTD of the sample document:

<?xml version="1.0"?>

<!DOCTYPE EMAIL [
  <!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT?, BODY?)>
  <!ATTLIST EMAIL
    LANGUAGE (Western|Greek|Latin|Universal) "Western"

    ENCRYPTED CDATA #IMPLIED
    PRIORITY (NORMAL|LOW|HIGH) "NORMAL">

  <!ELEMENT TO (#PCDATA)>
  <!ELEMENT FROM (#PCDATA)>
  <!ELEMENT CC (#PCDATA)>

  <!ELEMENT BCC (#PCDATA)>
  <!ATTLIST BCC
    HIDDEN CDATA #FIXED "TRUE">

  <!ELEMENT SUBJECT (#PCDATA)>
  <!ELEMENT BODY (#PCDATA)>
]>

In this example, attributes have been added to two elements, Email and the new element Bcc. The first attribute added to the Email element is LANGUAGE. The LANGUAGE attribute can contain one of several options. The attribute will contain the default value, Western, if none other is specified. The next attribute in the Email element is ENCRYPTED. This element must contain character data, and since the default is #IMPLIED, the processor will simply ignore this attribute if no value is specified. The last attribute in the Email element is PRIORITY. The PRIORITY attribute can have any one of the three values NORMAL, LOW, and HIGH. The default value is NORMAL.

The HIDDEN attribute has been included for the Bcc element. The HIDDEN attribute is a CDATA type, and since it has a default of #FIXED, the default value is specified following the keyword #FIXED. This attribute must always have the value specified in the DTD, in this case TRUE.

NOTE
Even though the attribute name is HIDDEN and the value is TRUE, the XML processor does not actually know what that means. In other words, the word "HIDDEN" has no special meaning in XML; it simply happens to be the attribute name used. It will be up to the application to know what to do with the attribute and its value.

How Attributes Work in an XML Document

Let's put the DTD together with the rest of the document to see how the attributes look in the document. The code, which you can find in the Chap04\Lst4_2.xml file on the companion CD, is shown in Code Listing 4-2.

Code Listing 4-2.

<?xml version="1.0"?>

<!DOCTYPE EMAIL [
  <!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT?, BODY?)>

  <!ATTLIST EMAIL

    LANGUAGE (Western|Greek|Latin|Universal) "Western"
    ENCRYPTED CDATA #IMPLIED
    PRIORITY (NORMAL|LOW|HIGH) "NORMAL">

  <!ELEMENT TO (#PCDATA)>
  <!ELEMENT FROM (#PCDATA)>
  <!ELEMENT CC (#PCDATA)>

  <!ELEMENT BCC (#PCDATA)>
  <!ATTLIST BCC
    HIDDEN CDATA #FIXED "TRUE">

  <!ELEMENT SUBJECT (#PCDATA)>
  <!ELEMENT BODY (#PCDATA)>
]>

<EMAIL LANGUAGE="Western" ENCRYPTED="128" PRIORITY="HIGH">
  <TO>Jodie@msn.com</TO>
  <FROM>Bill@msn.com</FROM>
  <CC>Philip@msn.com</CC>
  <BCC>Naomi@msn.com</BCC>
  <SUBJECT>My First DTD</SUBJECT>
  <BODY>Hello, World!</BODY>
</EMAIL>

Note that the Email element in the code listing includes all of the attributes and specifies a value for each one. In this code listing, the Bcc element includes no attribute. Since the HIDDEN attribute has a default of #FIXED, the processor will assume the value from the DTD. When the document is displayed, it appears as shown in Figure 4-3.

Figure 4-3. An XML document with attributes.

If you've noticed that this document looks the same as the document without attributes, you've been paying attention! This figure illustrates the idea that attributes often provide more information to the processor and the application than they provide to the user. Although the attributes in this document can affect how the content appears (such as inserting a different value for the LANGUAGE attribute, for example), it is up to the application to make use of the information that's provided by the document. Since none of the attributes used in our example changed the appearance of the content, there was no visible difference.

Entities

Recall from Chapter 3 the concepts of physical structure and entities. In addition to the general entity discussed in that chapter, another kind of entity exists called the parameter entity. This section covers both types of entities in detail and describes how you declare entities in a DTD. First let's review general entities.

General Entities: A Review

You know that entities are used as containers for content and that the content can reside in the XML document (as an internal entity) or outside the document in an external file (an external entity). Most entities must be declared in the DTD. (You'll recall that some predefined entities are already built into XML and are used to display characters normally used for markup.) Entity declarations follow the same basic syntax used by other declarations:

<!ENTITY EntityName EntityDefinition>

Entities in the DTD can be parsed or unparsed. Parsed entities, or text entities, contain text that becomes part of the XML document. Unparsed entities, or binary entities, are usually references to an external binary file. Unparsed entities can also be text that is not parsable, so it is best to think of unparsed entities as items that are not intended to be treated as XML.

Internal Entities

Internal entities are declared in the DTD and contain the content that will be used in the document. This line adds an internal entity called SIGNATURE to the example XML document:

<!ENTITY SIGNATURE "Bill">

This entity will be added to the DTD, and (as you will see in the "Entity References" section later in this chapter) whenever that entity is referenced in the document, it will be replaced with the content of the entity (Bill).

External EntitIes: SYSTEM and PUBLIC Keywords

Here's an external entity declaration you can add to the DTD. This entity references an external GIF file and will appear in the body of the XML document:

<!ENTITY IMAGE1 SYSTEM "Xmlquot.gif" NDATA GIF>

Notice that the external entity declaration differs from the internal entity declaration: it uses an additional keyword (SYSTEM) following the entity name (IMAGE1).

NOTE
External entities can also reference other XML files. For example, if you were putting together a book in XML, the master book document could use entity references to the chapters. An entity reference might be <!ENTITY CHAP01 SYSTEM "Chapter1.xml">. Using such a reference would greatly reduce the size of the master document and allow the individual chapter files to be self-contained.

An external entity declaration can include the SYSTEM keyword or the PUBLIC keyword. Many DTDs are developed locally—that is, they are developed for a specific organization or business or they are used only for a specific Web site. In this case, the SYSTEM keyword should be used. The SYSTEM keyword is followed by a URI (Uniform Resource Identifier) that tells the processor where to find the object referenced in the declaration. In the example above, the filename was used because the code is for local use. In the following declaration, the URI is a Web address that points to the location of the referenced file:

<!ENTITY IMAGE1 SYSTEM
  "http://www.XMLCo.com/Images/Xmlquot.gif" NDATA GIF>

Some DTDs are established standards that are available to a wide range of users. The PUBLIC keyword should be used, followed by the public identifier that the processor can use if a standards library is available. Following the public identifier is a URI, similar to the URI used with the SYSTEM keyword in the preceding example. A declaration that uses the PUBLIC keyword might look like this:

<!ENTITY IMAGE1 PUBLIC "-//XMLCo//TEXT Standard images//EN"
  "http://www.XMLCo.com/Images/Xmlquot.gif" NDATA GIF>

External Entities: Notations and Notation Declarations

Again, consider the entity declaration:

<!ENTITY IMAGE1 SYSTEM "Xmlquot.gif" NDATA GIF>

A notation (NDATA GIF) appears at the end of the declaration. This notation tells the processor what type of object is being referenced. At this point, if you simply added the entity declaration to the DTD and ran it through a processor, you'd get an error like the following:

Declaration `IMAGE1' contains reference to undefined notation `GIF'.

The error results because the entity declaration is referencing a binary file type and the processor has not been told what to do with the binary file. Remember that this is an unparsed entity that the processor cannot "understand." In this case, the notation must be declared as a notation declaration. A notation declaration tells the processor how to deal with a specific binary file type.

NOTE
Although the information in a notation declaration will usually identify a "helper" application, the XML specification does not require this. The processor passes this information on to the processing application; it doesn't care whether the application can understand it or not. The information in the declaration could also be used for other purposes, such as to provide a message to the application that could then be displayed to the user.

Notation declarations follow this format:

<!NOTATION GIF SYSTEM "Iexplore.exe">

This declaration tells the processor that whenever it encounters a GIF file in the DTD, it should use the Iexplore.exe program to process the file.

We'll add a simple entity declaration to the sample DTD:

<?xml version="1.0"?>

<!DOCTYPE EMAIL [
  <!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT?, BODY?)>
  <!ATTLIST EMAIL
    LANGUAGE (Western|Greek|Latin|Universal) "Western"
    ENCRYPTED CDATA #IMPLIED
    PRIORITY (NORMAL|LOW|HIGH) "NORMAL">

  <!ELEMENT TO (#PCDATA)>
  <!ELEMENT FROM (#PCDATA)>
  <!ELEMENT CC (#PCDATA)>

  <!ELEMENT BCC (#PCDATA)>
  <!ATTLIST BCC
    HIDDEN CDATA #FIXED "TRUE">

  <!ELEMENT SUBJECT (#PCDATA)>
  <!ELEMENT BODY (#PCDATA)>

  <!ENTITY SIGNATURE "Bill">
  
]>

The DTD now includes an entity declaration. But as with other declarations, this is not much good unless it is used, or referenced, in the actual XML document.

Entity References

You'll recall from Chapter 3 that entity references use a specific syntax within a document: &EntityName;. When a processor encounters an entity reference in a document, the reference tells the processor that it should replace that reference with the content declared in the entity. Code Listing 4-3 shows changes to the XML sample document (which you can find in the Chap04\Lst4_3.xml file on the companion CD) and adds an entity reference:

Code Listing 4-3.

<?xml version="1.0"?>

<!DOCTYPE EMAIL [
  <!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT?, BODY?)>
  <!ATTLIST EMAIL
    LANGUAGE (Western|Greek|Latin|Universal) "Western"
    ENCRYPTED CDATA #IMPLIED
    PRIORITY (NORMAL|LOW|HIGH) "NORMAL">

  <!ELEMENT TO (#PCDATA)>
  <!ELEMENT FROM (#PCDATA)>
  <!ELEMENT CC (#PCDATA)>

  <!ELEMENT BCC (#PCDATA)>
  <!ATTLIST BCC
    HIDDEN CDATA #FIXED "TRUE">

  <!ELEMENT SUBJECT (#PCDATA)>
  <!ELEMENT BODY (#PCDATA)>

  <!ENTITY SIGNATURE "Bill">
]>

<EMAIL LANGUAGE="Western" ENCRYPTED="128" PRIORITY="HIGH">
  <TO>Jodie@msn.com</TO>
  <FROM>&SIGNATURE;@msn.com</FROM>
  <CC>Philip@msn.com</CC>
  <BCC>Naomi@msn.com</BCC>
  <SUBJECT>Sample Document with Entity References</SUBJECT>

  <BODY>
    Hello, this is &SIGNATURE;.
    Take care, -&SIGNATURE;
  </BODY>
</EMAIL>

Notice that in this code, an entity reference to the entity SIGNATURE appears wherever the word Bill should appear. Figure 4-4 shows how the document looks when it is processed and displayed.

Figure 4-4. An XML document that uses entities.

To briefly demonstrate the power of entities, let's change the SIGNATURE entity declaration to the following:

<!ENTITY SIGNATURE "Colleen">

When the document is processed, at each location in which the SIGNATURE entity is referenced, the content will appear changed, as shown in Figure 4-5.

Figure 4-5. Changing an entity declaration changes content throughout the document.

Parameter Entities

Although parameter entities work in much the same way that general entities work, they have one important syntactical difference. Parameter entities use the percent symbol (%) in both the declaration and the reference. In the entity declaration, the percent symbol follows the keyword !ENTITY but precedes the entity name, as shown here. (Note that a single space is required before and after the % symbol.)

<!ENTITY % ENCRYPTION
  "40bit CDATA #IMPLIED
  128bit CDATA #IMPLIED">

This entity can now be referenced elsewhere in the DTD. For example:

<!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT?, BODY?)>
  <!ATTLIST EMAIL
    LANGUAGE (Western|Greek|Latin|Universal) "Western"
    ENCRYPTED %ENCRYPTION;
    PRIORITY (NORMAL|LOW|HIGH) "NORMAL">

Notice that the parameter entity reference (%ENCRYPTION;) uses the same basic format used by the general entity reference, except that the % replaces the &. Also notice that a space is not required following the % in the entity reference.

NOTE
Parameter entities are restricted to the DTD. You cannot reference a parameter entity within an XML document element.

As you can see, parameter entities can be a powerful way to create your own shorthand in your DTDs and make them more concise and better organized. These entities should be used with caution, however, since they can create complexity within a document that makes it difficult to manage. For example, you could reference several other parameter entities inside a single parameter entity declaration. As the author, you must be sure that those references actually point to something and that the content is valid.

The IGNORE and INCLUDE Keywords

The IGNORE and INCLUDE keywords can be used by authors to turn portions of the DTD "on" or "off." IGNORE and INCLUDE are used in the DTD to create conditions in the document that are suitable for various purposes. For example, using IGNORE and INCLUDE allow an author to test various structures while tracking the variations. IGNORE and INCLUDE are used in much the same way that CDATA is used:

<![IGNORE [DTD section]]>
<![INCLUDE [DTD section]]>

Neither keyword can appear inside a declaration, and each DTD section must include an entire declaration or a set of declarations, comments, and white space. Here is an example of how the keywords can be used:

<![IGNORE[<!ELEMENT BCC (#PCDATA)>
<!ATTLIST BCC
  HIDDEN CDATA #FIXED "TRUE">]]>
<![INCLUDE[<!ELEMENT SUBJECT (#PCDATA)>]]>

This fragment tells the processor to ignore the Bcc element and attribute list and to include the Subject element. As you look over this code, you might think that the IGNORE keyword seems useful but that the INCLUDE keyword seems unnecessary. You could accomplish the same effect by eliminating the INCLUDE keyword. However, INCLUDE proves its worth any time you want to quickly change what is being included or ignored in the document. Consider the following code, which changes both keywords into parameter entities and then places content within the appropriate sections:

<!ENTITY % SECURE "IGNORE">
<!ENTITY % UNSECURE "INCLUDE">
<![%SECURE; [any number of declarations go here]]>
<![%UNSECURE; [any number of declarations go here]]>

In this case, the various declarations can be turned on or off easily by changing their placement or by modifying the entity declarations.

Processing Instructions

Processing instructions (PIs) provide instructions for the application that's processing the document. PIs usually appear in the document prolog, but they can be placed anywhere in the XML document. The most common PI is the XML declaration included at the top of our sample XML documents:

<?xml version="1.0"?>

PIs are written with the sequence <?, followed by the PI name, followed by a value or instruction, and concluded with ?>. The name, or PI Target, identifies which application should be looking at the PI. See section 2.6 in the XML 1.0 specification for more details about PIs.

NOTE
XML has reserved names beginning with the characters x, m, and l for its own use. Apart from this restriction, PIs can be used to send instructions to any application that is processing the document.

Here are examples of other PIs:

<?AVI CODEC="VIDEO1" COLORS="256"?>
<?WAV COMPRESSOR="ADPCM" BITS="8" RESOLUTION="16"?>

Comments

Comments are one of a DTD's "miscellaneous" parts. Although comments are not required, they are widely used for making a document more readable to authors. You can add comments as a way to explain the purpose of a certain section of the DTD, to indicate what references mean, and for other purposes. Obviously, comments come in handy as reminders of your coding intentions if you need to go back to the DTD and make changes later or if another author works on the document. Comments are not restricted to the DTD and can be used throughout a document. Since comments benefit only the human reader, any true XML processor will ignore them. Comments appear between comment tags () and can include any combination of text, markup, and symbols, except the combination of symbols that make up the comment tags. (See section 2.5 in the XML 1.0 specification for more details about comments.) The following boldface code shows how a comment might look in a document:

<?xml version="1.0"?>

<!DOCTYPE EMAIL [
<!-- This document could be used as an email template. -->
  <!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT?, BODY?)>
  <!ATTLIST EMAIL
    LANGUAGE (Western|Greek|Latin|Universal) "Western"
    ENCRYPTED CDATA #IMPLIED
    PRIORITY (NORMAL|LOW|HIGH) "NORMAL">

  <!ELEMENT TO (#PCDATA)>

NOTE
It is considered good practice to provide well-commented documents and DTDs. Having said that, the sample code in this book will not use many comments because of space considerations.

External DTDs

You've probably noticed that the sample document (shown in Code Listing 4-3 in the "Entity References" section) has grown quite large. You've also probably noticed that the majority of the document space is taken up by the DTD. You can separate documents and DTDs to make them a bit easier to work with. After creating a separate DTD, you can reference it within any document.

To separate the DTD portion of our sample XML document, you simply cut the DTD portion and paste it into a new text file. The new filename should have the extension .dtd.

Code Listing 4-4 shows the stand-alone DTD file named Lst4_4.dtd:

Code Listing 4-4.

<?xml version="1.0"?>

  <!ELEMENT EMAIL (TO+, FROM, CC*, BCC*, SUBJECT?, BODY?)>
  <!ATTLIST EMAIL
    LANGUAGE (Western|Greek|Latin|Universal) "Western"
    ENCRYPTED CDATA #IMPLIED
    PRIORITY (NORMAL|LOW|HIGH) "NORMAL">

  <!ELEMENT TO (#PCDATA)>
  <!ELEMENT FROM (#PCDATA)>
  <!ELEMENT CC (#PCDATA)>

  <!ELEMENT BCC (#PCDATA)>
  <!ATTLIST BCC
    HIDDEN CDATA #FIXED "TRUE">

  <!ELEMENT SUBJECT (#PCDATA)>
  <!ELEMENT BODY (#PCDATA)>

  <!ENTITY SIGNATURE "Bill">

If you compare the internal DTD in the previous example with this newly created external DTD, you'll find that they are exactly alike. For this DTD to work, however, you must add a reference to it in the sample XML document. This is shown in Code Listing 4-5 (also in Chap04\Lst4_5.xml on the companion CD). The reference to the new DTD file is shown in boldface type:

Code Listing 4-5.

<?xml version="1.0"?>
<!DOCTYPE EMAIL SYSTEM "Lst4_4.dtd">

<EMAIL LANGUAGE="Western" ENCRYPTED="128" PRIORITY="HIGH">
  <TO>Jodie@msn.com</TO>
  <FROM>&SIGNATURE;@msn.com</FROM>
  <CC>Philip@msn.com</CC>
  <BCC>Naomi@msn.com</BCC>
  <SUBJECT>Sample Document with External DTD</SUBJECT>

  <BODY>
    Hello, this is &SIGNATURE;.
    Take care, -&SIGNATURE;
  </BODY>
</EMAIL>

Separating the DTD from the document greatly reduces the size of the XML document file and provides some other benefits. Now that the DTD is a separate file, it can be used in other documents by anyone who has access to it. Another author can create a document using the same structure with completely different content. And because the new document would follow the DTD, it could be read by any application that knows how to process that DTD.

This brings us back to the concept that opened this chapter—document objects.

Class DTDs

It should be clear to you now that using a DTD can allow you to create a document with the same basic properties as the original but for a different purpose. This brings us back to the concept of inheritance. By creating a DTD that is used by many documents, you are creating a base class DTD. This base class DTD is the rule book upon which all other documents of this class will be based. Every author who uses the DTD must obey the rules outlined in the DTD—almost. By using a combination of internal and external DTDs, an author can subclass a document and change some properties. Code Listing 4-6 demonstrates subclassing. (You can find this document in Chap04\Lst4_6.xml on the companion CD.)

Code Listing 4-6.

<?xml version="1.0"?>
<!DOCTYPE EMAIL SYSTEM "Lst4_4.dtd" [
  <!ENTITY SIGNATURE "Joe">
]>

<EMAIL LANGUAGE="Western" ENCRYPTED="128" PRIORITY="HIGH">
  <TO>Jodie@msn.com</TO>
  <FROM>&SIGNATURE;@msn.com</FROM>
  <CC>Philip @msn.com</CC>
  <BCC>Naomi@msn.com</BCC>
  <SUBJECT>Sample Document with External DTD</SUBJECT>

  <BODY>
    Hello, this is &SIGNATURE;.
    Take care, -&SIGNATURE;
  </BODY>
</EMAIL>

This document overrides the external, or base class, DTD with an internal DTD subset. If an XML processor encounters both an internal DTD and an external DTD, it uses the first declaration that it finds—the one in the internal DTD. In other words, the first one in wins. By declaring an entity with the name SIGNATURE and giving it the value Joe, the document above overrides the entity in the class DTD of the same name. Now any time the entity reference SIGNATURE is used in this document, the value Joe will appear instead of Bill, which appeared in the original DTD.

Required Markup Declaration

As stated in Chapter 3, a well-formed document does not need to read or process a DTD. While such practice might be fine in many situations, in some cases this can cause problems. For example, every external entity must be declared, even in well-formed documents. In this case, the processor might not need to process an external DTD, but it might need to process an internal DTD so that the necessary entity declarations will be properly read and dealt with.

Still other cases might exist in which all the DTDs must be processed for the document to be properly interpreted. To deal with such situations, XML includes in the XML declaration a required markup declaration or RMD. The RMD tells the processor how it should deal with the DTD. The RMD can have one of three values:

NONE, which indicates that the document can be processed without reading any part of the DTD, neither internal nor external.

INTERNAL, which specifies that the processor must process the internal DTD if it's available.

ALL, which specifies that the processor must read and process any available internal and external DTDs.

An example of how the RMD is used is shown below. In this case, the processor knows that it need not consult any DTD:

<?xml version="1.0" RMD="NONE"?>

If no RMD is declared, ALL is assumed by the processor.

Vocabularies

Vocabularies represent a practical use of the topics covered in this chapter. An XML vocabulary is a set of the actual elements and the structure for a specific document type. Vocabularies are defined in a DTD that serves as the rule book for that vocabulary. Vocabularies are currently in use both on the Internet and in some organizations and businesses. One of the first and probably most well-known vocabularies is the Channel Definition Format (CDF) used to define Web pages that are designed to be sent automatically, or "pushed," to client users.

Vocabularies are well suited to vertical applications and are likely to be used to develop data interchange systems for specific industries, such as telecommunications, pharmaceuticals, and the legal establishment, to name a few. Vocabularies are also well suited for more horizontal applications, such as the information push application mentioned above. As of this writing, several vocabularies exist or are in development. Following are some of these vocabularies with descriptions of how they might be used.

Channel Definition Format

Channel Definition Format (CDF) is used to describe the behavior of Web pages in a push model of delivery. CDF is currently used by Microsoft Internet Explorer and describes such processes as download schedule, channel bar display, page usage, and frequency of updates.

Open Financial Exchange

Open Financial Exchange (OFX) is currently an SGML application that is used by software packages to communicate to financial institutions. OFX will soon be based on XML.

Open Software Description

Open Software Description (OSD) is a data format used to allow updating and installation of software via the Internet. This is especially useful for notifying users when new versions of software are available and providing a mechanism for users to obtain the programs over the Internet.

Electronic Data Interchange

Electronic Data Interchange (EDI) is currently used worldwide for data exchange and transaction support. In its current implementation, however, it can be used only by organizations that have been set up to exchange information using compatible systems. XML can greatly expand the reach of EDI and make it more accessible to a larger number of organizations. Efforts are currently under way to move EDI to an XML-based format.