Specific and Generalized Markup Languages

Two types of markup languages are in use today: specific markup and generalized markup. Specific markup languages are used to generate code that is specific to a particular application or device. These markup languages are often built to serve a particular need. Generalized markup describes the structure and meaning of the text in a document, but it does not define how the text should be used. In other words, the language itself is not made for any specific application and is generic enough to be used in many different applications. Documents described with a generalized markup language are usually portable to more applications than those described with a specific markup language. Let's examine these concepts in more detail.

Specific Markup Languages

The examples of markup shown so far in this chapter have demonstrated specific markup—that is, the HTML and RTF languages were developed for specific purposes. HTML has the specific purpose of formatting documents for the Web. And RTF has a similar purpose—that of text formatting. As you saw in the previous examples, however, the markup code for HTML and RTF look different, even though both were created for similar purposes. As you might have guessed, the two languages are not interchangeable. A processor that understands RTF will not understand HTML, and vice versa.

NOTE
Even though markup languages are not interchangeable, a single application might still be able to read and display several different kinds of markup. For example, with the correct filters installed, Word can read and display RTF, HTML, Plain Text, WordPerfect, Microsoft Works, and other types of documents. Note that different processing software is required for each markup language. Also, the markup codes are not interchangeable within a document. An RTF document, for example, must contain only RTF markup or the text will not display as intended.

Many markup languages have served quite well as document formatting tools for printing or for the Web. However, they do not perform as well at describing the data they contain or at providing contextual information for the data. Let's look at the markup for our RTF memo example again.

{\rtf1\ansi\ansicpg1252\deff0\deftab720{\fonttbl
  {\f0\fswiss MS Sans Serif;}
  {\f1\froman\fcharset2 Symbol;}
  {\f2\froman Times New Roman;}}
{\colortbl\red0\green0\blue0;}
\deflang1033\pard\plain\f2\fs20\b To: \plain\f2\fs20 Jodie
\par \plain\f2\fs20\b From: \plain\f2\fs20 Bill
\par \plain\f2\fs20\b Cc: \plain\f2\fs20 Philip
\par \plain\f2\fs20\b Subject: \plain\f2\fs20 Chapter 1
\par 
\par What do you think of the format so far?
\par }

Notice that every tag describes how the text should be formatted but tells us nothing about the kind of text data included in the document. We could easily change all the text in the document and completely lose the fact that this was originally a memo document. We can do this because many markup languages are created for the specific purpose of describing text formatting and layout but not for any other purposes, such as defining a certain structure of data or providing a way to interchange incompatible data formats. Such specificity results in several limitations common to specific markup languages:

Authors are limited to a particular set of tags. If this set of tags does not meet a need, authors must find a workaround or live with the limitation.

A document might not be portable to other applications. Because the data is not self-describing, it cannot be used for any other purpose than that for which it was intended.

The language probably has a proprietary way of marking up text that is not compatible with other markup languages. This can create confusion and extra work for authors who must use several languages to accommodate different applications.

NOTE
Self-describing documents, examined in later chapters, basically provide data about data (also called metadata) so that the data in the documents can stand apart from the formatting that describes how the documents are displayed. For example, a document might contain information in the form of a number. A self-describing document might identify the number as an age, the age as that of a tree, the tree as part of a reforestation project, and so on.

Back when electronic documents were starting to make a big impact on information delivery, it was obvious that these kinds of limitations would cause a lot of problems down the road. This encouraged the use of generalized markup languages.

Generalized Markup Languages

In the 1970s, Dr. C. F. Goldfarb (an attorney who eventually went to work for IBM) and two of his colleagues proposed a method of describing text that was not specific to an application or a device. The method had two basic parts:

The markup should describe the structure of a document and not its formatting or style characteristics.

The syntax of the markup should be strictly enforced so that the code can clearly be read by a software program or by a human.

The result of these suggestions was the Document Composition Facility Generalized Markup Language (DCF GML, or GML for short) developed for IBM. GML was the precursor to the Standard Generalized Markup Language (SGML) that was adopted as a standard by the International Organization for Standardization (ISO) in 1986.

NOTE
The ISO was founded in 1947 and comprises some 130 member countries. The ISO exists to "promote the development of standardization and related activities in the world with a view to facilitating the international exchange of goods and services, and to developing cooperation in the spheres of intellectual, scientific, technological and economic activity." The ISO's work results in a set of published standards that are used throughout the world. ISO standards affect fields ranging from telecommunications to agriculture to entertainment. You will often see a published standard referenced by its ISO number. (ISO 8879, for example, is the SGML standard.) For more information on the ISO, see http://www.iso.ch/welcome.html.

The SGML standard brought some important changes to text markup. In addition to providing a way to lay out the structure of a document, SGML added provisions for:

Identifying the characters to be used in a document. This makes it easier to ensure that a processor can understand everything in a document by allowing a document to specify which character set it is using (ISO 646 or ISO 8859, for example).

Providing a way to identify objects that will be used throughout a document. These objects, called entities, are convenient to use when pieces of text or other data appear in several places in a document. By declaring an entity in one place in the document, any changes to that declaration will be reflected in all occurrences of the entity throughout the document. (Entities will be discussed in Chapters 3 and 4.)

Providing a way to incorporate external data into a document. This allows data that might not be text to be used in the document.

Now let's look at how our memo document might appear as an SGML document:

<!DOCTYPE MEMO PUBLIC "-//BJP//DTD MEMO//EN">
<MEMO>
  <TO>Jodie
  <FROM>Bill
  <CC>Philip
  <SUBJECT>Chapter 1
  <BODY>What do you think of the format so far?
</MEMO>

If you take a close look at this code, you'll see some elements that look similar to those in the markup we have already covered—and you'll see some differences as well. First let's look at the similarities.

This document should appear similar to the HTML version of our memo document in the section "A Look at HTML Markup". If you look back at that version, you will see a similar DTD declaration at the top of the document and you'll notice a similar tagging format. For example, the Memo element includes both opening and closing tags. For the most part, the content (text between tags) is also the same as that of the HTML document—for good reason. These similarities exist because HTML is an application of SGML. That is, HTML was created using the SGML standard. Because of this, many of the details of SGML are carried through to HTML, but not all details. Now let's look at how the two versions differ.

First of all, notice that many of the SGML elements do not include closing tags. These tags are optional and could easily have been included for any of the elements. For example, I can add a closing tag to the Body element without changing the meaning of the SGML code:

<!DOCTYPE MEMO PUBLIC "-//BJP//DTD MEMO//EN">
<MEMO>
  <TO>Jodie
  <FROM>Bill
  <CC>Philip
  <SUBJECT>Chapter 1
  <BODY>What do you think of the format so far?</BODY>
</MEMO>

HTML also supports this type of minimization technique. While this might not seem important to you now, you'll see its relevance later when we discuss XML.

The biggest difference between SGML and HTML is that nothing in the SGML document indicates how the data should look. The markup does, however, identify the structure of the document. Notice that some content has been removed, specifically the address information (To:, From:, and so on). This could be done safely because that information is now part of the document structure. In fact, the DTD outlines all the rules for what types of elements can exist in this type of document, where they can appear, and what kinds of data they can contain. The processor can read the document and, based on the structure and context, output the data in an appropriate way.