Page

18.2.3- Writing DTDs

  by NT Community Manager.
Last Updated  by Jim Minatel.  

PublicCategorized as 18. An Introduction to XML.

Not tagged.
<< 18.2.2- Valid Documents Chapter18 18.2.4- Styling XML >>

Writing DTDs

Document Type Definitions are part of the original core XML 1.0 specification. In order to learn about DTDs we will develop one for our sample books.xml file. They are written in a language called Extended Backus-Naur Form, or EBNF for short. The DTD needs to declare the rules of the markup language, which we said at the beginning of this chapter:

 

  • Declare what exactly constitutes markup
  • Declare exactly what our markup means

 

Practically speaking, this means that we have to give details of each of the elements, their order, and say what attributes (and other types of markup) they can take.

 

They are an example of what is known as a schema, but this should not be confused with XML Schemas, which offer similar extended functionality above DTDs but are written in XML syntax.

The DTD can be declared internally (actually in the XML document) within a Document Type Declaration (note that, to avoid total confusion, we do not shorten this term to DTD!). Nevertheless, this is where the terminology starts to get confusing! The document type declaration is used in the XML file, which is written according to a document type definition, so that a processing application knows that the XML file has been written according to a document type definition. The DTD can, alternatively, be an external file. In this case, a document type declaration within the XML document will 'point' to this external DTD.

Referencing a DTD From an XML Document

In order that many XML documents can be written according to a single DTD, the DTD for our books example would be external. So, we need to add a document type declaration to the books.xml example, so that a processing application knows that it has been written according to the books document type definition:

 

<!DOCTYPE books SYSTEM "books.dtd">

 

Here, books is the name of the root element and the name of the document type definition. In this case we have followed it with the keyword SYSTEM and the URI of the DTD, a value that a processing application could use to validate the document against the DTD. This, of course, means that there must be an instance of it available from that location.

 

A URI is a Unique Resource Identifier. This could be a URL, but it doesn't have to be –  as long as the location it provides is unique and it allows the processing application to locate the resource.

 

As we are just trying this out as a test we could just keep the DTD in the same folder as the XML document. However, if we were to make it available to all we would have to give a location for it that would be available to any application. So we might choose:

 

<!DOCTYPE books SYSTEM "http://www.wrox.com/DTDlibrary/books.dtd">

 

As we discussed earlier, it is possible to include the DTD within the document type declaration (in the XML document). In other words, the rules in the DTD could be placed within this declaration, rather than in a separate file. However, in most cases you will want to reference an external file, so we will only look at this. After all, there is no point copying the DTD into several files, if you can just have it in one place.

 

To do this we also add the standalone attribute to the XML declaration of the XML document (which comes directly before the document type declaration – remember that nothing is allowed to come before the XML declaration, not even white space).

 

<?xml version="1.0" standalone="no" ?>

<!DOCTYPE books SYSTEM "books.dtd">

 

If the value of the standalone attribute is no, this indicates that there may be an external DTD (or internally declared external parameter entities – but do not worry about this second option until you get more involved in creating complex XML documents). If the value is yes, then there are no other dependencies and the file can truly stand on its own.

 

It is very easy to get confused between Document Type Definitions and Document Type Declarations… To clarify, just remember that a document type declaration either refers to an external document type definition, as in the example we are about to see, or else it actually contains one in the form of markup declarations.

Writing a DTD for the Books Example

Creating your own markup language using a DTD need not be excessively complicated. Here is the external DTD for our books example. As you can see, it is very simple.

 

<!ELEMENT books (book+)>

<!ELEMENT book (title, ISBN, authors, description?, price+)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT authors (author+)>

<!ELEMENT author (#PCDATA)>

<!ELEMENT description (#PCDATA)>

<!ELEMENT price EMPTY>

<!ATTLIST price

  US  CDATA     #REQUIRED

 

You can write it in a simple text editor, just as we did with the XML document. (Alternatively, there are pieces of software that will help you to create them. The W3C maintains a list of schema tools at http://www.w3.org/XML/Schema#Tools. )

 

Let's take a closer look at this.

 

<!Element is used to declare elements, in the format:

 

<! ELEMENT name (contents)>

 

Where name gives the name of the element, and contents describes what type of data can be included and which elements can be nested inside that element. The books element must include the element book at least once, denoted by the use of the + symbol (which indicates one or more instances).

 

<!ELEMENT books (book+)>

 

The book element, declared in this line:

 

<! ELEMENT book (title, ISBN, authors, description?, price+)>

must include exactly one instance of each of the title, ISBN and authors elements, and at least one price element, in that particular order. The question mark after the description element means that this element is optional. We then have to define each of these elements individually. Here is a brief summary of the operators we can use to describe element content:

 

Symbol

Usage

,

Strict ordering

|

Selection, in any order (can be used in conjunction with +, * and ?.

+

Repetition (minimum of 1)

*

Repetition

?

Optional

()

Grouping

 

Next we see the line:

 

<!ELEMENT title (#PCDATA)>

 

This indicates that the title element can contain character data, indicated by #PCDATA. The # symbol prevents PCDATA from being interpreted as an element name. While the authors element can contain one or more author elements:

 

<!ELEMENT authors (author+)>

 

The author elements contain character data, as does the description element.

 

When we came to the price element in our books.xml file, there were no closing tags; the element was an empty element. It did, however, have an attribute to indicate its currency. This was how it looked in our books.xml example:

 

   <price US="$49.99"/>

 

So we need to declare the element as being empty, and also declare the attribute that it can take. First we will use this line:

 

<!ELEMENT price EMPTY>

 

to indicate that the elements name is price, but that it is an empty element. Then we have to declare the attribute using the <!ATTLIST... instruction, the data types or possible values and the default values for the attributes:

 

<!ATTLIST price

  US  CDATA     #REQUIRED

Each attribute has three components: a name (e.g. US), the type of information to be passed (in this case character data, CDATA), and the default value (in this case there is not one, but we are required to provide a value).

 

That covers the example book DTD, book.dtd, for the books.xml example. You will find it with the rest of the code for this chapter. If you want to create one yourself, you can simply use a text editor, such as Notepad (or Notepad2 as mentioned in What Is XML? ), just save the file (which will have the same name as your root element) as "books.dtd".

 

Obviously, if you have a well-formed instance of an element in a document, but do not declare it in the DTD, then it cannot be validated. An element is only valid if:

 

  • There is a declaration for the element type in the DTD which has a name matching that of the element itself
  • There are declarations for all of the element types, attributes and their value types in the DTD
  • The data type of the content matches that of the content schema defined in the declaration (e.g. PCDATA)

 

We have just created our own XML application, containing our own markup language for exchanging data about books. However, it is worth noting that there are other types of schema on the horizon. The W3C is working on a version of schemas written in XML rather than Extended Backus-Naur Form, to be called XML Schemas.

XML Schemas

XML Schemas have several advantages over their DTD counterparts. The group working on the specification has looked at several proposals, which you can see if you want to get an idea of what XML Schemas are going to be like. The main ones are XML-Data and Document Content Description. Links to both can be found, with all of the submissions and specifications in progress, on the W3C site at http://www.w3.org/tr/.

 

There are number of reasons why these XML Schemas will be an advantage over DTDs. Firstly, they use XML syntax rather than Extended Backus-Naur Form, which many people find difficult to learn. Secondly, if you needed to parse the schema (we will look at parsers shortly), it will be possible to do so using an existing XML parser, rather than having to use a special parser. Another strong advantage is the ability to specify data types in XML Schemas, for the content of elements and attributes. This means that applications using the content will not have to convert it into the appropriate data type from a string. Think about an application that has to add two numbers together, or perform a calculation on a date – it would not have to convert this data to the appropriate type, from a string, before it could perform the calculation. There will be other advantages too, such as support for namespaces, which we meet shortly. Also, XML Schemas can be extended, whereas DTDs cannot simply be extended once written.

Even HTML Has Schemas

Being an SGML application, HTML has several SGML DTDs (at least a strict and loose one for each version), and the coming XHTML specification has an XML DTD (as opposed to an SGML DTD). XHTML is a new version of HTML that is designed as an XML application, as opposed to an SGML application. This means that you will be able to parse XHTML documents using an XML parser. You can view an HTML DTD at http://www.w3.org/TR/REC-html40/loose.dtd . According to the HTML standard you should include the following line:

 

<DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 //EN">

 

It tells the user agent the location of HTML's DTD. However, it is often left out because, practically speaking, it is not necessary and if you are using browser specific tags, which deviate from the specification, it may cause unpredictable results.

<< 18.2.2- Valid Documents Chapter18 18.2.4- Styling XML >>

Copyright © 2003 by Wiley Publishing, Inc.

Powered by Near-TimeTerms of Services | Privacy Policy | Security Policy |