3.2 HTML, SGML, and XML
HTML is the primary format used for
Web documents. As I said earlier, HTML is a simple standard for
describing the semantic content of textual data. The idea of
describing a text's semantics rather than its
appearance comes from an older standard called the
Standard
Generalized Markup Language (SGML). Standard HTML is an instance of
SGML. SGML was invented in the mid-1970s by Charles Goldfarb, Edward
Mosher, and Raymond Lorie at IBM. SGML is now an International
Standards Organization (ISO) standard, specifically ISO 8879:1986.SGML and, by inheritance, HTML are based on the notion of design by
meaning rather than design by appearance. You don't
say that you want some text printed in 18-point type; you say that it
is a top-level heading (<H1> in HTML).
Likewise, you don't say that a word should be placed
in italics. Rather, you say it should be emphasized
(<EM> in HTML). It is left to the browser to
determine how to best display headings or emphasized text.The
tags used
to mark up the text are case-insensitive. Thus,
<STRONG> is the same as
<strong> is the same as
<Strong> is the same as
<StrONg>. Some tags have a matching end-tag
to define a region of text. An end-tag is the same as the start-tag,
except that the opening angle bracket is followed by a
/. For example: <STRONG>this text
is strong</STRONG>;
<EM>this text is
emphasized</EM>. The entire text from the
beginning of the start-tag to the end of the end-tag is called an
element.
Thus, <STRONG>this text is
strong</STRONG> is a STRONG
element.HTML elements may nest but they should not overlap. The first line in
the following example is standard-conforming. The second line is not,
though many browsers accept it nonetheless:
<STRONG><EM>Jack and Jill went up the hill</EM></STRONG>Some elements have additional attributes that are encoded as
<STRONG><EM>to fetch a pail of water</STRONG></EM>
name-value pairs on the start-tag. The <H1>
tag and most other paragraph-level tags may have an
ALIGN attribute that says whether the header
should be centered, left-aligned, or right-aligned. For example:
<H1 ALIGN=CENTER> This is a centered H1 heading </H1>The value of an attribute may be enclosed in double or single quotes,
like this:
<H1 ALIGN="CENTER"> This is a centered H1 heading </H1>Quotes are required only if the value contains embedded spaces. When
<H2 ALIGN='LEFT'> This is a left-aligned H2 heading </H2>
processing HTML, you need to be prepared for attribute values that do
and don't have quotes.There have been several versions of HTML over the years. The current
standard is HTML 4.0, most of which is supported by current web
browsers, with occasional exceptions. Furthermore, several companies,
notably Netscape, Microsoft, and Sun, have added nonstandard
extensions to HTML. These include blinking text, inline movies,
frames, and, most importantly for this book, applets. Some of these
extensionsfor example, the <APPLET>
tagare allowed but deprecated in HTML 4.0. Others, such as
Netscape's notorious
<BLINK>, come out of left field and have no
place in a semantically-oriented language like HTML.HTML 4.0 may be the end of the line, aside from minor fixes. The W3C
has decreed that HTML is getting too bulky to layer more features on
top of. Instead, new development will focus on
XML, a semantic
language that allows page authors to create the elements they need
rather than relying on a few fixed elements such as
P and LI. For example, if
you're writing a web page with a price list, you
would likely have an SKU element, a
PRICE element, a MANUFACTURER
element, a PRODUCT element, and so forth. That
might look something like this:
<PRODUCT MANUFACTURER="IBM">This looks a lot like HTML, in much the same way that Java looks like
<NAME>Lotus Smart Suite</NAME>
<VERSION>9.8</VERSION>
<PLATFORM>Windows</PLATFORM>
<PRICE CURRENCY="US">299.95</PRICE>
<SKU>D05WGML</SKU>
</PRODUCT>
C. There are elements and attributes. Tags are set off by
< and >. Attributes are
enclosed in quotation marks, and so forth. However, instead of being
limited to a finite set of tags, you can create all the new and
unique tags you need. Since no browser can know in advance all the
different elements that may appear, a stylesheet
is used to describe how each of the items should be displayed.XML has another advantage over HTML that may not be obvious from this
simple example. HTML can be quite sloppy. Elements are opened but not
closed. Attribute values may or may not be enclosed in quotes. The
quotes may or may not be present. XML tightens all this up. It lays
out very strict requirements for the syntax of a well-formed XML
document, and it requires that browsers reject all malformed
documents. Browsers may not attempt to fix the problem and make a
best-faith effort to display what they think the author meant. They
must simply report the error. Furthermore, an XML document may have a
Document Type Definition (DTD), which can impose additional
constraints on valid documents. For example, a DTD may require that
every PRODUCT element contain exactly one
NAME element. This has a number of advantages, but
the key one here is that XML documents are far easier to parse than
HTML documents. As a programmer, you will find it much easier to work
with XML than HTML.XML can be used both for pure XML pages and for embedding new kinds
of content in HTML and XHTML. For example, the Mathematical Markup
Language, MathML, is an XML application for including mathematical
equations in web pages. SMIL, the Synchronized Multimedia Integration
Language, is an XML application for including timed multimedia such
as slide shows and subtitled videos on web pages. More recently, the
W3C has released several versions of XHTML. This language uses the
familiar HTML vocabulary (p for paragraphs,
tr for table rows, img for
pictures, and so forth) but requires the document to follow
XML's stricter rules: all attribute values must be
quoted; every start-tag must have a matching end-tag; elements can
nest but cannot overlap; etc. For a lot more information about XML,
see XML in a Nutshell by Elliotte Rusty Harold
and W. Scott Means (O'Reilly).