Markup language
The
Standard Generalized Markup Language
(
SGML
;
ISO
8879:1986) is a standard for defining generalized
markup languages
for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two
postulates
":
[1]
- Declarative: Markup should describe a document's structure and other attributes rather than specify the processing that needs to be performed, because it is less likely to conflict with future developments.
- Rigorous: In order to allow markup to take advantage of the techniques available for processing, markup should rigorously define objects like programs and
databases
.
DocBook
SGML and
LinuxDoc
are examples which used SGML tools.
Standard versions
[
edit
]
SGML is an
ISO
standard: "ISO 8879:1986 Information processing ? Text and office systems ? Standard Generalized Markup Language (SGML)", of which there are three versions:
- Original
SGML
, which was accepted in October 1986, followed by a minor Technical Corrigendum.
- SGML (ENR)
, in 1996, resulted from a Technical Corrigendum to add
extended naming rules
allowing arbitrary-language and -script markup.
- SGML (ENR+WWW or WebSGML)
, in 1998, resulted from a
Technical Corrigendum
to better support XML and WWW requirements.
SGML is part of a trio of enabling ISO standards for
electronic documents
developed by
ISO/IEC JTC 1/SC 34
[1]
[2]
(ISO/IEC Joint Technical Committee 1, Subcommittee 34 ? Document description and processing languages) :
- SGML (ISO 8879) ? Generalized markup language
- SGML was reworked in 1998 into
XML
, a successful
profile
of SGML. Full SGML is rarely found or used in new projects.
- DSSSL
(ISO/IEC 10179) ? Document processing and styling language based on
Scheme
.
- HyTime
? Generalized
hypertext
and scheduling.
[3]
- HyTime was partially reworked into W3C
XLink
. HyTime is rarely used in new projects.
SGML is supported by various technical reports, in particular
- ISO/IEC TR 9573 ? Information processing ? SGML support facilities ? Techniques for using SGML
[4]
- Part 13: Public entity sets for mathematics and science
- In 2007, the W3C MathML working group agreed to assume the maintenance of these entity sets.
History
[
edit
]
SGML descended from
IBM
's
Generalized Markup Language
(GML), which
Charles Goldfarb
, Edward Mosher, and Raymond Lorie developed in the 1960s. Goldfarb, editor of the international standard, coined the "GML" term using their surname initials.
[5]
Goldfarb also wrote the definitive work on SGML syntax in "The SGML Handbook".
[6]
The syntax of SGML is closer to the
COCOA
format.
[
clarification needed
]
As a document markup language, SGML was originally designed to enable the sharing of
machine-readable
large-project documents in government, law, and industry. Many such documents must remain readable for several decades?a long time in the
information technology
field. SGML also was extensively applied by the military, and the aerospace, technical reference, and industrial publishing industries. The advent of the
XML
profile has made SGML suitable for widespread application for small-scale, general-purpose use.
Document validity
[
edit
]
SGML (ENR+WWW) defines two kinds of validity. According to the revised Terms and Definitions of ISO 8879 (from the public draft
[7]
):
A conforming SGML document must be either a type-valid SGML document, a tag-valid SGML document, or both. Note: A user may wish to enforce additional constraints on a document, such as whether a document instance is integrally-stored or free of entity references.
A
type-valid
SGML document is defined by the standard as
An SGML document in which, for each document instance, there is an associated
document type declaration
(DTD) to whose DTD that instance conforms.
A
tag-valid
SGML document is defined by the standard as
An SGML document, all of whose document instances are fully tagged. There need not be a
document type declaration
associated with any of the instances. Note: If there is a
document type declaration
, the instance can be parsed with or without reference to it.
Terminology
[
edit
]
Tag-validity
was introduced in SGML (ENR+WWW) to support
XML
which allows documents with no DOCTYPE declaration but which can be parsed without a grammar, or documents which have a DOCTYPE declaration that makes no
XML Infoset
contributions to the document. The standard calls this
fully tagged
.
Integrally stored
reflects the
XML
requirement that elements end in the same entity in which they started.
Reference-free
reflects the
HTML
requirement that entity references are for special characters and do not contain markup. SGML validity commentary, especially commentary that was made before 1997 or that is unaware of SGML (ENR+WWW), covers
type-validity
only.
The SGML emphasis on validity supports the requirement for generalized markup that
markup should be rigorous.
(ISO 8879 A.1)
Syntax
[
edit
]
An SGML document may have three parts:
- the SGML Declaration,
- the Prologue, containing a DOCTYPE declaration with the various
markup declarations
that together make a
Document Type Definition
(DTD), and
- the instance itself, containing one top-most element and its contents.
An SGML document may be composed from many
entities
(discrete pieces of text). In SGML, the entities and element types used in the document may be specified with a DTD, the different character sets, features, delimiter sets, and keywords are specified in the SGML Declaration to create the
concrete syntax
of the document.
Although full SGML allows implicit markup and some other kinds of tags, the
XML
specification (s4.3.1) states:
Each XML document has both a logical and a physical structure. Physically, the document is composed of units called entities. An entity may refer to other entities to cause their inclusion in the document. A document begins in a "root" or document entity. Logically, the document is composed of declarations, elements, comments, character references, and
processing instructions
, all of which are indicated in the document by explicit markup.
For introductory information on a basic, modern SGML syntax, see
XML
. The following material concentrates on features not in XML and is not a comprehensive summary of SGML syntax.
Optional features
[
edit
]
SGML generalizes and supports a wide range of markup languages as found in the mid 1980s. These ranged from terse
Wiki
-like syntaxes to
RTF
-like bracketed languages to
HTML
-like matching-tag languages. SGML did this by a relatively simple default
reference concrete syntax
augmented with a large number of optional features that could be enabled in the SGML Declaration. Not every SGML parser can necessarily process every SGML document. Because each processor's
System Declaration
can be compared to the document's
SGML Declaration
it is always possible to know whether a document is supported by a particular processor.
Many SGML features relate to markup minimization. Other features relate to
concurrent (parallel) markup
(CONCUR), to linking processing attributes (LINK), and to embedding SGML documents within SGML documents (SUBDOC).
The notion of customizable features was not appropriate for Web use, so one goal of
XML
was to minimize optional features. However, XML's well-formedness rules cannot support Wiki-like languages, leaving them unstandardized and difficult to integrate with non-text information systems.
Concrete and abstract syntaxes
[
edit
]
The usual (default) SGML
concrete syntax
resembles this example, which is the default
HTML
concrete syntax:
<QUOTE
TYPE=
"example"
>
typically
something
like
<ITALICS>
this
</ITALICS>
</QUOTE>
SGML provides an
abstract syntax
that can be
implemented
in many different types of
concrete syntax
. Although the markup norm is using
angle brackets
as start- and end-tag
delimiters
in an SGML document (per the standard-defined
reference concrete syntax
), it is possible to use other characters?provided a suitable concrete syntax is defined in the document's
SGML declaration
.
[8]
For example, an SGML interpreter might be programmed to parse GML, wherein the tags are delimited with a left
colon
and a right
full stop
, and an
:e
prefix denotes an end tag:
:xmp.Hello, world:exmp.
. According to the reference syntax, letter case (upper- or lower-case) is not distinguished in tag names, so the three tags
<quote>
,
<QUOTE>
, and
<quOtE>
are equivalent. (A concrete syntax might change this rule via the NAMECASE NAMING declarations.)
Markup minimization
[
edit
]
SGML has features for reducing the number of characters required to mark up a document, which must be enabled in the SGML Declaration. SGML processors need not support every available feature, thus allowing applications to tolerate many types of inadvertent markup omissions; however, SGML systems usually are intolerant of invalid structures. XML is intolerant of syntax omissions, and does not require a DTD for checking well-formedness.
OMITTAG
[
edit
]
Both start tags and end tags may be omitted from a document instance, provided:
- the OMITTAG feature is enabled in the SGML Declaration,
- the DTD indicates that the tags are permitted to be omitted,
- (for start tags) the element has no associated required (
#REQUIRED
) attributes, and
- the tag can be unambiguously inferred by context.
For example, if OMITTAG YES is specified in the SGML Declaration (enabling the OMITTAG feature), and the DTD includes the following declarations:
<!ELEMENT
chapter
-
-
(
title
,
section
+)
>
<!ELEMENT
title
o
o
(
#PCDATA
)
>
<!ELEMENT
section
-
-
(
title
,
subsection
+)
>
then this excerpt:
<chapter>
Introduction
to
SGML
<section>
The
SGML
Declaration
<subsection>
...
which omits two
<title>
tags and two
</title>
tags, would represent valid markup.
Omitting tags is optional ? the same excerpt could be tagged like this:
<chapter><title>
Introduction
to
SGML
</title>
<section><title>
The
SGML
Declaration
</title>
<subsection>
...
and would still represent valid markup.
Note: The OMITTAG feature is unrelated to the tagging of elements whose declared content is
EMPTY
as defined in the DTD:
<!ELEMENT
image
-
o
EMPTY
>
Elements defined like this have no end tag, and specifying one in the document instance would result in invalid markup. This is syntactically different from
XML
empty elements in this regard.
SHORTREF
[
edit
]
Tags can be replaced with delimiter strings, for a terser markup, via the SHORTREF feature. This markup style is now associated with
wiki markup
, e.g. wherein two equals-signs (==), at the start of a line, are the "heading start-tag", and two equals signs (==) after that are the "heading end-tag".
SHORTTAG
[
edit
]
SGML markup languages whose concrete syntax enables the SHORTTAG VALUE feature, do not require attribute values containing only alphanumeric characters to be enclosed within quotation marks?either double
" "
(LIT) or single
' '
(LITA)?so that the previous markup example could be written:
<QUOTE
TYPE=
example
>
typically
something
like
<ITALICS>
this
<
/>
</QUOTE>
One feature of SGML markup languages is the "presumptuous empty tagging", such that the empty end tag
</>
in
<ITALICS>this</>
"inherits" its value from the nearest previous full start tag, which, in this example, is
<ITALICS>
(in other words, it closes the most recently opened item). The expression is thus equivalent to
<ITALICS>this</ITALICS>
.
Another feature is the
NET
(Null End Tag) construction:
<ITALICS/this/
, which is structurally equivalent to
<ITALICS>this</ITALICS>
.
Other features
[
edit
]
Additionally, the SHORTTAG NETENABL IMMEDNET feature allows shortening tags surrounding an empty text value, but forbids shortening full tags:
can be written as
<QUOTE//
wherein the first
slash
( / ) stands for the NET-enabling "start-tag close" (NESTC), and the second slash stands for the NET.
NOTE:
XML defines NESTC with a
/
, and NET with an
>
(angled bracket)?hence the corresponding construct in XML appears as
<QUOTE/>
.
The third feature is 'text on the same line', allowing a markup item to be ended with a line-end; especially useful for headings and such, requiring using either SHORTREF or DATATAG minimization. For example, if the DTD includes the following declarations:
<!ELEMENT
lines
(
line
*)
>
<!ELEMENT
line
O
-
(
#PCDATA
)
>
<!ENTITY
line-tagc
"</line>"
>
<!SHORTREF
one-line
"&#RE;&#RS;"
line-tagc
>
<!USEMAP
one-line
line
>
(and "&#RE;&#RS;" is a short-reference delimiter in the concrete syntax), then:
<lines>
first
line
second
line
</lines>
is equivalent to:
<lines>
<line>
first
line
</line>
<line>
second
line
</line>
</lines>
Formal characterization
[
edit
]
SGML has many features that defied convenient description with the popular formal
automata theory
and the contemporary
parser
technology of the 1980s and the 1990s. The standard warns in Annex H:
The SGML
model group
notation was deliberately designed to resemble the
regular expression
notation of
automata
theory, because automata theory provides a theoretical foundation for some aspects of the notion of conformance to a content model. No assumption should be made about the general applicability of automata to content models.
A report on an early implementation of a parser for basic SGML, the Amsterdam SGML Parser,
[9]
notes
the DTD-grammar in SGML must conform to a notion of unambiguity which closely resembles the LL(1) conditions
and specifies various differences.
There appears to be no definitive classification of full SGML against a known class of
formal grammar
. Plausible classes may include
tree-adjoining grammars
and
adaptive grammars
.
XML is described as being generally parsable like a
two-level grammar
for non-validated XML and a
Conway
-style pipeline of
coroutines
(
lexer
,
parser
, validator) for valid XML.
[10]
The SGML productions in the ISO standard are reported to be LL(3) or LL(4).
[11]
XML-class subsets are reported to be expressible using a
W-grammar
.
[12]
According to one paper,
[13]
and probably considered at an
information set
or
parse tree
level rather than a character or delimiter level:
The class of documents that conform to a given SGML document
grammar
forms an LL(1) language. ... The SGML document grammars by themselves are, however, not LL(1) grammars.
The SGML standard does not define SGML with formal data structures, such as
parse trees
; however, an SGML document is constructed of a
rooted directed acyclic graph
(RDAG) of physical storage units known as "
entities
", which is parsed into a RDAG of structural units known as "elements". The physical graph is loosely characterized as an
entity tree
, but entities might appear multiple times. Moreover, the structure graph is also loosely characterized as an
element tree
, but the ID/IDREF markup allows arbitrary arcs.
The results of parsing can also be understood as a data tree in different notations; where the document is the root node, and entities in other notations (text, graphics) are child nodes. SGML provides apparatus for linking to and annotating external non-SGML entities.
The SGML standard describes it in terms of
maps
and
recognition modes
(s9.6.1). Each entity, and each element, can have an associated
notation
or
declared content type
, which determines the kinds of references and tags which will be recognized in that entity and element. Also, each element can have an associated
delimiter map
(and
short reference map
), which determines which characters are treated as delimiters in context. The SGML standard characterizes parsing as a
state machine
switching between recognition modes. During parsing, there is a stack of maps that configure the
scanner
, while the
tokenizer
relates to the recognition modes.
Parsing involves traversing the dynamically-retrieved entity graph, finding/implying tags and the element structure, and validating those tags against the grammar. An unusual aspect of SGML is that the grammar (DTD) is used both passively ? to
recognize
lexical structures, and actively ? to
generate
missing structures and tags that the DTD has declared optional. End- and start- tags can be omitted, because they can be inferred. Loosely, a series of tags can be omitted only if there is a single, possible path in the grammar to imply them. It was this active use of grammars that made concrete SGML parsing difficult to formally characterize.
SGML uses the term
validation
for both recognition and generation. XML does not use the grammar (DTD) to change delimiter maps or to inform the parse modes, and does not allow
tag omission
; consequently, XML validation of elements is not active in the sense that SGML validation is active. SGML
without
a DTD (e.g. simple XML), is a grammar or a language; SGML
with
a DTD is a
metalanguage
. SGML with an SGML declaration is, perhaps, a meta-metalanguage, since it is a metalanguage whose declaration mechanism
is
a metalanguage.
SGML has an abstract syntax implemented by many possible concrete syntaxes; however, this is not the same usage as in an
abstract syntax tree
and as in a
concrete syntax tree
. In the SGML usage, a concrete syntax is a set of specific delimiters, while the abstract syntax is the set of names for the delimiters. The
XML Infoset
corresponds more to the programming language notion of abstract syntax introduced by
John McCarthy
.
Derivatives
[
edit
]
The
W3C
XML (Extensible Markup Language) is a profile (subset) of SGML designed to ease the implementation of the parser compared to a full SGML parser, primarily for use on the World Wide Web. In addition to disabling many SGML options present in the reference syntax (such as omitting tags and nested subdocuments) XML adds a number of additional restrictions on the kinds of SGML syntax. For example, despite enabling SGML shortened tag forms, XML does not allow unclosed start or end tags. It also relied on many of the additions made by the WebSGML Annex. XML currently is more widely used than full SGML. XML has lightweight
internationalization
based on
Unicode
. Applications of XML include
XHTML
,
XQuery
,
XSLT
,
XForms
,
XPointer
,
JSP
,
SVG
,
RSS
,
Atom
,
XML-RPC
,
RDF/XML
, and
SOAP
.
HTML
[
edit
]
While HTML (Hyper Text Markup Language) was developed partially independently and in parallel with SGML, its creator,
Tim Berners-Lee
, intended it to be an application of SGML.
[
citation needed
]
The design of HTML was therefore inspired by SGML tagging, but, since no clear expansion and parsing guidelines were established, most actual HTML documents are not valid SGML documents. Later, HTML was reformulated (version 2.0) to be more of an SGML application; however, the HTML markup language has many legacy- and exception-handling features that differ from SGML's requirements. HTML 4 is an SGML application that fully conforms to ISO 8879 ? SGML.
[14]
The charter for the 2006 revival of the
World Wide Web Consortium
HTML Working Group says, "the Group will not assume that an SGML parser is used for 'classic HTML
'
".
[15]
Although HTML syntax closely resembles SGML syntax with the default
reference
concrete syntax
,
HTML5
abandons any attempt to define HTML as an SGML application, explicitly defining its own parsing rules,
[16]
which more closely match existing implementations and documents. It does, however, define an alternative
XHTML
serialization, which conforms to XML and therefore to SGML as well.
[17]
The second edition of the
Oxford English Dictionary
(OED) is entirely marked up with an SGML-based markup language using the
LEXX
text editor.
[18]
The third edition is marked up as XML.
Others
[
edit
]
Other document markup languages are partly related to SGML and XML, but?because they cannot be parsed or validated or other-wise processed using standard SGML and XML tools?they are not considered either SGML or XML languages; the
Z Format
markup language for typesetting and documentation is an example.
Several modern programming languages support tags as primitive token types, or now support Unicode and
regular expression
pattern-matching. An example is the
Scala programming language
.
Applications
[
edit
]
Document markup languages defined using SGML are called "applications" by the standard; many pre-XML SGML applications were proprietary property of the organizations which developed them, and thus unavailable in the World Wide Web. The following list is of pre-XML SGML applications.
- Text Encoding Initiative
(TEI) is an academic consortium that designs, maintains, and develops technical standards for digital-format textual representation applications.
- DocBook
is a markup language originally created as an SGML application, designed for authoring technical documentation; DocBook currently is an XML application.
- CALS
(Continuous Acquisition and Life-cycle Support) is a US Department of Defense (DoD) initiative for electronically capturing military documents and for linking related data and information.
- HyTime
defines a set of hypertext-oriented element types that allow SGML document authors to build hypertext and multimedia presentations.
- EDGAR
(Electronic Data-Gathering, Analysis, and Retrieval) system effects automated collection, validation, indexing, acceptance, and forwarding of submissions, by companies and others, who are legally required to file data and information forms with the US Securities and Exchange Commission (SEC).
- LinuxDoc
. Documentation for Linux packages has used the LinuxDoc SGML DTD and Docbook XML DTD.
- AAP DTD
is a
document type definition
for
scientific
documents, defined by the
Association of American Publishers
.
- ISO 12083
, a successor to AAP DTP, is an international SGML standard for document interchange between authors and publishers.
- SGMLguid
was an early SGML document type definition created, developed and used at
CERN
.
Open-source implementations
[
edit
]
Significant
open-source
implementations of SGML have included:
- ASP-SGML
- ARC-SGML
, by Standard Generalized Markup Language Users', 1991, C language
- SGMLS
, by James Clark, 1993, C language
- Project YAO
, by Yuan-ze Institute of Technology, Taiwan, with Charles Goldfarb, 1994.
- SP and Jade
by James Clark, C++ language
SP and Jade, the associated DSSSL processors, are maintained by the
OpenJade
project, and are common parts of Linux distributions. A general archive of SGML software and materials resides at
SUNET
. The original HTML parser class, in Sun System's implementation of Java, is a limited-features SGML parser, using SGML terminology and concepts.
See also
[
edit
]
References
[
edit
]
- ^
a
b
ISO (5 March 2008).
"JTC 1/SC 34 ? Document description and processing languages"
. ISO
. Retrieved
2009-12-25
.
- ^
ISO JTC1/SC34.
"JTC 1/SC 34 ? Document Description and Processing Languages"
. Retrieved
2009-12-25
.
{{
cite web
}}
: CS1 maint: numeric names: authors list (
link
)
- ^
ISO/IEC 10744
?
Hytime
- ^
"ISO/IEC TR 9573"
(PDF)
.
ISO
. 1991
. Retrieved
5 December
2017
.
- ^
Goldfarb, Charles F.
(1996).
"The Roots of SGML ? A Personal Recollection"
. Archived from
the original
on December 20, 2012
. Retrieved
July 7,
2007
.
- ^
Goldfarb, Charles F.
(1990).
The SGML Handbook
. Clarendon Press.
ISBN
9780198537373
.
- ^
"Terms and Definitions of ISO 8879 draft"
.
- ^
Wohler, Wayne (July 21, 1998).
"SGML Declarations"
. Retrieved
August 17,
2009
.
- ^
Egmond (December 1989).
"The Implementation of the Amsterdam SGML Parser"
(PDF)
.
- ^
Carroll, Jeremy J. (November 26, 2001).
"CoParsing of RDF & XML"
(PDF)
.
Hewlett-Packard
. Retrieved
October 9,
2009
.
- ^
"SGML: Grammar Productions"
.
- ^
"Re: Other whitespace problems was Re: Whitespace rules (v2)"
.
- ^
Bruggemann-Klein.
"Compiler-Construction Tools and Techniques for SGML parsers: Difficulties and Solutions"
.
- ^
"HTML 4?4 Conformance: requirements and recommendations"
. Retrieved
2009-12-30
.
- ^
Lilley, Chris
;
Berners-Lee, Tim
(February 6, 2009).
"HTML Working Group Charter"
. Retrieved
April 19,
2007
.
- ^
"HTML5 ? Parsing HTML documents"
.
World Wide Web Consortium
. October 28, 2014
. Retrieved
June 29,
2015
.
- ^
Dubost, Karl (January 15, 2008).
"HTML 5, one vocabulary, two serializations"
.
Questions & Answers blog
.
W3C
. Retrieved
February 25,
2009
.
- ^
Cowlishaw, M. F.
(1987). "LEXX?A programmable structured editor".
IBM Journal of Research and Development
.
31
(1).
IBM
: 73.
doi
:
10.1147/rd.311.0073
.
External links
[
edit
]
ISO
standards
by standard number
|
---|
|
1?9999
| |
---|
10000?19999
| |
---|
20000?29999
| |
---|
30000+
| |
---|
|