| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 605
Page 605
65
Electronic Document Interchange and Distribution Based on the
Portable Document Format, an Open Interchange Format
Stephen N. Zilles and Richard Cohn
Adobe Systems Incorporated
Abstract
The ability to interchange information in electronic form is key
to the effectiveness of the national information infrastructure
initiative (NII). Our long history with documents, their use, and
their management makes the interchange of electronic documents a
core component of any information infrastructure. From the very
beginning of networking, both local and global, documents have been
primary components of the information flow across the networks.
However, the interchange of documents has been limited by the lack
of common formats that can be created and received anywhere within
the global network. Electronic document interchange had been stuck
in a typewriter world while printed documents increased in visual
sophistication. In addition, much of the work on electronic
documents has been on their production and not on their
consumption. Yet there are many readers/consumers for every
author/producer. Only recently have we seen the emergence of
formats that are portable, independent of computer platform, and
capable of representing essentially all visually rich
documents.
Electronic document production can be viewed as having two
steps: the creation of the content of the document and the
composition and formatting of the content into a final form
presentation. Electronic interchange may take place after either
step. In this paper we present the requirements for interchange of
final form documents and describe an open document format, the
portable document format (PDF), that meets those requirements.
There are a number of reasons why it is important to be able to
interchange formatted documents. There are legal requirements to be
able to reference particular lines of a document. There is a legacy
of printed documents that can be converted to electronic form.
There are important design decisions that go into the presentation
of the content that can be captured only in the final form. The
portable document format is designed to faithfully represent any
document, including documents with typographic text, tabular data,
pictorial images, artwork, and figures. In addition, it extends the
visual presentation with electronic aids such as annotation
capabilities, hypertext links, electronic tables of contents, and
full word search indexes. Finally, PDF is extensible and will
interwork with formats for electronic interchange of the document
content, such as the HyperText Markup Language (HTML) used in the
World Wide Web.
Background
A solution to the problem of electronic document interchange
must serve all the steps of document usage. The solution must
facilitate the production of electronic documents, and, what is
more important, it must facilitate the consumption of these
documents. Here, consumption includes viewing, reading, printing,
reusing, and annotating the documents. There are far more readers
than authors for most documents. Serving consumers has a much
bigger economic impact than does serving authors. Replacing paper
distribution with electronic distribution increases timeliness,
reduces use of natural resources, and produces greater efficiency
and productivity. It also allows the power of the computer to be
applied to aiding the consumption of the document; for example,
hyperlinks to other documents and searches for words and phrases
can greatly facilitate finding the documents and portions thereof
that interest the consumer.
OCR for page 606
Page 606
From a consumption point of view, the critical need in
electronic document interchange is the ability to view, print, or
read the document everywhere that someone has access to it. If the
document can also be revised or edited then that is so much the
better, but it is not required for use of the document.
Final Form and Revisable Form
Interchange
There are two different ways to produce interchangeable
electronic documents. Although there are many steps in the
production of a visually rich document, the process of composition
and layout partitions production into two parts. Composition and
layout is the process which takes a representation of the content
of a document and places that content onto a two-dimensional space
(or sequence of two-dimensional spaces), usually called pages.
In the process of composition and layout, a number of decisions,
called formatting decisions, are made: which fonts in which sizes
and weights are used for which pieces of content, where on the page
the content is placed, whether there are added content fragments
such as headers and footers, and so on. These formatting decisions
may be made automatically based on rules provided to the
composition and layout process; they may be made by a human
designer interacting with the composition and layout process; or
they may be made using a combination of these two approaches.
The representation of the content before composition and layout
is called revisable form. Revision is relatively easy because the
formatting decisions have not been made or have only been
tentatively made and can be revised when changes occur. The
representation of the content after composition and layout is
called final form.
The interchange of electronic documents can be done either in
revisable form or in final form. If the revisable form is
interchanged, then either the formatting decisions must be made
entirely by the consumer of the electronic document or the rules
for making the formatting decisions must be interchanged with the
revisable form electronic document. If the first approach is
chosen, then there is no way to guarantee how the document will
appear to the consumer. Even if the second approach is chosen,
existing formatting languages do not guarantee identical final form
output when given the same revisable form input. Some formatting
decisions are always left to the consumer's composition and layout
software. Therefore, different composition and layout processes may
produce different final form output.
The interchange of revisable form electronic documents can meet
many authors' needs. Both the Standard Generalized Markup Language
(SGML) and the HyperText Markup Language (HTML) are successfully
used to interchange significant and interesting documents. But
there are cases where these formats are not sufficient to meet the
needs of the author. For these cases, interchange of final form
electronic documents is necessary.
Requirement for Page Fidelity
The key problem with interchanging only revisable form documents
is the inability to guarantee page fidelity. Page fidelity means
that a given revisable form document will always produce the same
final form output no matter where it is processed. There are a
number of reasons why page fidelity is required.
The most obvious reason is that the composition and layout
process involve a human designer's decisions. Only in the final
form is it possible to capture these decisions. These formatting
decisions are important in the presentation of the represented
information. This is quite obvious in advertisements, where design
plays an important role in the effectiveness of communicating the
intended message. It is perhaps less obvious but equally important
in the design of other information presentations. For example, the
placement of text in association with graphical structures, such as
in a map of the Washington, D.C., subway system (Figure 1), will
greatly affect whether the presentation can be understood. In
addition, formatting rules may not adequately describe complex
information presentations, such as mathematics or complex tables,
which may need to be hand designed for
OCR for page 607
Page 607
Figure 1
Metro system map. SOURCE: Courtesy of Washington Metropolitan
Area Transit Authority.
effective communication. (Figure 2 has simple examples of
mathematics and tables.) Finally, the composition and layout design
may reflect the artistic expression of the designer, making the
document more pleasing to read (Figure 3).
The rich tradition of printed documents has established the
practice of using page and line numbers to reference portions of
documents, for legal and other reasons. These references will work
for electronic documents only if page fidelity can be guaranteed.
Many governmental bodies have built these references into their
procedures and require that they be preserved, in electronic as
well as in paper documents.
The final set of cases does not require page fidelity; they are
just more simply handled with a final form representation than a
revisable one. Documents that exist only in paper form, legacy
documents, can be scanned and their content recognized to produce
an electronic document. Although this recognized content could be
represented in many forms, it is simplest when the content is
represented in final form. Then it is not necessary to decide, for
each piece of recognized content, what part of the original
document content, such as body text, header, or footer, it belonged
to. Since the final form is preserved, the reader of the document
can correctly perform the recognition function. (See Figure 2 for
an example of a legacy page that would be difficult to categorize
correctly.)
Finally, preparing a document as an SGML or HTML document
typically involves a substantial amount of markup of sections of
content to ensure that a rule-driven composition and layout process
will produce the intended effect. For documents that are of
transient existence, it may be far simpler to produce the
composition and layout by hand and to interchange the final form
representation than to spend time tuning the document for a
rule-driven process.
OCR for page 608
Page 608
Figure 2
Requirements for a Final Form Document
Interchange Format
For a final form representation to be a suitable electronic
document interchange format for the NII, it should meet a number of
requirements:
•
It should have an open, published specification
with sufficient detail to allow multiple implementations. There are
many definitions of openness, but the key component of them all is
that independent implementations of the specification are possible.
This gives the users of the format some guarantee of there being
reasonable cost products that support the format. With multiple
implementations, the question of interoperability is important.
This can be facilitated, although not guaranteed, by establishing
conformance test suites for the open specification.
•
It should provide page fidelity. This is a complex
requirement. For text, this means representing the fonts, sizes,
weights, spacing, alignment, leading, and so on that are used in
the composition and layout of the original document. For graphics,
this means representing the shapes, the lines and curves, whether
they are filled or stroked, any shading or gradients, and all the
scaling, rotations, and positioning of the graphic elements.
For
OCR for page 609
Page 609
Figure 3
images, this means doing automatic conversion from
the resolution of the sample space of the image to the resolution
of the device, representing both binary and continuous tone images,
with or without compression of the image data. For all of the
above, the shapes and colors must be preserved where defined in
device- and resolution-independent terms.
•
It should provide a representation for electronic
navigation. The format should be able to express hyperlinks within
and among documents. These links should be able to refer to
documents in formats other than this final form format, such as
HTML documents, videos, or animations. The format should be able to
represent a table of contents or index using links into the
appropriate places in the document. The format should also allow
searches for words or phrases within the document and positioning
at successive hits.
•
It should be resource independent. The ability to
view an electronic document should not depend on the resources
available where it is viewed. There are two parts to this
requirement. A standard set of resources, such as a set of standard
fonts, can be defined and required at every site of use. These
resources need not be transmitted with the document. For resources,
such as fonts, that are not in the standard set, there must be
provision for inclusion of the resource in the document.
•
It should provide access to the content of the
document for the visually disabled. This means, at a minimum, being
able to provide ASCII text that is marked up with tags that are
compliant with the International Standard ISO-12083 (ICADD DTD).
The ICADD (International Committee for Accessible Document Design)
markup must include the structural and navigational information
that is present in the document.
•
It should be possible to create electronic
documents in this format using a wide range of document generation
processes. Ideally, any application that can generate final form
output should be usable to generate an electronic document in this
format. There should also be a means for paper documents to be
converted into this format.
OCR for page 610
Page 610
•
It should be platform independent; that is, the
format should be independent of the choice of hardware and
operating systems, and transportable from one platform to any other
platform. It should also be independent of the authoring
application; the authoring application should not be required to
view the electronic document.
•
It should effectively use storage. In particular,
it should use relevant, data type-dependent compression
techniques.
•
It should scale well. Applications using the
format should perform nearly as well on huge (20,000-page)
documents as they do on small ones; the performance on a large
document or a (large) collection of smaller documents should be
similar. This requirement implies that any component of the
electronic document be randomly accessible.
•
It should integrate with other NII technologies.
It should be able to represent hyperlinks to other documents using
the Universal Resource Identifiers (URIs) defined by the World Wide
Web (WWW) and it should be indentifiable by a URI. It should be
possible to encrypt an electronic document both for privacy and for
authentication.
•
It should be possible to revise and reuse the
documents represented in the format. Editing is a broad notion that
begins with the ability to add annotations, electronic ''sticky
notes," to documents. At the next level, preplanned replacement of
objects, such as editing fields on a form, might be allowed. Above
that, one might allow replacement, extraction and/or deletion of
whole pages, and, finally, arbitrary revision of the content of
pages.
•
It should be extensible, meaning that new data
types and information can be added to the representation without
breaking previous consumer applications that accept the format.
Examples of extensions would be adding a new data object for sound
annotations or adding information that would allow editing of some
object.
These requirements might be satisfied in a number of different
ways. We describe below a particular format, the portable document
format (PDF) which has been developed by Adobe Systems Inc., and
the architectures that make interchange of electronic documents
practical.
An Architectural Basis for
Interchange
There are three basic architectures that facilitate interchange
of electronic documents: (1) the architecture for document
preparation, (2) the architecture of the document representation,
and (3) the architecture for extension. These three architectures
are illustrated in Figure 4. The architecture for document
preparation is shown on the left-hand side of the figure and
encompasses both electronic and paper document preparation
processes. The right-hand side of the figure shows consumption of
prepared documents. The portable document format (PDF) is the
architecture for document representation and is the link between
these components. The right-hand side of the figure shows
consumption at two levels. There is an optional cataloguing step
that builds full text indexes for one or a collection of documents.
Above this, viewing and printing PDF documents are shown. The
architecture for extension is indicated by the "search" plug-in,
which allows access to indexes built by the cataloguing
process.
The Architecture for Document
Preparation
To be effective, any system for electronic document interchange
must be able to capture documents in all the forms in which they
are generated. This is facilitated by the recent shift to
electronic preparation of documents, but it must include a pathway
for paper documents as well.
OCR for page 611
Page 611
Figure 4
Unlike the wide range of forms that revisable documents can
take, there are relatively few final form formats in use today.
(This is another reason that it makes sense to have a final form
interchange format.) Since, historically, final forms were prepared
for printing, one can capture documents in final form by replacing
or using components of the printing architectures of the various
operating systems. Two such approaches have been used with PDF: (1)
in operating systems with selectable print drivers, adding a print
driver that generates PDF and (2) translating existing visually
rich print formats to PDF. Both these pathways are shown in the
upper left corner of Figure 4. PDFWriter is the replacement print
driver and the Acrobat Distiller translates PostScript (PS)
language files into PDF.
Some operating systems, such as the Mac OS and Windows, have a
standard interface, the GUI (graphical user interface), which can
be used by any application both to display information on the
screen and to print what is displayed. By replacing the print
driver that lies beneath the GUI it is possible to capture the
document that would have been printed and to convert it into the
electronic document interchange format.
For operating systems without a GUI interface to printing and
for applications that choose to generate their print format
directly, the PostScript language is the industry standard for
describing visually rich final form documents. Therefore, the
electronic document interchange format can be created by
translating, or "distilling," PostScript language files. This
distillation process converts the print description into a form
more suitable for viewing and electronic navigation. The PostScript
language has been extended, for distillation, to allow information
on navigation to be included with the print description, allowing
the distillation process to automatically generate navigational
aids.
The above two approaches to creation of PDF documents work with
electronic preparation processes. But there is also an archive of
legacy documents that were never prepared electronically or are not
now available in electronic form. For these documents there is a
third pathway to PDF, shown in the lower left corner of Figure 4.
Paper documents can be scanned and converted to raster images.
These images are then fed to a recognition program, Acrobat
Capture, that identifies the textual and nontextual parts of the
document. The textual parts are converted to coded text with
appropriate font information including font name, weight, size,
posture. The nontextual parts remain as images. This process
produces an electronic representation of the paper document that
has the same final form as the original and is much smaller than
the scanned version. Because the paper document is a final form,
the same final form format can be used, without loss of fidelity,
for paper documents and for the electronically generated
documents.
OCR for page 612
Page 612
Page fidelity is important. The current state of recognition
technology, though very good, is not infallible; there are always
some characters that cannot be identified with a high level of
confidence. Because the PDF format allows text and image data to be
freely intermixed, characters or text fragments whose recognition
confidence falls below an adjustable level can be placed in the PDF
document as images. These images can then be read by a human reader
even if a mechanical reader could not interpret them. (Figure 2
shows a document captured by this process.)
The Architecture of the Document
Representation
There is more to the architecture of the document representation
than meeting the above requirements for a final form
representation. Architectures need to be robust and flexible if
they are to be useful over a continuing span of years. PDF has such
an architecture.
The PDF architecture certainly meets these requirements, as will
be clear below. Most importantly, PDF has an open specification:
the Portable Document Format Reference Manual (ISBN
0-201-62628-4) has been published for several years, and
implementations have been produced by several vendors.
PDF also goes beyond the final form requirements. For example,
the content of PDF files can be randomly accessed and the files
themselves can be generated in a single pass through the document
being converted to PDF form. In addition, incremental changes to a
PDF file require only incremental additions to the file rather than
a complete rewrite of the file. These are aspects of PDF that are
important with respect to the efficiency of the generation and
viewing processes.
The PDF file format is based on long experience both with a
practical document interchange format and with applications that
were constructed on top of that format. Adobe Illustrator is a
graphics design program whose intermediate file format is based on
the PostScript language. By making the intermediate format also be
a print format, the output of Adobe Illustrator could easily be
imported into other applications because they could print the
objects without having to interpret the semantics. In addition,
because the underlying semantics were published, these objects
could be read by other applications when required. The lessons
learned in the development of Adobe Illustrator went into the
design of PDF.
PDF, like the Adobe Illustrator file format, is based on the
PostScript language. PDF uses the PostScript language imaging
model, which has proven itself over 12 years of experience as being
capable of faithfully representing visually rich documents. Yet the
PostScript file format was designed for printing, not for
interactive access. To improve system performance for interactive
access, PDF has a restructured and simplified description
language.
Experience with the PostScript language has shown that, although
having a full programming language capability is useful, a properly
chosen set of high-level combinations of the PostScript language
primitives can be used to describe most, if not all, final form
pages. Therefore, PDF has a fixed vocabulary of high-level
operators that can be more efficiently implemented than arbitrary
combinations of the lower-level primitives.
The User Model for PDF Documents
The user sees a PDF document as a collection of pages. Each page
has a content portion that represents the final form of that page
and a number of virtual overlays that augment the page in various
ways. For example, there are overlay layers for annotations, such
as electronic sticky notes, voice annotations, and the like. There
are overlay layers for hyperlinks to other parts of the same
document or hyperlinks to other documents and other kinds of
objects, such as video segments or animations. There is an overlay
layer that identifies the threading of the content of articles from
page to page and from column to column. Each of the overlay layers
is associated with the content portion geometrically. Each overlay
object has an associated rectangle that encompasses the portion of
content associated with the object.
Each of the layers is independent of the others. This allows
information in one layer to be extracted, replaced, or imported
without affecting the other layers. This facilitates exporting
annotations made on multiple
OCR for page 613
Page 613
copies of a document sent out for review and then reimporting
all the review annotations into a single document for responding to
the reviewers' comments. This also makes it possible to define
hyperlinks and threads on the layout of a document that only has
the test portion present and then to replace the text-only pages
with pages that include the figures and images to create the
finished document.
In addition to the page-oriented navigational layers, there are
two document-level navigation aids. There is a set of bookmark or
outline objects that allow a table of contents or index to be
defined into the set of pages. Each bookmark is a link to a
particular destination in the document. A destination specifies the
target page and the area on that page that is the target for
display. Destinations can be specified directly or named and
referred to by name. Using named destinations, especially for links
to other documents, allows the other documents to be revised
without invalidating the destination reference.
Finally, associated with each page is an optional thumbnail
image of the page content. These thumbnails can be arrayed in
sequence in a slide sorter array and can be used both to navigate
among pages and to reorder, move, delete, and insert pages within
and among documents.
The Abstract Model of a PDF Document:
A Tree
Abstractly, the PDF document is represented as a series of
trees. A primary tree represents the set of pages and secondary
trees represent the document-level objects described in the user
model. Each page is itself a small tree with a branch for the
representation of the page content; a branch for the resources,
such as fonts and images used on the page; a branch for the
annotations and links defined on the page; and a branch for the
optional thumbnail image. The page content is represented as a
sequence of high-level PostScript language imaging model operators.
The resources used are represented as references to resource
objects that can be shared among pages. There is an array of
annotation and link objects.
The Representation of the Abstract
Tree
The abstract document tree is represented in terms of the
primitive building blocks of the PostScript language. There are
five simple objects and three complex objects. The simple objects
are the null object (which is a placeholder), the Boolean object
(which is either true or false), the number object (which is an
integer or fixed point), the string object (which has between 0 and
65535 octets), and the name object (which is a read-only
string).
The three complex objects are arrays, dictionaries, and streams.
Arrays are sequences of 0 to 65535 objects that may be mixed type
and may include other arrays. Dictionaries are sets of up to 65535
key-value pairs where the key is a name and the value is any
object. Streams are composed of a dictionary and an arbitrary
sequence of octets. The dictionary allows the content of the
streams to be encoded and/or compressed to improve space
efficiency. Encoding algorithms are used to limit the range of
octets that appear in the representation of the stream. Those
defined in PDF are ASCIIHex (each hex digit is represented as an
octet) and ASCII85 (each four octets of the stream are represented
as five octets). These both produce octet strings restricted to the
7-bit ASCII graphic character space. Compression algorithms are
used to reduce storage requirements. Those defined in PDF are LZW
(licensed from Univac), Run length, CCITT Group 3 and Group 4 FAX
and DCT (JPEG).
The terminal nodes of the abstract tree are represented by
simple objects and streams. The nonterminal nodes are represented
by arrays and dictionaries. The branches (arcs) of the tree are
represented in one of two ways. The simplest way is that the
referenced object is directly present in the nonterminal node
object. This is called a direct object. The second form of branch
is an indirect object reference. Objects can be made into indirect
objects by giving the (direct) object an object number and a
generation number. These indirect objects can then be referenced by
using the object number and generation number in place of the
occurrence of the direct object.
Indirect objects and indirect object references allow objects to
be shared. For example, a font used on several pages need only be
stored once in the document. They also allow the values of certain
keys, such as the
OCR for page 614
Page 614
length of a stream, to be deferred until the value is known.
This property is needed to allow PDF to be produced in one pass
through the input to the PDF generation process.
The PDF File Structure
Indirect objects and indirect object references do not allow
direct access to the objects. This problem is solved by the PDF
file structure. There are four parts to the file structure. The
first part is a header, which identifies the file as being a PDF
file and indicates the version of PDF being used in the file. The
second part is the body, which is a sequence of indirect objects.
The third part is the cross-reference table. This table is a
directory that maps object numbers to offsets in the (body of the)
file structure. This allows direct access to the indirectly
referenced objects.
The final part is the Trailer, which serves several purposes. It
is the last thing in the file and it has the offset of the
corresponding cross-reference table. It also has a dictionary
object. This dictionary is the size of the cross-reference table.
It indicates which indirect object is the root of the document
tree. It indicates which object is the "info dictionary," a set of
keys that allow attributes to be associated with the document.
These keys include such information as author, creation date,
etc.
Finally, the trailer dictionary can have an ID key whose value
has two parts. Both parts are typically hash functions applied to
parts of the document and key information about the document. The
first hash is created when the document is first stored; it is
never modified after that. The second is changed whenever the
document is stored. By storing these IDs with file specifications
referencing the document, one can more accurately determine that
the document retrieved via a given file specification is the
document that is desired.
The trailer is structured to allow PDF files to be incrementally
updated. This allows PDF files to be edited, say deleting some
pages or adding links or annotations, without having to rewrite the
entire file. For large files, this can be a significant savings.
This is accomplished by adding any new indirect objects after the
existing final trailer and appending a new cross-reference table
and trailer to the end of the file. The new cross-reference table
provides access to the new objects and hides any deleted objects.
This mechanism also provides a form of "undo" capability. Move the
end of the file back to the last byte of the previous trailer and
all changes made since that trailer was written will be
removed.
The purpose of the generation numbers in the indirect object
definition and reference is to allow reuse of table entries in the
cross-reference table when objects are deleted. This keeps the
cross-reference table from growing arbitrarily large. Any indirect
object reference is looked up in the endmost cross-reference table
in the document. If the generation number in that cross-reference
table does not match the generation number in the indirect
reference, then the reference object no longer exists, the
reference is bad, and an error is reported. Deleted or unused
entries in the cross-reference table are threaded on a list of free
entries.
Resources
The general form and representation of a PDF file have been
outlined. There are, however, several areas that need further
detail. The page content representation is designed to refer to a
collection of resources external to the pages. These include the
representations of color spaces, fonts, images, and shareable
content fragments. For device-independent color spaces, the color
space resource contains the information needed to map colors in
that color space to the standard CIE 1931 XYZ color space and
thereby ensure accurate reproduction across a range of devices.
Images are represented as arrays of sample values that come from a
specified color space and may be compressed. Page content fragments
are represented as content language subroutines that can be
referred to from content. For example, a corporate logo might be
used on many pages, but the content operators that draw the logo
need be stored only once as a resource.
Typically, however, the resources that are most critical for
ensured reproduction are the font resources. The correct fonts are
needed to be able to faithfully reproduce the text as it was
published. PDF has a three-level approach to font resources. First,
there is a set of 13 fonts (12 textual fonts and 1 symbol font)
that must be
OCR for page 615
Page 615
available to the viewer. These fonts can be assumed to exist at
any consumer of a PDF document. For other fonts, there are two
solutions. The fonts may be embedded within the document or
substitutions may be made for the fonts if they are not available
on the consumer's platform. Fonts that are embedded may be either
Adobe Type 1 fonts or TrueType fonts and may be the full font or a
subset of the font sufficient to display the document in which they
are embedded.
The font architecture divides the font representation into three
separate parts: the font dictionary, the font encoding, and the
font descriptor. The font dictionary represents the font and may
refer to a font encoding and/or a font descriptor. The font
encoding maps octet values that occur in a string into the names of
the glyphs in a font. The font descriptor has the metrics of the
font, including the width and height of glyphs and attributes such
as the weight of stems, whether it is italic, the height of
lower-case letters, and so on. The font shape data, if included,
are part of the font descriptor. If the font shape data are not
included, then the other information in the font descriptor can be
used to provide substitute fonts. Substitute fonts work for textual
fonts and replace the expected glyphs with glyphs that have the
same width, height, and weight as the original glyphs. If page
fidelity is required, then the font shape data should be embedded;
but font substitution can be used to reduce document size where the
omitted fonts are either expected at the consumer's location or
font substitution is adequate for reading the document.
Hyperlinks
The hyperlink mechanism has two parts: the specification of
where the link is and the specification of where the link goes. The
first specification is given as a rectangular area defined on the
page content. The second specification is called an action. There
are a number of different action types. The simplest is moving to a
different destination within the same document. A more complex
action is moving to a destination in another PDF document. The
destination may be a position in the document, a named destination,
or the beginning of an article thread. Instead of making another
PDF document available for viewing, the external reference may
launch an application on a particular file, such as a fragment of
sound or a video. All these external references use platform
independent file names, which may be relative to the file
containing the reference document, to refer to external
entities.
The URL (Uniform Resource Locator), as defined for the World
Wide Web, is another form of allowed reference to an external
document. A URL identifies a file (or part thereof) that may be
anywhere in the electronically reachable world. When the URL is
followed, the object retrieved is typed and then a program that can
process that type is invoked to display the object. Any type for
which there is a viewing program, including PDF, can thereby be
displayed.
Extensibility
PDF is particularly extensible. It is constructed from simple
building blocks; the PDF file is a tree constructed from leaves
that are simple data types or streams and with arrays and
dictionaries as the nonterminal nodes. In general, additional keys
can be added to dictionaries without affecting viewers who do not
understand the new keys. These additional keys may be used to add
information needed to control and/or represent new content object
types and to define editing on existing objects. Because of the
flexibility of the extension mechanism, a registry has been defined
to help avoid key name conflicts that might arise when several
extensions are simultaneously present.
The Architecture for Extensions
The third component architecture for final form electronic
document interchange is the extension architecture for consumers of
the electronic documents. Viewing a PDF document is always possible
and the PDF specification defines what viewing means. But, if there
are extensions within the PDF file, there must be a
OCR for page 616
Page 616
way to give semantic interpretation to the extension data. In
addition, vendors may want to integrate a PDF consumer application,
such as Acrobat Exchange, with their applications. For example, a
service that provides information on stocks and bonds may want to
seamlessly display well-formatted reports on particular stocks.
This service would like to include the display of PDF documents
with their non-PDF information. Providing semantics both for
extension data and for application integration can be accomplished
using the extension architecture for PDF viewers.
It is reasonable to look at PDF viewers as operating system
extensions. These viewers provide a basic capability to view and
print any PDF document. By extending the view and print application
programming interfaces (APIs), more powerful applications can be
constructed on top of the basic view and print capabilities of a
PDF viewer. These extended applications, called plug-ins, can
access extended data stored in the PDF file, change the viewer's
user interface, extend the functionality of PDF, create new link
and action types, and define limited editing of PDF files.
The client search mechanism shown in Figure 4 was done as a
plug-in to Acrobat Exchange. The search plug-in presumes that
collections of PDF documents have been indexed using Acrobat
Catalog. The plug-in is capable of accessing the resulting indexes,
retrieving selected documents, and highlighting occurrences that
match the search criteria. This plug-in is shipped with Acrobat
Exchange but could be replaced by other vendors with another
mechanism for building indexes and retrieving documents. Hence, PDF
files can be incorporated into many document management
systems.
Deployment and the Future
The process of PDF deployment has already begun. One can find a
variety of documents in PDF form on the World Wide Web, on CD-ROMs,
and from other electronic sources. These documents range from tax
forms from the Internal Revenue Service, to color newspapers,
commercial advertisements and catalogs, product drawings and
specifications, and standard business and legal documents.
Use of PDF is likely to increase as more document producers
understand the technology and learn that it is well adapted to
current document production processes. The greatest barrier to
expansion of consumption is awareness on the part of the consumers.
There are free viewers for PDF files, the Acrobat Readers,
available for most major user platforms (DOS, Sun UNIX, Macintosh
OS, Windows) and more support is coming. These viewers are
available on-line through a variety of services, are embedded in
CD-ROMs, and are distributed on diskette.
At this level, the barrier to deployment is primarily education.
But there are also opportunities to improve the quality of
electronic document interchange. Some examples of these
improvements are better support for navigational aids; support for
other content types, such as audio and video; support for a
structural decomposition of the document, as is done in SGML; and
support for a higher level of document editing.
Current document production processes naturally produce the
final form of the document, but they do not necessarily enable
navigation aids such as hyperlinks and bookmarks/tables of
contents. The document production architecture does provide a
pathway for this information to be passed to the distillation
process and through the print drivers. As producers enable this
pathway in their document production products, it will become
standard to automatically translate the representation of
navigational information in a document production product into the
corresponding PDF representation of navigational aids.
Another direction for future development is the inclusion of
additional content types within a PDF file. (There is already
support for referencing foreign content types stored in separate
files via the hyperlink mechanism.) Some of the obvious content
types that should be included are audio, video, and animation.
There is also a need for orchestrating multiple actions/events when
content is expanded beyond typical pages. Much of the barrier to
inclusion of these other content types is in the lack of standard
formats for these content types. Because PDF is designed to run
across all platforms, there is a particular need for standards that
are capable of being implemented in all environments. For example,
standards that require hardware assists are not as useful as
standards that can be helped by hardware assists but do not require
them.
OCR for page 617
Page 617
A final form interchange format guarantees viewability of the
information in a document, but it does not necessarily provide for
reuse or revision of the information. Structural information
representations, such as SGML and HTML, can simplify reuse, but
they do not capture the decisions of human layout designers. Best
would be a format that allowed both views: the final form view for
browsing and reading, and the ability to recover the structural
form for re-purposing, editing, or structure-based content
retrieval. PDF will be extended to allow the formatted content to
be related to the structural information from which it was produced
and to allow that structured information to be retrieved for
further use.
Clearly the final form document contains some of the information
that is needed to edit the document, but it is equally clear that
without extensions to represent structure as well as form the
document may not contain information about how the components of
the final form were created and how they might be changed. Such
simple things as what text was automatically generated, what
elements were grouped together to be treated as a whole, and into
what containers text was flowed need not be represented in the
final form.
The PDF representation was constructed from a set of primitive
building blocks that are also suitable for representing structural
and other information needed for editing. Augmenting the final form
with this kind of information, using these powerful and flexible
building blocks, would allow the final form document format to
offer revisability. As a simple example, one might use PDF to
represent a photo album as a collection of scanned images placed on
pages. A simple editor might be defined that allows these photos to
be reusedsay, to make a greeting card by combining text with
images selected from the photo album. The greeting cards thus
constructed could be represented in PDF using extensions that allow
editing of the added text. More complex editing tasks can be
accommodated by capturing more information about the editing
context within the PDF file generated by the editing application.
For some applications, the PDF file might be the only storage
format needed; it would be both revisable and final form.
Conclusion
The business case for final form electronic document interchange
is relatively straightforward. There are significant savings to be
achieved simply by replacing paper distribution with electronic
distribution, whether or not the document is printed at the
receiving site. The key success factor is whether the document can
be consumed once received. Consumption most often means access to
the document's contents in the form in which they were published.
This can be achieved by having a small number of final form
interchange formats (preferably one) and universal distribution of
viewers for these formats. The portable document format (PDF) is a
more than suitable final form interchange format with freely
distributable viewers.
For practical interchange, there must be tools to conveniently
produce the interchange format from existing (and future) document
production processes. The interchange format must be able to be
transmitted through the electronic networks and included on disks,
diskettes, CD-ROMs, and other physical distribution media. It must
be open to allow multiple implementations and to ensure against the
demise of any particular implementation. Finally, it must be
extensible to allow growth with the changing requirements of
information distribution. These features all are met by PDF.
PDF provides a universal format for distributing information as
an electronic document. The information can always be viewed and
printed. And, with extensions, it may be edited and integrated with
other information system components.
Representative terms from entire chapter:
electronic document