Cover Image

PAPERBACK
$118.00



View/Hide Left Panel

Page 605

65
Electronic Document Interchange and Distribution Based on the Portable Document Format, an Open Interchange Format

Stephen N. Zilles and Richard Cohn
Adobe Systems Incorporated

Abstract

The ability to interchange information in electronic form is key to the effectiveness of the national information infrastructure initiative (NII). Our long history with documents, their use, and their management makes the interchange of electronic documents a core component of any information infrastructure. From the very beginning of networking, both local and global, documents have been primary components of the information flow across the networks. However, the interchange of documents has been limited by the lack of common formats that can be created and received anywhere within the global network. Electronic document interchange had been stuck in a typewriter world while printed documents increased in visual sophistication. In addition, much of the work on electronic documents has been on their production and not on their consumption. Yet there are many readers/consumers for every author/producer. Only recently have we seen the emergence of formats that are portable, independent of computer platform, and capable of representing essentially all visually rich documents.

Electronic document production can be viewed as having two steps: the creation of the content of the document and the composition and formatting of the content into a final form presentation. Electronic interchange may take place after either step. In this paper we present the requirements for interchange of final form documents and describe an open document format, the portable document format (PDF), that meets those requirements. There are a number of reasons why it is important to be able to interchange formatted documents. There are legal requirements to be able to reference particular lines of a document. There is a legacy of printed documents that can be converted to electronic form. There are important design decisions that go into the presentation of the content that can be captured only in the final form. The portable document format is designed to faithfully represent any document, including documents with typographic text, tabular data, pictorial images, artwork, and figures. In addition, it extends the visual presentation with electronic aids such as annotation capabilities, hypertext links, electronic tables of contents, and full word search indexes. Finally, PDF is extensible and will interwork with formats for electronic interchange of the document content, such as the HyperText Markup Language (HTML) used in the World Wide Web.

Background

A solution to the problem of electronic document interchange must serve all the steps of document usage. The solution must facilitate the production of electronic documents, and, what is more important, it must facilitate the consumption of these documents. Here, consumption includes viewing, reading, printing, reusing, and annotating the documents. There are far more readers than authors for most documents. Serving consumers has a much bigger economic impact than does serving authors. Replacing paper distribution with electronic distribution increases timeliness, reduces use of natural resources, and produces greater efficiency and productivity. It also allows the power of the computer to be applied to aiding the consumption of the document; for example, hyperlinks to other documents and searches for words and phrases can greatly facilitate finding the documents and portions thereof that interest the consumer.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 605
Page 605 65 Electronic Document Interchange and Distribution Based on the Portable Document Format, an Open Interchange Format Stephen N. Zilles and Richard Cohn Adobe Systems Incorporated Abstract The ability to interchange information in electronic form is key to the effectiveness of the national information infrastructure initiative (NII). Our long history with documents, their use, and their management makes the interchange of electronic documents a core component of any information infrastructure. From the very beginning of networking, both local and global, documents have been primary components of the information flow across the networks. However, the interchange of documents has been limited by the lack of common formats that can be created and received anywhere within the global network. Electronic document interchange had been stuck in a typewriter world while printed documents increased in visual sophistication. In addition, much of the work on electronic documents has been on their production and not on their consumption. Yet there are many readers/consumers for every author/producer. Only recently have we seen the emergence of formats that are portable, independent of computer platform, and capable of representing essentially all visually rich documents. Electronic document production can be viewed as having two steps: the creation of the content of the document and the composition and formatting of the content into a final form presentation. Electronic interchange may take place after either step. In this paper we present the requirements for interchange of final form documents and describe an open document format, the portable document format (PDF), that meets those requirements. There are a number of reasons why it is important to be able to interchange formatted documents. There are legal requirements to be able to reference particular lines of a document. There is a legacy of printed documents that can be converted to electronic form. There are important design decisions that go into the presentation of the content that can be captured only in the final form. The portable document format is designed to faithfully represent any document, including documents with typographic text, tabular data, pictorial images, artwork, and figures. In addition, it extends the visual presentation with electronic aids such as annotation capabilities, hypertext links, electronic tables of contents, and full word search indexes. Finally, PDF is extensible and will interwork with formats for electronic interchange of the document content, such as the HyperText Markup Language (HTML) used in the World Wide Web. Background A solution to the problem of electronic document interchange must serve all the steps of document usage. The solution must facilitate the production of electronic documents, and, what is more important, it must facilitate the consumption of these documents. Here, consumption includes viewing, reading, printing, reusing, and annotating the documents. There are far more readers than authors for most documents. Serving consumers has a much bigger economic impact than does serving authors. Replacing paper distribution with electronic distribution increases timeliness, reduces use of natural resources, and produces greater efficiency and productivity. It also allows the power of the computer to be applied to aiding the consumption of the document; for example, hyperlinks to other documents and searches for words and phrases can greatly facilitate finding the documents and portions thereof that interest the consumer.

OCR for page 605
Page 606 From a consumption point of view, the critical need in electronic document interchange is the ability to view, print, or read the document everywhere that someone has access to it. If the document can also be revised or edited then that is so much the better, but it is not required for use of the document. Final Form and Revisable Form Interchange There are two different ways to produce interchangeable electronic documents. Although there are many steps in the production of a visually rich document, the process of composition and layout partitions production into two parts. Composition and layout is the process which takes a representation of the content of a document and places that content onto a two-dimensional space (or sequence of two-dimensional spaces), usually called pages. In the process of composition and layout, a number of decisions, called formatting decisions, are made: which fonts in which sizes and weights are used for which pieces of content, where on the page the content is placed, whether there are added content fragments such as headers and footers, and so on. These formatting decisions may be made automatically based on rules provided to the composition and layout process; they may be made by a human designer interacting with the composition and layout process; or they may be made using a combination of these two approaches. The representation of the content before composition and layout is called revisable form. Revision is relatively easy because the formatting decisions have not been made or have only been tentatively made and can be revised when changes occur. The representation of the content after composition and layout is called final form. The interchange of electronic documents can be done either in revisable form or in final form. If the revisable form is interchanged, then either the formatting decisions must be made entirely by the consumer of the electronic document or the rules for making the formatting decisions must be interchanged with the revisable form electronic document. If the first approach is chosen, then there is no way to guarantee how the document will appear to the consumer. Even if the second approach is chosen, existing formatting languages do not guarantee identical final form output when given the same revisable form input. Some formatting decisions are always left to the consumer's composition and layout software. Therefore, different composition and layout processes may produce different final form output. The interchange of revisable form electronic documents can meet many authors' needs. Both the Standard Generalized Markup Language (SGML) and the HyperText Markup Language (HTML) are successfully used to interchange significant and interesting documents. But there are cases where these formats are not sufficient to meet the needs of the author. For these cases, interchange of final form electronic documents is necessary. Requirement for Page Fidelity The key problem with interchanging only revisable form documents is the inability to guarantee page fidelity. Page fidelity means that a given revisable form document will always produce the same final form output no matter where it is processed. There are a number of reasons why page fidelity is required. The most obvious reason is that the composition and layout process involve a human designer's decisions. Only in the final form is it possible to capture these decisions. These formatting decisions are important in the presentation of the represented information. This is quite obvious in advertisements, where design plays an important role in the effectiveness of communicating the intended message. It is perhaps less obvious but equally important in the design of other information presentations. For example, the placement of text in association with graphical structures, such as in a map of the Washington, D.C., subway system (Figure 1), will greatly affect whether the presentation can be understood. In addition, formatting rules may not adequately describe complex information presentations, such as mathematics or complex tables, which may need to be hand designed for

OCR for page 605
Page 607 image Figure 1 Metro system map. SOURCE: Courtesy of Washington Metropolitan Area Transit Authority. effective communication. (Figure 2 has simple examples of mathematics and tables.) Finally, the composition and layout design may reflect the artistic expression of the designer, making the document more pleasing to read (Figure 3). The rich tradition of printed documents has established the practice of using page and line numbers to reference portions of documents, for legal and other reasons. These references will work for electronic documents only if page fidelity can be guaranteed. Many governmental bodies have built these references into their procedures and require that they be preserved, in electronic as well as in paper documents. The final set of cases does not require page fidelity; they are just more simply handled with a final form representation than a revisable one. Documents that exist only in paper form, legacy documents, can be scanned and their content recognized to produce an electronic document. Although this recognized content could be represented in many forms, it is simplest when the content is represented in final form. Then it is not necessary to decide, for each piece of recognized content, what part of the original document content, such as body text, header, or footer, it belonged to. Since the final form is preserved, the reader of the document can correctly perform the recognition function. (See Figure 2 for an example of a legacy page that would be difficult to categorize correctly.) Finally, preparing a document as an SGML or HTML document typically involves a substantial amount of markup of sections of content to ensure that a rule-driven composition and layout process will produce the intended effect. For documents that are of transient existence, it may be far simpler to produce the composition and layout by hand and to interchange the final form representation than to spend time tuning the document for a rule-driven process.

OCR for page 605
Page 608 image Figure 2 Requirements for a Final Form Document Interchange Format For a final form representation to be a suitable electronic document interchange format for the NII, it should meet a number of requirements: • It should have an open, published specification with sufficient detail to allow multiple implementations. There are many definitions of openness, but the key component of them all is that independent implementations of the specification are possible. This gives the users of the format some guarantee of there being reasonable cost products that support the format. With multiple implementations, the question of interoperability is important. This can be facilitated, although not guaranteed, by establishing conformance test suites for the open specification. • It should provide page fidelity. This is a complex requirement. For text, this means representing the fonts, sizes, weights, spacing, alignment, leading, and so on that are used in the composition and layout of the original document. For graphics, this means representing the shapes, the lines and curves, whether they are filled or stroked, any shading or gradients, and all the scaling, rotations, and positioning of the graphic elements. For

OCR for page 605
Page 609 image Figure 3   images, this means doing automatic conversion from the resolution of the sample space of the image to the resolution of the device, representing both binary and continuous tone images, with or without compression of the image data. For all of the above, the shapes and colors must be preserved where defined in device- and resolution-independent terms. • It should provide a representation for electronic navigation. The format should be able to express hyperlinks within and among documents. These links should be able to refer to documents in formats other than this final form format, such as HTML documents, videos, or animations. The format should be able to represent a table of contents or index using links into the appropriate places in the document. The format should also allow searches for words or phrases within the document and positioning at successive hits. • It should be resource independent. The ability to view an electronic document should not depend on the resources available where it is viewed. There are two parts to this requirement. A standard set of resources, such as a set of standard fonts, can be defined and required at every site of use. These resources need not be transmitted with the document. For resources, such as fonts, that are not in the standard set, there must be provision for inclusion of the resource in the document. • It should provide access to the content of the document for the visually disabled. This means, at a minimum, being able to provide ASCII text that is marked up with tags that are compliant with the International Standard ISO-12083 (ICADD DTD). The ICADD (International Committee for Accessible Document Design) markup must include the structural and navigational information that is present in the document. • It should be possible to create electronic documents in this format using a wide range of document generation processes. Ideally, any application that can generate final form output should be usable to generate an electronic document in this format. There should also be a means for paper documents to be converted into this format.

OCR for page 605
Page 610 • It should be platform independent; that is, the format should be independent of the choice of hardware and operating systems, and transportable from one platform to any other platform. It should also be independent of the authoring application; the authoring application should not be required to view the electronic document. • It should effectively use storage. In particular, it should use relevant, data type-dependent compression techniques. • It should scale well. Applications using the format should perform nearly as well on huge (20,000-page) documents as they do on small ones; the performance on a large document or a (large) collection of smaller documents should be similar. This requirement implies that any component of the electronic document be randomly accessible. • It should integrate with other NII technologies. It should be able to represent hyperlinks to other documents using the Universal Resource Identifiers (URIs) defined by the World Wide Web (WWW) and it should be indentifiable by a URI. It should be possible to encrypt an electronic document both for privacy and for authentication. • It should be possible to revise and reuse the documents represented in the format. Editing is a broad notion that begins with the ability to add annotations, electronic ''sticky notes," to documents. At the next level, preplanned replacement of objects, such as editing fields on a form, might be allowed. Above that, one might allow replacement, extraction and/or deletion of whole pages, and, finally, arbitrary revision of the content of pages. • It should be extensible, meaning that new data types and information can be added to the representation without breaking previous consumer applications that accept the format. Examples of extensions would be adding a new data object for sound annotations or adding information that would allow editing of some object. These requirements might be satisfied in a number of different ways. We describe below a particular format, the portable document format (PDF) which has been developed by Adobe Systems Inc., and the architectures that make interchange of electronic documents practical. An Architectural Basis for Interchange There are three basic architectures that facilitate interchange of electronic documents: (1) the architecture for document preparation, (2) the architecture of the document representation, and (3) the architecture for extension. These three architectures are illustrated in Figure 4. The architecture for document preparation is shown on the left-hand side of the figure and encompasses both electronic and paper document preparation processes. The right-hand side of the figure shows consumption of prepared documents. The portable document format (PDF) is the architecture for document representation and is the link between these components. The right-hand side of the figure shows consumption at two levels. There is an optional cataloguing step that builds full text indexes for one or a collection of documents. Above this, viewing and printing PDF documents are shown. The architecture for extension is indicated by the "search" plug-in, which allows access to indexes built by the cataloguing process. The Architecture for Document Preparation To be effective, any system for electronic document interchange must be able to capture documents in all the forms in which they are generated. This is facilitated by the recent shift to electronic preparation of documents, but it must include a pathway for paper documents as well.

OCR for page 605
Page 611 image Figure 4 Unlike the wide range of forms that revisable documents can take, there are relatively few final form formats in use today. (This is another reason that it makes sense to have a final form interchange format.) Since, historically, final forms were prepared for printing, one can capture documents in final form by replacing or using components of the printing architectures of the various operating systems. Two such approaches have been used with PDF: (1) in operating systems with selectable print drivers, adding a print driver that generates PDF and (2) translating existing visually rich print formats to PDF. Both these pathways are shown in the upper left corner of Figure 4. PDFWriter is the replacement print driver and the Acrobat Distiller translates PostScript (PS) language files into PDF. Some operating systems, such as the Mac OS and Windows, have a standard interface, the GUI (graphical user interface), which can be used by any application both to display information on the screen and to print what is displayed. By replacing the print driver that lies beneath the GUI it is possible to capture the document that would have been printed and to convert it into the electronic document interchange format. For operating systems without a GUI interface to printing and for applications that choose to generate their print format directly, the PostScript language is the industry standard for describing visually rich final form documents. Therefore, the electronic document interchange format can be created by translating, or "distilling," PostScript language files. This distillation process converts the print description into a form more suitable for viewing and electronic navigation. The PostScript language has been extended, for distillation, to allow information on navigation to be included with the print description, allowing the distillation process to automatically generate navigational aids. The above two approaches to creation of PDF documents work with electronic preparation processes. But there is also an archive of legacy documents that were never prepared electronically or are not now available in electronic form. For these documents there is a third pathway to PDF, shown in the lower left corner of Figure 4. Paper documents can be scanned and converted to raster images. These images are then fed to a recognition program, Acrobat Capture, that identifies the textual and nontextual parts of the document. The textual parts are converted to coded text with appropriate font information including font name, weight, size, posture. The nontextual parts remain as images. This process produces an electronic representation of the paper document that has the same final form as the original and is much smaller than the scanned version. Because the paper document is a final form, the same final form format can be used, without loss of fidelity, for paper documents and for the electronically generated documents.

OCR for page 605
Page 612 Page fidelity is important. The current state of recognition technology, though very good, is not infallible; there are always some characters that cannot be identified with a high level of confidence. Because the PDF format allows text and image data to be freely intermixed, characters or text fragments whose recognition confidence falls below an adjustable level can be placed in the PDF document as images. These images can then be read by a human reader even if a mechanical reader could not interpret them. (Figure 2 shows a document captured by this process.) The Architecture of the Document Representation There is more to the architecture of the document representation than meeting the above requirements for a final form representation. Architectures need to be robust and flexible if they are to be useful over a continuing span of years. PDF has such an architecture. The PDF architecture certainly meets these requirements, as will be clear below. Most importantly, PDF has an open specification: the Portable Document Format Reference Manual (ISBN 0-201-62628-4) has been published for several years, and implementations have been produced by several vendors. PDF also goes beyond the final form requirements. For example, the content of PDF files can be randomly accessed and the files themselves can be generated in a single pass through the document being converted to PDF form. In addition, incremental changes to a PDF file require only incremental additions to the file rather than a complete rewrite of the file. These are aspects of PDF that are important with respect to the efficiency of the generation and viewing processes. The PDF file format is based on long experience both with a practical document interchange format and with applications that were constructed on top of that format. Adobe Illustrator is a graphics design program whose intermediate file format is based on the PostScript language. By making the intermediate format also be a print format, the output of Adobe Illustrator could easily be imported into other applications because they could print the objects without having to interpret the semantics. In addition, because the underlying semantics were published, these objects could be read by other applications when required. The lessons learned in the development of Adobe Illustrator went into the design of PDF. PDF, like the Adobe Illustrator file format, is based on the PostScript language. PDF uses the PostScript language imaging model, which has proven itself over 12 years of experience as being capable of faithfully representing visually rich documents. Yet the PostScript file format was designed for printing, not for interactive access. To improve system performance for interactive access, PDF has a restructured and simplified description language. Experience with the PostScript language has shown that, although having a full programming language capability is useful, a properly chosen set of high-level combinations of the PostScript language primitives can be used to describe most, if not all, final form pages. Therefore, PDF has a fixed vocabulary of high-level operators that can be more efficiently implemented than arbitrary combinations of the lower-level primitives. The User Model for PDF Documents The user sees a PDF document as a collection of pages. Each page has a content portion that represents the final form of that page and a number of virtual overlays that augment the page in various ways. For example, there are overlay layers for annotations, such as electronic sticky notes, voice annotations, and the like. There are overlay layers for hyperlinks to other parts of the same document or hyperlinks to other documents and other kinds of objects, such as video segments or animations. There is an overlay layer that identifies the threading of the content of articles from page to page and from column to column. Each of the overlay layers is associated with the content portion geometrically. Each overlay object has an associated rectangle that encompasses the portion of content associated with the object. Each of the layers is independent of the others. This allows information in one layer to be extracted, replaced, or imported without affecting the other layers. This facilitates exporting annotations made on multiple

OCR for page 605
Page 613 copies of a document sent out for review and then reimporting all the review annotations into a single document for responding to the reviewers' comments. This also makes it possible to define hyperlinks and threads on the layout of a document that only has the test portion present and then to replace the text-only pages with pages that include the figures and images to create the finished document. In addition to the page-oriented navigational layers, there are two document-level navigation aids. There is a set of bookmark or outline objects that allow a table of contents or index to be defined into the set of pages. Each bookmark is a link to a particular destination in the document. A destination specifies the target page and the area on that page that is the target for display. Destinations can be specified directly or named and referred to by name. Using named destinations, especially for links to other documents, allows the other documents to be revised without invalidating the destination reference. Finally, associated with each page is an optional thumbnail image of the page content. These thumbnails can be arrayed in sequence in a slide sorter array and can be used both to navigate among pages and to reorder, move, delete, and insert pages within and among documents. The Abstract Model of a PDF Document: A Tree Abstractly, the PDF document is represented as a series of trees. A primary tree represents the set of pages and secondary trees represent the document-level objects described in the user model. Each page is itself a small tree with a branch for the representation of the page content; a branch for the resources, such as fonts and images used on the page; a branch for the annotations and links defined on the page; and a branch for the optional thumbnail image. The page content is represented as a sequence of high-level PostScript language imaging model operators. The resources used are represented as references to resource objects that can be shared among pages. There is an array of annotation and link objects. The Representation of the Abstract Tree The abstract document tree is represented in terms of the primitive building blocks of the PostScript language. There are five simple objects and three complex objects. The simple objects are the null object (which is a placeholder), the Boolean object (which is either true or false), the number object (which is an integer or fixed point), the string object (which has between 0 and 65535 octets), and the name object (which is a read-only string). The three complex objects are arrays, dictionaries, and streams. Arrays are sequences of 0 to 65535 objects that may be mixed type and may include other arrays. Dictionaries are sets of up to 65535 key-value pairs where the key is a name and the value is any object. Streams are composed of a dictionary and an arbitrary sequence of octets. The dictionary allows the content of the streams to be encoded and/or compressed to improve space efficiency. Encoding algorithms are used to limit the range of octets that appear in the representation of the stream. Those defined in PDF are ASCIIHex (each hex digit is represented as an octet) and ASCII85 (each four octets of the stream are represented as five octets). These both produce octet strings restricted to the 7-bit ASCII graphic character space. Compression algorithms are used to reduce storage requirements. Those defined in PDF are LZW (licensed from Univac), Run length, CCITT Group 3 and Group 4 FAX and DCT (JPEG). The terminal nodes of the abstract tree are represented by simple objects and streams. The nonterminal nodes are represented by arrays and dictionaries. The branches (arcs) of the tree are represented in one of two ways. The simplest way is that the referenced object is directly present in the nonterminal node object. This is called a direct object. The second form of branch is an indirect object reference. Objects can be made into indirect objects by giving the (direct) object an object number and a generation number. These indirect objects can then be referenced by using the object number and generation number in place of the occurrence of the direct object. Indirect objects and indirect object references allow objects to be shared. For example, a font used on several pages need only be stored once in the document. They also allow the values of certain keys, such as the

OCR for page 605
Page 614 length of a stream, to be deferred until the value is known. This property is needed to allow PDF to be produced in one pass through the input to the PDF generation process. The PDF File Structure Indirect objects and indirect object references do not allow direct access to the objects. This problem is solved by the PDF file structure. There are four parts to the file structure. The first part is a header, which identifies the file as being a PDF file and indicates the version of PDF being used in the file. The second part is the body, which is a sequence of indirect objects. The third part is the cross-reference table. This table is a directory that maps object numbers to offsets in the (body of the) file structure. This allows direct access to the indirectly referenced objects. The final part is the Trailer, which serves several purposes. It is the last thing in the file and it has the offset of the corresponding cross-reference table. It also has a dictionary object. This dictionary is the size of the cross-reference table. It indicates which indirect object is the root of the document tree. It indicates which object is the "info dictionary," a set of keys that allow attributes to be associated with the document. These keys include such information as author, creation date, etc. Finally, the trailer dictionary can have an ID key whose value has two parts. Both parts are typically hash functions applied to parts of the document and key information about the document. The first hash is created when the document is first stored; it is never modified after that. The second is changed whenever the document is stored. By storing these IDs with file specifications referencing the document, one can more accurately determine that the document retrieved via a given file specification is the document that is desired. The trailer is structured to allow PDF files to be incrementally updated. This allows PDF files to be edited, say deleting some pages or adding links or annotations, without having to rewrite the entire file. For large files, this can be a significant savings. This is accomplished by adding any new indirect objects after the existing final trailer and appending a new cross-reference table and trailer to the end of the file. The new cross-reference table provides access to the new objects and hides any deleted objects. This mechanism also provides a form of "undo" capability. Move the end of the file back to the last byte of the previous trailer and all changes made since that trailer was written will be removed. The purpose of the generation numbers in the indirect object definition and reference is to allow reuse of table entries in the cross-reference table when objects are deleted. This keeps the cross-reference table from growing arbitrarily large. Any indirect object reference is looked up in the endmost cross-reference table in the document. If the generation number in that cross-reference table does not match the generation number in the indirect reference, then the reference object no longer exists, the reference is bad, and an error is reported. Deleted or unused entries in the cross-reference table are threaded on a list of free entries. Resources The general form and representation of a PDF file have been outlined. There are, however, several areas that need further detail. The page content representation is designed to refer to a collection of resources external to the pages. These include the representations of color spaces, fonts, images, and shareable content fragments. For device-independent color spaces, the color space resource contains the information needed to map colors in that color space to the standard CIE 1931 XYZ color space and thereby ensure accurate reproduction across a range of devices. Images are represented as arrays of sample values that come from a specified color space and may be compressed. Page content fragments are represented as content language subroutines that can be referred to from content. For example, a corporate logo might be used on many pages, but the content operators that draw the logo need be stored only once as a resource. Typically, however, the resources that are most critical for ensured reproduction are the font resources. The correct fonts are needed to be able to faithfully reproduce the text as it was published. PDF has a three-level approach to font resources. First, there is a set of 13 fonts (12 textual fonts and 1 symbol font) that must be

OCR for page 605
Page 615 available to the viewer. These fonts can be assumed to exist at any consumer of a PDF document. For other fonts, there are two solutions. The fonts may be embedded within the document or substitutions may be made for the fonts if they are not available on the consumer's platform. Fonts that are embedded may be either Adobe Type 1 fonts or TrueType fonts and may be the full font or a subset of the font sufficient to display the document in which they are embedded. The font architecture divides the font representation into three separate parts: the font dictionary, the font encoding, and the font descriptor. The font dictionary represents the font and may refer to a font encoding and/or a font descriptor. The font encoding maps octet values that occur in a string into the names of the glyphs in a font. The font descriptor has the metrics of the font, including the width and height of glyphs and attributes such as the weight of stems, whether it is italic, the height of lower-case letters, and so on. The font shape data, if included, are part of the font descriptor. If the font shape data are not included, then the other information in the font descriptor can be used to provide substitute fonts. Substitute fonts work for textual fonts and replace the expected glyphs with glyphs that have the same width, height, and weight as the original glyphs. If page fidelity is required, then the font shape data should be embedded; but font substitution can be used to reduce document size where the omitted fonts are either expected at the consumer's location or font substitution is adequate for reading the document. Hyperlinks The hyperlink mechanism has two parts: the specification of where the link is and the specification of where the link goes. The first specification is given as a rectangular area defined on the page content. The second specification is called an action. There are a number of different action types. The simplest is moving to a different destination within the same document. A more complex action is moving to a destination in another PDF document. The destination may be a position in the document, a named destination, or the beginning of an article thread. Instead of making another PDF document available for viewing, the external reference may launch an application on a particular file, such as a fragment of sound or a video. All these external references use platform independent file names, which may be relative to the file containing the reference document, to refer to external entities. The URL (Uniform Resource Locator), as defined for the World Wide Web, is another form of allowed reference to an external document. A URL identifies a file (or part thereof) that may be anywhere in the electronically reachable world. When the URL is followed, the object retrieved is typed and then a program that can process that type is invoked to display the object. Any type for which there is a viewing program, including PDF, can thereby be displayed. Extensibility PDF is particularly extensible. It is constructed from simple building blocks; the PDF file is a tree constructed from leaves that are simple data types or streams and with arrays and dictionaries as the nonterminal nodes. In general, additional keys can be added to dictionaries without affecting viewers who do not understand the new keys. These additional keys may be used to add information needed to control and/or represent new content object types and to define editing on existing objects. Because of the flexibility of the extension mechanism, a registry has been defined to help avoid key name conflicts that might arise when several extensions are simultaneously present. The Architecture for Extensions The third component architecture for final form electronic document interchange is the extension architecture for consumers of the electronic documents. Viewing a PDF document is always possible and the PDF specification defines what viewing means. But, if there are extensions within the PDF file, there must be a

OCR for page 605
Page 616 way to give semantic interpretation to the extension data. In addition, vendors may want to integrate a PDF consumer application, such as Acrobat Exchange, with their applications. For example, a service that provides information on stocks and bonds may want to seamlessly display well-formatted reports on particular stocks. This service would like to include the display of PDF documents with their non-PDF information. Providing semantics both for extension data and for application integration can be accomplished using the extension architecture for PDF viewers. It is reasonable to look at PDF viewers as operating system extensions. These viewers provide a basic capability to view and print any PDF document. By extending the view and print application programming interfaces (APIs), more powerful applications can be constructed on top of the basic view and print capabilities of a PDF viewer. These extended applications, called plug-ins, can access extended data stored in the PDF file, change the viewer's user interface, extend the functionality of PDF, create new link and action types, and define limited editing of PDF files. The client search mechanism shown in Figure 4 was done as a plug-in to Acrobat Exchange. The search plug-in presumes that collections of PDF documents have been indexed using Acrobat Catalog. The plug-in is capable of accessing the resulting indexes, retrieving selected documents, and highlighting occurrences that match the search criteria. This plug-in is shipped with Acrobat Exchange but could be replaced by other vendors with another mechanism for building indexes and retrieving documents. Hence, PDF files can be incorporated into many document management systems. Deployment and the Future The process of PDF deployment has already begun. One can find a variety of documents in PDF form on the World Wide Web, on CD-ROMs, and from other electronic sources. These documents range from tax forms from the Internal Revenue Service, to color newspapers, commercial advertisements and catalogs, product drawings and specifications, and standard business and legal documents. Use of PDF is likely to increase as more document producers understand the technology and learn that it is well adapted to current document production processes. The greatest barrier to expansion of consumption is awareness on the part of the consumers. There are free viewers for PDF files, the Acrobat Readers, available for most major user platforms (DOS, Sun UNIX, Macintosh OS, Windows) and more support is coming. These viewers are available on-line through a variety of services, are embedded in CD-ROMs, and are distributed on diskette. At this level, the barrier to deployment is primarily education. But there are also opportunities to improve the quality of electronic document interchange. Some examples of these improvements are better support for navigational aids; support for other content types, such as audio and video; support for a structural decomposition of the document, as is done in SGML; and support for a higher level of document editing. Current document production processes naturally produce the final form of the document, but they do not necessarily enable navigation aids such as hyperlinks and bookmarks/tables of contents. The document production architecture does provide a pathway for this information to be passed to the distillation process and through the print drivers. As producers enable this pathway in their document production products, it will become standard to automatically translate the representation of navigational information in a document production product into the corresponding PDF representation of navigational aids. Another direction for future development is the inclusion of additional content types within a PDF file. (There is already support for referencing foreign content types stored in separate files via the hyperlink mechanism.) Some of the obvious content types that should be included are audio, video, and animation. There is also a need for orchestrating multiple actions/events when content is expanded beyond typical pages. Much of the barrier to inclusion of these other content types is in the lack of standard formats for these content types. Because PDF is designed to run across all platforms, there is a particular need for standards that are capable of being implemented in all environments. For example, standards that require hardware assists are not as useful as standards that can be helped by hardware assists but do not require them.

OCR for page 605
Page 617 A final form interchange format guarantees viewability of the information in a document, but it does not necessarily provide for reuse or revision of the information. Structural information representations, such as SGML and HTML, can simplify reuse, but they do not capture the decisions of human layout designers. Best would be a format that allowed both views: the final form view for browsing and reading, and the ability to recover the structural form for re-purposing, editing, or structure-based content retrieval. PDF will be extended to allow the formatted content to be related to the structural information from which it was produced and to allow that structured information to be retrieved for further use. Clearly the final form document contains some of the information that is needed to edit the document, but it is equally clear that without extensions to represent structure as well as form the document may not contain information about how the components of the final form were created and how they might be changed. Such simple things as what text was automatically generated, what elements were grouped together to be treated as a whole, and into what containers text was flowed need not be represented in the final form. The PDF representation was constructed from a set of primitive building blocks that are also suitable for representing structural and other information needed for editing. Augmenting the final form with this kind of information, using these powerful and flexible building blocks, would allow the final form document format to offer revisability. As a simple example, one might use PDF to represent a photo album as a collection of scanned images placed on pages. A simple editor might be defined that allows these photos to be reused—say, to make a greeting card by combining text with images selected from the photo album. The greeting cards thus constructed could be represented in PDF using extensions that allow editing of the added text. More complex editing tasks can be accommodated by capturing more information about the editing context within the PDF file generated by the editing application. For some applications, the PDF file might be the only storage format needed; it would be both revisable and final form. Conclusion The business case for final form electronic document interchange is relatively straightforward. There are significant savings to be achieved simply by replacing paper distribution with electronic distribution, whether or not the document is printed at the receiving site. The key success factor is whether the document can be consumed once received. Consumption most often means access to the document's contents in the form in which they were published. This can be achieved by having a small number of final form interchange formats (preferably one) and universal distribution of viewers for these formats. The portable document format (PDF) is a more than suitable final form interchange format with freely distributable viewers. For practical interchange, there must be tools to conveniently produce the interchange format from existing (and future) document production processes. The interchange format must be able to be transmitted through the electronic networks and included on disks, diskettes, CD-ROMs, and other physical distribution media. It must be open to allow multiple implementations and to ensure against the demise of any particular implementation. Finally, it must be extensible to allow growth with the changing requirements of information distribution. These features all are met by PDF. PDF provides a universal format for distributing information as an electronic document. The information can always be viewed and printed. And, with extensions, it may be edited and integrated with other information system components.