8
Research Behind Everyday Computation

Computer science research has led to the emergence of entirely new capabilities used by millions of ordinary people today. Using computers to typeset text was probably foreseeable, but the ease with which anyone can become a small-scale publisher by using the Web was harder to envision. Automating accountants’ ledger paper was perhaps natural, but the ability to disintermediate the financial-analysis function—moving control from a central data-processing department to a spreadsheet on an executive’s desk—could not have been. Creating an index of documents stored at various Internet locations so that they could be shared by a community of physics researchers was, perhaps, not astounding, but leveraging the power of distributed access, the universal name-space of the uniform resource locator (URL), and easily usable browsers together with new algorithms for searching has led to an explosion of information-retrieval opportunities that has changed forever the way we do business.

Computers now perform most of the document-preparation tasks that used to be handled by secretaries: typing drafts, correcting drafts, positioning figures, formatting final copy, and the like. Ullman traces the 30-year evolution of text-formatting programs from quick-and-dirty type-setters created by graduate students to avoid typing their theses to the powerful document-formatters of the present day.

Similarly, Foley shows how today’s spreadsheets, which began as a program to help accountants manage ledger sheets, astonished the



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field 8 Research Behind Everyday Computation Computer science research has led to the emergence of entirely new capabilities used by millions of ordinary people today. Using computers to typeset text was probably foreseeable, but the ease with which anyone can become a small-scale publisher by using the Web was harder to envision. Automating accountants’ ledger paper was perhaps natural, but the ability to disintermediate the financial-analysis function—moving control from a central data-processing department to a spreadsheet on an executive’s desk—could not have been. Creating an index of documents stored at various Internet locations so that they could be shared by a community of physics researchers was, perhaps, not astounding, but leveraging the power of distributed access, the universal name-space of the uniform resource locator (URL), and easily usable browsers together with new algorithms for searching has led to an explosion of information-retrieval opportunities that has changed forever the way we do business. Computers now perform most of the document-preparation tasks that used to be handled by secretaries: typing drafts, correcting drafts, positioning figures, formatting final copy, and the like. Ullman traces the 30-year evolution of text-formatting programs from quick-and-dirty type-setters created by graduate students to avoid typing their theses to the powerful document-formatters of the present day. Similarly, Foley shows how today’s spreadsheets, which began as a program to help accountants manage ledger sheets, astonished the

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field computer-science world by becoming the “killer application” that transferred effective control of computing to non-programmers in all disciplines. One of the principal personal uses of computers today is searching the Internet’s wealth of online information. The idea of information retrieval is far from new, but the ability to dedicate computing power to amplifying human effort has led to new wrinkles, such as attempts to establish the relevance of information based on the number of people referring to it. Norvig explains the technology behind modern Internet searches.

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field HOW YOU GOT MICROSOFT WORD Jeffrey Ullman, Stanford University and Gradience Corporation Those born after about 1975 cannot imagine the cumbersome process by which formal documents, ranging from business letters to books, were produced. Drafts were hand-written and then typed by a secretary using a typewriter. If corrections were needed, small errors could be handled by erasing or by applying whiteout, followed by typing of the correct words. But length-changing errors required that the entire page be retyped and the error-checking process be repeated. When computers first became widespread in businesses and schools, many people started typing their documents on punch cards. Similar to the cards that were used in the famous Florida vote of 2000, the cards represent letters when you punch out certain holes, creating “chad.” A document such as a thesis could be typed on cards, fed to the computer, and printed line-for-line on a printer. Early printers were like typewriters, but faster. This arrangement solved the problem of handling small errors, since only one or a few cards would have to be repunched. Several early computer-science students saw the potential for doing better, especially as they confronted the daunting task of typing their PhD theses. In 1964, Jerry Saltzer at the Massachusetts Institute of Technology created for this purpose a pair of programs: Typeset, which was an early form of a text editor, and Runoff, which was an improved formatting system. Bob Balzer, at Carnegie Mellon University, created software similar to Runoff, called LEAD (List, Edit, and Display), at about the same time. Programs like Runoff and LEAD not only reproduced what was on punch cards, but also formatted the document by placing on one line as many words as would fit, typically a maximum of 80 characters per line. This advance solved the problem of having to retype large amounts of text when words were inserted or deleted. Special commands, which were not part of the text, could be inserted to control matters such as justification, that is, alignment of text on the right, as in a book, as well as on the left. Word-Processing Software at Bell Labs The Typeset/Runoff system inspired a group of researchers at Bell Laboratories, leading to Joe Ossanna’s NROFF (new runoff) program. A “macro” capability was added to allow repetitive formatting concepts, such as section headers or indented paragraphs, to be defined once and used easily many times. The “MS” macro package by Mike Lesk was widely used as a standard for formatting of text. But the printed output

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field device was still a souped-up typewriter, with its single size and font of letters. In the early 1970s, work on typesetting at Bell Labs had progressed to the point where the labs purchased a CAT phototypesetting machine, of the kind used at the time to produce newspapers, for example. This device allowed printing to be done in several fonts and font sizes. Unlike today’s laser printers, it printed only on photographic paper, which was then reproduced by one of several processes, such as a “Xerox machine.” NROFF evolved into TROFF (typesetter runoff) to allow documents to be created that were then typeset on film. For the first time, it became possible for casual authors to produce a letter or report that looked as if it were part of a book. However, TROFF was not up to the requirement faced by many scientists and engineers to set mathematical text easily. While the CAT printer could, for example, print a subscript in the correct (smaller) size and below the line of text, there was no convenient way to tell it to do so. EQN, a program written by Brian Kernighan, solved that problem by taking expressions that represented how the equation or formula should look (e.g., “X sub 1” for an X with a subscript 1, X1) and turning them into TROFF, while leaving everything that wasn’t equations intact, for later processing by TROFF. EQN is in effect an atypical programming language, since it does not let you write general-purpose algorithms, as the best-known programming languages do, but it helps you deal with a particular, difficult problem, that of describing how mathematical expressions should look on the page. It is an excellent example of how the effective design of a language or notation for a problem leads to tools that people find useful and powerful. A key borrowing of EQN from the more traditional programming languages is recursion, one of the central themes of computing, where things are defined partially in terms of themselves. For example, EQN allows one to say not only “X sub 1” but “X sub anything” or even “anything sub anything,” where the “anythings” can themselves have subscripts or any of the other operations whereby mathematical expressions are built up, such as by horizontal lines for division. Thus, one can say things like “X sub {n sub i} ” to represent an X whose subscript is itself a subscripted “n sub i” ( ). Several other components of modern word processors got their start in the same Bell Labs group. The first spell-checkers were developed to compare words in the document with a list of “acceptable’’ words, modified by rules for plurals, tense modifications, and so on. Also, Lorinda Cherry developed the first grammar-checker (DICT, for “diction”) by applying a table of rules to the document.

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field Typesetting Research at Xerox PARC In the mid-1970s, a team of researchers at Xerox PARC (Palo Alto Research Center) made several important contributions toward the way we now create documents. One was the development of the laser printer. While the Xerox PARC researchers could and did build a laser printer, they could not do so at a price people would pay. They estimated that it would cost over $100,000 at that time for the printer we now buy for a few hundred dollars. Nevertheless, in order to verify their conjecture that the existence of these devices would change the way people dealt with documents, they built prototypes of a laser printer, called Dover, and gave them to several universities. As envisioned, the Dover changed markedly the way students and faculty prepared papers. Also as envisioned, the cost for building such devices has dropped dramatically over the years, and the dream of a laser printer on every desk has become real. A second innovation from the PARC group was the WYSIWYG (wizzy-wig, or “what you see is what you get”) editor. A prototype text editor called Bravo allowed users to see the document they were creating. In one sense, the differences between Bravo and TROFF were small. For example, each allowed you to specify that certain text should be italics. In Bravo, as in MSWord and other text-processing systems today, one could change fonts, say to italics, as one typed. In contrast, TROFF required you to type a command (“.it”) to say “change to italics.” Bravo allowed you to see that you were really typing italic text (and would also let you see if you forgot to return to roman text when you should), while with TROFF, you could only tell whether you got it right when you printed your document later. On the other hand, Bravo was less adept at setting complex mathematical formulas, since it did not possess the recursive description capabilities of EQN, and in fact, the successors of Bravo up to this day have not recaptured the power of EQN in WYSIWYG mode. While it may seem obvious that for non-mathematical typesetting, WYSIWYG editors are a wonderful tool, the existence of such editors depends on developments in a number of different areas. Of course, processors had to become fast enough that it became economical to devote a processor to a single task like editing. Computer graphics technology needed to advance to a state where it was possible to offer the user a video terminal as a display unit. Operating systems needed to incorporate windows as a primitive concept and manage rapid communication and drawing of characters, between keyboard, processor, and display.

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field The TeX Project At about the same time as the Xerox activities, Don Knuth at Stanford began a broad study of how computers can improve the preparation and typesetting of documents. By this time, it was clear that the font was a critical computer object. Descriptions of the letters in a font have to be developed so the same look can be displayed on a video terminal and also printed, perhaps at a different size, on one of many different printing devices. For this purpose, Knuth developed a system called Metafont for describing the shapes of letters and other characters in a way that allows easy generation of the font in different sizes and slants. Knuth also developed TeX, a unified system for typesetting that incorporates the capabilities of TROFF and several specialized systems such as EQN. In addition, TeX increases the care with which paragraphs and pages are laid out. TeX’s paragraphing algorithm looks at the words of an entire paragraph before deciding on line-breaks, and thereby improves the look of the text when compared with “greedy” algorithms that look at text a line at a time and fit whatever they can on each line in succession. The TeX paragraphing algorithm is an excellent example of a general computational approach called “dynamic programming.” It works roughly as follows. First, we need a measure of how well words fit on a line. A typical approach, even for greedy algorithms, is to decide on the optimum space between words and then assign a “badness” to a line proportional to the square of the deviations of the spaces from the optimal. The objective then becomes to minimize the total badness of all the lines in a paragraph. Now the number of ways to break a paragraph of, say, 100 words into 10 lines is astronomical; it is the number of ways to pick 9 line breaks among the 99 positions between the words, which is “99 choose 9,” or about 1.7 trillion. It is clearly not feasible to consider all possible ways to break paragraphs. Fortunately, as is often the case in computer science, the obvious way to do something is neither the best way nor the only way. In this case, dynamic programming let Knuth find an algorithm that takes time proportional to the length of the paragraph, rather than to some exponential in the size of the paragraph. The dynamic programming algorithm fills out a table of least badnesses, not just for the paragraph itself, but for all paragraphs that could be formed from a tail or suffix of the sequence of words in the paragraph. That is, for each i, starting at 1 and going up to the number of words in the paragraph, the entry for i is the least badness of any division of the last i words into lines. Suppose we have computed this value for 1, 2, . . . up to i – 1. To decide on the best line breaks for the last i words, we consider, for all k small enough that k words can fit on one line, the cost of placing the first k of the i words on one line (using the formula for the “badness” of

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field any particular line), plus the badness of the best breaking for the remaining i – k words (as given by the table). As soon as i reaches the total number of words in the paragraph, we know the cost of the best line breaking, and we can easily figure out what that breaking is, by storing in the table, as we go, the winning value of k for each i. In addition, one of Knuth’s students, Frank Liang, invented an elegant algorithm for choosing the points at which words should be hyphenated. The dictionary tells us how to hyphenate in a straghtforward way: it lists every word it knows and shows where the hyphens may go. In addition to using a lot of space, that approach is of little use if you’ve just invented a word like “cyberpunk,” or you need to hyphenate a proper noun. Liang’s approach was to summarize the rules of hyphenation, using a simple language for expressing those rules and their priorities. Imagine a collection of rules—it’s good to hyphenate here; it’s not good to hyphenate there. For instance, it is generally a bad idea to hyphenate between the “a” and “t” in “. . . at . . . .” But if the word happens to fit the pattern “. . . ation . . .,” then it is actually a good idea to hyphenate between the “a” and “t” if the word has to be hyphenated. Each rule has a number associated with it, and this number is given to those points between letters to which the rule applies. Thus, a point might acquire several different numbers, from different rules. If so, take the largest of the numbers and ignore the rest. The magic of Liang’s algorithm is this: if the number associated with a point is odd, then you may hyphenate there; if it is even, you may not. For example, the rule about “. . . at . . .” might receive a low, even number, like 2. That number suggests that in the absence of any more specific pattern, let’s not hyphenate here. The rule about “. . . ation . . . ,” on the other hand, would have a high, odd number, such as 9. Thus, the rule strongly suggests “a-tion” is a good hyphenation, and overrules the more general idea that “a-t” is bad. Were we to have a specific word in mind with pattern “. . . ation . . .” where we did not want the hyphen between the “a” and “t” (there are no such examples in common English), we could make that word a pattern with a still higher, even number, like 24. Word Processing Today If you are familiar with MSWord or a similar application such as WordPerfect, you will know that many of the ideas described here are available to you. The application breaks paragraphs into lines in a way that allows alignment at both left and right (although it does not use the TeX algorithm for doing so). It allows changes of font at the press of a button on the screen. It allows global changes in the way elements such as sections are displayed, just as the early Bell Labs macro packages did. It

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field has a rule-based system to catch many grammatical errors, like DICT did, and catches spelling errors in a similar way. It has a WYSIWYG interface like the old Bravo editor, and it easily passes its document to a laser printer, a descendant of the Dover printers of 25 years ago. The system provides a virtually unlimited number of fonts. Perhaps as a reflection on how prevalent word processing has become in ordinary discourse, neither Microsoft nor other manufacturers of word-processing applications have felt the need to integrate mathematical typesetting nearly as closely as did EQN. Ironically, many scientists and engineers still prefer to write using TeX (which has all the power of EQN and more), or its variant LaTeX, a system developed by Leslie Lamport.

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field VISICALC, SPREADSHEETS, AND PROGRAMMING FOR THE MASSES, OR “HOW A KILLER APP WAS BORN” James D. Foley, Georgia Institute of Technology An important part of computing is the study of human computer interaction—HCI. Individuals and organizations generally buy computer hardware and software expecting that the computer will allow them to do something entirely new, or more accurately or completely or quickly or economically or with more enjoyment than without the computer. The role of HCI is to develop computer capabilities that match what people want to do, so that the computer can be useful to individuals and organizations.1 At least three important software innovations in the last 30 years have been HCI tours de force: The development of window managers at SRI and PARC, which were precursors to the Macintosh and Microsoft window manager software that allowed millions of people to easily use personal computers. The development of the first Web browser by Tim Berners-Lee in 1991;2 the development of Mosaic (the first widely available browser) at the University of Illinois in 1993, and their subsequent progeny Netscape Navigator and Internet Explorer, which made the World Wide Web accessible to millions of people as opposed to the thousands who had been using it originally. VisiCalc, developed by Dan Bricklin and Bob Frankston in 1979. Prior to VisiCalc, personal computers were bought mostly by hobbyists, game players, and to teach programming. The VisiCalc spreadsheet program (whose commercial successors include Microsoft’s Excel) caused many financial analysts, business people, accountants, and students to buy personal computers (first the Apple II and later the IBM PC). As such, VisiCalc was the “killer app” that dramatically increased sales of these computers. VisiCalc In this essay we focus on VisiCalc, for several reasons. First, VisiCalc did spark the sales of personal computers. Second, VisiCalc introduced 1   See, for example, Brad A. Myers, 1998, “A Brief History of Human Computer Interaction Technology,” ACM Interactions 5(2):44-54. Available online at http://www-2.cs.cmu.edu/~mulet/papers/uihistory.tr.html. 2   See http://www.w3.org/History.html.

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field many of its users to computer programming without their ever realizing they were doing computer programming, thereby providing a powerful tool in an easy-to-use package. And finally, as with window managers and Mosaic, VisiCalc could not have happened without a considerable body of prior computer science research. Why did VisiCalc spark PC sales? Why did people rush to buy a computer in order to use VisiCalc? Simply because VisiCalc met a market need (a human need) to organize and calculate numbers more rapidly using an electronic spreadsheet than one could do by hand with a paper-based spreadsheet and an adding machine or an electronic calculator. There was a pent-up demand, and VisiCalc together with the PC met that demand. As time went on, other uses for VisiCalc developed—many of us organize lists of address and other information in the convenient row and column format of our favorite spreadsheet, perhaps never calculating a single number. These uses may not have been what Bricklin and Frankston envisioned, but are typical of the way in which tools are adapted by their users to meet needs not envisioned by the tools’ designers. What about this notion that VisiCalc users are actually programming? Well, the “equations” of VisiCalc, by which one states that a particular cell of the spreadsheet is to have a value computed from other cells, are essentially the same as the assignment statements and expressions used in all modern programming languages. The VisiCalc equation “= (B6 + B7)/(C4 + C5) – D9” placed in cell B9 has the same effect3 as the C programming language statement B9 = (B6 + B7)/(C4 + C5) – D9 or the Pascal statement “B9 := (B6 + B7)/(C4 + C5) – D9” or the Java statement “B9 = (B6 + B7)/(C4 + C5) – D9.” But, referring back to the concept of the universal machine (see Kleinberg and Papadimitriou in Chapter 2), it is not the case that this limited form of programming in VisiCalc is the equivalent of a general programming capability. VisiCalc allowed and encouraged its users to program without realizing that they were programming, using the simple-to-understand formulae that could either be typed into a spreadsheet cell or be partially or in some cases completely entered simply by pointing to the cells in the spreadsheet that appear in the equation. Hence, a reduction in the “fear of programming” that often discouraged people from putting computers to use. How did VisiCalc draw on computer science research? In multiple ways: 3   Strictly speaking, as soon as any of the variables such as B6 or C5 change, the equation is re-evaluated and the new value is placed in cell B9, whereas with the programming languages the assignment statement is executed only when the program flow of control comes to the statement. This means that the VisiCalc equation is a more powerful construct than the programming language assignment statement.

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field 1. By using a visual “what you see is what you get” (WYSIWYG) user interface, in which one sees the spreadsheet and can directly manipulate cells in the spreadsheet by pointing at them. Dan Bricklin states in his Web pages4 that he was aware of and influenced by the early research work of Doug Engelbart on text editors using a mouse, and that he had probably seen a demonstration of Xerox PARC’s Alto computer system, which was in turn also influenced by Engelbart. Engelbart and PARC are the critical early developers of the key elements of contemporary window-oriented user interfaces found in the Macintosh and Microsoft operating systems. Engelbart’s passion was to augment the human intellect by providing computer tools to help organize information, compose and modify documents, and work with others at a distance while doing this. With this vision, he and a group (up to 17 researchers) developed NLS (oN-Line System) between 1962 and 1969 at the Stanford Research Institute. Along the way, they invented the mouse as a way to interact with the computer by pointing rather than by the much slower process of moving a cursor with key strokes; developed tiled windows (non-overlapping, as opposed to the overlapping windows of Microsoft Windows and the Apple Macintosh operating system); developed a text editor with commands to move, copy, and delete individual characters, words, or blocks of text (see Ullman in Chapter 8); and provided hyperlinks (just like in the World Wide Web) and key-word searches (similar to the search-engine concept of Google and Alta Vista). Englebart’s research style is not unusual; set a big, hard-to-achieve goal, and create multiple tools needed to achieve that goal. The same thing happened at Xerox PARC. The Alto workstation software developed at PARC starting in the early 1970s was the prototype for the Apple Macintosh and Microsoft Windows user interfaces. Alto had many of the features seen in these and other contemporary window interfaces: windows, icons, menus, copy and paste commands, and direct-manipulation text and graphics editors. 2. By using “programming by example” to simplify and speed up the process of entering equations. VisiCalc was the first widely used instance of “programming by example,” in which the user does programming by demonstrating operations to the computer rather than by specifying what is to be done via 4   See http://www.bricklin.com/visicalc.htm.

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field some programming language. For example, to indicate that a cell should always have in it the sum of several other cells, one would point at the cell that has the sum, and then successively to the cells that are to be summed. The idea of programming by example was developed in a slightly different form by Moshe Zloof of IBM research a few years earlier. Zloof wanted to allow users to specify database queries without having to learn a query programming language (such as SQL). His insight was to recognize that if a user gave just an example of the query results such as a list of part numbers, descriptions, inventory quantity, and selling price, then the query language statement(s) to obtain all of the results could be generated from the example. VisiCalc applied the concept in a slightly different way than did Zloof, thus both drawing on and contributing to the storehouse of computer science knowledge (a typical pattern for CS and other research). Specifically, VisiCalc allows the user to enter a formula (such as given above) by pointing at the cells that occur in the formula. So to enter the formula B2 + C5 in cell B1, the user points at cell B1, types = to indicate that a formula is being specified, then points at cell B2, then points at cell C5, and ends by typing the enter key. The + in the formula is assumed and entered automatically, as the most common choice. If the product is desired rather than the sum, the user types “*”after pointing at cell B2 and before pointing at cell C5. This seemingly simple idea greatly simplifies the process of entering formulae, and it corresponds nicely to the intuition that, given a choice between pointing at something and typing the name of something (in the case at hand, the spreadsheet cell), we humans will tend to point rather than name, as the easier of the two alternatives. 3. By applying finite state machine notation as a tool for designing the program—what inputs were valid in each “state,” what response the computer would make to each input, and what new program “state” would result. The understanding of how to use state diagrams for specifying the behavior of interactive programs was first developed by William Newman in 1968 as a researcher at Imperial College, London. He, in turn, drew on the fundamental notion of finite state machines that had been developed as yet another computer science concept. The idea as applied to VisiCalc is that the program can be in one of a finite number of states. In each state, only certain user actions (such as entering information into a cell, or selecting a command) are possible. Each such action takes the program to another state (or back to the same state) and at the same time causes some change in what the user sees on the display.

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field 4. By using language parsing techniques. Computer scientists have developed techniques for taking expressions such as the earlier example of the expression (B6 + B7)/(C4 + C5) – D9 stored in cell B9 and actually performing the calculations in the appropriate sequence. With the example expression, the actual sequence of arithmetic operations would be: Add cells B6 and B7, saving the result in a place called TEMP1. Add cells C4 and C5, saving the result in a place called TEMP2. Add TEMP1 and TEMP2, saving the results in TEMP1. Add TEMP1 and cell D9, saving the results in cell B9. This may seem simple, but try writing out a general procedure that will work with an arbitrary expression, with many parentheses and divisions and multiplications! Not so easy, is it? The Next “Killer App” What will be the successor to VisiCalc, to windowing systems, to Web browsers that will provide new capabilities to millions and tens of millions of computer users? There are many possibilities; I believe that whatever the next major development, the next “killer app,” might be, it will: Meet an individual or organizational need. Draw on the research of many, many computer scientists. Leverage the existing hardware and software infrastructure, and accelerate their growth (just) as VisiCalc leveraged and accelerated the availability of PCs and e-mail leveraged the Internet and Web browsers leveraged the World Wide Web protocols. We also note that only Web browsing was well anticipated as a killer app—anticipated by Vannevar Bush in the 1940s and researched by a number of prominent computer scientists from the late 1960s into the 1980s. Many candidates for the next “killer app” recognize that the now-common use of computers by people sitting at a desk may not be the major way of using computers in the future. A fundamentally new direction for computing is evolving: I call this direction off-the-desk-top computing, in the sense that it will not be desk-top computers but other forms of computers that will create the next revolution. Others call it ubiquitous or embedded computing, in the sense that computing will be everywhere (i.e., ubiquitous) and embedded in all manner of devices and

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field equipment. This is already happening. Contemporary cell phones have a computer that is as powerful as desk-top computers of 10 years ago. TV set-top boxes have as much computing power as powerful engineering workstations of 5 years ago. The typical automobile has 10 to 20 computers with aggregate power comparable to a contemporary desk-top PC. The most widespread hardware and software infrastructure that might be leveraged is the wireless phone system. It is already pervasive, and it is computer intensive. It is already heavily used for short text messages, for games, and to obtain information. And, with the small size of wireless phones and their keypads, and the integration of wireless functionality into personal digital assistants and with the clear benefit of voice recognition in using small devices, a trend can surely be predicted. More generally, voice recognition will become more and more pervasive as algorithms improve and as computing power and memory become ever more available. Another hardware and software infrastructure that will be leveraged is the World Wide Web itself. With vast amounts of information online and more and more of the world citizenry connected, the use of intelligent assistants to find information, to make sense of information, and to use information will surely expand. Current examples include comparison shopping, money management, and travel arrangements. Life has become more complex and fast-moving; perhaps computers can help simplify life in some ways. Automobiles already have dozens of computers, many networked together. They serve just one of two purposes—operating the car, or entertaining/communicating with the driver and passengers. Many cars already have built-in wireless phones, global positioning systems, limited voice recognition, and navigation aids. Many people spend tens of hours a week in a car, and there are significant personal and societal benefits to further leveraging the automotive computing infrastructure to make driving safer, to connect us with the outside world, and to entertain us. We spend significant amounts of time at home. Many predictions are made for computer-based home entertainment systems and the broader “wired home,” but the home has a very weak hardware and software infrastructure—nothing like that of the car or the typical office working environment or the wireless phone. I believe that homes will ultimately have networked appliances, switches, lights, motion sensors, cameras, microphones, and so on. While we do not yet have a killer app, we have candidates such as the use of computers to allow older people to live alone at home for more years before having to live with their children or in a nursing home; energy management; and home security. We should not neglect new uses of the Internet and the Web. Amy Bruckman, in her essay in Chapter 7, nicely likens our use of the Web to

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field the early days when the automobile was still known as the “horseless buggy.” At that time, nearly a century ago, one could not imagine the cars and highways of today. While we can and should be proactive in exploring these possibilities and their implications for our lives, it is also true that only time will tell.

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field INTERNET SEARCHING Peter Norvig, Google Inc. In 300 BC, the Library of Alexandria held some half million scrolls and served as a cultural center for the world, uniting the thoughts of Greek, Egyptian, Macedonian, and Babylonian culture. According to the Letter of Aristeas, the library had “a large budget in order to collect, if possible, all the books in the world.” Whole fields of study such as geometry and grammar grew from the scholars who had the fortune to travel to and study at Alexandria. In 2004 AD, the Internet forms a distributed library of billions of pages, one that is accessible to anyone, anywhere in the world, at the click of a mouse. Every day hundreds of millions of “trips” to the library start with a query to an Internet search engine. If I want to research a topic such as “Library of Alexandria,” I can type those words to a search engine and in less than a second have access to a list of 31,000 pages, sorted by their usefulness to the query. This unprecedented ease of access to information has revolutionized the way research is done by students, scientists, journalists, shoppers, and others. It opens up an online marketplace of products, services, and ideas that benefits both information providers and seekers; sellers and buyers; consumers and advertisers. How is it that an Internet search engine can find the answers to a query so quickly? It is a four-step process: Crawling the Web, following links to find pages; Indexing the pages to create an index from every word to every place it occurs; Ranking the pages so the best ones show up first; and Displaying the results in a way that is easy for the user to understand. Crawling is conceptually quite simple: starting at some well-known sites on the Web, recursively follow every hypertext link, recording the pages encountered along the way. In computer science this is called the transitive closure of the link relation. However, the conceptual simplicity hides a large number of practical complications: sites may be busy or down at one point and come back to life later; pages may be duplicated at multiple sites (or with different URLs at the same site) and must be dealt with accordingly; many pages have text that does not conform to the standards for HTML, HTTP redirection, robot exclusion, or other protocols; some information is hard to access because it is hidden behind a form, Flash animation, or Javascript program. Finally, the necessity of

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field crawling 100 million pages a day means that building a crawler is an exercise in distributed computing, requiring many computers that must work together and schedule their actions so as to get to all the pages without overwhelming any one site with too many requests at once. A search engine’s index is similar to the index in the back of a book: it is used to find the pages on which a word occurs. There are two main differences: the search engine’s index lists every occurrence of every word, not just the important concepts, and the number of pages is in the billions, not hundreds. Various techniques of compression and clever representation are used to keep the index “small,” but it is still measured in terabytes (millions of megabytes), which again means that distributed computing is required. Most modern search engines index link data as well as word data. It is useful to know how many pages link to a given page, and what the quality of those pages is. This kind of analysis is similar to citation analysis in bibliographic work and helps establish which pages are authoritative. Algorithms such as PageRank and HITS are used to assign a numeric measure of authority to each page. For example, the PageRank algorithm says that the rank of a page is a function of the sum of the ranks of the pages that link to the page. If we let PR(p) be the PageRank of page p, Out(p) be the number of outgoing links from page p, Links(p) be the set of pages that link to page p, and N be the total number of pages in the index, then we can define PageRank by where r is a parameter that indicates the probability that a user will choose not to follow a link, but will instead restart at some other page. The r/N term means that each of the N pages is equally likely to be the restart point, although it is also possible to use a smaller subset of well-known pages as the restart candidates. Note that the formula for PageRank is recursive—PR appears on both the right- and left-hand sides of the equation. The equation can be solved by iterating several times, or by employing algorithms that compute the eigenvalues of a (4-billion-by-4-billion) matrix. The two steps above are query independent—they do not depend on the user’s query and thus can be done before a query is issued, with the cost shared among all users. This is why a search takes a second or less, rather than the days it would take if a search engine had to crawl the Web anew for each query. We now consider what happens when a user types a query. Consider the query [“National Academies” computer science], where the square brackets denote the beginning and end of the query, and the

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field quotation marks indicate that the enclosed words must be found as an exact phrase match. The first step in responding to this query is to look in the index for the hit lists corresponding to each of the four words “National,” “Academies,” “computer,” and “science.” These four lists are then intersected to yield the set of pages that mention all four words. Because “National Academies” was entered as a phrase, only hits where these two words appear adjacent and in that order are counted. The result is a list of 19,000 or so pages. The next step is ranking these 19,000 pages to decide which ones are most relevant. In traditional information retrieval this is done by counting the number of occurrences of each word, weighing rare words more heavily than frequent words, and normalizing for the length of the page. A number of refinements on this scheme have been developed, so it is common to give more credit for pages where the words occur near each other, where the words are in bold or large font, or in a title, or where the words occur in the anchor text of a link that points to the page. In addition the query-independent authority of each page is factored in. The result is a numeric score for each page that can be used to sort them best-first. For our four-word query, most search engines agree that the Computer Science and Telecommunications Board home page at www7.nationalacademies.org/cstb/ is the best result, although one preferred the National Academies news page at www.nas.edu/topnews/ and one inexplicably chose a year-old news story that mentioned the Academies. The final step is displaying the results. Traditionally this is done by listing a short description of each result in rank-sorted order. The description will include the title of the page and may include additional information such as a short abstract or excerpt from the page. Some search engines generate query-independent abstracts while others customize each excerpt to show at least some of the words from the query. Displaying this kind of query-dependent excerpt means that the search engine must keep a copy of the full text of the pages (in addition to the index) at a cost of several more terabytes. Some search engines attempt to cluster the result pages into coherent categories or folders, although this technology is not yet mature. Studies have shown that the most popular uses of computers are e-mail, word processing, and Internet searching. Of the three, Internet searching is by far the most sophisticated example of computer science technology. Building a high-quality search engine requires extensive knowledge and experience in information retrieval, data structure design, user interfaces, and distributed systems implementation. Future advances in searching will increasingly depend on statistical natural-language processing (see Lee in Chapter 6) and machine-learning (see Mitchell in Chapter 6) techniques. With so much data—billions of

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field pages, tens of billions of links, and hundreds of millions of queries per day—it makes sense to use data-mining approaches to automatically improve the system. For example, several search engines now do spelling correction of user queries. It turns out that the vast amount of correctly and incorrectly spelled text available to a search engine makes it easier to create a good spelling corrector than traditional techniques based on dictionaries. It is likely that there will be other examples of text understanding that can be accomplished better with a data-oriented approach; this is an area that search engines are just beginning to explore. Recommended Reading Chakrabarti, S., B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, 1999, “Hypersearching the Web,” Scientific American, June, pp. 54-60.

OCR for page 159
Computer Science: Reflections on the Field, Reflections from the Field This page intentionally left blank.