MARVIN M. THEIMER
Microsoft Research Redmond, Washington
There are countless examples of software projects that have consumed vast numbers of resources and then been scrapped (Saltzer, 2000). In fact, one study indicates that about 30 percent of all projects get scrapped, while 50 percent are delivered with significant budget overruns, shipping delays, or a significant fraction of their functionality left out (Standish Group, 1995).
Microsoft's business is designing, building, and shipping large software systems and applications. An extreme example is the Windows 2000 operating system, which contains roughly 29 million lines of code and which required the efforts of some 4,000 people to bring to fruition (Freeman, 1999). Most of Microsoft's other products also contain millions of lines of code and have development teams that number in the hundreds.
There are two key factors that make developing software systems so difficult. The first is the complexity of the potential interactions among all components of a system. Complex interactions make it difficult to test or verify that a system meets all the requirements of its design specifications. The second is the high rate of change to which current software systems are subjected. High rates of system evolution imply constant redesigns and reimplementations of many system components.
The number of possible system configurations an operating system or application may have to run on is huge. This creates a problem for testing that is exacerbated because many of the underlying hardware and third-party software
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 17
Page 17 Software Development at Microsoft MARVIN M. THEIMER Microsoft Research Redmond, Washington BIG SOFTWARE SYSTEMS ARE HARD TO BUILD There are countless examples of software projects that have consumed vast numbers of resources and then been scrapped (Saltzer, 2000). In fact, one study indicates that about 30 percent of all projects get scrapped, while 50 percent are delivered with significant budget overruns, shipping delays, or a significant fraction of their functionality left out (Standish Group, 1995). Microsoft's business is designing, building, and shipping large software systems and applications. An extreme example is the Windows 2000 operating system, which contains roughly 29 million lines of code and which required the efforts of some 4,000 people to bring to fruition (Freeman, 1999). Most of Microsoft's other products also contain millions of lines of code and have development teams that number in the hundreds. There are two key factors that make developing software systems so difficult. The first is the complexity of the potential interactions among all components of a system. Complex interactions make it difficult to test or verify that a system meets all the requirements of its design specifications. The second is the high rate of change to which current software systems are subjected. High rates of system evolution imply constant redesigns and reimplementations of many system components. Complexity of Interactions The number of possible system configurations an operating system or application may have to run on is huge. This creates a problem for testing that is exacerbated because many of the underlying hardware and third-party software
OCR for page 17
Page 18 products a system must interact with do not implement required specifications in a fully correct manner. In order to work correctly, the system must be able to work around many of the resulting problems, since a vendor often might be unwilling or unable to change the relevant aspects of their product. And if a vendor is of any size or importance, the alternative of not working with their offerings might be unacceptable. A second kind of interaction complexity stems from the fact that software engineering technologies are currently unable to encode all the various kinds of dependencies between interacting system components in a systematic, machine-checkable manner. Technologies such as type-safe programming languages have enabled the elimination of certain kinds of programming errors; however, many kinds of dependencies are still only documented as internal coding comments that can, at best, be checked by ad hoc means. Examples include locking conventions for concurrent access to shared data and fault-handling conventions for dealing with exceptional cases, such as out-of-memory errors. Rapid System Evolution Exacerbating the problem of building complex systems is rapid system evolution—a fundamental aspect of current high-tech markets, which implies that the requirements for a successful product might change frequently and sometimes dramatically. For example, at Microsoft about 30 percent of the contents of the specification for a particular version of a product typically changes during its development cycle (Cusumano and Selby, 1995). In addition to market evolution, hardware capabilities consistently double every year or two. A consequence is that, every 3 to 7 years, a system may require fundamental redesign in order to take advantage of the roughly 10x improvement in hardware resources that have become available. SOFTWARE DEVELOPMENT AT MICROSOFT Managing Development Projects Cusumano and Selby (1995) give a good description of the key principles that enable development projects at Microsoft to remain manageable. Paraphrasing, these principles are as follows: Have a short vision statement that also defines what a product is not supposed to offer. Guide feature selection and prioritization with models and data about user activities. Have multiple incremental milestones, buffering between milestones, and integrated development and bug fixing.
OCR for page 17
Page 19 Define a modular, horizontal design architecture with a correspondingly decoupled project structure. Control the project's scope via fixed project resources. One of the keys to successful product management is a means for controlling what features a product should include and in what order they should be implemented. Very early on, projects define a short vision statement, the purpose of which is to give a succinct, coherent description of what the product is and—equally importantly—what the product is not. This statement prevents products from evolving into things that are entirely different from what they were meant to be. However, the vision statement still allows for a far larger set of possible features than there are developer resources available to implement. The choice of which features to include and how important they are is determined by creating models of user activities. A user activity is defined to be a more or less self-contained activity, such as writing a letter or doing a business financial model. By examining which sets of features are needed for any given user activity, a project can ensure that the set of features actually implemented will allow users to successfully complete some set of their normal activities while avoiding the provision of feature sets that only partially support various activities. A common problem with development projects is that people do not like to report bad news until the last possible minute. Similarly, if bug testing and fixing get put off until the later stages of a project, nasty surprises may pop up at a time too late to fix. To avoid this, Microsoft projects have multiple incremental milestones and integrated development and bug fixing. That is, products are built in multiple increments, with each incremental version having to reach a level of quality that, in theory, is “ready to ship.” A key aspect of defining a realistic project schedule is the inclusion of time buffers—typically representing about one-third of the total time scheduled— between incremental milestones. These buffers are solely for dealing with unforeseen circumstances, such as the late delivery of a needed system component by another project team. In order to manage the complexity of interacting system components, as well as of interacting developers, the architecture of a product is structured to be as modular and “horizontally decoupled” as possible. Modules are forced to interact through narrow, well-defined interfaces so that developers can design and implement each module separately and in an independent manner. Each module is assigned to a single developer or to a small team of developers so that there is a well-defined understanding of who is responsible for each piece of the project. Limiting the size of the team working on a module is of critical importance. Since the overhead of coordination and communication grows with the square of the team size, teams typically consist of only a few people. For example, the
OCR for page 17
Page 20 entire Windows 2000 file system, which consists of about 200,000 lines of code, was mostly done by a 4-person team. One way to extend the leverage of a small team is to staff it with very good people. The best developers are considered to be as much as 10 times better than an average developer. Consequently, the most difficult parts of a system are typically given to small teams consisting primarily of very senior developers. Finally, in order to ensure that they do not go on endlessly or grow ever larger, projects are given fixed resources. That is, a project is typically forced to cut functionality rather than acquire additional resources or substantially delay its shipping schedule. This is one of the most important differences between how Microsoft develops software and how various famous failed software projects were run (Saltzer, 2000). A key reason for why this approach is acceptable is Microsoft's strategy of continual incremental upgrades for all of its products: cutting functionality from a project is a much less onerous task when the option exists of providing it in the next product release cycle. Software Development Strategies The following principles, described in Cusumano and Selby (1995) and paraphrased here, describe the key software development strategies Microsoft employs: Employ many small teams of developers in parallel but require them to synchronize each other's changes on a frequent basis. Always have a working version. Require everyone to use a common set of tools and have everyone colocated physically. Test continuously. All of these principles represent ways to manage the complexity of interactions. For example, although projects try to structure their architecture as an assembly of more or less independent modules, the interactions between them still must be constantly checked and tested. Hence, a new master version of the product is built and tested daily to check for unexpected problems. Similarly, changes made in each module are propagated to everyone else as soon as possible so that unforeseen interaction problems can be flushed out early. A key requirement for this style of development is that there always be a working version of the product. Although this might not be possible for the very earliest incremental versions of a new product, it is maintained in all other cases at Microsoft. Another form of system complexity comes from component interaction requirements that are encoded as coding comments. It is vital to have quick access to the developers of any given component and to avoid communication problems
OCR for page 17
Page 21 stemming from the use of incompatible tools by different groups and projects. The advantage of colocation of developers using a common set of tools cannot be overemphasized. Consider, for example, when some component is used in a way that was never envisioned by its developers; the only truly effective way to understand any puzzling interactions it might exhibit is to go talk to the developers and perhaps even sit down with them and explore a live example of the problem. Probably the most important strategy Microsoft uses for software development is continuous testing in as many circumstances as possible. Microsoft employs roughly as many dedicated test personnel as software developers. Testers are paired with developers and work together closely throughout the lifetime of a project. A key facility that enables large-scale testing is automated test scripts and software tools to aid simplified deployment thereof. Automated test scripts enable arbitrary users to run a variety of regression and stress tests without having to know much about the underlying code. Test scripts are run by “everyone.” Limited regression and stress tests are run as “quick” tests by a developer before he or she checks any code into a project's master version. Full tests are run against each new daily build of the master version by as many people as possible, on as many different machine configurations as possible. Even upper management participates. Both developers and testers use a wide variety of tools to help them test their software. Debugging versions of a product typically contain tens of thousands of checked assertion statements. Various program analysis tools have been developed to detect such things as the use of uninitialized variables. Debugging versions also typically contain code that checks for memory allocation errors and corrupted data structures. Yet another technique used is fault injection. Code is added to a system to artificially cause faults to occur in various subsystems and to produce incorrect input parameters to and output results from called functions. When a product is nearing completion, it is tested via actual use outside the project but internal to Microsoft. Finally, a beta version is sent out to a large audience of external volunteers. Given the extraordinary number of test cases that can occur for large software systems such as Windows 2000, beta tests involving hundreds of thousands of users are necessary to explore even a fraction of all the possible system configurations that can occur in practice. LIFE IN THE INTERNET WORLD Microsoft manages to routinely deliver large, complex software systems and applications. However, with the advent of its .NET Internet initiative, its software development process will be facing some fundamental new challenges. Chief among these will be extremely short development cycles and the require-
OCR for page 17
Page 22 ment that the systems it will be providing be able to run continuously, irrespective of any faults that might occur, any hardware changes that are needed, and any software upgrades that are done. Short development cycles are the norm in the Internet world: companies adapt their business models to changing market requirements on a frequent basis, and things such as fixes for security holes have to be deployed as soon as possible after someone reports them. This makes extensive testing almost impossible and requires the creation of a new generation of tools that will assist in getting software “right the first time.” Continuous operation will require that developers focus on things like reliability, maintainability, and dynamic scalability, in addition to all their usual concerns. Unfortunately, these are so-called crosscutting issues that tend to affect the design of almost every component in a system. Web services are also becoming more and more distributed in their implementation, with pieces of an application often running concurrently on several different machines. This will require the creation of an entire new generation of debugging, monitoring, and testing tools that can coordinate and analyze the behavior of activities spanning multiple machines. REFERENCES Cusumano, M., and R. Selby. 1995. Microsoft Secrets: How the World's Most Powerful Software Company Creates Technology, Shapes Markets, and Manages People. New York: Free Press. Freeman, E. 1999. Building gargantuan software: 4,000 programmers do Windows 2000. Scientific American Presents 10(4): 28–31. Saltzer, J. H. 2000. Coping with complexity. Invited talk at the 17th ACM Symposium on Operating Systems Principles, Kiawah, S. C., December 12–15, 1999. A brief summary and pointer to the presentation slides is presented in Operating Systems Review 34(2): 7–8. Standish Group. 1995. CHAOS (Application Project and Failure). Standish Group Study. West Yarmouth, Mass.: Standish Group .