DataThe Basis for New Knowledge
As the basis for exploration and rigorous analysis of observed phenomena, systematically collected data are critical to investigations of the economic and social impact of information technology. Moreover, more timely data collection and analysis are likely to be useful in informing future policy decisions. Several data-related issues arise for researchers working at the intersection of information technology and socioeconomic issues. As discussion at the workshop made clear, among the most important is the need for more extensive, more timely, and new sources of relevant data.
3.1 Types And Uses Of Data
Social scientists collect data from and about a variety of social units, ranging in degree of aggregation from individual human beings to corporations, economies, and nations (Box 3.1). Their time perspective may be historical or longitudinal (Box 3.2). The kinds of data collected and the methods of collection also depend on the overall purpose of the study. Special-purpose data sets are generally constructed by researchers to address a particular questionsuch as the extent to which information technology contributes to economic productivity. Multipurpose data sets can be used to study a wide variety of issues in a range of social science disciplines. As such, multipurpose data sets are part of the infrastructure of social science research.1 The scope and range of such data sets are illustrated by the top-level categories of the data in the archives maintained by the Inter-university Consortium for Political and Social Research (see Box 3.3).
•As personal observers (e.g., verbal "think-aloud" responses gathered in studies of decision making and problem solving; responses to interviews and question-naires)
•As corporate or community representatives (e.g., descriptions of corporate deployment of information technology elicited in interviews with chief information officers)
•As performers (e.g., scores from educational testing; patterns of participation in groups)
•As decision makers (e.g., data revealing consumer preferences or choices)
Documents and RecordsHistorical and Contemporary
•Diaries (e.g., details of personal situations)
•Media content (e.g., indicators of cultural themes)
•Commercial records (e.g., data on the diffusion of the telephone)
•Public records (e.g., data from birth and death records indicating population changes over time)
•Performance measures collected from publicly reported data such as earnings reports; other financial measures
•Product performance data
•Data on voter turnout or library circulation rates as indicators of citizen participation
•Labor market statistics
•National and regional economic statistics
3.1.1 Data from Experiments
Experiments involve setting up control and experimental groups that differ only with respect to the presence or absence of the effect being studied and thus permit researchers to conclude that a difference in outcomes in the two groups is actually due to the difference in treatment. The HomeNet project is an experiment that examines the impacts of computers in the home (see section 2.1.1, "Computer Use in the Home"). The Internet Demand Experiment (INDEX) at the University of California, Berkeley, is attempting to measure user demand for Internet "quality of service" by offering different price-quality combinations and observing what users choose.2 Offering an actual choice is likely to lead to more accurate results than is asking hypothetical questions.
Case study: in-depth study of one social unita family, a school, an organizational work group, or a political campaign. The researcher uses multiple means of data collection (observation, interviews, document analysis) to develop a rich understanding of the interplay of factors operating in a single social setting.
Cross-sectional study: study in which data are collected on a relatively small number of variables from a relatively large number of social units at one point in time, often from questionnaires or existing records. The researcher often uses statistical techniques to characterize how variables are associated with one another.
Panel study: study in which data are collected on the same variables from the same social units at repeated points in time; supports investigating the impact of particular events that occur over the time course selected (e.g., the impact of a presidential candidate's debate on a panel of voters or the impact of an advertising campaign on a panel of consumers) as well as trends over time.
Experimental study: study in which the researcher uses random assignment techniques to allocate social units to different treatment regimes or experiences. Experiments establish control groups and experimental groups that differ only with respect to the presence or absence of the effect being studied. When random assignment is achieved, and social units have the same experiences in all ways except for the experimental treatment, and there is a measured difference in outcomes associated with the different treatments, then the researcher has evidence for concluding that the difference in treatment actually caused the difference in outcomes.
•Community, Urban Studies
•Conflict, Aggression, Violence
•Economic Behavior, Attitudes
•Elites and Leadership
•Geography and Environment
•Government Structures, Policies
•Health Care, Facilities
•Legislative, Deliberative Bodies
•Mass Political Behavior, Attitudes
•Social Institutions, Behavior
3.1.2 Panel Data
Panel data is especially valuable because it enables answering questions about both cross-sectional and time-series variations. For example, a panel study of families can address such cross-sectional questions as how children's access to computers in the home is related to their educational performance in school (holding constant other factors). It also can address such time-series questions as whether high school students' educational performance in school is related to their access to computers in the home (holding constant other factors) and how this relationship is affected by the age at which students first had home access. Panel data also allows application of statistical techniques to control for unobserved effects that vary across a population. For example, the kinds of software and hardware available in the home have changed substantially over the past 15 years and are continuing to do so. An analysis of how the age at which students first had home access to computing affects later educational performance in school must take into account changes over time in the particular computing resources available.
A number of these multipurpose sets of panel data are collected by both private research groups and the federal government. Federal longitudinal studies include the Current Employment Statistics program3 and the National Longitudinal Surveys (Bureau of Labor Statistics, 1998).
The versatility and potential range of uses of multipurpose longitudinal studies are illustrated by the Panel Study of Income Dynamics (PSID) administered by the Institute for Social Research4 at the University of Michigan. The original purpose of the PSID, a longitudinal study begun in 1968 of a representative sample of U.S. individuals and the family units in which they reside, was to study factors influencing economic variables such as income, wealth, and earnings. Recently the PSID undertook five major key initiatives: (1) studies of data quality; (2) a re-contact initiative; (3) coding of data on census tract, mortality, and relationships; (4) supplements on wealth and health; and (5) early file release through the Internet. These initiatives have increased the cumulative response rate of the PSID; shown its continuing value as representing the U.S. population; added enormously to the stock of knowledge about important areas such as health and wealth; added detailed information on the residential areas in which respondents live, on mortality, and on relationships among family members; and increased the accessibility of the data to users. As a consequence of low attrition rates and the success of re-contact efforts, the sample size grew dramatically over the period, from about 7,000 core households to almost 8,700.
An extension of the PSIDthe new parent-child survey component that will include time-use questions covering use of computers by children both at home and in schoolillustrates the sort of valuable information on use and impacts of information technology that these extensive surveys can provide (see Box 3.4).
The Panel Study of Income Dynamics (PSID) is currently supplementing its core data collection with data on parents and their 0- to 12-year-old childrenthe PSID Parent-Child Survey. The objective is to provide researchers with a comprehensive, nationally representative, and longitudinal database of children and their families that will enable study of the dynamic process of childhood development.
The additions to the core data set include the following: (1) reliable, age-graded assessments of the cognitive, behavioral, and health status of 3,500 children (including about 550 immigrant children) in 2,500 families, obtained from the mother, a second caregiver, and absent parent, the teacher, the school administrator, and the child; (2) a comprehensive accounting of parental and caregiver time inputs to children as well as other aspects of the way children and adolescents spend their time; (3) teacher-reported use of time in elementary and preschool programs; and (4) measures of use of resources other than timefor example, the learning environment in the home, teacher and administrator reports of school resources, and decennial census-based measurement of neighborhood resources. (The survey questions may be found at the PSID Web site at ‹http://www.umich.edu/˜psid/›.)
The data include those entered in two home-based time diaries for each child age 0 to 12, covering both school and nonschool days. There are also data from teacher-reported school-day diaries for about 75 percent of the children. The home-based and school-day diaries include a special coding for computer-related activities. In the home-based diary children can report time spent with TV, video games, or computers. In the school-day diary children's use of time can be reported on in terms of their having had computers as the instructional mode. Among the parameters considered in the class time segments are the length of time spent; who was present; whether the teacher was with the child; whether the activity involved groups, the whole class, or only the individual; and the teacher's assessment of the level of the child's involvement. Collection and analysis of data on time use entered in a diary have been established as a valid method for measuring actual time use. This approach gives more accurate results than do respondents' reports about their allocation of time to different activities over a week.
The Parent-Child Survey data will be released to the public in fall 1998, as soon as they are cleaned (erroneous or nonsensical data eliminated) and documented. The data collection will support studies of the ways in which time, money, technology, and social capital at the family, school, and neighborhood levels, as well as parental psychological resources and sibling characteristics, are linked to the cognitive and behavioral development of children. The researchers plan to reinterview the children in 1999, again including time diary measures of computer use.
3.1.3 Data from Time-Use Studies
Data obtained from time-use studies, which can take the form of cross-sectional, panel, or experimental studies, can help answer questions such as what people do with computers. Current information on that and related topics is
rather limited. One source of such data is a 1997 Price Waterhouse Consumer Technology Survey (Price Waterhouse, 1997), which asked 1,010 consumers how they spent their time using computers. Twenty-five percent of the consumers had Internet access from the home. On average, 43 percent of their time using computers was spent accessing the Internet for research; 34 percent for e-mail; 9 percent for game playing; 5 percent for reading online magazines and newspapers; 4 percent for online chat; 2 percent for online banking; and 1 percent each for two-way voice communications and online shopping.
Although these numbers are suggestive of how computers are used by consumers, they certainly are not definitive, given that they describe computer use only by adults at a particular point in time. Ongoing studies that examine users of different ages and from different population groups, along the lines of the parent-child time-use studies referred to above, would be very helpful. It might also prove useful to use cluster analysis to discover patterns of usage that do not emerge from averages over predetermined income, class, age, or population groups.
"Metadata" are data about data, such as compendia or collections of data sets. The Statistical Abstract of the United States (U.S. Bureau of the Census, 1992) and Historical Statistics of the United States: Colonial Times to 1970, Bicentennial Edition (U.S. Bureau of the Census, 1975) are two well-known examples. A vitally important part of the research infrastructure, these publications as well as metadata sites on the World Wide Web such as STAT-USA (‹http://www.stat-usa.gov›) are enormously helpful to researchers, teachers, journalists, and policy makers even though they generally do not present new data (Box 3.5).
Metadata are valuable because they are selective and authoritative, and, moreover, they provide a context that assists users in interpreting the data selected
There are some who suggest that compendia of data will no longer be useful or necessary in a computer-intensive future, when anyone with a computer and a modem will be able to download whatever data series is of interest nearly instantaneously and often without charge. These commentators may be correct about the future ease of access to data in digital format, but they are surely wrong to suggest the imminent obsolescence of compendia. Collections of compiled data will not become redundant when the entire Internet in effect becomes one gigantic repository for statistics. Indeed, with the decline in the cost of computer power one can expect the volume of available data to reach unprecedented levels. This avalanche of alternatives will make research tools like compendia more valuable, not less so.
together with warnings about pitfalls and possible misinterpretations and references to the debates and sources containing alternative measures. Compilation of metadata requires exacting research, intensive review, and refereeing by the nation's and the world's best experts.
Unfortunately, in recent years government statistical agencies, operating under severe financial constraints, often have given the development of new compendia or the maintenance of existing metadata sets a low priority. The Historical Statistics volume, for example, has not been updated or revised for a quarter-century; several years ago the Bureau of the Census abandoned its plans to take on this effort. Instead, the project has been taken up by a private publisher (Cambridge University Press) in collaboration with a team of more than 70 scholars whose volunteer contributions of time and expertise indicate how important the revision of Historical Statistics is to the research community.5
Both public and private foundations devoted to funding research have tended to resist underwriting the costs of preparing metadata sets, perhaps because they view such projects as mere digitizing and collating efforts requiring little or no scholarship or research.6 Although the Historical Statistics project may prove to be commercially viable, efforts to revise and update other less widely used data compendia may not be able to attract the aid of a private-sector sponsor. Collaboration between government statistical agencies and experts in academia and industry in preparing these resources might be facilitated by direct contracts and grants or through informal partnerships between the agencies and experts from the scholarly and business communities.
3.2 Availability Of And Access To Data
In general, researchers must conduct their work within limited budgets and also face the need to preserve long-term continuity in studies while capturing rapidly changing phenomena. Owing partly to constraints on federal data gathering (see section 3.2.3 below), government and academic researchers have been relatively slow to refocus their data collection efforts on the emerging social and economic impacts of information technology. Myriad private-sector groups have responded more quickly to businesses' appetite for timely information about the technological challenges they confront. For example, private-sector market research organizations run household or consumer panels that administer monthly surveys. Particularly given the time lag required for careful analysis, investigation of rapid changes in peoples' responses to technology requires that social scientists have better access to data from a variety of sources. Social scientists need better access to each other's data as well as to information that is collected by the private sector and government.
3.2.1 Data Collected by the Private Sector
The significant private-sector resources devoted to data collection could be of great value to researchers and policy makers if properly leveraged. Trade associations such as the Semiconductor Industry Association gather detailed data on industry output, prices, employment practices, demand forecasts, and managers' key concerns. Rather than attempting to duplicate or replace these efforts, it would be useful to coordinate collection of data for research with the data gathering of industry groups. Resources could be pooled and greater cooperation fostered among participants. Respondents to surveys are most diligent, for example, when they can expect some return on their efforts. The prospect of obtaining feedback, typically in the form of aggregate results, is often an important incentive.
The fact that a private group is interested in gathering, or already has gathered, certain data suggests that the information is perceived by managers as having real value. Indeed, in some cases private-sector client groups may be interested in helping to disseminate the results of research to at least a selected audience, thus increasing the overall impact of the research. Often a consulting firm will broadly release at least a summary of a research study in order to bolster the firm's reputation, although the proprietary nature of the results may inhibit wide dissemination in their full form.
A major concern associated with the use of data collected by the private sector is that private firms often lack academic standards of quality control such as peer review. Consultants, trade magazines, and industry groups may be less than rigorous about survey design, sample selection, or other biases in the data, and as a result their data may be unreliable or misleading.
One approach to improving the reliability of findings is to use data from multiple independent sources to the extent possible. For instance, in his study of information technology and productivity, Lichtenberg (1995) drew on data from two distinct private-sector sources on firms' capital investment in information technology. Although the correlation was far from perfect, the overall econometric results were quite similar regardless of which data source was used, making the results more credible.
Another possible approach to ensuring quality is to work closely with private data-collection firms, although private groups generally want to keep data and results private and available for the exclusive use of clients. Nevertheless, in at least one instance, a team of researchers struck a bargain with a media group, according to which the research team was to design several annual surveys, supervise the sampling and data collection, and then conduct the analysis itself. The media group paid all the costs of this undertaking and turned the data over to the research team; in exchange, the researchers wrote a sequence of articles summarizing the latest publishable findings, which were then presented each year
in a special issue based on the surveys. Such collaboration is currently the exception, as is private-sector commitment to these sorts of long-term research endeavors.
Indeed a difficulty in working with private-sector groups is that they are often focused on whatever topic is currently "hot," and yesterday's news, it seems, is of only academic interest. The practical result is that time series of more than a few years are difficult to obtain, which makes it difficult to conduct statistical analysis. Another drawback is that private-sector data collectors may well change the definitions used in surveys and the nature of the groups sampled; again, the focus is often how new data relate to the latest management question, and not how recently collected data relate to past data. In fact, many such firms do not even attempt to preserve data for more than a year once it has been collected, as Brynjolfsson and Kemerer learned when they sought to estimate the value that consumers placed on various software features so that they could determine how the quality-adjusted price of spreadsheets had evolved over time (Brynjolfsson and Kemerer, 1996).7
3.2.2 The Need for Firm-level Data
As observed by Ronald Coase more than 50 years ago, firms are the dominant way of organizing economic activity (Coase, 1937). Any complete understanding of the economic and social impacts of information technology requires examining activities at the level of individual firms. Unfortunately, there are significant gaps in the available data that describe this level, forcing researchers to make extrapolations from other types of data to try to answer important questions about the effects of technologies' use. For instance, one recent research study developed a theory of how companies' growth would be affected by new technologies but could only test it using industry-level data. As the authors lamented: "Each industry contains thousands to tens of thousands of firms, so it may seem odd to take industries as firms. Unfortunately, there are no firm-level data sets that span the economy" (Basu et al., 1997).
Although important insights can be gained from assessing industry-level data, trends at this level of aggregation may be quite different from trends at the level of the firm. For instance, income inequality could be increasing overall in the economy even if gaps in wage levels within every individual firm were being reducedif, for example, firms "outsourced" noncore work while specializing in narrower functions. In fact, important questions about causality, learning, and lags in observed effects are best analyzed by studying a cross section of firms over time. One-time cross-sectional studies of firms will not suffice.
A few firm-level longitudinal data sets do exist, such as Standard and Poor's Compustat databases, which provide extensive financial data on publicly traded firms, including their sales, stock prices, and employment statistics. More detailed firm-level data sets have been assembled in Europe, such as the data set assembled by the Industriens Utrednings Institut (Industrial Institute for Economic and Social
Research) in Stockholm, Sweden, and data sets collected by the Institut für Wirtschaftsforschung (Institute for Economic Research) in Munich, Germany.
Many more finely focused firm-level data sets also exist for specific industries and purposes in the United States and abroad. One is the data set derived from a minimum wage survey of U.S. firms. Each of these data sets has proven useful for addressing certain research questions. However, very few of the firm-level panel studies include data on information technology or important organizational variables. In addition, the results of studies that examine smaller, more focused samples of firms within specific industries cannot be readily generalized to other industries or the broader economy.
An economical approach to assembling a broader firm-level data set is to build on existing data sets and link them together. For instance, to address questions about information technology and productivity, Brynjolfsson and Hitt (1996) combined data from Compustat with private-sector data from International Data Corporation as well as data obtained in their own surveys. This approach enabled them to identify a significant correlation between use of information technology and firm-level productivity that could not be discerned from conflicting case studies or coarser, economy-wide data.
A number of lessons can be learned from prior work with firm-level data:
• Firms or business units are always changing and reorganizing, thus posing challenges for measurement and data collection parallel to those arising from changes in the set of individuals that constitute a household in studies such as the Panel Study of Income Dynamics. Of course if they are followed, the spinoffs of changes are also potentially very interesting (e.g., in the study of the formation of new enterprises).
• Firms need feedback. For instance, most firms greatly value information that enables benchmarking: knowledge of where one's own firm stands in relation to an aggregate of anonymous peers. The opportunity to obtain such information was the main incentive provided for respondents in the case of the Institute for Economic Research study. Respondents in such data collection surveys could even be given access to the database itself, although this approach can substantially increase the workload associated with maintaining the data.8
• Firms are often very heterogeneous. As a result, for some purposes it makes sense to focus on firms that have something in common, such as an industry group. In other cases, it may be best to seek multiple respondents from the same firm, each of whom may have a different perspective. Even when one individual is compiling the data for a firm, it must be understood that the information may derive from a set of individuals with knowledge of different functional areas of the enterprise.
Once firm-level data are compiled, they can often be usefully linked to data at other levels of aggregation, both higher and lower. For instance, industry- and
firm-level data can be combined to address such questions as whether productive use of information technology correlates with one type of organizational structure in retailing and a different one in high-technology manufacturing.
In other cases additional insights may be gained by combining firm-level data with finer-detail information about individuals in those firms. This approachconstructing a study covering a sample of firms that also includes data on a sample of individual employees in those firmswas used with remarkable success by Greenan and Mairesse (1996) to measure the correlation between computerization and productivity in firms. Although Greenan and Mairesse had complete production data for a large sample of French firms that enabled them to estimate a variety of productivity measures, they did not have any direct data on the extent of computerization at those firms. Instead, they combined data from a separate survey of individuals, which asked whether they used a computer at work and what the name of their employer was, with the firm-level production data. They found that many of the firms in their first data set matched with one or more individuals in the second data set. If the sampled employee used a computer at work, this was evidence that the firm was more computerized than its competitors. Although matching the data in this way was very difficult and provided a fairly weak indication of the effects of computerization, the researchers were able to establish an overall positive correlation between computerization and productivity for the French firms.
3.2.3 Data Collected by Government
The federal government collects a vast amount of data, much of which is readily available via printed or computer-accessible media. Among the advantages of federally collected data are its high quality and objectivity, its accessibility for use by the public, and its free availability as material in the public domain that can be used without raising intellectual property concerns. The FedStats Web site (‹http://www.fedstats.gov›), launched in 1997 by the Federal Interagency Council on Statistical Policy, is a directory of data collected by the U.S. federal government and available online.
Nevertheless, the availability of data has been curtailed in recent years due to budget cuts, government reforms, and policy changes. Overall federal government collection of data is restricted by both the statutory goals of the Paperwork Reduction Act of 1995 and Administration targets for reducing the burden of collecting information. Structural and regulatory changes have also reduced the availability of standardized, public data describing the telecommunications sector. For example, deregulation of the telecommunications industry has reduced the quantity and availability of data on telephony, and deregulation of terminal equipment (e.g., telephone instruments) led the FCC to stop collecting data on such equipment. In addition, following privatization of the Internet and the end of government funding for the NSF-run Internet backbone, data were no longer
available on the size and characteristics of Internet traffic.9 It is ironic that the communications industry, an object of intense scrutiny by policy makers, is more poorly measured now than in the past.
In ways that are important to setting communications policy, efforts to reduce the data collection burdens imposed by the federal government and to reduce the role of regulation in telecommunications are at odds with the need for good data on the telecommunications infrastructure and the changing nature of ernsumer use of new technologies. If social science researchers are to gain insights into what information technology-related changes are taking place within the home, how Americans invent ways to interconnect, and how access to new communications media can affect economic growth and civic participation, then more, not less, statistical data on the penetration and uses of media needs to be collected, starting with use of the telephone (Box 3.6).
3.3 New Types of Data
3.3.1 Documenting the Effects of Technology Deployment
Many social institutionsschools, libraries, hospitals, municipalitiesare going online. Institutions may collect and report basic measures of use, such as the number of times their online resources are accessed (''hits"), but often local resources are not devoted to using such information to systematically document the dynamics and the social and individual effects of system deployments.
Individual institutions typically lack the time, expertise, motivation, and perspective to document the dynamics and effects of change resulting from the use of information technology. Indeed even the first step, measuring access to and use of online resources, is a nontrivial problem. Since each component of a Web page (a graphic, text, or other item) will result in a separate hit, hits as a measure of use will give different counts depending on the details of the content's design. Collecting meaningful data on use, especially where cross-comparisons are to be made, depends on systematically defining such things as visitors, users, and the like.10
Externally supported comparative research projects exploring the effects of technology deployments could be enormously useful to at least four audiences. Policy makers and citizens would be able to understand the benefits (and costs) of online access to information and online interactions with social institutions. Technologists and managers would be able to understand the effects of different technology configurations and deployment strategies. Scholars would be able to test and revise existing theories of institutional participation with new kinds of data. Future generations of scholars and citizens would be able to study this transition period, as institutions experiment with different modes of online service.
Consider a specific example. The Gates Library Foundation, established by Microsoft Chief Executive Officer Bill Gates, will provide $200 million, matched
BOX 3.6 Challenges of Collecting Data on the Use of
To understand the difficulties in answering the simplest questions about the use of information technology, consider basic data on household use of telephony. Households that have telephone service constitute the conceptual basis for all measures of universal service. The most widely used measure is the percentage of households with telephone servicesometimes referred to as telephone "penetration."1 Yet this measure, though seemingly straightforward, can harbor multiple definitions, and studies designed to measure it are subject to errors.
Prior to the 1980 census, precise calculation of telephone subscribershipi.e., one definition of penetrationwas of little concern. In the days of one phone, one household, one service provider, telephone penetration was traditionally measured by dividing the number of residential telephone lines by the number of households. As households added second lines and as the number of second homes increased, measurement based on the number of residential lines became subject to a large margin of error. By 1980, the penetration according to the traditional measure (residential lines divided by the number of households) reached 96 percent in the United States, whereas the number of households that reported having telephones in the 1980 census lagged at 92.9 percent.
In 1980, the Federal Communications Commission (FCC) requested that the Bureau of the Census include questions on telephone penetration as part of its Current Population Survey (CPS), which monitors demographic trends between decennial censuses. For national studies, use of the CPS has several advantages: (1) it is conducted every month by an independent and expert agency, (2) the sample is large, and (3) the questions are consistent. Thus, changes in the results can be compared over time with a great deal of confidence.
Unfortunately, however, the telephone penetration results of the CPS cannot be directly compared with the figures on telephone penetration obtained in either the 1980 or 1990 census. Differences in the sampling and survey methodologies are a source of discrepancies.
Although the CPS is conducted every month, not all of the questions are included every month. Since the sample is staggered, the information that is reported for
1 According to the Bureau of the Census, "A household includes the related family members and all the unrelated persons, if any, such as lodgers, foster children, wards, or employees who share the housing unit. A person living alone in a housing unit, or a group of unrelated persons sharing a housing unit as partners, is also counted as a household. … The figures for number of households are not strictly comparable from year to year. In general the definitions of household for 1790, 1900, 1930, 1940, 1950, 1960, and 1970 are similar. Very minor differences result from the fact that in 1950, 1960, and 1970, housing units with 5 or more lodgers were excluded from the count of households, whereas in 1930 and 1940, housing units with 11 lodgers or more were excluded, and in 1790 and in 1900, no precise definition of the maximum allowable number of lodgers was made." (U.S. Bureau of the Census, 1975)
According to the CPS, "A household consists of all the persons who occupy a house, an apartment, or other group of rooms, or a room, which constitutes a housing unit. A group of rooms or a single room is regarded as a housing unit when it is occupied as separate living quarters; that is, when the occupants do not live and eat with any other person in the structure, and when there is direct access from the outside through a common hall. The count of households excludes persons living in group quarters, such as rooming houses, military barracks, and institutions. Inmates of institutions (mental hospitals, rest homes, correctional institutions, etc.) are not included in the survey." (U.S. Bureau of the Census, 1993)
any given month actually reflects responses over the preceding 4 months. Aggregated summaries of the responses are reported to the FCC, based on the surveys conducted through March, July, and November of each year. Also, the questions in the CPS were written long before the breakup of AT&T and reflect realities of the monopoly era, when having a telephone also meant having service. But in the post-divestiture era encompassed by the 1990 census, the question, is there a telephone in this house/apartment? inadvertently focuses on the telephone as an instrument. Instead the real issue is the presence of telephone service. Therefore, one potential for statistical bias stems from a literal response to this question. In the case of the census, the respondent could truthfully answer yes to the question and confound the results with an upward bias; and since there is no follow-up to the census, the upward bias would go uncorrected. In the case of the CPS, follow-up questions and surveys may correct for this bias;2 however, they contain the potential for a downward bias. The follow-up, a telephone call repeated in subsequent months, will catch a household that originally had telephone service and lost it, but will not catch a household that did not originally have telephone service but subsequently received itthus, the downward bias.
For the researcher, another difficulty is that the census is not strictly comparable with the CPS. The differencessome correctable, some inherentresult in a gap in the final numbers. According to the 1990 census, 94.8 percent of all households in the United States had telephones. However, CPS data showed penetration at 93.3 percent for 1990. This difference, which represents nearly 1.4 million households, is statistically significant and appears to indicate that the CPS may be on the low side of the actual penetration rate, whereas the census may be on the high side.
Collecting comparable data on the use of the Internet, e-mail, or other new information technology, and measuring the penetration of telecommunications in an increasingly heterogeneous environment, clearly present a substantial challenge. For example, reliance on CPS data would carry with it an inherent bias against the use of wireless and mobile services for telecommunications purposes. Address-based measurement excludes the presence of new wireless technologies if they are used as substitutes for wired service to the home.
2 The Current Population Survey includes households in the survey for the same 4 consecutive months in 2 consecutive years.
by $200 million in software provided by Microsoft, to connect the nation's public libraries to the Internet (Lohr, 1997). Since not all libraries will go online at the same time, the program offers the opportunity for a range of comparisons. Longitudinal studies could document changes in a variety of social welfare indicators (perhaps, for example, changes in circulation rates, civic participation, and consumer awareness) as a function of Internet access and use. More pragmatically, they could document how different deployment strategies (e.g., location of Internet stations; access and use policies; tie-ins with school, civic group, or municipal
government programs) were associated with patterns of use and effects. These findings would be extremely useful to the later-deploying libraries.
Similar longitudinal comparative studies should be designed and conducted to understand the dynamics and effects of increasing numbers of other social institutions going online. Projects that deploy new technologies, especially prototypes, should be encouraged to build in the capture of such data. Digital libraries, distance learning, and efforts to use information technology to enhance government services ("digital government") would all be valuable areas in which to incorporate study of the sociological and economic impacts of information technology.
3.3.2 Data on Social Interactions from the Internet
A great deal of social behavior is visible on the Internet. For example, one can see how many Usenet groups or public distribution lists exist on what topics, and what the level of activity is on each. Snapshots of publicly accessible social behavior could be captured and made available to social scientists studying group behavior on the Internet.11 It might even be possible to study the complete corpus of communication within a group of Usenet news or e-mail "listserves." However, collecting data from the Internet presents technological difficulties and may also raise legal questions.12
In addition to Usenet groups and e-mail lists, another source of data is illustrated by Kaminer (1997), who has used the UNIX logs of natural scientists to obtain data on their use of a variety of Internet features (telnet, FTP, and so on). Using a multivariate approach, he has shown that increased use of the Internet increases a scholar's research productivity (publications per year), other things being held constant.13
Systematic longitudinal data on group behavior on the Internet would be a valuable resource for social scientists studying the formation and diffusion of electronic communities. Such data are in some measure the electronic equivalent of the town records that historians have used to document and understand 19th-century community formation and development. But these electronic data on Internet behavior are ephemeral. Unless they are collected and archived now, they will disappear, and researchers will have no systematic record of how group behavior on the Internet is growing and evolving over time. What databases should be developed to support research in these areas?
Tools used to analyze Usenet and e-mail data might also be applied to Web-based systems and emerging software systems for collaboration. Application of these approaches to new technologies will in some cases require a new focus on data collection methodologies. For example, methods are needed to factor out distorting artifacts such as the use of proxies to access Web resources or the activities of indexing robots.
3.3.3 The Internet as a Window into
Transactions Are Conducted
In his position paper "Electronic Interactions" in Appendix B of this volume, Paul Resnick suggests that the Internet would also permit study of a number of interesting topics in how commercial transactions are conducted. For example:
• Recommendations and referrals can help people to find interesting information and vendors. There is a need for continued research on techniques for gathering and processing recommendations (this is sometimes called collaborative filtering). Compilation of "grand challenge" data sets of recommendations would help this field advance.
• The structure of negotiation protocols and the availability of information about past behavior of participants will affect the kinds of outcomes that are possible. Economists have theoretical results regarding many simplified negotiation scenarios, but there is a need for interdisciplinary research to apply and extend these results to practical problems of protocol design.
• In the transaction consummation phase, much effort has focused on secure payment systems. Some transactions, however, require a physical consummation (mailing of a product, for example) and hence must rely on trust in some form. Research can explore the role of reputations in creating trustworthy (though not completely secure) contract consummation. Such transactions may also have lower transaction costs than secure payment systems, even in the realm of purely electronic transactions.
3.4 Time And Tools For Gathering
3.4.1 The Time Required to Do Good Social Science
It is important to recognize that systematically gathering and analyzing social science data are very time-intensive tasks. The need for time can lead to difficulties in synchronization of attention to the object of study, information technology effects, and ways of studying information technology. Although information technology is developing at a very rapid pace, the speed at which social science data have been acquired has changed little in the last few decades. Analysis of quantitative data, after it has been acquired and prepared, has certainly been speeded up by the widespread use of computers, but analysis of qualitative data has been much less affected.
A few examples illustrate the time required for data collection and analysis.
• As previous research efforts have shown (see, e.g., Orr, 1990), it can take significant time for a researcher to enter a community, gain the trust of its members,
and begin to understand how community members interact with one another and with technology.
• The Homenet field experiment to understand how families use Internet technology was begun in 1994 and is still under way (Kraut et al., 1996; Kiesler et al., 1997). It has taken significant time to recruit families (both those who will receive technology and those who will serve as a control group), to acquire and deploy technology and train people to use it, and to administer questionnaires and conduct interviews and home visits. Data collection must be repeated at regular intervals in order to investigate changes over time. The systems that are deployed in the Homenet study are not yet obsolete, but they are aging. If researchers give participants newer technology, they compromise their understanding of how people's use of a particular technology changes over time. If they do not give participants newer technology, their findings, particularly any negative ones, can be dismissed because they were based on old technology.
• It takes at least 6 months after an interface is implemented, debugged, and regarded as stable to conduct a laboratory study of peoples' social behavior in responding to that interface (e.g., Sproull et al., 1996). Thus interface designers can develop new interfaces more rapidly than researchers can collect data on social responses to each particular interface.
These observations suggest that some mismatch is unavoidable between technology development and social science research. Researchers studying the effects of widespread deployment of a new technology must wait for widespread deployment to occur. By that time, however, the technology is no longer new. For example, although electronic mail was invented in the 1960s, the first research on e-mail's effects on patterns of organizational communication was not published until 20 years later, when noticeable numbers of organizations were beginning to use it routinely.
In addition to lags imposed by the need to wait until technology is widely deployed, there are the normal delays inherent in social science research. Data collection and careful data set preparation take time, especially for large data sets.
Sometimes the mismatch between development of technology and research on its effects can lead to criticism of studies for relying on older data, even though more recent data of equivalent quality was not available. Examples include the RAND Corporation's analysis of home access to computing, which is based on the 1989 and 1993 Current Population Survey data (Anderson et al., 1995), and Attewell's analysis of the effects of home computers on educational performance, which is based on data from the 1988 National Educational Longitudinal Study (Attewell and Battle, 1997). However, in many cases the mismatch does not matter. For example, researchers investigating the impacts of alternative models of investment in technology would find data on corporate expenditures on information technology enormously usefulsuch data would
certainly be treated as proprietary when current but after a relatively brief time would be considered obsolete for corporate decision-making purposes. In some cases, such as historical studies, older data is better. For instance, data about early adopters of technology can certainly be useful in better exploring advantages that may accrue to early entrants in an industry (the so-called first-mover advantage).
3.4.2 Appropriate Subject Pools and Instrumentation
In addition to benefiting from improved access to data (see section 3.2 above), social scientists exploring the impacts of information technology also would benefit from better access to appropriate subject pools for behavioral studies. Most university-based subject pools, which do not operate during exam periods and breaks, are best suited to short-term studies of the behavior of individuals. To study the effects of technologies designed for groups, researchers need access to groups. Creating groups composed of strangers in an experimental laboratory will not allow researchers to understand the long-term effects of technologies that require or cause substantial changes in organizational procedures governing how people work together.
In addition, social scientists need new instrumentation for more rapid data collection. Data collection may be an area for fruitful collaboration between social scientists and technologists exploring, for example, Internet-based survey and interviewing technologies. Use of these new sources of data would also require attention to the privacy concerns they raise.
3.5 Approaches To Meeting Requirements For Data
Based on discussions at the workshop and in position papers submitted by workshop participants, the workshop steering committee developed a set of approaches to meeting requirements for data needed to advance research on the impacts of information technology. Listed below, these approaches are intended as illustrations of ways to enable researchersin concert with government and the private sectorto address the need for more extensive, more timely, and new sources of data.
• Making data related to the social and economic impacts of computing and communications available to the research community through a clearinghouse. A clearinghouse would provide documentation and archiving of data sets (but not an evaluation of data quality). It might be necessary to develop incentives for researchers to contribute data to this archive. Journals could make submission of data to an appropriate clearinghouse a requirement for publication. For example, journals publishing research in the biological sciences typically require authors to deposit genetic sequences and similar data in publicly accessible
databases such as the National Institutes of Health genetic sequence database, GenBank.14 The terms of federal research funding could also support efforts for a clearinghouse. For example, the National Science Foundation expects grantees to share research data with other researchers (with safeguards in place for the rights of experimental subjects and the like). Depositing research data in a clearinghouse could be an efficient way for researchers to satisfy this expectation. An explicit expectation that data be deposited in appropriate data banks could also be added as a condition of receiving grants.
In addition, a clearinghouse could encourage comparabilityin both format and research methodologyacross data sets and the reuse of data, especially if academic researchers and also commercial data sources were to collaborate on defining standards.
A possible model for such a clearinghouse is the Inter-university Consortium for Political and Social Research located within the Institute for Social Research at the University of Michigan. Funded by subscribing member institutions, it provides access to a large archive of computerized social science data. A clearinghouse could also derive support from grant funding and charges for access to data sets.
Both the archiving and standards-setting functions would enable increased secondary use of data sets, which would of course depend on the social science community's ease of access to data in a clearinghouse. Joint work between social scientists and technologists could lead to building new kinds of data clearinghouses and new tools and techniques for making use of them.
• Exploring ways for researchers to gain access to private-sector data. Commercial data on firms' capital investment in information technology is of considerable value to researchers examining the social and economic impacts of computing and communications. Consultant, trade magazine, and industry group data is another valuable resource (see section 3.2.1).
Overall, however, several factors impede collaboration between researchers and the private sector. First, data on individual firms is often protected because of competitive concerns. One remedy is for social scientists and the commercial sector to explore aggregation and other ways of hiding individual corporate identity. Second, incentives for collaboration by private-sector firms typically are lacking, although one way of providing them is to establish an agreement whereby researchers who use private-sector data then make research results available in a useful form to the firm or organization that supplied the data. To both protect proprietary interests and increase incentives, strengthened institutional relationships between the research community and industry associations would be valuable.
It is important to note that private-sector sources of data may have a number of possible limitations, including a lack of consistent definitions and methods over time and the tendency of private-sector firms to preserve only the current information.
In many of these cases closer working relationships between researchers and the private sector can provide solutions (see section 3.2.1 for examples).
• Increasing data collection efforts by government. As described in section 3.2.3, deregulation and privatization can reduce the quantity and availability of data on telecommunications and computing at a time when more, not less, information is needed to guide policy decisions. Budget constraints and government reforms to reduce information gathering burdens also have reduced data collection. In addition, fewer resources have been available for analyzing data and for making the data publicly available.
Workshop participants noted that loss of such data sources inhibits social science explorations of the social and economic impacts of computing and communications. At a minimum, government decision makers need to be aware of the cost of losing such data. In some cases they may be able to find other ways to gather valuable information. For example, additional questions might be added to the Census Bureau's Current Population Survey (CPS) to measure wireless phone or Internet use (see Box 3.6), as was done by the National Telecommunications and Information Administration and described in the report "Falling Through the Net" (NTIA, 1995), which reported on computer and modem use and explored the demographics of telephone, computer, and modem use in terms of population density, ethnicity, age, and economic status. However, this approach has been taken only once, because it is expensive to add supplemental questions to the CPS.
• Exploring the development of new multipurpose data sets by the research community. To what extent can multipurpose data sets based on such techniques as user diaries prove helpful to researchers examining the social impacts of information technology? Careful observational methods have been critical to specific deep organizational studies in such areas as computer-supported cooperative work (e.g., the study of Xerox technicians described in section 2.3.4). However such research has not typically relied on multipurpose public data sets. As the body of observational data grows, it may be possible to start development of such nonquantitative multipurpose data sets. A precedent is the Human Relations Area Files at Yale University, which consist of ethnographic extracts organized in various categories. Given a rich enough corpus of observational data in a given domain, it may prove both possible and valuable to invest in the creation of new, qualitative multipurpose data sets.
• Establishing stronger ties with industry associations to facilitate collaborative research. In general, proprietary concerns are likely to impede collaboration between academic researchers and private-sector firms. Yet to explore topics such as the relationships among organizational structure, the use of information technology, and productivity, researchers need access to firm- and process-level data that typically is not public. Nonpublic data relevant to other social and economic research includes details of pricing, employment, demand forecasts
and the like. Lack of experience in collaborative work between the two communities is another barrier.
Industry associations are a possible bridge between the communities, to allow each to benefit from the resources of the other. One approach might involve sponsorship of forums where academics and industry people can meet to discuss common interests. These events can create serendipitous opportunities for cooperation that could not have been predicted or planned in formal brokering. Industry associations might also help connect the academic and industry communities in more formal roles such as the following:
As an intermediary that aggregates or otherwise makes proprietary data anonymous so that firms will be more comfortable about providing it to outsiders;
As a matchmaker in bringing together industry and the research community to work on projects of mutual interest;
As a depository for research results based on industry-provided data (if research reports are readily available to them, industries may be more willing to provide data); and
As a sponsor of research on topics of interest to the membership.
Note that limited financial resources may place constraints on such collaboration. Since trade associations are unlikely to have in-house resources to cover the administrative costs of a research study collaboration, it may prove necessary to structure such a project on a break-even basis for the association.
• Exploring, in workshop sessions, uses of the Internet as a source of data on social interactions. As described in section 3.3.2, the Internet can provide a wealth of information on group and community behavior. It would be very useful to convene a workshop of technologists, social scientists, academics, and representatives of commercial interests to discuss and resolve such issues as the following:
How to develop indicators of group behavior that are publicly available on the Internet;
The feasibility of commercial services providing data such as those on use of their forums and chat rooms;
Appropriate sampling and estimation procedures;
Appropriate publishing and archiving procedures;
Standards for data collection and exchange; and
How to establish relationships with possible providers of information (e.g., search engines or newsgroup archives).
Such an endeavor would also need to address ethical and privacy issues associated with data collection, archiving, and reporting as well as the proprietary interests of commercial Internet services.
1. Many of them were developed for other users, but some have been able to provide useful input to social science studies of information technology. Multipurpose data sets are often collected by organizations dedicated to this task such as the Institute for Social Research (ISR) at the University of Michigan (see ‹http://www.isr.umich.edu/›), the National Opinion Research Center at the University of Chicago (see ‹http://www.norc.uchicago.edu›), and several other research organizations. Many such multipurpose data sets are maintained by organizations like the Inter-university Consortium for Political and Social Research (ICPSR, see ‹http://www.icpsr.umich.edu/›), which is supported by member-university subscriptions. ICPSR, located within the Institute for Social Research at the University of Michigan, is a membership-based, not-for-profit organization serving member colleges and universities in the United States and abroad. Data sets can be found online at ‹http://www.isr.umich.edu›.
2. For details see the project description and related documents available online at ‹http://www.INDEX.berkeley.edu›.
3. Data sets and reports are available online at ‹http://stats.bls.gov/cesprog.htm›.
4. Information on the Institute for Social Research can be found online at ›http://www.icpsr.umich.edu›.
5. Work to prepare the new edition (Susan Carter, Scott Gartner, Michael Haines, Alan Olmstead, Richard Sutch, and Gavin Wright, editors, Historical Statistics of the United States from Colonial Times to the Present, Millennial Edition, Cambridge University Press, in preparation, scheduled for publication in 2000) has received partial support from the National Science Foundation and the Alfred P. Sloan Foundation.
6. A workshop participant noted that university and college promotion committees seem to give little weight to faculty contributions to such projects, perhaps because they, too, share the widespread misunderstanding of the value of metadata and the scholarly research required to create a metadata set.
7. Brynjolfsson and Kemerer knew that Software Digest, published by National Software Testing Labs (NSTL), a private firm, had regularly reviewed all the major spreadsheet products and conducted detailed feature evaluations. By matching these data with price data from Dataquest, another private firm, they could estimate the values that consumers placed on various software features.
The difficulties came in trying to get historical data. Only intervention by the president of Dataquest following a chance meeting at an industry dinner finally led the researchers to historical data on prices. Obtaining data on software features required a different sort of intervention. To save on storage space, NSTL had simply erased all information regarding previous years' spreadsheet products because there was no market for that information. Repeated queries to various managers of the firm, as well as efforts to find the back issues in libraries, were unsuccessful. Finally, a mid-level employee of the firm came to the rescue. On his own initiative, he had stockpiled back issues of Software Digest in the basement of his home, along with thousands of other magazines. He agreed to ship several large boxes with the relevant issues to Brynjolfsson and Kemerer, where they were duly re-entered into a computer database.
8. Dealing with people's inquiries for data can be time consuming because people often need different data formats, more detailed documentation, and follow-up explanations of what the variables mean; make requests for related data; and have other requirements that need attention. At a minimum, such database support activities involve one or more conversations with people at each participating company.
9. The NSFNET backbone was originally the core of the Internet, and thus much of the total Internet traffic passed through it, making useful measurement of total Internet traffic possible. When this backbone was replaced by a new architecture, data on total traffic became harder to acquire. Also, the architecture of the backbone had been designed to allow more measurements than are now possible with the off-the-shelf router components that were used post-NSFNET, in order to satisfy deliverables of traffic measurement in the agreement with NSF.
10. Web advertising is one area where an effort has been made to develop useful definitions of access and use (Novak and Hoffman, 1997).
11. For several years in the early 1990s Brian Reid, an employee of the Digital Equipment Corporation, collected and posted on the Internet data on Usenet groups. Each month he used a sampling plan to report estimates of Usenet readership and message traffic for all Usenet groups. Researchers were able to use Reid's data to track growth in overall group membership over time, track the relative popularity of different groups at any one point in time, or identify groups worthy of further study (e.g., Sproull and Faraj, 1995). Reid stopped collecting and reporting Usenet data in June 1995.
12. Reid's study of Usenet traffic is a case in point. One reason Reid stopped collecting Usenet data was that its relevance declined when the World Wide Web was invented and the ratio of quality material to junk declined markedly. Another reason was the evolution of the way Usenet data was distributed, which made the study's measurement techniques increasingly statistically meaningless. However, the ultimate reason for ending the study was not technological, but rather was related to the threat of legal challenges over privacy issues surrounding collection, analysis, and publication of the Usenet data (personal communication, Brian Reid, Digital Equipment Corporation, 1998).
13. This work replicated findings by Hesse et al. (1993) derived from analysis of electronic survey data.
14. GenBank, an annotated collection of publicly available DNA sequences, is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA Data Bank of Japan, the European Molecular Biology Laboratory data library, and GenBank. More information on GenBank may be found online at ‹http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html›.