Read "Copyright in the Digital Era: Building Evidence for Policy" at NAP.edu

Page 45 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

4

Data Infrastructure for an Empirical Approach to Copyright Policy Research

Although the empirical research described in the previous chapter suggests that independent research on the copyright system’s impact on creativity and innovation can provide significant insights for policy makers, the availability of such research is very limited; and for questions on which some research exists, it is clearly at an early stage of development. The paucity of independent research can be explained by many factors, but the committee’s deliberations repeatedly returned to one key bottleneck—the quality and quantity of data across all of the principal content media—books, movies, recorded music, newspapers, and software. Categories include data on such matters as the costs of production, marketing, and distribution; prices of products and quantities sold; ancillary sources of revenue for creators such as live performances; consumption behavior; patterns of access, including unauthorized access, to copyrighted works; licensing terms and the efficacy of licensing arrangements; and the costs and efficacy of anti-piracy technologies and legal enforcement measures.

The situation with respect to copyright is analogous to discussions of the impact of the patent system some 15 years ago. There was no paucity of theory, but the difficulty of subjecting these theories to systematic and detailed empirical analysis meant that the debates went largely unresolved. There was even widespread skepticism that empirical research was feasible, let alone useful. This state of affairs has changed significantly over the past two decades. Most importantly, a number of key data sources were made available or created, spawning a diverse literature

Page 46 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

on the operation and impact of the patent system. An important early effort was the establishment at the National Bureau of Economic Research (NBER) of the first publicly available patent dataset that incorporated both accessible patent citation data and links to Compustat data on individual firms (Jaffe and Trajtenberg, 2002). Extensive surveys of corporate R&D managers by researchers first at Yale University (Levin et al., 1987) and later at Carnegie-Mellon University (Cohen et al., 2000) provided the first systematic data on how patents are used relative to other means of creating competitive advantage in different industries. Public agencies such as the National Science Foundation and, in recent years, the U.S. Patent and Trademark Office itself, have taken further steps to expand patent-related data collection and analysis. A robust empirical research agenda in the copyright area will require data associated with the activities of very different stakeholders—originating artists, performers, companies that publish and disseminate copyrighted works—as well as much more detailed user data that capture patterns of digitized material consumption and distribution across population groups.

The availability of systematic data and the emergence of a community of investigators able to identify the strength and weaknesses of particular data sources for addressing particular issues were keys to an empirically oriented understanding of the patent system that has clearly influenced policy making in the area. The committee believes that creating a similar data infrastructure platform around copyright and enabling a community of investigators to study and engage directly in policy debates in the area of copyright would be immensely valuable.

Empirical copyright research has been undertaken in the past although not on a sustained basis. Issues similar to today’s debates about anti-piracy measures arose at the dawning of the digital age over two decades ago. With the advent of digital audio tape (DAT) technology, the record industry and the consumer electronics industry diverged on the need for government intervention. Both sides produced consumer surveys and studies supporting their points of view. The non-partisan Office of Technology Assessment (OTA), created to provide Congress with authoritative analysis of complex technical issues, sponsored theoretical, empirical, and survey research that addressed consumer patterns as well as the concerns about infringing use of home recording technology. Although the legislation growing out of this work—the Audio Home Recording Act of 1992, P.L. 102-563, 106 Stat. 4237—was soon eclipsed by more effective digital copying and playback technologies (e.g., computer ripping of audio files from CDs and MP3 players), the OTA studies, in particular its consumer survey, provided an objective basis for anticipating consumer behavior and evaluating policy options (U.S. Congress Office of Technology Assessment, 1989).

Page 47 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

The analogy to empirical patent research has limitations. Unlike the patent system, there is no comprehensive repository for copyrighted works. Measuring their value using sales or usage data is challenging because such data are either unknown, dispersed, or privately owned. Owing to the vast, decentralized, and often private nature of the data, the costs and benefits of the collection process are often difficult to know. In some cases, such as orphan works, it is simply infeasible. Thus, before describing some types of research projects that might be profitably undertaken, we outline in this chapter both key opportunities and formidable challenges associated with acquiring and using data related to copyright and identify some promising data resources to support policy-relevant empirical studies.

OPPORTUNITIES AND CHALLENGES ARISING FROM DIGITAL TECHNOLOGY

Copyright policy is most contentious and in flux in the digital realm. The introduction of CDs, DVDs, MP3 files, UGC websites, web-based content aggregators, and now streaming music and radio have all created challenges for the interpretation and enforcement of copyright law not only in the music industry but also in other copyright-intensive industries such as newspapers, software, and film. Digital technology also enables rapid changes in the nature of consumption, which can expand rapidly in new areas and contract just as swiftly in others.

The implications for data collection are also profound. Most promising, the process of digitizing and digitally distributing expressive works generates a digital data trail that can then be used by researchers to study copyright policy. File-sharing is a prime example. By its design file-sharing software requires an accounting infrastructure that keeps track of users connected to the system, including their location, operating system type and speed, as well as information on which files are being shared by whom in what way. These data are ostensibly public, although collecting, organizing, and making data amenable to systematic research takes considerable effort. Several studies have collected different chunks of such file-sharing data and use it to telling effect. Such direct comprehensive data-based analysis of music sharing would have been impossible in a world where users swapped CDs and purchased bootleg copies from local dealers.

Although infringing use of music has been the phenomenon most thoroughly studied using this digital data trail, it is not inconceivable that similar methods could be applied to other industries as they become increasingly digitized. E-books provide a prime example. In a world where readers increasingly consume written content on digital devices,

Page 48 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

usage data now exist that would have been prohibitively expensive to collect in the analog age. Software logs routinely collect information not only on sales of books downloaded from centralized repositories like Amazon, but also information on if and when a particular book was read, how quickly it was read, and so forth. Similar analyses could be done on e-magazines and blogs where it is now possible to measure time spent on a particular article or blog-post, and click-through rates of particular hypertext links. In the context of streaming video, YouTube and Netflix collect data on user behavior including repeat consumption and the location and time of consumption. All of this information, if routinely collected by private and public entities and systematically organized, would be invaluable to the study of copyright in the digital age, as well as other aspects of the digital economy. Of course, proper use of this data will require taking steps to protect the privacy of consumers.

On the other hand, collecting such microdata for research remains a considerable challenge. Perhaps the biggest challenge lies in the fact that data about the creation, consumption, and distribution of digital media increasingly reside in the hands of private entities whose incentives diverge from those of researchers. Even if such data were available, constructing pseudo-experimental research designs places an additional burden on data when, as is usually the case, researchers are unable to directly run experiments. Finally, the problem of “free” goods is particularly salient in the digital domain. E-magazines and blogs are often free to read, free applications for smartphones abound, and free music and video are widely available. In such cases, it becomes hard to place a dollar value on such goods, compounding the difficulty of estimating consumer or producer surplus in these industries. This section highlights the practical and conceptual challenges inherent in the collection of digital copyright-related data and its use in carefully designed research.

Incentives of Data Owners

Data collection can be costly. Firms and industries have some motivation to collect such information in the pursuit of profit maximization and industry-focused advocacy. To out-compete rivals they will want to keep some information proprietary, but in some cases they will be open to selectively sharing data that will help their industry in policy advocacy. They might also design studies and surveys to shape public or political elite perceptions in ways that favor their policy agenda. The home recording controversy described earlier is a good example. What private data holders do not have at present is an incentive to act in concert to share data with researchers whose results they do not control.

These challenges will undoubtedly persist as the Internet and

Page 49 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

digital technologies continue to evolve. For that reason, we believe that the policy agenda must begin with a multi-faceted, robust, broad-based, forward-looking data collection foundation.

Challenges of Research Design

Even if some of the adverse data-sharing incentives of data owners could be negotiated, credible research requires well-conceived research designs. The ideal approach is to experimentally subject a treatment group to a particular policy while leaving another, similar “control” group untouched, then to estimate the impact of the policy using relevant outcome variables. This simple comparative approach would work if we could experimentally expose, say, half a population to an opportunity to engage in infringing use of copyrighted content. But this may not be feasible. We assume that the people we would observe engaging in infringement are likely those with a high level of interest in the work. However, research into whether this assumption is valid may be a threshold step in this inquiry. For example, it may be that some people access the work without authorization merely for the purpose of skimming, sampling, or other initial inquiry much as one would use a précis, index, or other aid. Gaining access to data while simultaneously implementing a credible research design is often a considerable challenge. Nevertheless, the more data collection is expanded, the more it will be possible to implement better research design. (Angrist and Pischke, 2008).

The copyright context may well be a source of pseudo-experimental comparisons. As a general rule, books and musical works published in 1923 are now in the public domain while some works produced a year later are not, making it possible for simple comparisons to provide important insight into the effect of copyright although this may be complicated by the fact that there are often several editions of the same title. If copyright protection inhibits use—or if being in the public domain promotes over-use—then the works still under copyright protection should see less use. As useful as this insight may be, a researcher of course still needs data on usage or other outcomes of interest. In particular, careful research designs must reflect the fact that copyrighted material is heterogeneous and ensure that “apples to apples” comparisons are being made when the objective is to determine the impact of copyright law on the creation, diffusion and use of those works.

Free Goods

The challenges of incorporating the impact of digital technology into GDP are particularly troublesome in the case of digital goods and services

Page 50 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

whose price is zero. To see why, consider the usual approach to adjusting for quality. Suppose that technical change has allowed the price of lettuce to fall from $3.50 to $2.00 from 2009 to 2010 and that demand is perfectly inelastic, i.e., the quantity remains constant. While the total nominal sales of lettuce would decrease from 2009 to 2010, we can easily make a price index adjustment. Using the 2009 prices for the same good, GDP would have been higher, and so we can use these quality-adjusted prices to characterize the impact of technical change on the lettuce industry.

If a good that formerly had a price becomes free, however, there is no procedure for incorporating it into GDP statistics. Suppose that in 2009, there were many sales of music CDs, but by 2010 consumers relied exclusively on infringing downloads, possibly in much higher volume. As customers download music without cost from the Internet in place of purchasing music CDs, both the price and quantity of music purchases disappear from GDP calculations. There is no simple price adjustment that will allow us to link the 2009 and 2010 distribution and account for the change in price. Instead, the entire category of music sales simply disappears from the GDP estimation. A concrete example is the decline in sales of printed encyclopedias, initially attributed to the rise of Encarta, which was recorded as a drop in GDP, while the rise of Wikipedia, which displaced Encarta, is absent from the GDP statistics. Similarly, there is no direct accounting in GDP for the rise of online media services such as the New York Times or Washington Post except for the indirect sales generated through advertising revenue. This mismatch in the quantity of digital output and its mis-measurement in copyright-relevant industries makes empirical analysis extremely hard to implement.

Despite the formidable challenges of measuring the value of free goods, their increasing importance in many digital contexts requires that new research methods be developed and implemented. Contingent valuation, randomized control trials, and quasi-experimental settings are all potential methods for helping to determine what value consumers and other stakeholders ascribe to free goods on the Internet. Companies like Google have been measuring and benchmarking the impact of digital content. A website’s PageRank or reputation on the Web translates into how much attention or time it can expect to get from consumers, which translates into how much ad revenue it can demand from advertisers. These links fall short of scientific rigor, and it is debatable whether ad revenue captures all the values and if not, what the correct methodology should be.

Page 51 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

Measuring the Impact of Digital Technology

Many of the topics in copyright policy research require measuring some aspect of the transition to a digital age. The measurement questions are central in some cases, secondary in others; but measuring the emergence of digital technology is an underdeveloped field attracting a level of effort woefully small in comparison to its social and economic importance. A very large scale government enterprise measures GDP, the flow of pecuniary goods and services. The shifts toward digital goods and digital distribution command attention nothing like it in scale or sophistication.

The symptoms of underdevelopment are apparent in many aspects of U.S. policy. For example, the recently issued 360-page National Broadband Plan contains information from only a few statistical studies authored by neutral third-parties, primarily academics. It contains little in the way of statistical analysis of the consequences of various policy options. This is not attributable to inadequate staff effort but reflects the inchoate state of economic research about digital infrastructure and digitization more broadly, in particular, the absence of an organized community of researchers with a large and well developed body of knowledge.

So incomplete a data foundation would be unthinkable in other infrastructure contexts. Every congressional bill supporting transportation infrastructure, for example, is accompanied by a forecast for the economic growth it will generate and the number of jobs it will create. Nothing comparable can be done for legislation shaping the information infrastructure because there is not even a simple measure of the size of the digital economy nor any apparatus in place to project its growth.

Many initiatives to improve measurement of the digital economy were launched in the 1990s—at the Bureau of Economic Analysis, Census Bureau, Bureau of Labor Statistics (BLS), and the National Telecommunications and Information Administration. A few of these have survived, for example, a survey about the labor market for information technology workers, and an estimate of the scale of electronic commerce, called E-Stats. Others did not survive, however—for example, household and business surveys of broadband supply, adoption, and use.

Unlike in other developed countries, the best information about the online behavior of the U.S. population came not from a government-sponsored survey but instead from a private foundation, the Pew Internet and American Life Project. Although the Pew survey has been useful, especially in tracking social behavior online, its scale is limited, ranging from a little more than one thousand to several thousand households at a time. With these sample sizes the survey could only gauge general trends and gain some insight into their variance. It is incapable of achieving what the BLS survey, involving 80,000-100,000 households, does well—providing a

Page 52 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

picture of variance across populations in different regions with different gender, age, skill level, educational, and ethnic profiles.

WHAT DATA ARE NEEDED AND AVAILABLE, ACCESSIBLE, OR COULD BE CREATED?

Public discourse about copyright would benefit from a range of innovative institutions contributing to measurement efforts. What types of publicly accessible databanks would contribute to research efforts? What standards for data in this area would contribute to building further research? What data remain locked in proprietary vaults but could be unlocked by a standard process for protecting privacy while informing research? What is not being systematically measured but could be?

Assessing the health of the copyright system requires, at a minimum, documenting both the supply side and the demand side of the market for each content area—books, movies, recorded music, newspapers, software, etc. On the supply side, this means determining the number of products, and new products, available in each year, and the prices of each of the products. Generally a harder task is quantifying the consumer side of the market, not only the quantities sold but also the amount of use that each product gets. Harder still, but vital for answering important policy questions, is ascertaining the volume of unpaid use of each product over time. Because many copyright industries derive much of their revenue from ancillary activities, it would be useful to know about revenue flows to producers from these activities, including, for example, live performance revenue for musicians and speaking fees for authors. With data of these sorts, one could begin to address the following questions: What has happened to revenue? To what extent has unpaid consumption displaced sales? What has happened to the flow of new creative works?

To study the role of each agent in the digital economy—creator, marketer, distributor, and consumer—three categories of data are needed. These include data that are currently available to the public but not extensively studied in the context of digital technology; data that exist but for whatever reason are not available to the general public; and data that do not currently exist but can be created.

Existing Accessible Data

We have found a wide range of data sources from government agencies to private institutions that can be used to measure the impact of copyright in the digital age. Most of the data are published on an annual or quarterly basis, although a few reports have been released on a one-time basis. First, we will examine data related to Internet use in general from

Page 53 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

public and private institutions. Then we will look at the relevant data sources for digital copyright in particular. See Table 4-1 for an annotated bibliography of these data sources.

The most comprehensive public domain report on the behaviors and demographics of Internet users is the Federal Communications Commission’s High-Speed Services for Internet Access, which focuses on the status of broadband in the United States. It shows the number of consumers connected on broadband through DSL, cable modem, FTTP, and satellite. The report further breaks down each population group into seven tiers both in terms of upload and download speeds. It also includes a geographical mapping of connection speeds on a state-by-state basis.

The Pew Research Center publishes an annual report that shows the number of Internet users by gender, race, age, household income, education, and community type. This report includes data on broadband and wireless penetration as well as the percentage of Internet users who carry out certain activities online such as reading the news or playing games. Together, the Federal Communications Commission and Pew reports describe some aspects of the user dynamics of the digital world and have the potential to model different aspects of consumer behavior online.

Private firms collect a great deal of information on products, prices, and volumes of paid consumption (see Table 4-2). Nielsen, for example, collects very detailed data on the quantities of books and music recordings sold as far back as 2001 in the case of books and the 1990s in the case of music. Nielsen also conducts a quarterly survey, the A2/M2 Three Screen Report, that tracks the penetration of broadband, HDTV, DVR, and smartphones. In addition, the report contains the number of users for and the hours spent on TV, Internet, and mobile phones broken down by age demographics. Although some researchers have gained access to Nielsen data, they have not been widely used because of the restrictive terms on which they are available.

Movie box office revenue data are available from the Internet Movie Database (IMDb) and Box Office Mojo, among other sources. Information on sales of discs is available from Opus and other providers. The RIAA now provides substantial data on its member companies’ current and historical sales activity.

Perhaps the biggest void is data on the volume of unpaid consumption, yet that, too, is changing. Big Champagne has tracked the popularity of copyright-protected works through unpaid distribution channels for a decade. And Google’s recently developed Transparency Report portal provides real-time and historical data on take-down and user data requests.

Another regularly published report, by the International Data Corporation (IDC), shows the size and growth of digital data over time. The

Page 54 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

TABLE 4-1 Data Requirements for Copyright Analysis: An Illustrative Framework

	Supply	Demand
Music	• data on new records, music tracks including professional, semi-professional, and amateur recordings • number of concerts (with details on venues, capacities, etc.) • information on quality of new music recorded • copyright status of recorded work	• number of new tickets sold • music video plays on YouTube and elsewhere • radio airplay and listening times (including online streaming services like Pandora and Spotify) • record and Internet sales data • data on unauthorized use
Performance Artists	• information on the careers, activities, and income of dancers, performers, musical artists, etc.	• Information on the consumption of artistic performances of various types, and the impact of digitization on that.
Original artistic productions	• information on the careers, activities, and income of originating artists including fine artists, architects, designers, sculptors, etc.	• information on the consumption of art by museums, collectors, galleries, corporations and the general public
Scientific papers and research reports	• data on scientific researchers • data on the activities and finances of scientific publishers	• data on use of prior research by scientific researchers, by professional practitioners who rely on scientific findings (e.g., physicians), and by the general public
Movies	• data on new movies, video clips released • quality measures of new video content • copyright status of recorded work	• data from videos taken down from YouTube, Ustream and other video content sites • cinema attendance numbers • home movie watching including internet purchases, video rentals, streaming movie services, set-top box consumption, etc. • data on unauthorized use

Page 55 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

Supply

Demand

Software

• data on the amount of software produced, value of such software, and its diffusion in various formats

• data on the use and extension of software by users in both the private and public sectors • data on user-generated software, including software from open source movement

Content

• data on publication of new content, by publication type (magazine, newspapers, blogs, websites, etc.)

• copyright status of work

• readership figures

• time and money spent in consuming content

• ad revenue for publishers

• data on unauthorized use

numbers include both historical values and future projections as far into the future as 2020. The report shows the cost of information management, the percentages of Internet data that require various levels of security, and the number of people using social networks. The State of the Internet is a quarterly report published by Akamai that provides country-level Internet data. The statistics include Internet attack traffic, average connection speed, and number of unique IP addresses. The same data is available on a state-by-state basis for the United States.

Some copyright data from government and academic institutions have not yet been analyzed. The online U.S. Copyright Office Database contains roughly 20 million records of works registered since 1978 by creators of books, music, films, maps, software, etc. Each record contains the date of creation, date of publication, and basis of the copyright claim. Pre-1978 Copyright Office records are being digitized back to 1923. Another source of copyright data is the Stanford Copyright Renewal Database, which contains renewals of copyrighted books between 1950 and 1992. Each record shows the title, author, renewal date, and renewing entity.

Another category of government data, important for understanding copyright enforcement, is civil infringement suits filed in U.S. Federal District Courts and criminal prosecutions for infringement. This provides a record of plaintiffs, defendants, and judgments for cases that proceed through litigation. A private firm, Lex Machina, is preparing copyright litigation data in a form that should be useful to researchers.

We have also identified data in the private sector that can advance our understanding of the impact of copyright laws. The RIAA publishes an annual Music Consumer Profile report that estimates the market size

Page 56 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

TABLE 4-2 Existing Data Sources and Stakeholders

Agents	Database Name	Source	Frequency
Consumers	A2/M2	Nielsen	Quarterly

Consumers	The Diverse and Exploding Digital Universe	IDC	One-time

Consumers	The Digital Universe Decade	IDC	One-time

Consumers	High-speed services for Internet access	FCC	Semi-annual

Consumers	Survey data	Pew Internet	Annual

Consumers	Soundscan Social Media Report Television Report	Nielsen	Ongoing

Creators	Copyright records	U.S. Copyright Office	Ongoing

Creators	Copyright renewal database	Stanford University	One-time

Distributors	10-K and 10-Q reports	Media distribution companies	Annually and quarterly

Copiers	Digital Music Report 2010	IFPI	Annual

Regulators	Music Consumer Profile	RIAA	Annual

Page 57 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

Description

• Analyzes consumer behavior on video related media including TV, Internet, and mobile phones. Discusses what consumers watch, how much time spent, and how trends are changing.

• http://en-us.nielsen.com/main/insights/nielsen_a2m2_three

• Calibrates size and growth of digital data through 2011.

• Also explores the impact of scientific industries as well as the environmental footprints of digitization.

•http://www.emc.com/collateral/analyst-reports/diverse-exploding-digitaluniverse.pdf

• Estimates the size and growth of the digital universe through 2020. Also looks at the cost to manage information, security issues, and the prevalence of social networks.

• http://www.emc.com/collateral/demos/microsites/idc-digitaluniverse/iview.htm

• Provides summary of subscribership data filed by annual providers of high-speed services. Includes details about subscribership differences among census tracts.

• http://www.fcc.gov/wcb/iatd/comp.html

• Shows the current demographics of Internet users and the activities they do online. Describes the frequencies of Internet activities.

• http://www.pewinternet.org/Data-Tools/Download-Data.aspx

• Overall sales and viewership figures on a variety of media platforms, including CD, DVD, consumption of social media, etc. Scattered across multiple reports and Nielsen channels

• Catalogs all registered books, music, art, periodicals, and other works. Includes the date of creation, basis of claim, previous registration, and claimant.

• http://www.copyright/gov/records/

• Creates a searchable copyright renewal records for books published between 1923 and 1963. Contains information on renewing entity, renewal date, and registration date.

• http://collections.stanford.edu/copyrightrenewals

• Includes financial data such as net income, revenue, and cost of goods sold. Also discloses special events such as CEO departure, bankruptcy, and business risks. Available on company websites and from financial information services.

• One measure of the incidence of global music revenue and the impact of unauthorized use across different domains. Imperfect sales suggest that the decline in global music revenue is a result of unauthorized use with certain regions suffering more than others.

• http://www.ifpi.org/content/library/DMR2010.pdf

• Provides benchmark on genre, format, age, and gender of music consumers. Estimates the overall size of the music industry.

• http://www.riaa.com/keystatistics.php

Page 58 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

Agents	Database Name	Source	Frequency
Regulators	The State of the Internet	Akamai	Quarterly
Regulators	Fair Use on the Internet	Library of Congress	One-Time
Researchers/Inventors	Web of Science	Thomson Reuters	Ongoing
Researchers/Inventors	USPTO Patent Database	USPTO	Ongoing

for the industry. The market figures are broken down by genre, format, age and gender of consumer, and channel of sales. In addition, music recording companies publish annual financial 10-K reports that contain profit margins and revenues numbers. The same is true of publicly held companies in other copyright-intensive industries such as film, publishing, and software. These data can shed light on the stakes involved for copyright regulation.

Lastly, there are one-time reports published by government institutions and special interest groups to address the issue of digital copyright. The Library of Congress published the Fair Use on the Internet report in 2002, which contains a list of court cases that can help define what is considered fair use and what is not. The International Federation of the Phonographic Industry (IFPI) Digital Music Report 2010 estimates the revenues lost due to music infringement in select countries around the world. Estimates include global revenues for games, music, films, newspapers, and magazines. The report also provides a list of legal music providers for each country.

Existing Data with Limited Access

Massive amounts of copyright-related data exist but are not readily available for public use for multiple reasons. For example, the records of customer purchases on eBay or Amazon.com can be used to study online consumer behaviors. Due to privacy issues, these data are not easily accessible by research institutions and have limited use even for keepers of the

Page 59 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

Description

• Includes data gathered across Akamai's global server network about attack traffic, connection speeds, Internet penetration and broadband adoption. Also aggregates publicly available news and events.

• http://www.akamai.com/stateoftheinternet

• Assesses the merits of the fair use argument for actions on the Internet. Highlights the difficulty in creating a general guideline for fair use on the Internet.

• http://www.fas.org/irp/crs/RL31423.pdf

• Scientific publications and citations

• http://thomsonreuters.com/products_services/science/science_products/a-z/web_of_science/

• scientific publications and citations

• http://patft.uspto.gov/

data. Another example is the amount and content nature of peer-to-peer file transfers that take place over the Internet. Some of that information exists on peer-to-peer network servers that are operating in questionable legal realms and some on individual personal computer hard drives. For these types of data, the first challenge is simply to identify the sources, then to overcome the legal barriers to access, and agree on protocols to protect privacy, and finally, to aggregate the data into one place.

Currently Non-existent Data

A full understanding of the digital economy will eventually require collection of additional data that currently do not exist. These data may not be quantitative or even quantifiable. In the Internet realm, with little control and regulation, the data collection process presents many technological challenges. Examples of such data of interest include systematic measures of copyright enforcement, radio playlists for all stations, and licensed use of musical works in television and movies.

Closing the Gap

We have three suggestions to advance research to inform evidence-based policy making. First, we need to attract social science researchers’ attention to the questions we have identified. By forming a cohort of researchers from a wide variety of disciplines and by supporting them

Page 60 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

with a robust and comprehensive data infrastructure we can make significant progress on a wide variety of policy issues relevant to copyright.

Second, public and private grant-making organizations should support research that builds the data infrastructure that would support research in this area. They could convene a representative group of researchers, for example, under the auspices of NBER, to further identify, characterize, and prioritize data sources. Funding agencies could then assist researchers in negotiating access to such data and in some cases fund their acquisition from industry stakeholders, perhaps through a research consortium. In many cases, private firms hold data that may be recent enough for some research purposes but obsolete commercially. They might be induced to release these to researchers on a rolling basis.

Third, as we have observed, the federal government needs to expand the collection of data on the digital economy as well as on intangible assets such as intellectual property holdings and their use. This should take several forms. First, agencies such as the Bureau of Labor Statistics and the Bureau of the Census should consider adding copyright-related information to regularly conducted surveys of businesses and consumers. One prime example would be revising the Bureau of Labor Statistics Time Use Survey to address questions of digital consumption in a contemporary way. In the current survey, there is no measurement of time spent listening to music exclusively rather than in combination with other activities. Although private sector sources of data are important, as we have noted, there are significant limitations of current surveys, and the availability of such data is limited for researchers. The Bureau of Economic Affairs of the Commerce Department has very limited resources to acquire the types of business data described above that could be extremely useful in understanding the landscape of intangible assets.

The committee proposes a more ambitious approach. Agencies such as the Bureau of the Census, Bureau of Economic Analysis, National Science Foundation, U.S. Patent and Trademark Office, and the Copyright Office should form an interagency group that, along with expert advisors, would study the advisability and feasibility of an ongoing and systemic national business survey of intellectual property. Like the Business R&D and Innovation Survey (BRDIS), the IP survey would include samples of businesses in the service and manufacturing sectors. It would probe uses (e.g., licensing) and holdings of intellectual property and costs of acquisition and maintenance. Because of the nature of the production of digital goods, including the prominence of user-generated content, the business survey should be complemented, if at all feasible, by a detailed consumer survey of user-generated content and use. This would include, among other things, measurement of the amount of production and distribution of digital content by non-business entities (i.e., by users), and also

Page 61 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×

measurement of the consumption of such content by both business and the population at large.

Unlike BRDIS, these surveys could be conducted periodically, such as every five years. The Bureau or the National Science Foundation would issue periodic reports of aggregated data, but detailed data would be available to qualified licensed researchers on the same basis as other business confidential information, through the Census data centers. Such survey data could never provide data to answer all of the research questions we pose in Chapter 3 but would be a considerable advance on the status quo, greatly contributing to our ongoing efforts to better understand the stock and flow of intangible assets in the economy.

We cast this proposal as a study recommendation because of the constraints of our charge and limitations of our expertise. Although a survey would be especially important for understanding copyrights because of the lack of a formal registration requirement, it would make little sense to mount a survey of copyrights alone, neglecting patents and trademarks. Nevertheless, other forms of intellectual property are outside our statement of work. Equally important, we are not in a position to judge two very important considerations that could render either or both surveys impracticable—the burden they would impose on respondents (e.g., the need for businesses to conduct patent and copyright searches) and the resources needed by agencies charged with carrying them out. The federal statistical agencies generally are tightly budget constrained and having to cut back activities.

The gap between what would be ideal in terms of data requirements for a thriving research agenda around copyright and what exists currently is large. Building easily accessible and comprehensive datasets relevant to the study of copyright-relevant industries is crucial for the development of a research community based around copyright issues. We hope the categories of data described in this chapter will help focus efforts to obtain and create high quality datasets for addressing some of the key policy questions described in this report.

Page 62 Cite

Suggested Citation:"4 Data Infrastructure for an Empirical Approach to Copyright Policy Research." National Research Council. 2013. Copyright in the Digital Era: Building Evidence for Policy. Washington, DC: The National Academies Press. doi: 10.17226/14686.

×