Chapter 12: Obtaining Descriptive Data to Describe Database Use and Users: Policy Issues and Strategies | Data for Science and Society: The Second National Conference on Scientific and Technical Data | U.S. National Committee for CODATA

Chapter 12: Obtaining Descriptive Data to Describe Database Use and Users: Policy Issues and Strategies | Data for Science and Society: The Second National Conference on Scientific and Technical Data | U.S. National Committee for CODATA | National Research Council

12

Obtaining Descriptive Data to Describe Database Use and Users:
Policy Issues and Strategies

Charles McClure

My job this afternoon is to talk a little bit about problems related to obtaining descriptive data and understanding what these databases are, who is using them, and how they are using them, and to try to convince some of you who are producers of these databases that we need your help--"we" being the people who purchase and use the databases that you have been talking about today. There are two questions I specifically want to address.

The first is access to and use of scientific and technical (S&T) data so that interdisciplinary basic and applied research can be improved. How can access be improved? The argument is that access cannot be improved until we have a better sense of who uses our databases, how they use them, for what purposes, and so forth. The other question is, How do we measure and evaluate productivity and performance, and the management and use of these S&T databases, not only across disciplines but also across different user sectors? This is a very difficult problem. The fact of the matter is that right now we are very limited in our ability to describe who uses what kinds of databases, and we really cannot do much comparison between or among use of different databases. What are some of the topics of interest here?

Are there core sets of use and user data that database vendors should be providing? I use the term database "producers" for both the public and the private sector. The same arguments that we can make for one of the private databases, we can make, for example, for the U.S. Geological Survey. How do use and user data affect our database pricing? What is the cost per user, cost per session of some of these databases? Is it whatever the market will bear? How can use and users of databases be compared across different database vendors? Another question focuses on the range and type of user and use statistics that are needed to access and measure productivity. This is important because from the library perspective, for example, decisions have to be made about what databases get purchased and what databases are made available to users. In the academic world, there is a finite pie of monies that can be dedicated to the purchase and use of these databases.

Figure 12.1

Figure 12.1 identifies three key problems similar to the issue areas. Many of the database producers assume they know the users and what the uses are. In a number of research projects, with which we have been involved, these assumptions usually are wrong unless there are very small, discrete bodies of scientists. The third issue--how can use and users of databases be compared across different database vendors--will be addressed later.

For the moment, let us talk about network resources and use this as a terminology. The Association of Research Libraries (ARL) includes about 120 or 130 of the largest academic and research libraries in the United States and Canada. They are very concerned about the limited ability to obtain a range of data from vendors and producers about who is using these databases and what the costs are, and about getting some commonality among definitions and terms. A recent ARL report defined a networked information resource as a commercially available, electronic information resource (e.g., library database, full-text service, e-journal) funded or enabled by the library, which is made available to authorized users through a network (e.g., LAN, WAN, dial-in).¹ One database provider might use the term "session." What is a session? Another database vendor might use the term "log-ins," and another one might use "hits" or something else. There is a lack of commonality on what these terms are and what they mean.

So what are some of the key problems? Paul Uhlir encouraged me to talk about opportunities rather than problems, but the fact of the matter is that there are problems (see Figure 12.1). First, the people who purchase and use these databases oftentimes do not control the data, which is very different than in the old days. Libraries, for example, could count circulations. That is not true with databases. Second, accurate, reliable, and longitudinal data that describe use, users access, downloads, sessions, turn-aways, and other related data are often difficult, if not impossible, to obtain from various vendors. Then if data can be obtained from some database producers or vendors, they frequently are not comparable. Accesses are different from a log-in, which is different from a visit, which is different from this and this and so forth and so on.

What are some impacts of these problems in terms of policy and policymaking? The first problem is that the inability to describe use and users accurately injures our ability to have a base set of data from which we can describe and compare use among these databases. Some of you who are producers of databases would like other people to use your databases. Without adequate software and reporting techniques, we have no idea of who uses these databases or of how they could be improved. Secondly, these problems hinder our ability to specify and analyze a whole range of policy issues, which then results in policymaking by opinion. These policy issues include copyright, breaches of security perhaps, and pricing--to name a few. Third, pricing of databases is a very hot topic; not all databases, as we know, are free. Purchasers find it increasingly difficult to justify the costs for databases in terms of use or impact. There are also intellectual property rights, privacy issues, data protection, and so forth. Basically the lack of an empirical body of knowledge that describes database use and access impinges on our ability to have intelligent conversations about a range of policy issues.

Now, let us talk about why vendors may not provide data. First, it might be that database producers do not produce such data and statistics even for themselves. In these instances, they know not what they don't know, and some of these folks are easily educated. A second reason is that licensing agreements with individual organizations do not require database vendors to produce such data. For those of you who have not lived in the licensing world, licensing agreements with database vendors can be very interesting. Some licensing agreements require a lawyer to understand what you are allowed and are not allowed to do.

A number of purchasers of databases now include language in the licensing agreement that says, "I expect you to provide for me on a multibasis these specific types of data of use, users, downloads, turn-aways, whatever." The database people sometimes will say, "Okay, we can do that," but they don't ever do it, alternatively; they can say, "Forget it, we are not going to do that." So in licensing agreements, oftentimes we have people who are very sophisticated making these agreements with purchasers who are very unsophisticated.

Another problem to the defense of database producers is that there are no agreed-upon definitions for many of these terms, such as turn-aways, sessions, and hits. There are games that we play on these terms and so forth. Until we get some agreed-upon definitions, licensers may agree to provide you with certain data, but the data you end up getting are not what you thought you were going to get. For example, what you thought you were going to get as a session count turns out to be something else.

Some other reasons that vendors do not provide data include the manner in which databases are organized. Some databases were created technically in such a way that it is very difficult for them to actually produce those statistics. Another problem is that libraries and other people who have these databases have competing kinds of data that they want vendors to provide, and vendors and producers of these databases correctly say, "Wait a minute, I cannot provide 20 different reporting formats and types of data to 20 different types of requesting organizations." One of the issues that the ARL is working on is for the libraries to come together and say, "Here is what we mean when we want counts of sessions, of log-ins, of turn-aways, or whatever." For those of you who are producers of databases in the scientific and technical area, I assure you that there will be increased pressure in the near term to provide data to better describe who is using which databases, what they are using, and the various kinds of counts that I mentioned earlier.

Of course, another reason not to provide the information is that vendors may consider these data as proprietary. "I don't want you to know who is using my database because another company will take that as competitive advantage and use those data against me to build or market their database." There is a whole set of reasons why this happens, and purchasers do need to remember that the database world is very competitive.

In the library context, all of the traditional kinds of counts such as circulation are oftentimes stagnant or declining in many cases, whereas anything that is electronic or Web based--anything that is downloaded from databases--is increasing astronomically. We know this, but we are unable to count it well. Part of the reason libraries and other users of these databases cannot produce such statistics is because we have yet to work on how best to collect them and how to agree on what some of these definitions are.

Let's turn to measuring some of the impacts and benefits. We need to know what it is we want to measure from these databases. We then have to agree on how we want to measure it. We ought to have a good reason for measuring it. We need to know by whom the data are going to be used, and I think one of the issues here is that the need for this kind of data is very much lodged within this economic context. There is a finite pie here of how much money can be spent on purchasing databases. If you are a library or another type of organization that purchases databases, you want to make sure that those databases best meet the needs and purposes of your users. In order to do this, these and related kinds of data are going to be needed. We are only beginning the process of learning how to measure various types of database services and use.

It can be done. However, the larger scientific community needs to be able to have an ongoing dialogue with database producers to build these kinds of mechanisms into the database so that such data can be regularly reported (see Figure 12.2).

Figure 12.2

One of the models that we are beginning to consider measures efficiency, effectiveness, quality, impact, and usefulness (see Figure 12.3) against aspects of the network (e.g., technical infrastructure, information content, support, management). You can have great information content via the network (or database) but still have a network infrastructure that injures the ability to access and use that database. So these are some of the criteria that we are developing. This model seems to be very useful in thinking about and developing measures (see Figure 12.4). In most of these cells, we are developing some individual measures that are quite useful in describing database use and uses.

Figure 12.3

Figure 12.3b

Figure 12.4

Figure 12.4 describes network components (e.g., technical infrastructure, content, and types of services provided). Some of these databases not only give you resources, but provide you with a range of interactive services with current awareness. They can include a wide range of value-added services that make them more or less desirable if you are making decisions on what kind of databases to buy.

Extensiveness, for example, asks how much of a service is being provided--for instance, the number of Web page accesses per week, remote sessions, et cetera. An efficiency measure might be cost per session. We found in our research thus far that the cost per access to a session on one database compared to another database in the same area with the same basic content can vary considerably for the same kind of access. These are all criteria that we are beginning to use to judge the usefulness of these various databases.

There are different types of Web log files that are essential for assessing database use, but I don't want to spend a lot of time talking about log-file analysis. All the databases that are housed in a Web environment have log files. There are four standard text-based log files: access logs, agent logs, error logs, and referral logs. They provide a huge amount of data in terms of access: Internet Protocol (IP) addresses, who is getting on, and for how long. You may be aware of techniques called threading in which one can track the way in which a certain IP address gets on, which pages it goes to, how long it stays at each page. This is an excellent method to find out how your database is actually being used. We all know that there are problems with log files and getting statistics off these files. There are fire walls. There are proxy problems. There are caching problems. Nonetheless, log-file analysis is a powerful tool that we have only begun to employ in assessing the use of networked information services.

There are also a number of Web statistics. These are the kinds of statistics that are being bandied about, such as number of sessions completed on individual databases and they produce these sessions by demographic characteristics, by IP address, by the length of the session, by the umber of turn-aways. Other statistics are full-text downloads, and some databases allow full-text e-mails or forwarding of files by type, by organization, by file size, by how many bytes these files are by titles of records browsed, and by a whole range of cost data that I won't get into. Basically, however, we are talking about things such as cost per session, cost per full-text download, cost of an on-site versus a remote session, and many more.

For those of us interested in assessing the quality, performance, impact, and use of these databases, we need to work with those of you who produce databases so that the potential of producing and using these data is realized. If you are interested in ensuring that your database is successful and does meet user needs, you should be as interested in these kind of data as, for instance, the library community is.

I want to point out that there is a group called the International Coalition of Library Consortia. Its Web site can be found at http://www.library.yale.edu/consortia/webstats.html, and it is worth a look. These are standards and guidelines that a group of folks have suggested for statistics gathering in databases. I certainly applaud this effort. However, there is a problem of getting people to agree on basic definitions and terminology, and while I say, "Congratulations that this has been done," it is now time to update and move ahead with additional measures, definitions, and data collection procedures.

I pulled off a couple of principles that they generated. One is called "comparative statistics." The information provider should provide comparative statistics to give consortia or users a context in which to analyze statistics at aggregate institutional levels. Another principle addresses access, delivery mechanisms, and report formats. It states that access to statistical reports should be provided via Web-based reporting systems. A number of database vendors allow you to go directly to their Web site where you can immediately see use statistics right then for your particular organization.

A project that we are completing currently is developing a national set of performance measures and statistics to describe information use in the network environment.² We just finished field-testing a range of statistics, some of which I have described to you. The findings suggest that there is limited agreement from database vendors about which statistics are going to be made available. There are differing definitions and terms to describe basic database activities--especially in the area of use and users.

To some degree, I think we are at the point where we may need a little fascism and simply have a national group say that this is how we are going to have to count this stuff. Currently, there is a bit of a Tower of Babel, and there is a varying range of knowledge. Some people don't know why we should bother doing this. Some people don't care. Some people think it costs too much money, and so forth.

I also want point out that the findings from some work that we have been doing with the federal government suggests that a number of agencies have produced a range of databases that do not comply with, or at least do not consider, a whole range of federal information policy requirements such as those in the Government Performance Results Act. There are a number of Privacy Act and Freedom of Information Act (FOIA) policies, and other federal information policy guidelines that apply to the Web environment and to database production, et cetera. We have completed a compendium of these laws and policy guidelines that affect Web-based services in the federal government, and it should be published by September 2000. There is a degree of ignorant bliss regarding these various policy statements, which may come back to bite federal database producers in the near future.

There are other research efforts in place to produce statistics and performance measures to assess database use and users. The National Commission on Libraries and Information Science has a project entitled "Testing National Public Library Electronic Use Performance Measures."³ The Association of Research Libraries is mounting a project called "Usage Measures for Electronic Resources."⁴ I would like to suggest to you that there are some people out in the library world concerned about the difficulties getting these kinds of data from vendors. Some of the largest ARL libraries may begin to reconsider where they buy database services because of the producer's failure to provide such use and user data. Such possible decisions could result in "some serious economic impacts" (to quote the previous speaker).

I also want to talk about the importance of standards. I think that after all is said and done with standards, more is said than done. Earlier, I kiddingly said, "Maybe we need some fascist decision making here." Well, it may be, but it would be a better world if there could be some agreement among not only the library community, but also the database producers and vendors, as to definitions and procedures for producing these statistics.

There also is this issue of national data collection agencies. We have federal agencies that are supposed to go out and collect national data, describing who owns what databases and how they are being used. So this is a serious problem, and it still comes down to the basic conference themes that were pointed to in the beginning of this session. Who is accessing these databases, for what reasons, and are they successful?

Some parting shots: Agreeing on, collecting, and reporting scientific and technical database statistics are going to take some time and some resources--and is going to require some collaborative help among key stakeholders. This is going to be an interesting problem, like the rest of the problems we have been talking about today, because as the information technology environment changes so will the process and the techniques that we need to collect the data to describe database use and users. How many of you still want to collect data about gopher uses? I don't think we count them too often anymore. We are in an environment in which some of these statistics are going to be needed for 2 to 3 years, and right now we are looking at a process that takes 3 to 5 years for us to figure out which statistics to agree upon; so, by the time the agreement is reached, we may not need the statistic anymore.

We have to do better than this. Database vendors and providers--whether they are government, private sector, or organizations like university consortia--need to begin thinking about agreeing to agree. Collaboration among policymakers, vendors, academic networking, and others is essential. Progress is being made. We need, frankly, more attention and visibility. Lastly, debates about a range of policy issues that have come up today would be better informed if we had data to describe some of the use and users of these databases. Most database providers think that they have a good database; they think they have great data and information in the database and people can easily find it. This may not be true--oftentimes they simply don't know.

A final point I want to emphasize is that we cannot make informed decisions about the quality of access without having some of these kinds of data that we are talking about this afternoon. Now is the time to start working together to develop such statistics and performance measures. The future quality and development of these databases depends on an ongoing program of assessment--an assessment that includes the use of nationally agreed-upon statistics and performance measures for services and resources in the networked environment.

Notes

¹ Association of Research Libraries (ARL). 1999. Networked Information Resources (Spec Kit 253), Washington, D.C.

² J.C. Bertot, C.R. McClure, and J. Ryan. 2000. Public Library Networked Services: A Guide for Using Statistics and Performance Measures. American Library Association, Chicago. See also "Developing National Public Library Statistics for the Networked Environment" at <http://www.ii.fsu.edu/>.

³ See http://www.nclis.gov/libraries/1sp/statist.html for additional information.

⁴ See http://www.arl.org/stats/newmeas/e-usage.html for additional information.