Black Mesa Technologies
As a complement to the presentations on data citation in the life, natural, and social sciences, this presentation will discuss data citation in the humanities. The reaction of some people to that topic will be to say: “What?! When did humanists start working with data? What is considered to be data in the humanities?” To respond to these questions, I will start with a little background on what counts as data in the digital humanities, then survey current practice with respect to data citation in this domain, concluding with some remarks on intellectual issues with data citation in the humanities.
Data in the digital humanities
Humanities scholars started using machine-readable data in 1948, when Father Roberto Busa began work on the Index Thomisticus. This index was a full-text concordance of every word of every work published by St. Thomas Aquinas, as well as every work that historically had ever been attributed to Aquinas (even those that Busa felt confident were not actually by Aquinas). In his theological research, Busa was particularly interested in the concept of the eucharistic presence, which required an exhaustive list of Aquinas’ use of the preposition in. Since prepositions tend to have very high frequency in natural-language texts, such a list is very hard to prepare with note cards.
During the 1950s and 1960s, a great many individual texts were encoded and work of various kinds was performed with them. Often, concordances were made; sometimes stylistic studies were performed. In the early 1960s, Henry Kucera and Nelson Francis created the Brown Corpus of American English, a one-million-word corpus drawn from works published in 1961.
The journal Computer and the Humanities began publication in 1966, and in the 1970s, organizational structures formed to support the field: the Association for Literary and Linguistic Computing (1973) and the Association for Computers and the Humanities (1978). Also in 1978, the Lancaster-Oslo-Bergen Corpus of British English was published, created as a pendant to the Brown Corpus of American English.
Typically, because of the amount of work involved, the first texts to be encoded were the sacred texts. Busa, as a theologian, found Aquinas worth the effort of 30 years of work (the Index Thomisticus was finally published in 1978), and the Bible too was one of the earliest texts put into electronic form.
Of course, the classics were not far behind; classicists are accustomed to very labor-intensive work with their texts. The Thesaurus Linguae Graecae project was started in the 1970s with the
1 Presentation slides are available at http://www.sites.nationalacademies.org/PGA/brdi/PGA_064019.
aim of making a comprehensive digital library of Greek writing in all genres from Homer to the fall of Constantinople in 1453; it issued its first CD-ROM in 1985.
What counts as data in the humanities? The humanistic disciplines seek, in general, better understanding of human culture, which means that pretty much anything created by humans is a plausible object of study. Data in these fields may include:
• digitized editions of major works;
• transcriptions of manuscripts;
• thematic collections (e.g., author, period, genre);
• language corpora (balanced or opportunistic; monolingual or multilingual [parallel structure or parallel-text translation equivalents]);
• images of artworks (e.g., Rossetti, Blake, DeYoung Museum ImageBase); and
Digital representations of pre-existing artifacts now often take multi-media forms, e.g., scans plus transcriptions. Human culture did not end when humans built computers, however, so digital artifacts and born-digital objects are also objects of study for humanities disciplines. Scholars are studying digital art forms, hypertexts, interactive games, databases, and digital records of any kind. There is no human artifact that is a priori unsuitable as an object of historical or cultural critical study.
If humanists have been creating large, labor-intensive, expensive digital resources for six decades, the question arises: is anyone taking care of the materials thus created? The answer is yes and no.
Publishers of digital editions presumably have a commercial interest in retaining an electronic copy of the edition. (I do not actually have any evidence that they are aware of having that interest or that they are acting on that interest, but I hope some of them are, because to the extent that publishers regard digital editions as commercial products, they have historically not wanted to deposit the editions with a repository for long-term holding.) Individual projects also have an interest in the preservation of the materials they create, but individual projects sometimes suffer from the illusion that they are going to live forever, with the consequence that they are often taken by surprise when their funding runs out. So neither publishers nor individual data-creating projects have, as a rule, been reliable long-term custodians of humanities data. From as early as the 1970s, there have been projects to collect electronic texts and to preserve them in an archive, beginning with Project Libri at Dartmouth University, which no longer exists. Fortunately, before Project Libri was terminated, its managers deposited all their texts with the Oxford Text Archive, founded a few years later in 1976, which does still exist. There are currently many electronic text centers in university libraries, which have retained digital objects. As a rule, however, most library-based electronic text centers are interested in displaying and making accessible things that they have digitized and are less interested in acquisition of digital resources from elsewhere. Also, there are (at least in theory) digital repositories of various kinds, including institutional digital repositories.
Although there has been a long history of this work, there is nothing in the humanities like the network of social science data archives that we see with the Inter-university Consortium for Political and Social Research (ICPSR) in the United States, the Economic and Social Research Council Data Archive in the United Kingdom (UK), or the Danish Data Archive, and their various analogues in other countries. The Arts and Humanities data service in the UK essentially tried to fill that role and did so very successfully for the eleven years that they were funded before they were terminated. (It should be noted that the ESRC Data Archive has now been renamed the UK Data Archive and describes itself as holding research data in “the social sciences and humanities.”
Current data citation practice in the humanities
The current status of data citation practice in the humanities is mixed. There are some hopeful signs. For example, the TEI Guidelines explicitly require internal metadata for the electronic object itself, and not just for the exemplar of the electronic object. So in principle digital humanists should be familiar with the idea that an electronic object representing a non-digital artifact is distinct from its source and needs to be documented and cited in its own right. In theory, at least, people should know what to cite.
Also, most citation styles used by humanists now have been revised to allow the inclusion of IRIs (internationalized resource identifiers) in the citation, which ought to be helpful. Some citation styles, of course, refer not to IRIs but to URIs (uniform resource identifiers) or URLs (uniform resource locators) instead, which will irritate those of us who believe IRIs should be the identifiers of choice. But the principle of providing an identifier for an electronic object is at least recognized.
In preparation for this presentation, I examined a random sample of papers in the field, looking to see whether there are fairly complete digital resources behind these papers, and if so, whether I could understand or read what they are? There are, as might be expected, several patterns. Ideally, one might want to see published resources used in the papers, with explicit citations in the references so services like ISI will find them and so that the data citations will show up in a citation index. As far as I can tell, however, this is purely a theoretical category: I found no instances of it in my small sample.
The closest thing found in the sample to this ideal practice were papers which mention published resources, which are explicitly described, sometimes with a URL pointing to the item, but with no reference to the resource in the references. Sometimes, the references include instead a reference to a related paper, which may indicate both a desire to cite the work and a discomfort with citing resources which do not take traditional scholarly forms, or perhaps uncertainty about how to cite data resources directly.
In other cases, papers mentioned resources that are clearly identifiable as objects, that clearly have an identity of their own, and that have not been published. The resources are explicitly mentioned and acknowledged in the text, but naturally enough they are not cited because they have not been published. There is not even the equivalent of a personal-communication citation.
There also are many resources that clearly must exist unless the paper is a complete forgery, but they are unpublished. These resources are implicit in the argument of the paper, but they remain cited because, clearly, the author thinks of them as analogous to working notes.
In summary, the situation for data citation in the humanities is completely confused.
Some problems in humanities data citation
Some of the problems arising for data citation in the humanities are problems already discussed in the natural and social science context. Others may be particular to the humanities.
If I, as an author, want to cite a resource, how should it be cited? What exactly should be cited? Am I citing the entire British National Corpus? Am I citing a particular sample of the British National Corpus? Am I citing the archive from which I got the British National Corpus? In the case of the British National Corpus, those distinctions are very clear. In other cases, the distinction is not clear at all. When working with digital resources which combine and recombine with each other in unpredictable ways, scholars will find philosophical questions of identity taking on an unwontedly urgent practical aspect.
Second, it is sometimes difficult to locate reliable metadata concerning a resource one might want to cite. Without a physical title page, it may be challenging to identify the title of a data resource, or the names of those intellectually responsible for its content, or the nature of their contribution. In many cases, it is difficult to identify a publisher or a date of publication. Those responsible for distributing a data resource may not regard themselves as the publishers of the work, because they do not regard themselves as engaged in publishing in the conventional sense. (This problem may be familiar to those who have tried to cite microfilms created for individual scholars by individual photographers.) It is not even clear whether familiar roles like “publisher” have the same relevance for electronic resources as they do for print materials. If the familiar division of labor among publishers, distributors, repositories, libraries and archives is, fundamentally, a way of organizing the management of information, we may expect those roles to be recognizable in the digital world. If, on the contrary, that familiar division of labor reflects only a way of organizing the management of paper and other physical objects, then the digital world may well converge on a different and incommensurable set of roles.
In many ways, the challenge of locating reliable metadata among them, digital objects seem to be in an incunabular phase; like the earliest printed books, digital objects lack established conventions for identifying the object or those responsible for it. In pessimistic moments, an observer might fear that the situation is even worse than that, and that the creation and dissemination of digital objects does not resemble the creation and dissemination of early printed books so much as it resembles scribal transmission of manuscripts.
A third challenge for citation of humanities data resources is that many of those involved, whether as producers or as consumers of resources, want turn-key systems, not a set of tools and materials which leave them to their own devices. This perfectly understandable desire for ease of use tends in practice to lead to tight coupling (both in the technical sense, and psychologically in producers and consumers) of the data resource, the software used to provide access to the resource, and the user interface of that software. (There are notable exceptions, including the
Perseus Digital Library, which has over the years released its data with different software, as the computational infrastructure available to its users has changed.) When a data resource is tightly coupled with a particular piece of software, it can be difficult to distinguish the one from the other, with consequent difficulties for anyone who would like to cite the data resource itself, and not the data-plus-software combination.
Several factors may be identified which tend to inhibit data citation in the humanities.
Fear of copyright issues: When digital resources are constructed without copyright clearance (perhaps on the theory that they are for personal use only), the creators of the resources will understandably hesitate to publish them, or even to cite them explicitly. As a senior figure in the field wrote to me: I think you will still find plenty of people saying “we ran a stylometric analysis on a corpus which has these properties, but we cannot let you see the actual corpus because we did not obtain the copyright.”
Anti-scientism: Citing data resources may seem foreign to the culture of humanistic scholarship, an eruption into the humanities of natural-scientific practices and perhaps a symptom of science envy, to be discouraged as naive and unhelpful.
Citation chains: Print has (reasonably) well-established conventions for re-publication and citation of earlier publications. Not so digital resources, which may include refinements, revisions, elaborations, subsets, derivations, annotations, and so on, often made without any explicit reference to the sources from which they were derived. This is not unknown in print culture, of course: some editions of classic authors provide no information about the copy text used in the edition. But it does make it hard to trace the provenance of some digital resources, and when resource creators fail to cite the prior resources they use, it is not surprising if users of the later resources also fail to cite the resource when they use it.
Versioning: Large humanities projects typically make multiple passes over the same material. In the future, it is not unlikely that early results will be published (under pressure from the Web culture and from funders). If there are multiple versions of a resource from the same source at least two problems may be expected.
Will the metadata for the resource label the version and explain the nature of its relation to other versions of the resource?
What will be the unit of change? Will changes be clumped into groups in the way familiar from print editions? Or will the resource change continuously (in which case, will it be possible to pluck out a given state of the resource at a given instant to represent a version of the resource?
Quiddity: Large humanities projects typically make multiple passes over material. reading text; text-critical variorum text; text with literary annotations; linguistic annotations (glosses for cruxes? parse trees? …); or formalization of propositional content.
Which of these is the thing I am publishing? And which of these is the thing I am citing?
Longevity: Finally, there is the question of longevity. It is well known that the half-life of citations is much higher in humanities than in the natural sciences. We have been cultivating a culture of citation of referencing for about 2,000 years in the West since the Alexandrian era.
Our current citation practice may be 400 years old. The http scheme, by comparison, is about 19 years old. It is a long reach to assume, as some do, that http URLs are an adequate mechanism for all citations of digital (and non-digital!) objects. It is not unreasonable for scholars to be skeptical of the use of URLs to cite data of any long-term significance, even if they are interested in citing the data resources they use.
Moderated by Herbert van de Sompel
PARTICIPANT: This is a question for Mary Vardigan. When you have data systems that are based on surveys, do you have to include in the metadata how exactly the survey was conducted? The reason I am asking is that a couple of years ago, my city paid a contractor to do a big survey of citizen satisfaction. It was done classically with randomly drawn phone numbers and when I looked into it, there was no list of people with cell phones. As a result, cell phones, essentially used by the Hispanic population, were vastly under-sampled. How do you deal with this kind issue?
DR. VARDIGAN: That is a good question. The sampling information is part of the important documentation and metadata that we distribute with every dataset we make available. It is important in assessing data quality. At ICPSR, we do not assess data quality ourselves; it is the community that will determine whether the sample is adequate and scientifically sound. It is important, therefore, to have that descriptive information about how the survey was conducted.
DR. BORGMAN: We were taught in this session about what might be generic solutions across the disciplines, as well as what could be specific. So far, I have heard two things in common across them. One is that there are data papers or surrogates, where a journal article that describes the data will be cited in lieu of citing the data per se. The other is that there is a deep complexity and confusion in the field. I think it would be good if each of you could highlight what you heard from the other panelists that might work in your field and might be really useful.
DR. SPERBERG-MC QUEEN: Another item to add to your list is the issue of granularity and the perceived need to be able to cite parts as well as entire datasets.
DR. CALLAGHAN: I will emphasize the importance of metadata when it comes to doing data citation. It is not enough to validate our datasets unless we have got the full description of what the numbers are and what they actually mean.
DR. VARDIGAN: I would like to add the issue of versioning, which seems to be common across disciplines, because data do change over time. There are some dynamic datasets that are continually being generated and there may be discipline-specific solutions to this issue that could be deployed across other disciplines if we knew more about them.
DR. BOURNE: The notion of peer review of data is being brought up in different contexts. My sense is that this is something that is dependent on the maturity of the field that is generating those data. Unlike new fields, when a field is pretty mature, I think the community must have come to a good understanding of how the data could be peer reviewed as part of the process.
DR. CHAVAN: I want to emphasize that some of the data that we deal with are very complex and this affects our ability to cite them properly. What we have done is that we got involved with a publisher who is innovative and willing to experiment to work on this issue. GBIF currently publishes more than 18,000 datasets and none of those datasets have more than four metadata
fields complete out of 64 lines. So, publishing data is easy, but writing metadata is really a difficult task.
We came up with the approach of publishing the metadata document as a scholarly publication. We publicized an announcement about three months ago regarding a technical solution and a recognition mechanism that we will put in our papers. In the last three months, we approached about 350 data publishers who publish 18,000 datasets and few of them took it up. They said that writing a good metadata document that can actually be published is difficult because every metadata document will have to go to a review process. This is difficult and I personally did it. I wrote a metadata document that could qualify for review and it took about eight hours. This is something that needs to be addressed. Nonetheless, out of the 350 publishers, we were able in three months to convince about ten of them.
DR. CALLAGHAN: I agree that comprehensive dataset management and metadata work are very challenging tasks. However, we are in the position of having good incentives for data producers until they give us their metadata. Metadata are very important to help data producers understand the complexity of the systems related to data management. When it comes to the peer review of metadata, I do not think that we want to get scientists and journals involved in this process because it is time consuming, complicated, and technically biased toward data management professionals. I am of the opinion that it is the job of the data centers to make sure that the metadata are complete. We can do that and it is well within our area of expertise.
DR. VARDIGAN: Just one more point related to incentivizing data producers to create good metadata. In the United States, we now have the National Science Foundation asking researchers applying for grants to provide data management plans, and metadata are a big component of those plans. We are hoping that this will be a positive influence on what eventually gets deposited into the data centers.
DR. BOURNE: I think that the provision of good metadata is dependent on the reward systems. In some biosciences communities, it is not only that you cannot publish without depositing the data, you must also deposit that data with a fair degree of rigorous metadata. That is also a reward because you cannot publish without it.
PARTICIPANT: I want to ask all the panelists how do you see the chances of some entity that would be a registry of unique and persistent identifiers across the different domains?
DR. BOURNE: Can you turn it around? The problem right now is that publishing is a competitive business and no single publisher is going to demand getting something standardized, because there is a risk to their business model. However, if publishers got together and insisted that there should be a standardized metadata description that we can use across the board, there could be a chance for it to happen.
DR. CALLAGHAN: I would say that DOIs are very well accepted and that is the route that we have chosen to use as far as our datasets are concerned. I would like to add, though, that a DOI should not be the sole basis of the citation. There has to be more information on the DOI because a DOI is just an alpha-numeric string. A person will look at it and will not understand anything.
Whereas, if you have another part of the citation that gives the author name, title, and perhaps other information it might not be any good for computers, but it will help the humans.
DR. BRASE: I think part of the issue about how much metadata is already contained in an archive depends on the discipline. On the one hand, for example, in the life sciences, there is a large amount of data already in their archives and they have their own way of doing things. Therein lies the problem. Change will be difficult. On the other hand, for long-tail data that do not have a home, I think there is real opportunity because people want to deposit those data and get credit for them. This is where you can standardize the process. That is where DataCite and other similar initiatives can come in.
DR. WILSON: It is important in a lot of fields that we have documentation of the method by which the data was generated. I also want invite more comment on the division between what are metadata and what are data, because this is not always a clear line—one person’s data may be another’s metadata, and vice versa.
DR. VAN DE SOMPEL: I agree. I think that we still have a huge gap in our understanding of what people consider to be “data.”
PARTICIPANT: I would like to see a universal approach or guidelines in relation to data citation and attribution. I think that it is very encouraging that we have people representing different fields and areas looking at the same set of problems here. Let us just try to be simple and work on things gradually. These discussions and emerging tools and technologies have tremendous motivation for publishers and researchers, and I think that this meeting is a very good starting point. We might not come up with the best solution right now, but as time goes on, I think it is very encouraging.
PARTICIPANT: Learning from each other is a useful approach as well. I have been working towards citable references for a long time and as I had this subject as a priority; it has been less of a priority in other disciplines. Other disciplines suffer from this syndrome of incrementally refined datasets. For example, sequences from GenBank are refined by many people and that makes it a complicated co-authorship situation. Have you run into that in ICPSR and, if so, how do you handle such a situation in a cited reference?
DR. VARDIGAN: We have some instances of data that have multiple contributors. Some datasets have over a hundred contributors. We have just used “Name, et al.” to acknowledge the variety of people involved. We do not have a specific approach to deal with such a situation.
PARTICIPANT: I would be very interested if Dr. Vardigan or other colleagues can talk about third-party metadata. For example, if there is a record put somewhere with an appropriate link that states, “this sample did not include cell phones,” this would tell us that the sample is biased. That would be a really useful approach to have.
DR. VARDIGAN: I do not know of anything like that currently in existence, but we all rely on the scientific method. If a paper is published and others decide to make judgments about its merits and publish something themselves about the quality of the data or its content, they can do that. As a data center, ICPSR does write what we consider to be comprehensive metadata references, and we track publications based on our data.
This page intentionally left blank.
This page intentionally left blank.