Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 59
9- Data Citation in the Humanities: What's the Problem?
Michael Sperberg-McQueen1
Black Mesa Technologies
As a complement to the presentations on data citation in the life, natural, and social sciences, this
presentation will discuss data citation in the humanities. The reaction of some people to that
topic will be to say: "What?! When did humanists start working with data? What is considered
to be data in the humanities?" To respond to these questions, I will start with a little background
on what counts as data in the digital humanities, then survey current practice with respect to data
citation in this domain, concluding with some remarks on intellectual issues with data citation in
the humanities.
Data in the digital humanities
Humanities scholars started using machine-readable data in 1948, when Father Roberto Busa
began work on the Index Thomisticus. This index was a full-text concordance of every word of
every work published by St. Thomas Aquinas, as well as every work that historically had ever
been attributed to Aquinas (even those that Busa felt confident were not actually by Aquinas). In
his theological research, Busa was particularly interested in the concept of the eucharistic
presence, which required an exhaustive list of Aquinas' use of the preposition in. Since
prepositions tend to have very high frequency in natural-language texts, such a list is very hard to
prepare with note cards.
During the 1950s and 1960s, a great many individual texts were encoded and work of various
kinds was performed with them. Often, concordances were made; sometimes stylistic studies
were performed. In the early 1960s, Henry Kucera and Nelson Francis created the Brown Corpus
of American English, a one-million-word corpus drawn from works published in 1961.
The journal Computer and the Humanities began publication in 1966, and in the 1970s,
organizational structures formed to support the field: the Association for Literary and Linguistic
Computing (1973) and the Association for Computers and the Humanities (1978). Also in 1978,
the Lancaster-Oslo-Bergen Corpus of British English was published, created as a pendant to the
Brown Corpus of American English.
Typically, because of the amount of work involved, the first texts to be encoded were the sacred
texts. Busa, as a theologian, found Aquinas worth the effort of 30 years of work (the Index
Thomisticus was finally published in 1978), and the Bible too was one of the earliest texts put
into electronic form.
Of course, the classics were not far behind; classicists are accustomed to very labor-intensive
work with their texts. The Thesaurus Linguae Graecae project was started in the 1970s with the
1
Presentation slides are available at http://sites.nationalacademies.org/PGA/brdi/PGA_064019.
59
OCR for page 60
60 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS
aim of making a comprehensive digital library of Greek writing in all genres from Homer to the
fall of Constantinople in 1453; it issued its first CD-ROM in 1985.
What counts as data in the humanities? The humanistic disciplines seek, in general, better
understanding of human culture, which means that pretty much anything created by humans is a
plausible object of study. Data in these fields may include:
· digitized editions of major works;
· transcriptions of manuscripts;
· thematic collections (e.g., author, period, genre);
· language corpora (balanced or opportunistic; monolingual or multilingual [parallel
structure or parallel-text translation equivalents]);
· images of artworks (e.g., Rossetti, Blake, DeYoung Museum ImageBase); and
· maps.
Digital representations of pre-existing artifacts now often take multi-media forms, e.g., scans
plus transcriptions. Human culture did not end when humans built computers, however, so digital
artifacts and born-digital objects are also objects of study for humanities disciplines. Scholars are
studying digital art forms, hypertexts, interactive games, databases, and digital records of any
kind. There is no human artifact that is a priori unsuitable as an object of historical or cultural
critical study.
If humanists have been creating large, labor-intensive, expensive digital resources for six
decades, the question arises: is anyone taking care of the materials thus created? The answer is
yes and no.
Publishers of digital editions presumably have a commercial interest in retaining an electronic
copy of the edition. (I do not actually have any evidence that they are aware of having that
interest or that they are acting on that interest, but I hope some of them are, because to the extent
that publishers regard digital editions as commercial products, they have historically not wanted
to deposit the editions with a repository for long-term holding.) Individual projects also have an
interest in the preservation of the materials they create, but individual projects sometimes suffer
from the illusion that they are going to live forever, with the consequence that they are often
taken by surprise when their funding runs out. So neither publishers nor individual data-creating
projects have, as a rule, been reliable long-term custodians of humanities data.
From as early as the 1970s, there have been projects to collect electronic texts and to preserve
them in an archive, beginning with Project Libri at Dartmouth University, which no longer
exists. Fortunately, before Project Libri was terminated, its managers deposited all their texts
with the Oxford Text Archive, founded a few years later in 1976, which does still exist. There
are currently many electronic text centers in university libraries, which have retained digital
objects. As a rule, however, most library-based electronic text centers are interested in
displaying and making accessible things that they have digitized and are less interested in
acquisition of digital resources from elsewhere. Also, there are (at least in theory) digital
repositories of various kinds, including institutional digital repositories.
OCR for page 61
DATA CITATION IN THE HUMANTIES: WHAT'S THE PROBLEM? 61
Although there has been a long history of this work, there is nothing in the humanities like the
network of social science data archives that we see with the Inter-university Consortium for
Political and Social Research (ICPSR) in the United States, the Economic and Social Research
Council Data Archive in the United Kingdom (UK), or the Danish Data Archive, and their
various analogues in other countries. The Arts and Humanities data service in the UK essentially
tried to fill that role and did so very successfully for the eleven years that they were funded
before they were terminated. (It should be noted that the ESRC Data Archive has now been re-
named the UK Data Archive and describes itself as holding research data in "the social sciences
and humanities."
Current data citation practice in the humanities
The current status of data citation practice in the humanities is mixed. There are some hopeful
signs. For example, the TEI Guidelines explicitly require internal metadata for the electronic
object itself, and not just for the exemplar of the electronic object. So in principle digital
humanists should be familiar with the idea that an electronic object representing a non-digital
artifact is distinct from its source and needs to be documented and cited in its own right. In
theory, at least, people should know what to cite.
Also, most citation styles used by humanists now have been revised to allow the inclusion of
IRIs (internationalized resource identifiers) in the citation, which ought to be helpful. Some
citation styles, of course, refer not to IRIs but to URIs (uniform resource identifiers) or URLs
(uniform resource locators) instead, which will irritate those of us who believe IRIs should be the
identifiers of choice. But the principle of providing an identifier for an electronic object is at
least recognized.
In preparation for this presentation, I examined a random sample of papers in the field, looking
to see whether there are fairly complete digital resources behind these papers, and if so, whether
I could understand or read what they are? There are, as might be expected, several patterns.
Ideally, one might want to see published resources used in the papers, with explicit citations in
the references so services like ISI will find them and so that the data citations will show up in a
citation index. As far as I can tell, however, this is purely a theoretical category: I found no
instances of it in my small sample.
The closest thing found in the sample to this ideal practice were papers which mention
published resources, which are explicitly described, sometimes with a URL pointing to the item,
but with no reference to the resource in the references. Sometimes, the references include instead
a reference to a related paper, which may indicate both a desire to cite the work and a discomfort
with citing resources which do not take traditional scholarly forms, or perhaps uncertainty about
how to cite data resources directly.
In other cases, papers mentioned resources that are clearly identifiable as objects, that clearly
have an identity of their own, and that have not been published. The resources are explicitly
mentioned and acknowledged in the text, but naturally enough they are not cited because they
have not been published. There is not even the equivalent of a personal-communication citation.
OCR for page 62
62 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS
There also are many resources that clearly must exist unless the paper is a complete forgery, but
they are unpublished. These resources are implicit in the argument of the paper, but they remain
cited because, clearly, the author thinks of them as analogous to working notes.
In summary, the situation for data citation in the humanities is completely confused.
Some problems in humanities data citation
Some of the problems arising for data citation in the humanities are problems already discussed
in the natural and social science context. Others may be particular to the humanities.
If I, as an author, want to cite a resource, how should it be cited? What exactly should be cited?
Am I citing the entire British National Corpus? Am I citing a particular sample of the British
National Corpus? Am I citing the archive from which I got the British National Corpus? In the
case of the British National Corpus, those distinctions are very clear. In other cases, the
distinction is not clear at all. When working with digital resources which combine and recombine
with each other in unpredictable ways, scholars will find philosophical questions of identity
taking on an unwontedly urgent practical aspect.
Second, it is sometimes difficult to locate reliable metadata concerning a resource one might
want to cite. Without a physical title page, it may be challenging to identify the title of a data
resource, or the names of those intellectually responsible for its content, or the nature of their
contribution. In many cases, it is difficult to identify a publisher or a date of publication. Those
responsible for distributing a data resource may not regard themselves as the publishers of the
work, because they do not regard themselves as engaged in publishing in the conventional sense.
(This problem may be familiar to those who have tried to cite microfilms created for individual
scholars by individual photographers.) It is not even clear whether familiar roles like "publisher"
have the same relevance for electronic resources as they do for print materials. If the familiar
division of labor among publishers, distributors, repositories, libraries and archives is,
fundamentally, a way of organizing the management of information, we may expect those roles
to be recognizable in the digital world. If, on the contrary, that familiar division of labor reflects
only a way of organizing the management of paper and other physical objects, then the digital
world may well converge on a different and incommensurable set of roles.
In many ways, the challenge of locating reliable metadata among them, digital objects seem to be
in an incunabular phase; like the earliest printed books, digital objects lack established
conventions for identifying the object or those responsible for it. In pessimistic moments, an
observer might fear that the situation is even worse than that, and that the creation and
dissemination of digital objects does not resemble the creation and dissemination of early printed
books so much as it resembles scribal transmission of manuscripts.
A third challenge for citation of humanities data resources is that many of those involved,
whether as producers or as consumers of resources, want turn-key systems, not a set of tools and
materials which leave them to their own devices. This perfectly understandable desire for ease of
use tends in practice to lead to tight coupling (both in the technical sense, and psychologically in
producers and consumers) of the data resource, the software used to provide access to the
resource, and the user interface of that software. (There are notable exceptions, including the
OCR for page 63
DATA CITATION IN THE HUMANTIES: WHAT'S THE PROBLEM? 63
Perseus Digital Library, which has over the years released its data with different software, as the
computational infrastructure available to its users has changed.) When a data resource is tightly
coupled with a particular piece of software, it can be difficult to distinguish the one from the
other, with consequent difficulties for anyone who would like to cite the data resource itself, and
not the data-plus-software combination.
Several factors may be identified which tend to inhibit data citation in the humanities.
Fear of copyright issues: When digital resources are constructed without copyright clearance
(perhaps on the theory that they are for personal use only), the creators of the resources will
understandably hesitate to publish them, or even to cite them explicitly. As a senior figure in the
field wrote to me: I think you will still find plenty of people saying "we ran a stylometric
analysis on a corpus which has these properties, but we cannot let you see the actual corpus
because we did not obtain the copyright."
Anti-scientism: Citing data resources may seem foreign to the culture of humanistic scholarship,
an eruption into the humanities of natural-scientific practices and perhaps a symptom of science
envy, to be discouraged as naïve and unhelpful.
Citation chains: Print has (reasonably) well-established conventions for re-publication and
citation of earlier publications. Not so digital resources, which may include refinements,
revisions, elaborations, subsets, derivations, annotations, and so on, often made without any
explicit reference to the sources from which they were derived. This is not unknown in print
culture, of course: some editions of classic authors provide no information about the copy text
used in the edition. But it does make it hard to trace the provenance of some digital resources,
and when resource creators fail to cite the prior resources they use, it is not surprising if users of
the later resources also fail to cite the resource when they use it.
Versioning: Large humanities projects typically make multiple passes over the same material. In
the future, it is not unlikely that early results will be published (under pressure from the Web
culture and from funders). If there are multiple versions of a resource from the same source at
least two problems may be expected.
Will the metadata for the resource label the version and explain the nature of its relation to other
versions of the resource?
What will be the unit of change? Will changes be clumped into groups in the way familiar from
print editions? Or will the resource change continuously (in which case, will it be possible to
pluck out a given state of the resource at a given instant to represent a version of the resource?
Quiddity: Large humanities projects typically make multiple passes over material.
reading text; text-critical variorum text; text with literary annotations; linguistic annotations
(glosses for cruxes? parse trees? ...); or formalization of propositional content.
Which of these is the thing I am publishing? And which of these is the thing I am citing?
Longevity: Finally, there is the question of longevity. It is well known that the half-life of
citations is much higher in humanities than in the natural sciences. We have been cultivating a
culture of citation of referencing for about 2,000 years in the West since the Alexandrian era.
OCR for page 64
64 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS
Our current citation practice may be 400 years old. The http scheme, by comparison, is about 19
years old. It is a long reach to assume, as some do, that http URLs are an adequate mechanism
for all citations of digital (and non-digital!) objects. It is not unreasonable for scholars to be
skeptical of the use of URLs to cite data of any long-term significance, even if they are interested
in citing the data resources they use.
OCR for page 65
DISCUSSION BY WORKSHOP PARTICIPANTS
Moderated by Herbert van de Sompel
PARTICIPANT: This is a question for Mary Vardigan. When you have data systems that are
based on surveys, do you have to include in the metadata how exactly the survey was conducted?
The reason I am asking is that a couple of years ago, my city paid a contractor to do a big survey
of citizen satisfaction. It was done classically with randomly drawn phone numbers and when I
looked into it, there was no list of people with cell phones. As a result, cell phones, essentially
used by the Hispanic population, were vastly under-sampled. How do you deal with this kind
issue?
DR. VARDIGAN: That is a good question. The sampling information is part of the important
documentation and metadata that we distribute with every dataset we make available. It is
important in assessing data quality. At ICPSR, we do not assess data quality ourselves; it is the
community that will determine whether the sample is adequate and scientifically sound. It is
important, therefore, to have that descriptive information about how the survey was conducted.
DR. BORGMAN: We were taught in this session about what might be generic solutions across
the disciplines, as well as what could be specific. So far, I have heard two things in common
across them. One is that there are data papers or surrogates, where a journal article that describes
the data will be cited in lieu of citing the data per se. The other is that there is a deep complexity
and confusion in the field. I think it would be good if each of you could highlight what you heard
from the other panelists that might work in your field and might be really useful.
DR. SPERBERG-MC QUEEN: Another item to add to your list is the issue of granularity and
the perceived need to be able to cite parts as well as entire datasets.
DR. CALLAGHAN: I will emphasize the importance of metadata when it comes to doing data
citation. It is not enough to validate our datasets unless we have got the full description of what
the numbers are and what they actually mean.
DR. VARDIGAN: I would like to add the issue of versioning, which seems to be common
across disciplines, because data do change over time. There are some dynamic datasets that are
continually being generated and there may be discipline-specific solutions to this issue that could
be deployed across other disciplines if we knew more about them.
DR. BOURNE: The notion of peer review of data is being brought up in different contexts. My
sense is that this is something that is dependent on the maturity of the field that is generating
those data. Unlike new fields, when a field is pretty mature, I think the community must have
come to a good understanding of how the data could be peer reviewed as part of the process.
DR. CHAVAN: I want to emphasize that some of the data that we deal with are very complex
and this affects our ability to cite them properly. What we have done is that we got involved with
a publisher who is innovative and willing to experiment to work on this issue. GBIF currently
publishes more than 18,000 datasets and none of those datasets have more than four metadata
65
OCR for page 66
66 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS
fields complete out of 64 lines. So, publishing data is easy, but writing metadata is really a
difficult task.
We came up with the approach of publishing the metadata document as a scholarly publication.
We publicized an announcement about three months ago regarding a technical solution and a
recognition mechanism that we will put in our papers. In the last three months, we approached
about 350 data publishers who publish 18,000 datasets and few of them took it up. They said that
writing a good metadata document that can actually be published is difficult because every
metadata document will have to go to a review process. This is difficult and I personally did it. I
wrote a metadata document that could qualify for review and it took about eight hours. This is
something that needs to be addressed. Nonetheless, out of the 350 publishers, we were able in
three months to convince about ten of them.
DR. CALLAGHAN: I agree that comprehensive dataset management and metadata work are
very challenging tasks. However, we are in the position of having good incentives for data
producers until they give us their metadata. Metadata are very important to help data producers
understand the complexity of the systems related to data management. When it comes to the peer
review of metadata, I do not think that we want to get scientists and journals involved in this
process because it is time consuming, complicated, and technically biased toward data
management professionals. I am of the opinion that it is the job of the data centers to make sure
that the metadata are complete. We can do that and it is well within our area of expertise.
DR. VARDIGAN: Just one more point related to incentivizing data producers to create good
metadata. In the United States, we now have the National Science Foundation asking researchers
applying for grants to provide data management plans, and metadata are a big component of
those plans. We are hoping that this will be a positive influence on what eventually gets
deposited into the data centers.
DR. BOURNE: I think that the provision of good metadata is dependent on the reward systems.
In some biosciences communities, it is not only that you cannot publish without depositing the
data, you must also deposit that data with a fair degree of rigorous metadata. That is also a
reward because you cannot publish without it.
PARTICIPANT: I want to ask all the panelists how do you see the chances of some entity that
would be a registry of unique and persistent identifiers across the different domains?
DR. BOURNE: Can you turn it around? The problem right now is that publishing is a
competitive business and no single publisher is going to demand getting something standardized,
because there is a risk to their business model. However, if publishers got together and insisted
that there should be a standardized metadata description that we can use across the board, there
could be a chance for it to happen.
DR. CALLAGHAN: I would say that DOIs are very well accepted and that is the route that we
have chosen to use as far as our datasets are concerned. I would like to add, though, that a DOI
should not be the sole basis of the citation. There has to be more information on the DOI because
a DOI is just an alpha-numeric string. A person will look at it and will not understand anything.
OCR for page 67
DISCUSSION BY WORKSHOP PARTICIPANTS 67
Whereas, if you have another part of the citation that gives the author name, title, and perhaps
other information it might not be any good for computers, but it will help the humans.
DR. BRASE: I think part of the issue about how much metadata is already contained in an
archive depends on the discipline. On the one hand, for example, in the life sciences, there is a
large amount of data already in their archives and they have their own way of doing things.
Therein lies the problem. Change will be difficult. On the other hand, for long-tail data that do
not have a home, I think there is real opportunity because people want to deposit those data and
get credit for them. This is where you can standardize the process. That is where DataCite and
other similar initiatives can come in.
DR. WILSON: It is important in a lot of fields that we have documentation of the method by
which the data was generated. I also want invite more comment on the division between what are
metadata and what are data, because this is not always a clear line--one person's data may be
another's metadata, and vice versa.
DR. VAN DE SOMPEL: I agree. I think that we still have a huge gap in our understanding of
what people consider to be "data."
PARTICIPANT: I would like to see a universal approach or guidelines in relation to data citation
and attribution. I think that it is very encouraging that we have people representing different
fields and areas looking at the same set of problems here. Let us just try to be simple and work
on things gradually. These discussions and emerging tools and technologies have tremendous
motivation for publishers and researchers, and I think that this meeting is a very good starting
point. We might not come up with the best solution right now, but as time goes on, I think it is
very encouraging.
PARTICIPANT: Learning from each other is a useful approach as well. I have been working
towards citable references for a long time and as I had this subject as a priority; it has been less
of a priority in other disciplines. Other disciplines suffer from this syndrome of incrementally
refined datasets. For example, sequences from GenBank are refined by many people and that
makes it a complicated co-authorship situation. Have you run into that in ICPSR and, if so, how
do you handle such a situation in a cited reference?
DR. VARDIGAN: We have some instances of data that have multiple contributors. Some
datasets have over a hundred contributors. We have just used "Name, et al." to acknowledge the
variety of people involved. We do not have a specific approach to deal with such a situation.
PARTICIPANT: I would be very interested if Dr. Vardigan or other colleagues can talk about
third-party metadata. For example, if there is a record put somewhere with an appropriate link
that states, "this sample did not include cell phones," this would tell us that the sample is biased.
That would be a really useful approach to have.
DR. VARDIGAN: I do not know of anything like that currently in existence, but we all rely on
the scientific method. If a paper is published and others decide to make judgments about its
merits and publish something themselves about the quality of the data or its content, they can do
that. As a data center, ICPSR does write what we consider to be comprehensive metadata
references, and we track publications based on our data.
OCR for page 68
OCR for page 69
PART THREE
LEGAL, INSTITUTIONAL, AND SOCIO-CULTURAL ASPECTS
69
OCR for page 70