1

Introduction

OVERVIEW

Mathematics is facing a pivotal junction where it can either continue to utilize digital mathematics literature in ways similar to traditional printed literature, or it can take advantage of new and developing technology to enable new ways of advancing knowledge. This report details how information contained in individual items within the literature could be readily extracted and linked to create a comprehensive digital mathematics information resource that is more than the sum of its contributing publications. That resource can serve as a platform and focal point for further development of the mathematical knowledge base.

This new system, referred to throughout the report as the Digital Mathematics Library (DML), could support a wide variety of new functionalities and services over aggregated mathematical information, including dramatically improved capabilities for searching, browsing, navigating, linking, computing, visualizing, and analyzing the literature.

STUDY DEFINITION AND SCOPE AND THE COMMITTEE’S APPROACH

The Alfred P. Sloan Foundation commissioned this study and charged the committee to:

  • Evaluate the potential value of a virtual global library of mathematical science publications;


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 8
1 Introduction OVERVIEW Mathematics is facing a pivotal junction where it can either continue to utilize digital mathematics literature in ways similar to traditional printed literature, or it can take advantage of new and developing technology to enable new ways of advancing knowledge. This report details how infor- mation contained in individual items within the literature could be readily extracted and linked to create a comprehensive digital mathematics infor- mation resource that is more than the sum of its contributing publications. That resource can serve as a platform and focal point for further develop- ment of the mathematical knowledge base. This new system, referred to throughout the report as the Digital Math- ematics Library (DML), could support a wide variety of new functionalities and services over aggregated mathematical information, including dramati- cally improved capabilities for searching, browsing, navigating, linking, computing, visualizing, and analyzing the literature. STUDY DEFINITION AND SCOPE AND THE COMMITTEE’S APPROACH The Alfred P. Sloan Foundation commissioned this study and charged the committee to: • Evaluate the potential value of a virtual global library of math- ematical science publications; 8

OCR for page 8
INTRODUCTION 9 • Assuming that a stable context for sharing copyrighted information has been achieved, assess the remaining issues to be addressed in setting up such a library; • Identify a range of desired capabilities of such a library; and • Characterize resource needs. While a traditional library is perhaps the oldest formal information resource available, the manifestation of libraries has evolved dramatically over the past few decades. In many cases within mathematics, as for other fields of scholarship, buildings housing paper publications have given way to online collections of downloadable documents. While this increased a ­ ccess is not perfect—not all material is readily available to all researchers, ­ and search tools vary from site to site—widespread digitization has made it easier for many to access the mathematical literature. Overall, a much greater proportion of the mathematical literature is available to more p ­ eople than at any time before. The research libraries, scholarly societies, and other players that curate and steward this material continue to grapple with issues, such as long-term preservation of digital materials, but it is fair to say there exists a fairly comprehensive, distributed “digital library” for mathematics offering a much improved but not fundamentally different version of what existed in the time of printed books and journals. The committee has thus taken the term library in its charge to mean a system that accumulates and shares knowledge, rather than the more traditional library that houses documents, either digital or physical. The committee’s focus has been on functionality that can meet the needs of mathematicians facing a rapidly expanding and diversifying knowledge base. The committee has largely ignored traditional issues of assembling and stewardship of those collections, which are being handled well, for the most part, by the existing distributed digital library. The committee envisions its target digital library users to be work- ing research mathematicians and advanced graduate students beginning their research careers throughout the world (hence the word global). The library discussed does not specifically target students below the advanced graduate student level or researchers outside of mathematics, although both sets would likely constitute some of the library’s user base. Having a clear understanding of the target user base directly impacts the types of content the library targets and the types of services it provides. The com- mittee also believes that the disciplinary scope of the mathematics that this library could provide is best left undefined for now. Mathematics and the mathematical sciences have diffuse boundaries, and this committee takes no stance on where appropriate content lies. However, this is an issue that will have to be addressed by either a future management organization or the community of users.

OCR for page 8
10 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY The committee believes that there is much room for innovation and progress in the mainstream mathematical information services. To deter- mine which potential areas for innovation are of the most interest to the mathematics community, the committee held three meetings where it heard from outside presenters on issues relevant to mathematics (November 27- 28, 2012; February 19-20, 2013; and May 30-31, 2013—agendas for these meetings can be found in Appendix A) and two public data-gathering ses- sions (at the University of Minnesota on May 6, 2013, and at Northwestern University on May 30, 2013), posted questions on two mathematics discus- sion forums (MathOverflow1 and Math 2.02), and wrote a guest entry on Professor Terry Tao’s mathematics blog.3 The committee also referred to the information shared at the World Digital Mathematics Library workshop held by the International Mathematical Union (IMU) on June 1-3, 2012.4 The committee made an assessment of what computers can do today, what computers can help mathematicians to do, and how rapidly these capabilities are likely to grow, if provided with some ongoing focused re- search funding. The committee’s consensus is that by some combination of machine learning methods and community-based editorial effort, a signifi- cant portion of the information and knowledge in the global mathematical corpus could be made available to researchers as linked open data. Broadly defined, linked open data are structured data that are published in such a way that makes it easy to interlink them with other data, thereby making it possible to connect them with information from multiple sources. This connected data can provide a user with a more meaningful query of a sub- ject by consolidating relevant information from a variety of places (e.g., in different research papers) and pulling out specific components that the user might be particularly interested in. The committee envisions that much of the existing mathematical information can be provided as linked open data through a central organizational entity—referred to in this report as the DML. It should be noted that linked open data are not the only way that this can be accomplished, but they are essentially today’s standard for ontologies and other important representations. The committee believes that the DML should make use of current best practices rather than trying to develop some other alternative, whenever possible. 1  I. Daubechies, “Math Annotate Platform?,” MathOverflow (question and answer site), February 18, 2013, http://mathoverflow.net/questions/122125/math-annotate-platform. 2  I. Daubechies, “Math Annotate Platform?,” Math2.0 (discussion forum), February 18, 2013, http://publishing.mathforge.org/discussion/163/. 3  I. Daubechies, “Planning for the World Digital Mathematical Library,” What’s New (blog by Terence Tao), daily archive for May 8, 2013, http://terrytao.wordpress.com/2013/05/08/. 4  Many of the materials presented at the International Mathematics Union’s DML work- shop can be found at http://ada00.math.uni-bielefeld.de/mediawiki-1.18.1/index.php/, updated April 23, 2013.

OCR for page 8
INTRODUCTION 11 STRUCTURE OF THE REPORT This report consists of five main chapters and several appendices. The rest of this chapter discusses previous digital mathematics library efforts, the universe of mathematical information, relevant conceptual tools, and current mathematical resources. Chapter 2 discusses what is missing from the mathematical information landscape and what gaps the DML would fill, and elaborates on the desired DML capabilities from a user’s perspec- tive. This includes a discussion of what types of features would make the mathematical literature and current resource capability more meaningful to a mathe­ atical researcher. Chapter 3 discusses some of the broad issues m that the DML would face during development, including developing partner- ships, managing large data sets, navigating open access, and planning for system and data maintenance. Chapter 4 provides a strategic plan for the development of the DML, including a discussion of fundamental principles, the constitution of a governing organization, steps toward initial develop- ment, and resources that would be needed. Chapter 5 discusses some details of entity collections and technical considerations for the DML that will be needed to make the features and capabilities discussed in Chapter 2 a reality. In preparing this report, the committee reviewed many existing digital resources for mathematics, as well as relevant initiatives in some other sci- ences. A brief discussion of these tools is given in Appendix C. PREVIOUS DIGITAL MATHEMATICS LIBRARY EFFORTS The idea of a comprehensive digital mathematics library has been around for decades, and there have been several incarnations of the idea with different foci. The first step in this vision was retrospective digitization of the older parts of the literature that did not already exist in digital form, and this has largely been achieved (though the quality, and hence utility, of these converted materials varies widely, ranging from simple page scans to carefully proofread markups). The Cornell University Digital Mathematics Library Planning Project ­ was funded by the National Science Foundation from 2003 to 2004 as a step “toward the establishment of a comprehensive, international, dis- tributed collection of digital information and published knowledge in mathematics.”5 Its vision statement reads as follows: In light of mathematicians’ reliance on their discipline’s rich published heritage and the key role of mathematics in enabling other scientific disci- 5  Cornell University Library, Digital Mathematics Library. S.E. Thomas, principal investi- gator, R.K. Dennis and J. Poland, co-principal investigators, http://www.library.cornell.edu/ dmlib/, last updated December 2, 2004.

OCR for page 8
12 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY plines, the Digital Mathematics Library strives to make the entirety of past mathematics scholarship available online, at reasonable cost, in the form of an authoritative and enduring digital collection, developed and curated by a network of institutions. A follow-up report from the International Mathematical Union (IMU, 2006) shared this vision of a distributed collection of past mathematical scholarship that served the needs of all science, and it encouraged math- ematicians and publishers of mathematics to join together in implementing this vision. However, it was clear within a few years that this vision was not going to become a reality soon. As David Ruddy of Project Euclid wrote (Ruddy, 2009): The grand vision of a Digital Mathematics Library, coordinated by a group of institutions that establish policies and practices regarding digitization, management, access, and preservation, has not come to pass. The project encountered two related problems: it was overly ambitious, and the ap- proach to realizing it confused local and community responsibilities. While the vision called for a network of distributed, interoperable repositories, the committee approached and planned the project with the goal of build- ing a single, unified library. At the time of this study, there has been some progress in this vision of a single, unified library in the form of the European Digital Mathematics ­ ­ ibrary (EuDML) project.6 The EuDML project, funded from 2010-2013 by L the European Commission, created a network of 12 European repositories acquiring selected mathematical content for preservation and access and made progress in establishing a single distributed ­ibrary with a collection l of about 225,000 unique items, spanning 2.6 million pages. The EuDML succeeded in creating a unified metadata framework7—which includes items about a document such as the title, ­ uthors, abstract, comments, report a number, category, journal reference, direct object identifier, Mathematics Subject Classification (MSC), and Asso­iation for Computing Machinery c (ACM) computing classification—that is shared by these repositories and providing a single point of access to publications in these repositories, albeit with limited rights to search the full text from some sources. Impressive as the EuDML is, when compared to the full size and scope of the universe of published mathematics (described in the next section), and given the 6  T. Bouche, Université de Grenoble, “From EuDML to WDML: Next Steps,” Presentation to the committee on November 27, 2012. 7  European Digital Mathematics Library, “Appendix, EuDML Metadata Schema (Final)/ Tagging Best Practices,” in EuDML Metadata Schema Specification (v2.0-final), https://­ roject. p eudml.org/sites/default/files/d36-appendix_uncropped.pdf, accessed January 16, 2014.

OCR for page 8
INTRODUCTION 13 essen­ial requirement to integrate with copyrighted materials and the clear t desirability and cost-effectiveness of leveraging existing repositories and services, the EuDML experience only emphasizes the difficulties inherent in aiming for a single, centrally managed and truly comprehensive collection of digitized mathematics as the cornerstone for a comprehensive DML. With the advent of recent advances in technology and the advantage of experience gained on EuDML and other projects, the study committee concluded that a more effective approach going forward would be to partner with exist- ing content providers and focus instead on the innovations and elements of shared infrastructure and knowledge management that are not being adequately addressed by other entities (i.e., rather than on central harvest- ing and aggregation of primary content). The committee believes that this vision is consistent with the original vision of the EuDML, although it was not realized by that project. Another example of an online resource that helps users connect with knowledge is the National Science Digital Library (NSDL).8 NSDL is an on- line educational resource for teaching and learning, with current emphasis on the sciences, technology, engineering, and mathematics. NSDL does not hold content directly—instead, it provides structured metadata about Web- based educational resources held on other sites by providers who contribute this metadata to NSDL for organized search and open access to educational resources via NSDL.org and its services. A discussion of many other efforts and current digital resources can be found in Appendix C. The Alfred P. Sloan Foundation supported a World Digital Mathe­ matics Library workshop in June 2012,9 which was planned by the IMU’s Committee on Electronic Information and Communication. This workshop provided a wealth of information to the committee on the current state of the art and research efforts aimed at making the World Digital Mathe­ atics m Library a reality. Much of the straightforward work of assembling digital mathematics libraries has been done (e.g., digitizing material, aggregating it into small to medium-sized collections). The difficulties that the EuDML faced in creat- ing a single large aggregation of mathematics literature and the difficulty of other World Digital Mathematics Library efforts in gaining community support indicates that these challenges are unlikely to be overcome soon. The committee notes that there has been sizable ongoing investment from publishers (both commercial and noncommercial) to retrospectively digi- 8  National Science Digital Library, http://nsdl.org/, accessed January 16, 2014. 9  International Mathematics Union, “The Future World Heritage Digital Mathematics Library: Plans and Prospects,” updated April 23, 2013, http://ada00.math.uni-bielefeld.de/ mediawiki-1.18.1/index.php/Main_Page.

OCR for page 8
14 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY tize historical runs of their copyrighted journals and also, in many cases, even earlier historical materials that are now out of copyright, in order to capture comprehensive representations of their journals. However, broad services such as Google Scholar now provide much of the functionality that many of these specialized efforts had hoped to achieve in building compre- hensive and coherent collections of the mathematical literature. Such ser- vices achieve this functionality by searching across a range of repositories, rather than trying to collect all of the material in one (or a very few) reposi- tories. In the committee’s view, efforts to build centralized comprehensive resources are reaching a point of diminishing returns. Finding: The construction of mathematical libraries through centralized aggregation of resources has reached a point of diminishing returns, particularly given that much of this construction has been coupled with retrospective digitization efforts. While there is still a substantial amount of historical (mostly out of copyright) mathematical literature that would benefit from retrospective digitization, or higher quality digitization than has currently been done, the committee does not believe that there is justification for a major new program and investment in this area. In particular, although there is value in modest, sustained investment in existing efforts, these will make only incremental contributions. While the fundamental importance of the heri- tage literature remains, its size, as a fraction of the overall mathematics literature, is diminishing steadily. No amount of additional retrospective digitization will result in a fundamental change in the way that the math- ematical literature can be used in new ways or evolved to meet new research needs. Moreover, while the historical (e.g., out of copyright) segments of the mathematical literature are valuable, any genuinely meaningful large- scale change in accessing the mathematical literature and knowledge base must encompass not only heritage but also current literature. Thus, the committee believes that a very different set of investments (as described in this report) is where the transformative opportunities await. The next section provides some more detailed information on the exist- ing landscape of mathematical literature and how much has been digitized. THE UNIVERSE OF PUBLISHED MATHEMATICAL INFORMATION Mathematics shares more with the arts than the sciences, in that its primary data are human creations, perhaps representations of ideas in a platonic realm, rather than data derived by observation or measurement of the physical universe. Mathematical information is primarily mined from its own literature or derived by computation. This section describes the state of

OCR for page 8
INTRODUCTION 15 mathematical publishing and the world of mathematical objects that exist within the publications. Digital Mathematical Publications Most of the mathematics literature of the 20th century is now available digitally. Through the Jahrbuch Electronic Research Archive for Mathemat- ics10 project and the independent efforts of publishers and others, much of the most important mathematical research of the last half of the 19th century also has been digitized. Appendix C provides an overview of the many sources for digitized mathematical source material, including reposi- tories and many other types of sources, whether freely accessible or behind paywalls (and thus only accessible to subscribers). A large part of the math- ematics literature in electronic form consists of papers written in the past 20 years. This portion of the literature is searchable and navigable by any user of a library with access to the main subscription services controlled by libraries and publishers. In addition, a considerable body of the heritage literature in mathe­ matics has been digitized over the past 15 years. The most comprehensive listing of the retro-digitized mathematics literature is Ulf Rehmann’s list of Retro­ igitized Mathematics Journals and Monographs,11 which is a d list of titles of serials and books that have been digitized without meta­ data.12 Much of this metadata has found its way into indexes maintained by Google, ­ athSciNet, and Zentralblatt (zbMATH).13 M The digital corpus of mathematics literature is extensive. The ­MathSciNet14 database includes approximately 2.9 million publica- tions from 1940 to the present, with direct links to 1.7 million of them. M ­ athSciNet currently indexes more than 2,000 journal/serial titles and contains about 100,000 books (post 1960). Of the items currently avail- able on MathSciNet, 2.6 million of them are from the 1970s or later, and ­ 1.7 million are from 1990 onward. The American Mathematical Society has ­ kept track of new journal titles in the field since 1997, and there has been an average growth of about 40 new journal titles per year in mathematics. 10  TheJahrbuch Project, Electronic Research Archive for Mathematics, last modified Octo- ber 31, 2006, http://www.emis.de/projects/JFM/. 11  DML: Digital Mathematics Library, http://www.mathematik.uni-bielefeld.de/~rehmann/ DML/dml_links.html, accessed January 16, 2014. 12  Metadata are broadly defined as data about data. In the case of a typical mathematics journal digital publication, metadata may include information such as author, journal name and volume, date of publication, time of file creation, size of file. 13  zbMATH, http://zbmath.org/, accessed January 16, 2014. 14  American Mathematical Society, MathSciNet, http://www.ams.org/mathscinet/, accessed January 16, 2014.

OCR for page 8
16 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY zbMATH (1931-­ resent) contains more than 3 million publications and p currently indexes approximately 3,500 journals. The annual production of ­ mathe­ atics papers is more difficult to quantify. There has been a steady m increase in the number of math papers added to arXiv15 over the past 5 years (shown in Table 1-1), although it is not clear from these data if this shows an increase in mathematics publications or an increase in mathemati- cians’ willingness to post their papers. Annual entries on MathSciNet and ­ the number of mathematics papers listed in Web of Science16 have both ­ remained relatively constant around 90,000 and 20,000, respectively (see Tables 1-2 and 1-3). Components of the digitized corpus of mathematics are increasingly included in a variety of stable, well-curated repositories, although access to much of this corpus remains limited by copyright or other intellectual rights restrictions. For example, in terms of retrospectively digitized works cataloged under the subject heading (or subheading) of “mathematics,” the HathiTrust Digital Library17 includes approximately 40,000 biblio- graphically distinct resources.18 Of these, only 6,800 were digitized from public-domain works; the rest were digitized from copyrighted originals. These numbers are a mix of monograph titles and serial titles (a serial title in HathiTrust typically encompasses a complete run of a journal, edited series, or conference publication series). Each serial run could be expected to include tens or even hundreds of issues, with each issue containing at least several articles or papers. In terms of pages, using the HathiTrust repository-wide ratio of pages per bibliographic resource to estimate, this translates to a rough estimate of 25.5 million pages of retrospectively digi- tized mathematics in HathiTrust with approximately 17 percent (6,800 out of 40,000) digitized from public-domain sources. The basic trends seem clear: more and more of the corpus of math- ematical literature will be in digital form, including some with high-quality markup, specifically those items that are “born” digital or retro-digitized to be in a machine readable format and that use typesetting such as LaTeX or MathML (as opposed to page images of publications). As mentioned before, the fraction of the overall corpus that is pre-1970 is rapidly dimin- ishing due to the relative explosion in the annual rates of publication in recent decades (however, this should in no way be seen as diminishing the fundamental importance of heritage literature). 15  arXiv,http://arxiv.org/, accessed January 16, 2014. 16  Thomson Reuters, “Web of Science Core Collection,” http://thomsonreuters.com/web-of- science/, accessed January 16, 2014. 17  HathiTrust Digital Library, http://www.hathitrust.org/, accessed January 16, 2014. 18  Current as of September 2013.

OCR for page 8
INTRODUCTION 17 TABLE 1-1  Number of Mathematics Papers Added to arXiv Annually Between 2008 and 2012 Year Mathematics Papers Added to arXiv 2008 14,373 2009 16,319 2010 18,765 2011 21,287 2012 24,176 SOURCE: arXiv, http://arxiv.org/, accessed January 16, 2014. TABLE 1-2  Number of Articles in Research Journals in MathSciNet Annually Between 2006 and 2012 Publication Year Entries in MathSciNet 2006 76,187 2007 81,638 2008 86,533 2009 87,279 2010 87,162 2011 89,638 2012 92,191 NOTE: A steady growth of about 3 percent per year is seen. SOURCE: American Mathematical Society, MathSciNet, http://www.ams.org/mathscinet/, accessed January 16, 2014. TABLE 1-3  Mathematics Papers Listed in Web of Science Annually Between 2008 and 2012 Year Mathematics Papers Listed in Web of Science 2008 20,908 2009 22,390 2010 22,079 2011 22,716 2012 23,760 SOURCE: Thomson Reuters, “Web of Science Core Collection,” http://thomsonreuters.com/ web-of-science/, accessed January 16, 2014.

OCR for page 8
18 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY Objects in the Mathematical Literature Information found in the mathematical literature is diverse but largely falls into two main categories: 1. Bibliographic information, such as a. Documents (e.g., articles, books, proceedings, talks, diagrams, homepages, blogs, videos); b. People (e.g., authors, editors, referees, reviewers); c. Events (e.g., discoveries, publications, conferences, talks, births, deaths, degrees, awards); d. Organizations (e.g., universities, publishers, journals, libraries, service providers); e. Subjects (e.g., major branches of mathematics—algebra, g ­eometry, analysis, topology, probability, statistics—as well as their intersections and interactions and their various sub- branches, down to even finer topics and including ubiquitous mathematical terms like “number,” “set”) 2. Mathematical concepts (e.g., axioms, definitions, theorems, proofs, formulas, equations, numbers, sets, functions) and objects (e.g., groups, rings). Collecting and aggregating mathematical bibliographic information has been the path many digital libraries and digital resources have taken in the past (Chapter 2 and Appendix C discuss many of these efforts to date). While there are many challenges in collecting this information, the even more difficult work lies in collecting mathematical concepts, which lack the standardization that most bibliographic information has acquired. However, an ability to explore these mathematical objects within the litera- ture offers the potential to uncover currently under-explored connections in mathematics. The recent National Research Council report The Mathematical Sci- ences in 2025 (NRC, 2013) discusses the importance of mathematical struc- tures, which are part of the larger mathematical concepts described above: A mathematical structure is a mental construct that satisfies a collection of explicit formal rules on which mathematical reasoning can be car- ried out. . . . What is remarkable is how many interesting mathematical structures there are, how diverse are their characteristics, and how many of them turn out to be important in understanding the real world, often in unanticipated ways. Indeed, one of the reasons for the limitless pos- sibilities of the mathematical sciences is the vast realm of possibilities for mathematical structures. . . . A striking feature of mathematical structures is their hierarchical nature—it is possible to use existing mathematical

OCR for page 8
INTRODUCTION 19 structures as a foundation on which to build new mathematical structures . . . . Mathematical structures provide a unifying thread weaving through and uniting the mathematical sciences. (pp. 29-30) Given the size, diversity, and inherent nature of mathematics informa- tion in categories 1 and 2 above, it is clearly not sufficient to simply pro- vide undifferentiated access to the universe of mathematics monographs, journal articles, and conference papers. Instead, the online research litera- ture of mathematics must be organized into a well-structured network of resources linked together based on a variety of attributes—bibliographic and topical, of course, but also linked in a highly granular fashion on com- monalities of mathematical structures and the shared use of mathematical objects, reasoning, and methodologies. The committee believes that the greatest potential for the DML lies in providing mathematicians access to a well-structured network of information and building services that both enhance and utilize this data. In the context of today’s Web environment, a well-structured network implies adherence to the Semantic Web19 and linked open data principles and to community-endorsed standards and best practices. While the foundation for such a well-structured network of digi- tal research mathematics exists in established repositories and component digital libraries, the underlying thesauri and ontologies of mathematical objects do not yet exist (or have not yet been given permanence and formal identity), and the agreements on best practices for interoperability and the implementation of linked open data principles in the context of research mathematics repositories have not yet been reached. CONCEPTUAL TOOLS General conceptual tools that are used to structure, organize, represent, and share knowledge include the closely related ideas of ontologies, tax- onomies, and vocabularies. There is considerable debate about the precise definitions and differences among these tools, although ontologies (most commonly viewed as a tool for defining some classes of objects—the attri- butes that these objects may have and the way in which these objects may be related to each other) are usually seen as the most general formulation (Gruber, 2009). Taxonomies are specific, usually hierarchical, collections of terms that can be used to describe or classify objects in some contexts— examples of these include subject headings or the naming schemes used in biological systematics. “Controlled” vocabularies are collections of values that can be used to populate specific instances of object attributes within an ontology; in a certain sense, they are equivalent to taxonomies in that 19  W3C, “Semantic Web,” http://www.w3.org/standards/semanticweb/.

OCR for page 8
20 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY they can be used to classify. However, controlled vocabularies are often “flat,” without other internal structure among the possible values, whereas taxonomies commonly include very rich internal hierarchical structure. Ontologies, vocabularies, and taxonomies work together. As a simple ex- ample, a part of an ontology might define a specific class of objects called documents; each of these has attributes that include subjects and languages. One might have a list of possible language values (a controlled vocabulary) associated with the ontology and also a tree structure of subject headings (a taxonomy, though it could also viewed as a simple vocabulary). For instance, within the mathematical sciences, the widely accepted Bibliographic Ontology20 provides a fairly adequate accounting of the many common relations between objects in categories 1a through 1e listed above. The BibTeX21 schema that describes the structure of BibTeX ­ ecords defines r a similar ontology. The Citation Typing Ontology (CiTO)22 is an ontology for description of the citation relation between documents. The Mathematics Subject Classification (MSC2010)23 provides a very well thought out, largely hierarchical taxonomy for the classification of mathematical documents by subject, and thence for the subjects themselves. OpenMath,24 discussed fur- ther in Chapter 5, offers a potential standard for representing the semantics of mathematical objects that is very relevant to the DML’s goals. The application of such ontologies to a mathematical objects data set can create graphical structures of information that can provide new in- sights. For instance, citations generate a citation graph, and collaborations generate a collaboration graph. Such graphical structures are commonly embedded in the structure of hyperlinked webpages, thereby connecting literature that was not obviously related otherwise. Development of new ontologies is a complex process requiring a high level of community effort for consensus, even for limited sets of relations. The committee expects that when communities start to curate various digital collections of records of mathematical entities, there will be some “bottom up” development of at least minimal ontologies for these entities, as has already occurred with MSC2010 and OpenMath. The structure of these ontologies will be reflected in the necessary schemas25 for description of the objects they involve, and the graphical relations induced by these 20  The Bibliographic Ontology, “Bibliographic Ontology Specification,” dated November 4, 2009, http://bibliontology.com/specification. 21  BibTeX, http://www.bibtex.org/, accessed January 16, 2014. 22  CiTO, the Citation Typing Ontology, dated March 7, 2013, http://purl.org/spar/cito/. 23  Encoded by the Mathematics Subject Classification (MSC2010), American Mathematical Society, http://www.ams.org/mathscinet/msc/msc2010.html, accessed January 16, 2014. 24  OpenMath Society, OpenMath, http://www.openmath.org/, accessed January 16, 2014. 25  A schema is broadly defined as a representation of a plan or theory in the form of an outline or model.

OCR for page 8
INTRODUCTION 21 ontologies will be of potentially great interest in the process of extracting information and knowledge from mathematical publications. CURRENT MATHEMATICAL RESOURCES The management of formal representations of mathematical concepts is known as mathematics knowledge management (Carette and Farmer, 2009). In this report, this issue is viewed more broadly as the management of mathematical information and concepts, both formal and informal, in- cluding the bibliographic information and mathematical concepts categories of objects introduced in the previous section, only the latter of which can be usefully regarded as part of mathematics itself. Bibliographic Resources in Mathematics Several general bibliographic resources exist, and some of these are d ­ escribed in Appendix C. Among them, mathematicians typically use Google26 and Google Scholar27 most often, although CrossRef28 is “­ nder u the hood” whenever a user navigates from one publisher’s site to another ­ by a reference link. While many mathematicians heavily utilize these gen- eral information services because of their power and ubiquity, some math- ematicians prefer the discipline-specific abstracting and indexing services provided by MathSciNet29 and zbMath.30 This discipline-specific service preference is partly for historical reasons and partly because the focus and quality of metadata provided by these services in mathematics makes it ­ asier to find publications of interest. Both services offer bibliographic e ­ ntries in BibTeX,31 which is machine-readable and reusable, for prepara- e tion of reference lists for LaTeX32 documents, and, with more technical ­ffort, for publication of online bibliographies in HTML33 or JSON.34 e U ­ sing search engines with access to well-curated bibliographic metadata and full-text indexing is how most mathematicians find mathematical pri- mary sources today. 26  Google, https://www.google.com/, accessed January 16, 2014. 27  Google Scholar, http://scholar.google.com/, accessed January 16, 2014. 28  CrossRef, http://www.crossref.org/, accessed January 16, 2014. 29  American Mathematical Society, MathSciNet, http://www.ams.org/mathscinet/, accessed January 16, 2014. 30  zbMATH, http://www.zentralblatt-math.org/zmath/, accessed January 16, 2014. 31  BibTeX, http://www.bibtex.org/, accessed January 16, 2014. 32  LaTeX—A document preparation system, last revised January 10, 2010, http://www. latex-project.org/. 33  “HTML,” Wikipedia, http://en.wikipedia.org/wiki/HTML, accessed January 16, 2014. 34  “Introducing JSON,” http://www.json.org/, accessed January 16, 2014.

OCR for page 8
22 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY Services such as MathSciNet, zbMATH, and Google Scholar provide complementary and somewhat overlapping services. One distinct difference is that MathSciNet is organized chronologically and referentially, while Google Scholar is based on “importance” as qualified by page ranks or some variant thereof. Both are important and are used in literature searches. MathSciNet is great for tasks such as listing all articles by an author and listing all articles in a specific mathematical field, and it has high-quality metadata that are needed for many purposes. Its search capabilities are limited because it only searches over metadata. Google Scholar is often better for searches because it searches over full text, including reference lists, and has better ranking or returns for most purposes. One issue that some mathematicians have with Google Scholar is that it is not possible to limit searches to math or subfields of math. MathSciNet, zbMATH, and Google Scholar combined do a good job providing conventional discovery over the corpus of traditionally published mathematical literature, but no services currently provide a finer-grain search capability that allows a user to search for mathematical objects or ideas that cannot be easily defined by text search, such as an equation or the evolution of a specific notation. Ideally, a mathematician should have the best of both capabilities through a single interface, but this is challenging because neither MathSciNet nor Google Scholar currently allow their data to be merged with the other’s. Mathematicians also make extensive use of arXiv as a platform for sharing preprints and keeping up with current research developments. Mathematicians strongly support arXiv in part because the full text is largely indexed and exposed to the Web through search engines. How- ever, arXiv items are not indexed through services such as MathSciNet or zbMATH, which would help connect these items to the rest of the ­ literature. Search tools associated with distinct subsets of the literature, such as arXiv, publisher-based repositories, library catalogs, and academic institutional repositories provide overlapping access to the mathematical lit- erature. Unfortunately, the present configuration of these discipline-specific ­ tools does not provide a single information source where mathematicians can find and access information from diverse sources, and the more general information sources often lack the mathematical metadata and details that make mathematics literature easy to search and browse. Combining data from multiple information resources (e.g., Google, MathSciNet, zbMATH) is complicated. Partnering organizations would have to allow their data to be collected, reused, or recombined on a large scale, which many services are hesitant to do. Even seemingly open re- sources (such as arXiv) may have legal restrictions on outside data aggrega- tion, depending on what is done with the data. This collaboration would have to be negotiated between potential partners with the goal of creating

OCR for page 8
INTRODUCTION 23 a unified view of the mathematics literature. Some approaches toward developing partnerships and relevant examples are discussed in Chapter 3. Given the central importance of bibliographic data searches and the repeated use of bibliographic information by researchers in preparation of research articles, it is essential for the DML to provide adequate biblio- graphic support tools with access to the best available bibliographic data in mathematics and related fields. Ideally, it should support advanced biblio- graphic data processing to detect and identify the structure of networks of papers, authors, topics, and the like. The foundations of such bibliographic data processing are provided by the larger existing bibliographic services in mathematics and beyond, especially MathSciNet, zbMATH, and Google Scholar, which are the most commonly used by mathematicians. At ­ resent, p none of these services provides an application programming interface (API) for programmatic access, and none of them allow their data to be down- loaded in bulk, except with severe restrictions on what can be done with it. To provide the greatest benefit to users of a DML, that would have to change. Both EuDML and Microsoft Academic Search provide steps in a positive direction with more or less open bibliographic data stores with an API for access, which allows tools and services to be built over the corpus. To seriously engage the mathematics world with a digital library system, extensive coverage of mathematical information is essential. The commit- tee considered whether the DML could initially focus on out-of-copyright material, but it concluded that there would not be community support or interest in this approach because it is too limited. On the other hand, much progress has been made in digitizing heritage content, and it is essential that this be integrated with the rest of the math literature base. Specialized Mathematical Information Resources General bibliographic services provide limited support for navigating and searching mathematical literature below the top five bibliographic classes (documents, people, events, organizations, subjects) discussed above. Beyond these five universal classes, information storage and retrieval for math-specific entities is fragmented and typically does not have links or references to the main indexing services.35 Research mathematics literature includes a diverse range of special o ­ bjects—e.g., theorems, lemmas, functions, sequences—that are not repre- sented adequately, or sometimes at all, in full-text indexing and ­ rticle-level a subject classification systems. Currently, these objects are computationally 35  MathSciNet and zbMATH share the MSC2010 subject classification, which provides some basic filtering of bibliographic data by subject. ArXiv uses a coarser classification, which is however easily mapped to sets of top-level MSC 2010 categories.

OCR for page 8
24 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY expensive and difficult to recognize through machine-based methods alone. Ontologies of objects—such as reference volumes that enumerate classes of functions, sequences, and other objects—have been developed and curated by mathematicians for centuries. These resources include mathematical handbooks, some of the most famous being the following: • Abramowitz and Stegun (1972) and the subsequent Digital Library of Mathematical Functions,36 • The Bateman Manuscript,37 • Gradshteyn and Ryzhik (2007), • Borodin and Salminen (2002), and • The Princeton Companion to Mathematics (Gowers et al., 2008). There are also examples of more recently developed resources that provide collections of some mathematical objects, including the following: • Propositions: Wikipedia’s List of Theorems,38 Mizar39; • Proofs: Proofs from the Book (Aigner and Ziegler, 2010), Mizar, Coq,40 and others41; • Numbers: A Dictionary of Real Numbers (Borwein and Borwein, 1990); • Sequences: The On-Line Encyclopedia of Integer Sequences (OEIS)42; • Functions: Digital Library of Mathematical Functions,43 Wolfram MathWorld,44 Wolfram Functions Site45; • Groups, rings, and fields: Wikipedia’s List of Simple Lie Groups,46 Wikipedia’s List of Finite Simple Groups,47 Centre for Inter­ 36  NIST Digital Library of Mathematical Functions, 2013, http://dlmf.nist.gov/. 37  “Bateman Manuscript Project,” Wikipedia, last modified July 24, 2013, http://en. wikipedia.org/wiki/Bateman_Manuscript_Project. 38  “List of Theorems,” Wikipedia, last modified December 9, 2013, http://en.wikipedia.org/ wiki/List_of_theorems. 39  Mizar Home Page, last modified January 8, 2014, http://mizar.org/. 40  The Coq Proof Assistant, http://coq.inria.fr/, accessed January 16, 2014. 41  “Category:Proof assistants,” Wikipedia, last modified September 21, 2011, http://en. wikipedia.org/wiki/Category:Proof_assistants. 42  On-Line Encyclopedia of Integer Sequences® (OEIS®) Wiki, https://oeis.org/wiki/­ elcome, W accessed January 16, 2014. 43  NIST Digital Library of Mathematical Functions, 2013, http://dlmf.nist.gov/. 44  Wolfram MathWorld, http://mathworld.wolfram.com/, accessed January 16, 2014. 45  Wolfram Research, Inc., The Wolfram Functions Site, http://functions.wolfram.com/, accessed January 16, 2014. 46  “List of Simple Lie Groups,” Wikipedia, last modified March 30, 2013, http://en.wikipedia. org/wiki/List_of_simple_Lie_groups. 47  “List of finite simple groups,” Wikipedia, last modified December 18, 2013, http:// en.wikipedia.org/wiki/List_of_finite_simple_groups.

OCR for page 8
INTRODUCTION 25 disciplinary Research in Computational Algebra: Finite Fields,48 Sage’s Finite Fields49; • Identities: Piezas50; Petkovsek et al. (1996); • Inequalities: Wikipedia’s List of Inequalities,51 DasGupta (2008); and • Formulas: Springer LaTeX Search,52 Hijikata et al. (2009), Kohl- hase et al. (2012). From a review of these lists, as well as the resources discussed in Appen­ ix C, it is clear that authors and editors continue to be motivated to d create and publish lists of various kinds of mathematical objects. Some of these lists, especially ones like tables of integrals and lists of sequences, pro- vide very useful tools for mathematicians and other users of mathe­ atics,m especially when combined with computational resources. Wikipedia cur- rently plays a key role in supporting distributed creation and maintenance of numerous lists of serious interest to mathematicians. Lists and tables have been an essential part of mathematical research throughout history, and the vast majority of working mathematicians have made use of appropriate tables (or, more recently, the equivalent numerical or symbolic software) in the course of their research. The most basic are numerical tables (e.g., values of logarithms, trigonometric functions, vari- ous special functions, zeros of the zeta function, integer sequences). More sophisticated are lists of mathematical objects (e.g., indefinite and definite integrals, finite simple groups, Fourier transforms, partial differential equa- tions and their solutions). Or, at even a higher level, lists of theorems, concepts, etc. At their most basic, tables provide a simple mechanism for speeding up research. Once one identifies that an object under investigation appears in a table, one can make use of prior knowledge about said object, thereby facilitating either applications or new advances in theory. Compiling a table is an important research contribution in its own right, helping codify the knowledge in a field, point out gaps therein, and inspire new research to fill in and extend what is known. Scanning a table often enables one to spot 48  CIRCA, “GAP Instructional Material,” January 2003, http://www-circa.mcs.st-and.ac.uk/ gapfinite.php. 49  Sage Development Team, “Finite Fields,” http://www.sagemath.org/doc/reference/rings_ standard/sage/rings/finite_rings/constructor.html, accessed January 16, 2014. 50  T. Piezas III, A Collection of Algebraic Identities, https://sites.google.com/site/tpiezas/ Home/, accessed January 16, 2014. 51  “List of Inequalities,” Wikipedia, last modified November 28, 2013, http://en.wikipedia. org/wiki/List_of_inequalities. 52  Springer, LaTeX Search, http://www.latexsearch.com/, accessed January 16, 2014.

OCR for page 8
26 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY otherwise obscure patterns, leading to new theorems and new directions of research. Sara Billey and Bridget Tenner wrote that a database for mathemati- cal theorems would “enhance experimental mathematics, help researchers make unexpected connections between areas of mathematics, and even im- prove the refereeing process” (Billey and Tenner, 2013, p. 1093). Extensive lists could also enhance search and retrieval of mathematical information and allow for connections to be made between mathematical topics and objects. Currently, there are no satisfactory indexes of many mathematical objects, including symbols and their uses, formulas, equations, theorems, and proofs, and systematically labeling them is challenging and, as of yet, unsolved. In many fields where there are more specialized objects (such as groups, rings, fields), there are community efforts to index these, but they are typically not machine-readable, reusable, or easily integrated with other tools and are often lacking editorial efforts. So, the issue is how to identify existing lists that are useful and valuable and provide some central guidance for further development and maintenance of such lists. Chapter 2 of this report discusses some of the user features that could advance mathematics research by increasing connections, and Chapter 5 discusses what collections of entity lists could start making these features and this connectivity a reality. REFERENCES Abramowitz, M., and I.A. Stegun, eds. 1972. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover Publications, New York. Aigner, M., and G.M. Ziegler. 2010. Proofs from THE BOOK. 4th edition. Springer-Verlag, Berlin. doi:10.1007/978-3-642-00856-6. Billey, S.C., and B.E. Tenner. 2013. Fingerprint databases for theorems. Notices of the AMS 60(8):1034-1039. Borodin, A.N., and P. Salminen. 2002. Handbook of Brownian Motion—Facts and Formulae. ­ 2nd edition. Probability and Its Applications book series. Birkhäuser Verlag, Basel. doi:10.1007/978-3-0348-8163-0. Borwein, J., and P. Borwein. 1990. A Dictionary of Real Numbers. Wadsworth and Brooks/Cole Advanced Books and Software, Pacific Grove, Calif. doi:10.1007/978-1-4615-8510-7. Carette, J., and W.M. Farmer. 2009. A review of mathematical knowledge management. Pp. 233-246 in Intelligent Computer Mathematics. Springer. DasGupta, A. 2008. A collection of inequalities in probability, linear algebra, and analysis. Pp. 633-687 in Springer Texts in Statistics. Springer, New York. doi:10.1007/978-0-387- 75971-5 35. Gowers, T., J. Barrow-Green, and I. Leader, eds. 2008. The Princeton Companion to Math- ematics. Princeton University Press, Princeton, N.J. Gradshteyn, I.S., and I.M. Ryzhik. 2007. Table of Integrals, Series, and Products. 7th edition. Elsevier/Academic Press, Amsterdam. Translated from the Russian, Translation edited and with a preface by A. Jeffrey and D. Zwillinger.

OCR for page 8
INTRODUCTION 27 Gruber, T. 2009. Ontology. Encyclopedia of Database Systems (L. Liu and M. Tamer Özsu, eds.). Springer-Verlag. http://tomgruber.org/writing/ontology-definition-2007.htm. Hijikata, Y., H. Hashimoto, and S. Nishida. 2009. Search mathematical formulas by math- ematical formulas. Pp. 404-411 in Lecture Notes in Computer Science. Volume 5617. doi:10.1007/978-3-642-02556-3 46. International Mathematics Union. 2006. “Digital Mathematics Library: A Vision for the Future.” http://www.mathunion.org/fileadmin/IMU/Report/dml_vision.pdf. Accessed August 20, 2006. Kohlhase, M., B.A. Matican, and C.-C. Prodescu. 2012. MathWebSearch 0.5: Scaling an open formula search engine. Pp. 342-357 in Lecture Notes in Artificial Intelligence. Volume 7362. Springer, Berlin, Heidelberg. doi:10.1007/978-3-642-31374-5. National Research Council. 2013. The Mathematical Sciences in 2025. The National Acad- emies Press, Washington, D.C. Petkovsek, M., H. Wilf, and D. Zeilberger. 1996. A = B. A.K. Peters, Ltd., Wellesley, Mass. Ruddy, D. 2009. The evolving digital mathematics network. Pp. 3-16 in DML 2009 Towards a Digital Mathematics Library Proceedings (P. Sojka, ed.) Conferences on Intelligent Computer Mathematics, CICM 2009, Grand Bend, Ontario, Canada.