5

Technical Details

This chapter discusses some details of entity collections and technical considerations for the Digital Mathematics Library (DML). The lists discussed in this chapter are reasonable and obvious places for the DML to start developing its entity databases, but these may just be the starting point in an entity collection that is likely to evolve over time with the needs and capabilities of the DML. The ultimate goal of these lists is to provide interesting and nontrivial connections between topics, in particular the user features described in Chapter 2. The committee believes this is best accomplished by the DML organization overseeing the simpler entity collections first, which may have the most impact. These early lists can be managed in a straightforward, flexible, and sustainable way. Once this is achieved, the DML may benefit from moving on to more complex structures, such as ordered lists based on importance, relevance, pedagogical value, historical importance, etc., or to lists that can be (partially) ordered using different criteria and hyperlinked.

ENTITY COLLECTION

This section discusses potential object types that the committee believes should be set up early in DML development, with details about location of relevant data sources and technical and political issues in data acquisition. These objects are divided into two categories: mathematical objects and bibliographic entities. Some of these entities are already data rich and can be developed by collaborating with existing databases and services.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 91
5 Technical Details This chapter discusses some details of entity collections and techni- cal considerations for the Digital Mathematics Library (DML). The lists discussed in this chapter are reasonable and obvious places for the DML to start developing its entity databases, but these may just be the starting point in an entity collection that is likely to evolve over time with the needs and capabilities of the DML. The ultimate goal of these lists is to provide interesting and nontrivial connections between topics, in particular the user features described in Chapter 2. The committee believes this is best accom- plished by the DML organization overseeing the simpler entity collections first, which may have the most impact. These early lists can be managed in a straightforward, flexible, and sustainable way. Once this is achieved, the DML may benefit from moving on to more complex structures, such as ordered lists based on importance, relevance, pedagogical value, historical importance, etc., or to lists that can be (partially) ordered using different criteria and hyperlinked. ENTITY COLLECTION This section discusses potential object types that the committee believes should be set up early in DML development, with details about location of relevant data sources and technical and political issues in data acquisi- tion. These objects are divided into two categories: mathematical objects and bibliographic entities. Some of these entities are already data rich and can be developed by collaborating with existing databases and services. 91

OCR for page 91
92 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY Whenever possible, these areas of least resistance should be targeted first to establish a breadth of content within the DML. Mathematical Objects The collection and organization of data on mathematical objects should be a high priority of any DML development. The following entities can be pursued and developed individually or jointly, but cross connections should be noted and exploited whenever possible. MathTopics MathTopics would be a collection of mathematical subjects, topics, and terms that includes supporting definitions at various levels of formality and that indicates relations between topics derived from graphical analysis of book and journal data. This collection is practical to begin immediately, and some initial sources of information include MSC2010, Wikipedia, and tables of contents of mathematics books. As an application, MathTopics could be used to provide visualizations of the global structure of math- ematical fields and their interactions. Including information from open encyclopedic resources1 and metadata records of entries in other encyclopedias behind paywalls2 would be an extremely helpful service of the DML. Encyclopedic information aggregation has been achieved before in limited cases, as in the case of the National Science Digital Library,3 which indexed MathWorld and PlanetMath together. With the expansion of the linked open data approach, these cross-connections are happening in other domains as well. To use linked open data, interfaces need to allow cross- connections, and once an encyclopedia is available as linked open data, the data provider no longer has to be involved in the process of creating the combined resource. There are also some commercial entities providing metadata as linked open data,4 so there is a sense that these connections may be possible in the near future. The DML could provide dedicated search over this collection, with automated disambiguation of author names and superior subject naviga- tion derived from graphical analysis of various forms of proximity between 1  Some encyclopedic resources include Wikipedia, Springer Encyclopedia of Mathematics, StatProb, MathWorld, The Princeton Companion to Mathematics, etc. 2  Encyclopedia of Statistical Sciences would be of particular interest. 3  National Science Digital Library, http://nsdl.org/, accessed January 16, 2014. 4   See a list of linked open data encyclopedia data sets (e.g., http://datahub.io/dataset?q= encyclopedia) or search for encyclopedias in particular domains (e.g., http://datahub.io/dataset?q= biology+encyclopedia).

OCR for page 91
TECHNICAL DETAILS 93 subjects. Google currently indexes this material but does not provide a means of browsing or navigating the material besides a simple search. Other methods of navigation, such as browsing and faceted search and browse, are very appealing and useful if available, but such systems typically do not have ontologies, and the data are not structured. For topics that do not already have encyclopedia articles, the DML can flag that an article needs creation, and such an article can then be written in any of the available encyclopedia frameworks. Following the DML prin- ciple of not unnecessarily replicating data, and especially not unnecessarily replicating complex editorial structures, the DML would likely benefit from not providing its own encyclopedia publication infrastructure that would compete with and undermine the existing open encyclopedias. MathSequences MathSequences would be a collection of mathematical sequences found throughout the literature. This list is already well developed in the Online Encyclopedia of Integer Sequences (OEIS). The DML could offer to help de- velop systematic hyperlinking of the text of all references, all author names to MathPeople, and all journal names to MathJournals, and systematic conversion of the data to standard machine-readable formats that can be understood by bibliographic and computational services. The OEIS data set, augmented with such enhancements, would be an example of what the DML should strive for in its data structures for other kinds of mathematical objects. Systematically reconstructing the OEIS as computable linked open data does not appear to be a very difficult task. Moreover, the solutions to difficulties encountered in this process should inform the choice of data schema for other similar collections. The main issue for DML involvement in the OEIS appears to be one of negotiating cooperation between the organizations. MathFunctions MathFunctions would be a collection of mathematical functions found throughout the literature. This collection can begin immediately, and the National Institute of Standards and Technology Digital Library of Math- ematical Functions (DLMF)5 and the Wolfram Functions site6 could provide the basis for a well-structured collection of mathematical special functions. This collection could then be added to MathTopics and used to tag compo- 5  National Institute for Standards and Technology, Digital Library of Mathematical Func- tions, Version 1.0.6, release date May 6, 2013, http://dlmf.nist.gov/. 6  Wolfram Research, Inc., The Wolfram Functions Site, http://functions.wolfram.com/, accessed on January 16, 2014.

OCR for page 91
94 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY nents of articles and papers that discuss or apply specific special functions. This collection would likely take considerable time to populate extensively beyond the DLMF and Wolfram capabilities but could provide a wealth of information once reasonably established. MathTransforms MathTransforms would be a searchable and browsable lookup table for classical transforms (e.g., Laplace, Fourier, Mellin) with links to com- putational resources. This could begin to be developed immediately in cooperation with DLMF and/or the Wolfram Functions site but would likely develop fully over a longer timeline. It is useful for mathematicians to be able to search or browse a table of transforms for various purposes: for inspiration, to see what is out there, and to see what might be adapted to a problem at hand. Moreover, such a table has the potential to be hyper­inked to the occurrences of its entries in the mathematical literature, l which would be a step toward a more fine-grained indexing of the litera- ture. Especially for rarely used functions and transforms, it is potentially rewarding for users to be able to find quickly where the same function or ­ transform might have been used before. Special functions are often kept out of sight in higher mathematical constructions but have applications to other branches of mathematics. Making it easier for users to follow threads of their occurrences across the literature might easily lead to novel discoveries or unexpected relations between research in different branches of mathematics. Examples of such relations include the unexpected appli- cations of Airy kernels and Painleve transcendents (functions) in random matrix theory, statistical physics, and elsewhere (Tracy and Widom, 2011; Forrester and Witte, 2012). MathIdentities MathIdentities would be an organized table of classical combinatorial identities and methods of reduction and proof of such identities. There has been huge progress in recent years in computer methods for proving clas- sical combinatorial identities, including closed-form summation formulas. This means that a great many simplifications of algebraic sums and proofs of algebraic identities can be done rapidly and with high reliability by ­machine.7 For the same reasons as identified above for tables of functions 7  See Gould (1972); Wolfram|Alpha (http://www.wolframalpha.com/); the work of Christian Krattenthaler (http://www.mat.univie.ac.at/~kratt/, accessed January 16, 2014), including Mathematica packages HYP and HYPQ for the manipulation and identification of binomial and hypergeometric series and identities (C. Krattenthaler, “HYP and HYPQ,” http://www. mat.univie.ac.at/~kratt/hyp_hypq/hyp.html, accessed January 16, 2014); and Gauthier (1999).

OCR for page 91
TECHNICAL DETAILS 95 and transforms, it may be instructive for mathematicians to browse through tables of identities and to follow links to applications of identities in the literature. This collection could begin immediately and progress similar to MathFunctions and MathTransforms. MathSymbols MathSymbols would be a collection of mathematical symbols with com- monly accepted special meanings, to be cross-linked as well as possible with MathTopics, and if possible with place of first usage. Within restricted do- mains, symbols often acquire stable conventional meanings, and sometimes this is true across all of mathematics. Some work has been done on develop- ing a consensus of mathematical notations across cultures (Libbrecht, 2010), and this Notation Census8 is a meaningful precursor to what the committee envisions. The collection that the committee envisions for the DML is com- plex and may require multiple steps. As a first step before embarking on a complete index, the DML could partner with resources such as MathSciNet and zbMATH to create a collection of journal article titles that contain any mathematical symbols. This would provide a core set of symbols with authoritative links to the literature. The meaning of those symbols could be established by a small community-sourcing exercise. The symbols could be linked to MathTopics at the collection level, and then MathNavigator tool could serve these links to MathTopics entries from a reference to any article anywhere in the mathematical literature that has the same symbol in its title. This might be considered a preliminary exploration before attempting to do a similar but more ambitious undertaking for formulas or equations. MathFormulas MathFormulas would be a collection of mathematical formulas and their variations, initially those of special interest and importance. This col- lection could assist in supervised machine learning processes for the creation and maintenance of a larger body of formulas and equations. This is a long- term collection goal and DLMF, Wolfram, and Springer would be desired partners, especially the data in Springer’s LaTeX Search. This is an ambitious list to attempt to collect, because there are serious challenges to overcome because of superficial variations in the way every given expression might by written (as discussed in Chapter 3). Still, initial progress has been made by several teams of researchers, and the DML could provide a nexus for further research, a forum for tracking advances in this field, and eventually 8  “Notation Census Manifest,” last edited March 9, 2013, http://wiki.math-bridge.org/ display/ntns/Notation-Census-Manifest.

OCR for page 91
96 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY some attempt to create and maintain an authoritative list of at least those formulas considered interesting or important enough to be recognized and assigned an HTTP URL. Further open efforts at both supervised learning relative to these exemplars and unsupervised learning similar to the Springer effort, with linking to the literature, should also be attempted, motivated by applications to formula search as indicated in Chapter 3. MathMedia MathMedia would be a collection of images, photos, videos, and p ­ resentations—or links to such—relating to mathematics. Video clips from conferences and presentations, visualizations of results and simulations, and portraits of mathematicians who contributed to the research field could all be included in the DML and systematically integrated with the mathematics literature. Widespread collection of media entities could begin immediately and would likely continue to evolve. Many mathematics conferences are already filming and posting speakers’ presentations, and it would be oppor­ tunistic for the DML to arrange for these data to be indexed and sorted based on known information such as the title of the presentation, author(s) and presenter(s), date, name of conference and/or section, etc. Other infor- mation on the contents of the presentation, which may be more difficult to automatically categorize, can be tagged by community sourcing. In terms of mathematician portraits, there are several images of mathematicians that may be of interest, such as Oberwolfach Photo Collection,9 Portraits of Statisticians,10 Microsoft Academic Search Profiles,11 Halmos (Beery and Mead, 2012), and Pólya (Alexanderson, 1987). Bibliographic Entities The following bibliographic data collection entities are a needed ele- ment of the DML, and their collection can begin quickly—and largely be completed—since much of the information is already available elsewhere through existing information resources. These entities can be viewed as part of the necessary infrastructure of the DML and are key areas for develop- ing partnerships (as discussed in Chapter 3). However, the collection and development of these entities are not meaningful on their own and should only be pursued as part of a larger DML development. 9  Oberwolfach Photo Collection, http://owpdb.mfo.de/, accessed January 16, 2014. 10  Portraits and Articles from Biographical Dictionaries, revised July 10, 2013, http://www. york.ac.uk/depts/maths/histstat/people/. 11  Microsoft Academic Search, “Overview,” http://academic.research.microsoft.com/About/ Help.htm, accessed January 16, 2014.

OCR for page 91
TECHNICAL DETAILS 97 MathPeople MathPeople would be a lean authority file for mathematical people with links to and selected data from homepages, Wikipedia, ­MacTutor, Math Genealogy, zbMath Open Author Profiles, Celebratio.org, MacTutor, ­ M ­ athSciNet, and so on. There was an effort by the International Mathemati- cal Union in 2005 to build a Federated World Directory of Mathematicians,12 but it was abandoned due to copyright and privacy concerns and inadequate federated search technology. More recently, zbMATH Author Profiles and data in Microsoft Academic Search’s Top Authors in Mathematics offer machine access to approximate authority records for about half a million authors in mathematics and related fields, with no apparent legal restriction on further processing of the data. It would be a straightforward application of machine processing and community input to deduplicate these lists, sync them also with the Virtual International Authority Files of all mathemati- cians, and thereby obtain a combined DML index of all mathematicians, both living and deceased, who have ever published a book or article in mathematics. This data set would include addi­ional information about the t mathematicians’ fields, their collaborators, and their numbers of publica- tions. This would then provide a graphical data set with about half a million nodes for authors and editors, and some fraction of that number of nodes for books they wrote and edited, and a few thousand subject nodes. This could be used very quickly as a test bed for application of modern graphical ­ visualization methods to provide subject navigation, and otherwise as a major framework for organizing other facets of DML information. MathSciNet has high-quality representation of the collaboration graph for mathematical articles, obtained through many years of manual ­ uration c of book and article metadata records, and MathSciNet offers a glimpse into this proprietary data set with its computation of minimal distance paths through the collaboration graph from one author to another. These ­ collaborator connections are helpful and allow users to see if an author’s collaborators are working in relevant areas, but they do not provide links to other relevant data. Having access to similar information in addition to the other data that the DML is proposing to collect (such as theorems, re- search areas, homepages), this information then becomes significantly more integrated into a larger picture of the mathematics research community. With suitable graphical visualization, MathPeople could provide users with a sense of the “geography” of mathematics, how the subfields of math- ematics are related to each other through the collaborations of authors, and how this structure has evolved over time. 12 International Mathematical Union, “Personal Homepages and the World Dictionary of Mathematicians,” http://www.mathunion.org/MPH-EWDM/ last udpated December 13, 2012.

OCR for page 91
98 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY MathHomepages MathHomepages would be a table linked to MathPeople, but with indications of depth of content (e.g., curriculum vitae, photo, bibliogra- phy). From a user perspective, this may appear to be a simple variation of MathPeople; however, a person can have more than one homepage, each of which may contain references and connections to subjects and collabora- tors. It would be beneficial for there to be separate tables for homepages and for people and for these to be cross-linked by a general, extensible data architecture, such that the cross-links are easily maintainable and cor- rectible. This is not trivial, and it is illustrative of the maintenance problem for Web-based data. Much of this data could be mined from sources such as the Microsoft Academic Search API, some subject specific collections in the Web (e.g., for number theory, probability), and easily completed and maintained by Web-spidering, community input, and self-registration. While people stay the same, their homepages and affiliations may change. The relation between people and their homepages could be treated as a simple case of a dynamically changing data set, and methods and interfaces could be developed to respect this aspect. This information would be useful to mathematics researchers because it can help find people with common names and can be useful to the larger DML because it helps with interlinking other data. MathJournals MathJournals would be a deduplicated and cleaned list of serials in mathematics, past and present, with indications of online availability and subject coverage. Most of this data currently exists and is maintained openly by a number of agents (such as Ulf Rehmann, MathSciNet, zbMATH, the ­ Online Computer Library Center).13 There are several lists of math journals in various places, many of them accessible and reusable, but none of these lists provides easy access to the features that researchers would like, includ- ing the following: • Links to journal homepages whenever they exist; • Information about the number of articles published and subject areas covered; • Copyright and rights information for authors; and • Simple search over the list. 13  See also UlrichsWeb.com for a proprietary solution across all fields.

OCR for page 91
TECHNICAL DETAILS 99 The entire math journals list is only a few thousand entries, but the number of readily available attributes of a journal is potentially large and, in principle, unlimited. Some desired capabilities for the journals list that will require some initial work and maintenance are the following: • Graphical displays (e.g., nodes with size proportional to various journal metrics and locations reflective of their subject coverage, linking to MathTopics below) that could easily be derived; • Display of journals by defined metrics (e.g., in cooperation with ­eigenfactor.org14), which uses recently developed methods of net- work analysis and information theory to evaluate the influence of scholarly journals and for mapping the structure of academic research; and • Access to identities of all authors who ever published in a journal with links to MathPeople. These are typical functionalities that the standard abstracting and in- dexing services could provide but currently do not offer. Aggregating and displaying this information would give users a quick overview of the whole field of mathematics from the point of view of its journal coverage, and graphical relations derived from such information could feed into tools for navigation of mathematical information. While no such navigation tools are currently available, they could easily be built over a MathJournals list, especially if cross-linked to MathPeople (e.g., “authors who published in this journal also published in these other journals”). MathBooks MathBooks would be a list of mathematical books at all levels, from ele­ mentary to advanced, with links to and selected data from publishers. Some of these data already exist through services such as MathSciNet, zbMATH, OCLC, OpenLibrary, and Ulf Rehmann, but this bibliographic entity is less developed than the previous three discussed in this section. A plethora of openly accessible metadata about books in all fields has been released in the past few years by academic libraries and library ­ ooperatives.15 Considering c just books in mathematics and related fields, the information in these data releases swamps what is currently available in MathSciNet and zbMATH both in quantity of titles and depth of information about each title. 14  Eigenfactor, http://www.eigenfactor.org/, accessed January 16, 2014. 15  Mostnotable are the British Library release of millions of catalog records in 2010 (­ ritish B Library, 2010) and the OCLC recommendation to use Open Data Commons Attribution License (ODC-BY) for WorldCat data in August 2012 (OCLC, 2012).

OCR for page 91
100 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY A large number of elementary mathematics books in these releases are not indexed at all by MathSciNet and zbMATH, but they may be of value to students and teachers of mathematics. There is potential to index this collection in ways that would provide novel recommendation and discovery services over mathematics book data for students and teachers as well as researchers and those who apply mathematics in other fields. The process of indexing and cleaning these data, and providing enhanced discovery ser- vices over them, should be a fairly routine application of machine learning methods, which could be done as a standalone project and which would provide a first test of DML machine learning capabilities. The general methods involved would not be domain-specific, and they could be applied also to other non-math domain-specific collections. However, mathematics is special in that is already has a well-developed subject ontology for the field, the MSC2010. Cross-linking of the library books data with subject tags from either MathSciNet or zbMATH, and with author identities from MathPeople and the Virtual International Authority File,16 should aid r ­eaders in navigating the universe of mathematical concepts by reference to the statistics of its book data. The DML could also use these data to suggest key textbooks and research texts for specific subjects or theorems. MathBibliographies MathBibliographies would be a collection of bibliographies of various topics in mathematics, including personal and subject bibliographies. Initial sources for these data include Celebratio Mathematica,17 IMS Scientific Legacy,18 other subject bibliographies, and bibliographies from books con- tributed by participating publishers. This collection could be cross-linked to MathPeople and MathTopics. The structure of aggregated collections of such bibliographies could then inform search and navigation services, just as reference lists of articles do already. The key functionality for users is to make it easy for them to select, annotate, and tag bibliographic items. MathSciNet’s MRLookup tool already provides a useful open interface for acquisition of modest-sized bibliographies from data in MathSciNet. Simi- lar data are readily available from Microsoft Academic or Google Scholar, but there is not yet any tool comparable to MRLookup for acquiring data from those sources, and neither is there any good tool for aggregation and deduplication of data from multiple sources, as would typically be neces- 16  VIAF: The Virtual International Authority File, http://viaf.org/, accessed January 16, 2014. 17  Celebratio Mathematica, http://celebratio.org/, accessed January 16, 2014. 18  IMS Scientific Legacy is a collection of bibliographic information about recipients of awards by the Institute of Mathematical Statistics (http://imstat.org/) and is currently under develop- ment in collaboration between IMS and Mathematical Sciences Publishers (http://msp.org/).

OCR for page 91
TECHNICAL DETAILS 101 sary to develop the bibliography of any topic where mathematics reaches into other domains. MathArticles MathArticles would be a collection of metadata of journal articles in various topics in mathematics. Some initial sources for these data include MathSciNet, arXiv, Web of Science, Google Scholar, and Microsoft Aca- demic, among others. There would be considerable connections between the other bibliographic entities proposed in this section. TECHNICAL CONSIDERATIONS This section lists a number of technical considerations that the commit- tee believes will influence the development of the DML and its information management structures. Some key issues discussed are managing diverse data formats, incorporating math-aware tools and services, appropriately dealing with authority control, and managing client-side versus server-side software. None of these discussions are intended to be overly prescriptive, but to raise issues that the committee feels are very important. Data Formats For annotation and sharing of data it is necessary to have a format that fulfills certain requirements as follows: 1. Easy to use and ideally human readable; 2. Can be implemented into any recording, analysis, or management tool; 3. Open and freely available; 4. Inherently extensible and flexible for science continually changes; and 5. More or less unrestricted—that is, it should not restrict the user or strictly require entries. At some points, format conventions have to be introduced. This is the process of schema modeling and introduction, which is by now fairly well understood. It is essential to clearly separate format from content. Docu- mentation about formats can be maintained along with the data model, and a place to record and maintain property definitions can be included. For any given list of objects, the expected internal structure of those objects and their expected relations with other objects define an ontology. There are

OCR for page 91
102 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY many tools available for creating and maintaining ontologies (as discussed in Chapter 1). Essentially the same metadata structure can be used for metadata of all kinds of objects, including documents, people, organizations, subjects, or mathematical concepts. The schema for the object is type dependent, with some sub-typing within major types like documents.19 To the greatest extent possible, existing or adapted schemas can be used. But for mathematical concepts in particular, development of adequate schemas will be a slower process, informed by the success of partners such as Wolfram and OEIS with experience in handling such data and the experience of numerous others in development of math-aware tools and services. Math-Aware Tools and Services There currently exist math-aware tools and services that can compe- tently manage mathematical syntax and formatting. Such tools and services are essential for tasks such as conversion between formats that are different mathematically and semantic parsing of mathematical documents. How- ever, many current resources do not functionally handle mathematical nota- tion and syntax, and this limits how the mathematical community can use these resources. Significant interest in better utilizing math-aware tools and services is apparent in the series of Conferences on Intelligent Computer Mathematics.20 The following is from the announcement of their digital mathematics library conference track, chaired by Petr Sojka: Track objective is to provide a forum for development of math-aware technologies, standards, algorithms and formats towards fulfillment of the dream of global digital mathematical library (DML21). Computer sci- entists (D) and librarians of digital age (L) are especially welcome to join mathematicians (M) and discuss many aspects of DML preparation. Track topics are all topics of mathematical knowledge management and digital libraries applicable in the context of DML building—processing of math knowledge expressed in scientific papers in natural languages, namely: • Math-aware text mining (math mining) and MSC classification; • Math-aware representations of mathematical knowledge; 19  The basic framework for most document types is already provided by the BibTeX ontol- ogy, and easily implementable in JSON as BibJSON or in XML according to some variant of the NLM DTD (http://dtd.nlm.nih.gov/), which is currently used by the EuDML for document records. 20  Conferences on Intelligent Computer Mathematics, last modified July 10, 2013, http:// www.cicm-conference.org/2013/cicm.php. 21  Please note that the DML discussed in this quotation is distinct from the DML vision laid in this report.

OCR for page 91
TECHNICAL DETAILS 103 • Math-aware computational linguistics and corpora; • Math-aware tools for [meta]data and full-text processing; • Math-aware OCR and document analysis; • Math-aware information retrieval; • Math-aware indexing and search; • Authoring languages and tools; • MathML, OpenMath, TeX and other mathematical content standards; • Web interfaces for DML content; • Mathematics on the Web, math crawling and indexing; • Math-aware document processing workflows; • Archives of written mathematics; • DML management, business models; • DML rights handling, funding, sustainability; and • DML content acquisition, validation and curation. DML track is an opportunity to share experience and best practices be- tween projects in many areas (MKM, NLP, OCR, IR, DL, pattern recog- nition, etc.) that could change the paradigm for searching, accessing, and interacting with the mathematical corpus.22 Integrating math-aware tools and services developed by diverse partners may be challenging but would benefit the DML. One math-aware standard of particular interest to DML developments proposed in this report is that of OpenMath,23 which is an extensible standard for representing the semantics of mathematical objects. The OpenMath website describes its objective as follows: OpenMath is an emerging standard for representing mathematical objects with their semantics, allowing them to be exchanged between computer programs, stored in databases, or published on the worldwide Web. While the original designers were mainly developers of Mathematical objects encoded in OpenMath can be • Displayed in a browser, • Exchanged between software systems, • Cut and pasted for use in different contexts, • Verified as being mathematically sound (or not!), and • Used to make interactive documents really interactive. OpenMath is highly relevant for persons working with mathematics on computers, for those working with large documents (e.g., databases, manu- als) containing mathematical expressions, and for technical and mathemati- 22  Conferences on Intelligent Computer Mathematics, “Track B: DML,” last modified March 4, 2013, http://www.cicm-conference.org/2013/cicm.php?event=dml. 23  OpenMath, http://www.openmath.org/, accessed January 16, 2014.

OCR for page 91
104 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY cal publishing. The worldwide OpenMath activities are coordinated within the OpenMath Society, based in Helsinki, Finland. It is coordinated by an executive committee, elected by its members. It organizes regular work- shops and hosts a number of electronic discussion lists. The Society brings together tool builders, software suppliers, publishers and authors. This standard and the community that has developed around it should contribute to development of the DML. Authority Control The committee expects continuing advances in authority control24 over entities and the provision of adequate human-computer interfaces for the semi-automated curation of large digital collections. Some customization of these tools will be necessary to apply them to mathematical objects. How- ever, once the tools are built and the editorial workflows established, these tools and workflows should be largely replicable across multiple distributed nodes in the network of bibliographic data stores contributing to the DML. The problems of identification and deduplication of conventional bib- liographic records are by now largely solved. Solutions and workflows developed by other organizations, such as OCLC and ORCID, should be adopted to the extent that these organizations are willing to share them. In mathematics, existing automated tools such as MRef and MRLookup25 return similar matches to queries from traditional bibliographic reference data. This enables machine enhancement of reference lists by matching into the MathSciNet database. However, the universe of mathematics in- formation resources of interest to the DML is not limited to traditionally published items alone. Neither ORCID nor MRef are comprehensive in providing identifiers for all mathematicians. The problem of identification and deduplication of various of math- ematical entities remains a research problem on which more effort will need to be expended before the fullest potential of DML navigation can be realized. Like searching for articles, exploring the citation graph in the DML will need to deal with the “identity problem”—that is, the problem of deciding that two citations are actually to the same article, although the names of authors can be different (e.g., initial instead of full first name) the journal names can be altered (e.g., abbreviations or misspellings), the order- ing of terms can be changed, and so on. Another aspect of this problem is determining to what degree lightweight authorities (e.g., MathPeople, men- 24  In library science, authority control is a process that organizes bibliographic information by using a single, distinct name for each topic. 25  American Mathematical Society, MRLookup, http://www.ams.org/mrlookup, accessed January 16, 2014.

OCR for page 91
TECHNICAL DETAILS 105 tioned above) can be relied on as supplements to more traditional authori- ties. It is interesting to note that Google Scholar and Microsoft Academic Search deal with this problem reasonably well by using a statistical model- ing approach rather than the more in-depth approach of writing down all possible transformations and then unraveling those transformations. Client-Side Software The DML would likely benefit from using a combination of client-side software and Web services to provide its content to users. Client-side soft- ware can be thought of as a computer application, such as a Web browser, that runs on a user’s local workstation and connects to a server as neces- sary. If part of the DML were run client-side, a user would download a DML application that would carry out much of the data processing on the users machine, thereby lessening the server load on the DML. However, it is not always clear what resources are available on the user’s machine, and users may not like the DML application using their machine’s potentially limited storage and processing ability. Another concern is DML security; if too much of the DML data and processing is pushed client side, it may become an easy target for unintended manipulation. To balance the security and processing load concerns, the DML may benefit from moving much of the processing layer client-side while keeping the data layer server-side (or accessible only as a Web service that cannot be easily manipulated). There are a number of services that use a mix of client-side software and Web services to provide enhanced document navigation capabilities, some of which may serve as an example of how to set up the DML: • BibSonomy26 (very open data and services, great scrapers for ac- quiring bibliographic metadata from publisher sites), • CiteULike27 (could easily go the way of Mendeley, which had a partnership with Springer at one time but has since stalled), • Connotea,28 • Delicious,29 • JabRef30 (desktop bibliography manager, syncs with BibSonomy), • Mendeley,31 26  BibSonomy, http://www.bibsonomy.org/, accessed January 16, 2014. 27  CiteULike, http://www.citeulike.org/, accessed January 16, 2014. 28  Nature Publishing Group, Connotea, http://www.connotea.org/, discontinued service on March 12, 2013. 29  Delicious, https://delicious.com/, accessed January 16, 2014. 30  JabRef, last updated October 29, 2013, http://jabref.sourceforge.net/. 31  Mendeley, http://www.mendeley.com/, accessed January 16, 2014.

OCR for page 91
106 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY • MindMaps,32 • Papers,33 • Scholarometer,34 and • Zotero35 (open source, but focused on the humanities). REFERENCES Alexanderson, G.L., ed. 1987. The Pólya Picture Album: Encounters of a ­ athematician. M Birkhäuser Basel. http://www.springer.com/birkhauser/history+of+science/book/978-1- 4612-5376-1. Beery, J., and C. Mead. 2012. Who’s that mathematician? Images from the Paul R. Halmos Photograph Collection. Loci (January), doi:10.4169/loci003801. British Library. 2010. British Library to Share Millions of Catalogue Records. Press release. August 23. http://pressandpolicy.bl.uk/Press-Releases/British-Library-to-share-millions- of-catalogue-records-43b.aspx. Forrester, P.J., and N.S. Witte. 2012. Painleve II in Random Matrix Theory and Related Fields. http://arxiv.org/abs/1210.3381. Gauthier, B. 1999. HYPERG: A Maple package for manipulating hypergeometric series. ­ éminaire Lotharingien de Combinatoire: 43:S43a. S Gould, H.W. 1972. Combinatorial Identities. University of West Virginia, Morgantown. Libbrecht, P. 2010. Notations around the world: Census and exploitation. Intelligent Com- puter Mathematics 6167:398-410. OCLC (Online Computer Library Center). 2012. OCLC Recommends Open Data Commons Attribution License (ODC-BY) for WorldCat Data. News release. August 6. http://www. oclc.org/news/releases/2012/201248.en.html. Tracy, C.A., and H. Widom. 2011. Painlevé functions in statistical physics. Publications for the Research Institute of the Mathematical Sciences 47(1):361-374. 32  For general concept, see “Mind Map,” Wikipedia, last modified January 14, 2014, http:// en.wikipedia.org/wiki/Mind_map. For a specific software implementation, see Docear—The Academic Literature Suite (http://www.docear.org/). Docear (pronounced “dog-ear”) is a free and open academic literature suite that integrates tools to search, organize, and create aca- demic literature into a single application. Docear works seamlessly with many existing tools like Mendeley, Microsoft Word, and Foxit Reader. 33  Mekentosj B.V., Papers 3, http://www.papersapp.com, accessed January 16, 2014. 34  Indiana University, Scholarometer, http://scholarometer.indiana.edu/, accessed January 16, 2014. 35  Zotero, http://www.zotero.org/, accessed January 16, 2014.