This chapter discusses some details of entity collections and technical considerations for the Digital Mathematics Library (DML). The lists discussed in this chapter are reasonable and obvious places for the DML to start developing its entity databases, but these may just be the starting point in an entity collection that is likely to evolve over time with the needs and capabilities of the DML. The ultimate goal of these lists is to provide interesting and nontrivial connections between topics, in particular the user features described in Chapter 2. The committee believes this is best accomplished by the DML organization overseeing the simpler entity collections first, which may have the most impact. These early lists can be managed in a straightforward, flexible, and sustainable way. Once this is achieved, the DML may benefit from moving on to more complex structures, such as ordered lists based on importance, relevance, pedagogical value, historical importance, etc., or to lists that can be (partially) ordered using different criteria and hyperlinked.

This section discusses potential object types that the committee believes should be set up early in DML development, with details about location of relevant data sources and technical and political issues in data acquisition. These objects are divided into two categories: mathematical objects and bibliographic entities. Some of these entities are already data rich and can be developed by collaborating with existing databases and services.

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 91

5
Technical Details
This chapter discusses some details of entity collections and techni-
cal considerations for the Digital Mathematics Library (DML). The lists
discussed in this chapter are reasonable and obvious places for the DML
to start developing its entity databases, but these may just be the starting
point in an entity collection that is likely to evolve over time with the needs
and capabilities of the DML. The ultimate goal of these lists is to provide
interesting and nontrivial connections between topics, in particular the user
features described in Chapter 2. The committee believes this is best accom-
plished by the DML organization overseeing the simpler entity collections
first, which may have the most impact. These early lists can be managed
in a straightforward, flexible, and sustainable way. Once this is achieved,
the DML may benefit from moving on to more complex structures, such as
ordered lists based on importance, relevance, pedagogical value, historical
importance, etc., or to lists that can be (partially) ordered using different
criteria and hyperlinked.
ENTITY COLLECTION
This section discusses potential object types that the committee believes
should be set up early in DML development, with details about location
of relevant data sources and technical and political issues in data acquisi-
tion. These objects are divided into two categories: mathematical objects
and bibliographic entities. Some of these entities are already data rich and
can be developed by collaborating with existing databases and services.
91

OCR for page 91

92 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY
Whenever possible, these areas of least resistance should be targeted first
to establish a breadth of content within the DML.
Mathematical Objects
The collection and organization of data on mathematical objects should
be a high priority of any DML development. The following entities can be
pursued and developed individually or jointly, but cross connections should
be noted and exploited whenever possible.
MathTopics
MathTopics would be a collection of mathematical subjects, topics, and
terms that includes supporting definitions at various levels of formality
and that indicates relations between topics derived from graphical analysis
of book and journal data. This collection is practical to begin immediately,
and some initial sources of information include MSC2010, Wikipedia, and
tables of contents of mathematics books. As an application, MathTopics
could be used to provide visualizations of the global structure of math-
ematical fields and their interactions.
Including information from open encyclopedic resources1 and metadata
records of entries in other encyclopedias behind paywalls2 would be an
extremely helpful service of the DML.
Encyclopedic information aggregation has been achieved before in
limited cases, as in the case of the National Science Digital Library,3 which
indexed MathWorld and PlanetMath together. With the expansion of the
linked open data approach, these cross-connections are happening in other
domains as well. To use linked open data, interfaces need to allow cross-
connections, and once an encyclopedia is available as linked open data,
the data provider no longer has to be involved in the process of creating
the combined resource. There are also some commercial entities providing
metadata as linked open data,4 so there is a sense that these connections
may be possible in the near future.
The DML could provide dedicated search over this collection, with
automated disambiguation of author names and superior subject naviga-
tion derived from graphical analysis of various forms of proximity between
1 Some encyclopedic resources include Wikipedia, Springer Encyclopedia of Mathematics,
StatProb, MathWorld, The Princeton Companion to Mathematics, etc.
2 Encyclopedia of Statistical Sciences would be of particular interest.
3 National Science Digital Library, http://nsdl.org/, accessed January 16, 2014.
4
See a list of linked open data encyclopedia data sets (e.g., http://datahub.io/dataset?q=
encyclopedia) or search for encyclopedias in particular domains (e.g., http://datahub.io/dataset?q=
biology+encyclopedia).

OCR for page 91

TECHNICAL DETAILS 93
subjects. Google currently indexes this material but does not provide a
means of browsing or navigating the material besides a simple search. Other
methods of navigation, such as browsing and faceted search and browse,
are very appealing and useful if available, but such systems typically do not
have ontologies, and the data are not structured.
For topics that do not already have encyclopedia articles, the DML can
flag that an article needs creation, and such an article can then be written
in any of the available encyclopedia frameworks. Following the DML prin-
ciple of not unnecessarily replicating data, and especially not unnecessarily
replicating complex editorial structures, the DML would likely benefit from
not providing its own encyclopedia publication infrastructure that would
compete with and undermine the existing open encyclopedias.
MathSequences
MathSequences would be a collection of mathematical sequences found
throughout the literature. This list is already well developed in the Online
Encyclopedia of Integer Sequences (OEIS). The DML could offer to help de-
velop systematic hyperlinking of the text of all references, all author names
to MathPeople, and all journal names to MathJournals, and systematic
conversion of the data to standard machine-readable formats that can be
understood by bibliographic and computational services. The OEIS data
set, augmented with such enhancements, would be an example of what the
DML should strive for in its data structures for other kinds of mathematical
objects. Systematically reconstructing the OEIS as computable linked open
data does not appear to be a very difficult task. Moreover, the solutions
to difficulties encountered in this process should inform the choice of data
schema for other similar collections. The main issue for DML involvement
in the OEIS appears to be one of negotiating cooperation between the
organizations.
MathFunctions
MathFunctions would be a collection of mathematical functions found
throughout the literature. This collection can begin immediately, and the
National Institute of Standards and Technology Digital Library of Math-
ematical Functions (DLMF)5 and the Wolfram Functions site6 could provide
the basis for a well-structured collection of mathematical special functions.
This collection could then be added to MathTopics and used to tag compo-
5 National Institute for Standards and Technology, Digital Library of Mathematical Func-
tions, Version 1.0.6, release date May 6, 2013, http://dlmf.nist.gov/.
6 Wolfram Research, Inc., The Wolfram Functions Site, http://functions.wolfram.com/,
accessed on January 16, 2014.

OCR for page 91

94 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY
nents of articles and papers that discuss or apply specific special functions.
This collection would likely take considerable time to populate extensively
beyond the DLMF and Wolfram capabilities but could provide a wealth of
information once reasonably established.
MathTransforms
MathTransforms would be a searchable and browsable lookup table
for classical transforms (e.g., Laplace, Fourier, Mellin) with links to com-
putational resources. This could begin to be developed immediately in
cooperation with DLMF and/or the Wolfram Functions site but would
likely develop fully over a longer timeline. It is useful for mathematicians
to be able to search or browse a table of transforms for various purposes:
for inspiration, to see what is out there, and to see what might be adapted
to a problem at hand. Moreover, such a table has the potential to be
hyperinked to the occurrences of its entries in the mathematical literature,
l
which would be a step toward a more fine-grained indexing of the litera-
ture. Especially for rarely used functions and transforms, it is potentially
rewarding for users to be able to find quickly where the same function or
transform might have been used before. Special functions are often kept
out of sight in higher mathematical constructions but have applications
to other branches of mathematics. Making it easier for users to follow
threads of their occurrences across the literature might easily lead to novel
discoveries or unexpected relations between research in different branches
of mathematics. Examples of such relations include the unexpected appli-
cations of Airy kernels and Painleve transcendents (functions) in random
matrix theory, statistical physics, and elsewhere (Tracy and Widom, 2011;
Forrester and Witte, 2012).
MathIdentities
MathIdentities would be an organized table of classical combinatorial
identities and methods of reduction and proof of such identities. There has
been huge progress in recent years in computer methods for proving clas-
sical combinatorial identities, including closed-form summation formulas.
This means that a great many simplifications of algebraic sums and proofs
of algebraic identities can be done rapidly and with high reliability by
machine.7 For the same reasons as identified above for tables of functions
7 See
Gould (1972); Wolfram|Alpha (http://www.wolframalpha.com/); the work of Christian
Krattenthaler (http://www.mat.univie.ac.at/~kratt/, accessed January 16, 2014), including
Mathematica packages HYP and HYPQ for the manipulation and identification of binomial
and hypergeometric series and identities (C. Krattenthaler, “HYP and HYPQ,” http://www.
mat.univie.ac.at/~kratt/hyp_hypq/hyp.html, accessed January 16, 2014); and Gauthier (1999).

OCR for page 91

TECHNICAL DETAILS 95
and transforms, it may be instructive for mathematicians to browse through
tables of identities and to follow links to applications of identities in the
literature. This collection could begin immediately and progress similar to
MathFunctions and MathTransforms.
MathSymbols
MathSymbols would be a collection of mathematical symbols with com-
monly accepted special meanings, to be cross-linked as well as possible with
MathTopics, and if possible with place of first usage. Within restricted do-
mains, symbols often acquire stable conventional meanings, and sometimes
this is true across all of mathematics. Some work has been done on develop-
ing a consensus of mathematical notations across cultures (Libbrecht, 2010),
and this Notation Census8 is a meaningful precursor to what the committee
envisions. The collection that the committee envisions for the DML is com-
plex and may require multiple steps. As a first step before embarking on a
complete index, the DML could partner with resources such as MathSciNet
and zbMATH to create a collection of journal article titles that contain
any mathematical symbols. This would provide a core set of symbols with
authoritative links to the literature. The meaning of those symbols could be
established by a small community-sourcing exercise. The symbols could
be linked to MathTopics at the collection level, and then MathNavigator tool
could serve these links to MathTopics entries from a reference to any article
anywhere in the mathematical literature that has the same symbol in its title.
This might be considered a preliminary exploration before attempting to do
a similar but more ambitious undertaking for formulas or equations.
MathFormulas
MathFormulas would be a collection of mathematical formulas and
their variations, initially those of special interest and importance. This col-
lection could assist in supervised machine learning processes for the creation
and maintenance of a larger body of formulas and equations. This is a long-
term collection goal and DLMF, Wolfram, and Springer would be desired
partners, especially the data in Springer’s LaTeX Search. This is an ambitious
list to attempt to collect, because there are serious challenges to overcome
because of superficial variations in the way every given expression might
by written (as discussed in Chapter 3). Still, initial progress has been made
by several teams of researchers, and the DML could provide a nexus for
further research, a forum for tracking advances in this field, and eventually
8 “Notation Census Manifest,” last edited March 9, 2013, http://wiki.math-bridge.org/
display/ntns/Notation-Census-Manifest.

OCR for page 91

96 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY
some attempt to create and maintain an authoritative list of at least those
formulas considered interesting or important enough to be recognized and
assigned an HTTP URL. Further open efforts at both supervised learning
relative to these exemplars and unsupervised learning similar to the Springer
effort, with linking to the literature, should also be attempted, motivated by
applications to formula search as indicated in Chapter 3.
MathMedia
MathMedia would be a collection of images, photos, videos, and
p
resentations—or links to such—relating to mathematics. Video clips from
conferences and presentations, visualizations of results and simulations, and
portraits of mathematicians who contributed to the research field could all
be included in the DML and systematically integrated with the mathematics
literature. Widespread collection of media entities could begin immediately
and would likely continue to evolve. Many mathematics conferences are
already filming and posting speakers’ presentations, and it would be oppor
tunistic for the DML to arrange for these data to be indexed and sorted
based on known information such as the title of the presentation, author(s)
and presenter(s), date, name of conference and/or section, etc. Other infor-
mation on the contents of the presentation, which may be more difficult to
automatically categorize, can be tagged by community sourcing. In terms
of mathematician portraits, there are several images of mathematicians
that may be of interest, such as Oberwolfach Photo Collection,9 Portraits
of Statisticians,10 Microsoft Academic Search Profiles,11 Halmos (Beery and
Mead, 2012), and Pólya (Alexanderson, 1987).
Bibliographic Entities
The following bibliographic data collection entities are a needed ele-
ment of the DML, and their collection can begin quickly—and largely be
completed—since much of the information is already available elsewhere
through existing information resources. These entities can be viewed as part
of the necessary infrastructure of the DML and are key areas for develop-
ing partnerships (as discussed in Chapter 3). However, the collection and
development of these entities are not meaningful on their own and should
only be pursued as part of a larger DML development.
9 Oberwolfach Photo Collection, http://owpdb.mfo.de/, accessed January 16, 2014.
10 Portraits
and Articles from Biographical Dictionaries, revised July 10, 2013, http://www.
york.ac.uk/depts/maths/histstat/people/.
11 Microsoft Academic Search, “Overview,” http://academic.research.microsoft.com/About/
Help.htm, accessed January 16, 2014.

OCR for page 91

TECHNICAL DETAILS 97
MathPeople
MathPeople would be a lean authority file for mathematical people
with links to and selected data from homepages, Wikipedia, MacTutor,
Math Genealogy, zbMath Open Author Profiles, Celebratio.org, MacTutor,
M
athSciNet, and so on. There was an effort by the International Mathemati-
cal Union in 2005 to build a Federated World Directory of Mathematicians,12
but it was abandoned due to copyright and privacy concerns and inadequate
federated search technology. More recently, zbMATH Author Profiles and
data in Microsoft Academic Search’s Top Authors in Mathematics offer
machine access to approximate authority records for about half a million
authors in mathematics and related fields, with no apparent legal restriction
on further processing of the data. It would be a straightforward application
of machine processing and community input to deduplicate these lists, sync
them also with the Virtual International Authority Files of all mathemati-
cians, and thereby obtain a combined DML index of all mathematicians,
both living and deceased, who have ever published a book or article in
mathematics. This data set would include addiional information about the
t
mathematicians’ fields, their collaborators, and their numbers of publica-
tions. This would then provide a graphical data set with about half a million
nodes for authors and editors, and some fraction of that number of nodes
for books they wrote and edited, and a few thousand subject nodes. This
could be used very quickly as a test bed for application of modern graphical
visualization methods to provide subject navigation, and otherwise as a
major framework for organizing other facets of DML information.
MathSciNet has high-quality representation of the collaboration graph
for mathematical articles, obtained through many years of manual uration
c
of book and article metadata records, and MathSciNet offers a glimpse
into this proprietary data set with its computation of minimal distance
paths through the collaboration graph from one author to another. These
collaborator connections are helpful and allow users to see if an author’s
collaborators are working in relevant areas, but they do not provide links
to other relevant data. Having access to similar information in addition to
the other data that the DML is proposing to collect (such as theorems, re-
search areas, homepages), this information then becomes significantly more
integrated into a larger picture of the mathematics research community.
With suitable graphical visualization, MathPeople could provide users
with a sense of the “geography” of mathematics, how the subfields of math-
ematics are related to each other through the collaborations of authors, and
how this structure has evolved over time.
12 International Mathematical Union, “Personal Homepages and the World Dictionary of
Mathematicians,” http://www.mathunion.org/MPH-EWDM/ last udpated December 13, 2012.

OCR for page 91

98 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY
MathHomepages
MathHomepages would be a table linked to MathPeople, but with
indications of depth of content (e.g., curriculum vitae, photo, bibliogra-
phy). From a user perspective, this may appear to be a simple variation of
MathPeople; however, a person can have more than one homepage, each of
which may contain references and connections to subjects and collabora-
tors. It would be beneficial for there to be separate tables for homepages
and for people and for these to be cross-linked by a general, extensible
data architecture, such that the cross-links are easily maintainable and cor-
rectible. This is not trivial, and it is illustrative of the maintenance problem
for Web-based data. Much of this data could be mined from sources such
as the Microsoft Academic Search API, some subject specific collections
in the Web (e.g., for number theory, probability), and easily completed
and maintained by Web-spidering, community input, and self-registration.
While people stay the same, their homepages and affiliations may change.
The relation between people and their homepages could be treated as a
simple case of a dynamically changing data set, and methods and interfaces
could be developed to respect this aspect.
This information would be useful to mathematics researchers because
it can help find people with common names and can be useful to the larger
DML because it helps with interlinking other data.
MathJournals
MathJournals would be a deduplicated and cleaned list of serials in
mathematics, past and present, with indications of online availability and
subject coverage. Most of this data currently exists and is maintained openly
by a number of agents (such as Ulf Rehmann, MathSciNet, zbMATH, the
Online Computer Library Center).13 There are several lists of math journals
in various places, many of them accessible and reusable, but none of these
lists provides easy access to the features that researchers would like, includ-
ing the following:
• Links to journal homepages whenever they exist;
• Information about the number of articles published and subject
areas covered;
• Copyright and rights information for authors; and
• Simple search over the list.
13 See also UlrichsWeb.com for a proprietary solution across all fields.

OCR for page 91

TECHNICAL DETAILS 99
The entire math journals list is only a few thousand entries, but the
number of readily available attributes of a journal is potentially large and,
in principle, unlimited. Some desired capabilities for the journals list that
will require some initial work and maintenance are the following:
• Graphical displays (e.g., nodes with size proportional to various
journal metrics and locations reflective of their subject coverage,
linking to MathTopics below) that could easily be derived;
• Display of journals by defined metrics (e.g., in cooperation with
eigenfactor.org14), which uses recently developed methods of net-
work analysis and information theory to evaluate the influence
of scholarly journals and for mapping the structure of academic
research; and
• Access to identities of all authors who ever published in a journal
with links to MathPeople.
These are typical functionalities that the standard abstracting and in-
dexing services could provide but currently do not offer. Aggregating and
displaying this information would give users a quick overview of the whole
field of mathematics from the point of view of its journal coverage, and
graphical relations derived from such information could feed into tools for
navigation of mathematical information. While no such navigation tools
are currently available, they could easily be built over a MathJournals list,
especially if cross-linked to MathPeople (e.g., “authors who published in
this journal also published in these other journals”).
MathBooks
MathBooks would be a list of mathematical books at all levels, from ele
mentary to advanced, with links to and selected data from publishers. Some
of these data already exist through services such as MathSciNet, zbMATH,
OCLC, OpenLibrary, and Ulf Rehmann, but this bibliographic entity is less
developed than the previous three discussed in this section. A plethora of
openly accessible metadata about books in all fields has been released in the
past few years by academic libraries and library ooperatives.15 Considering
c
just books in mathematics and related fields, the information in these data
releases swamps what is currently available in MathSciNet and zbMATH
both in quantity of titles and depth of information about each title.
14 Eigenfactor,
http://www.eigenfactor.org/, accessed January 16, 2014.
15 Mostnotable are the British Library release of millions of catalog records in 2010 ( ritish
B
Library, 2010) and the OCLC recommendation to use Open Data Commons Attribution
License (ODC-BY) for WorldCat data in August 2012 (OCLC, 2012).

OCR for page 91

100 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY
A large number of elementary mathematics books in these releases are
not indexed at all by MathSciNet and zbMATH, but they may be of value
to students and teachers of mathematics. There is potential to index this
collection in ways that would provide novel recommendation and discovery
services over mathematics book data for students and teachers as well as
researchers and those who apply mathematics in other fields. The process
of indexing and cleaning these data, and providing enhanced discovery ser-
vices over them, should be a fairly routine application of machine learning
methods, which could be done as a standalone project and which would
provide a first test of DML machine learning capabilities. The general
methods involved would not be domain-specific, and they could be applied
also to other non-math domain-specific collections. However, mathematics
is special in that is already has a well-developed subject ontology for the
field, the MSC2010. Cross-linking of the library books data with subject
tags from either MathSciNet or zbMATH, and with author identities from
MathPeople and the Virtual International Authority File,16 should aid
r
eaders in navigating the universe of mathematical concepts by reference
to the statistics of its book data. The DML could also use these data to
suggest key textbooks and research texts for specific subjects or theorems.
MathBibliographies
MathBibliographies would be a collection of bibliographies of various
topics in mathematics, including personal and subject bibliographies. Initial
sources for these data include Celebratio Mathematica,17 IMS Scientific
Legacy,18 other subject bibliographies, and bibliographies from books con-
tributed by participating publishers. This collection could be cross-linked
to MathPeople and MathTopics. The structure of aggregated collections of
such bibliographies could then inform search and navigation services, just
as reference lists of articles do already. The key functionality for users is
to make it easy for them to select, annotate, and tag bibliographic items.
MathSciNet’s MRLookup tool already provides a useful open interface for
acquisition of modest-sized bibliographies from data in MathSciNet. Simi-
lar data are readily available from Microsoft Academic or Google Scholar,
but there is not yet any tool comparable to MRLookup for acquiring data
from those sources, and neither is there any good tool for aggregation and
deduplication of data from multiple sources, as would typically be neces-
16 VIAF: The Virtual International Authority File, http://viaf.org/, accessed January 16,
2014.
17 Celebratio Mathematica, http://celebratio.org/, accessed January 16, 2014.
18 IMS Scientific Legacy is a collection of bibliographic information about recipients of awards
by the Institute of Mathematical Statistics (http://imstat.org/) and is currently under develop-
ment in collaboration between IMS and Mathematical Sciences Publishers (http://msp.org/).

OCR for page 91

TECHNICAL DETAILS 101
sary to develop the bibliography of any topic where mathematics reaches
into other domains.
MathArticles
MathArticles would be a collection of metadata of journal articles in
various topics in mathematics. Some initial sources for these data include
MathSciNet, arXiv, Web of Science, Google Scholar, and Microsoft Aca-
demic, among others. There would be considerable connections between
the other bibliographic entities proposed in this section.
TECHNICAL CONSIDERATIONS
This section lists a number of technical considerations that the commit-
tee believes will influence the development of the DML and its information
management structures. Some key issues discussed are managing diverse
data formats, incorporating math-aware tools and services, appropriately
dealing with authority control, and managing client-side versus server-side
software. None of these discussions are intended to be overly prescriptive,
but to raise issues that the committee feels are very important.
Data Formats
For annotation and sharing of data it is necessary to have a format that
fulfills certain requirements as follows:
1. Easy to use and ideally human readable;
2. Can be implemented into any recording, analysis, or management
tool;
3. Open and freely available;
4. Inherently extensible and flexible for science continually changes;
and
5. More or less unrestricted—that is, it should not restrict the user or
strictly require entries.
At some points, format conventions have to be introduced. This is the
process of schema modeling and introduction, which is by now fairly well
understood. It is essential to clearly separate format from content. Docu-
mentation about formats can be maintained along with the data model,
and a place to record and maintain property definitions can be included.
For any given list of objects, the expected internal structure of those objects
and their expected relations with other objects define an ontology. There are

OCR for page 91

102 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY
many tools available for creating and maintaining ontologies (as discussed
in Chapter 1).
Essentially the same metadata structure can be used for metadata of all
kinds of objects, including documents, people, organizations, subjects, or
mathematical concepts. The schema for the object is type dependent, with
some sub-typing within major types like documents.19 To the greatest extent
possible, existing or adapted schemas can be used. But for mathematical
concepts in particular, development of adequate schemas will be a slower
process, informed by the success of partners such as Wolfram and OEIS
with experience in handling such data and the experience of numerous
others in development of math-aware tools and services.
Math-Aware Tools and Services
There currently exist math-aware tools and services that can compe-
tently manage mathematical syntax and formatting. Such tools and services
are essential for tasks such as conversion between formats that are different
mathematically and semantic parsing of mathematical documents. How-
ever, many current resources do not functionally handle mathematical nota-
tion and syntax, and this limits how the mathematical community can use
these resources. Significant interest in better utilizing math-aware tools and
services is apparent in the series of Conferences on Intelligent Computer
Mathematics.20 The following is from the announcement of their digital
mathematics library conference track, chaired by Petr Sojka:
Track objective is to provide a forum for development of math-aware
technologies, standards, algorithms and formats towards fulfillment of
the dream of global digital mathematical library (DML21). Computer sci-
entists (D) and librarians of digital age (L) are especially welcome to join
mathematicians (M) and discuss many aspects of DML preparation. Track
topics are all topics of mathematical knowledge management and digital
libraries applicable in the context of DML building—processing of math
knowledge expressed in scientific papers in natural languages, namely:
• Math-aware text mining (math mining) and MSC classification;
• Math-aware representations of mathematical knowledge;
19 The basic framework for most document types is already provided by the BibTeX ontol-
ogy, and easily implementable in JSON as BibJSON or in XML according to some variant of
the NLM DTD (http://dtd.nlm.nih.gov/), which is currently used by the EuDML for document
records.
20 Conferences on Intelligent Computer Mathematics, last modified July 10, 2013, http://
www.cicm-conference.org/2013/cicm.php.
21 Please note that the DML discussed in this quotation is distinct from the DML vision
laid in this report.

OCR for page 91

TECHNICAL DETAILS 103
• Math-aware computational linguistics and corpora;
• Math-aware tools for [meta]data and full-text processing;
• Math-aware OCR and document analysis;
• Math-aware information retrieval;
• Math-aware indexing and search;
• Authoring languages and tools;
• MathML, OpenMath, TeX and other mathematical content standards;
• Web interfaces for DML content;
• Mathematics on the Web, math crawling and indexing;
• Math-aware document processing workflows;
• Archives of written mathematics;
• DML management, business models;
• DML rights handling, funding, sustainability; and
• DML content acquisition, validation and curation.
DML track is an opportunity to share experience and best practices be-
tween projects in many areas (MKM, NLP, OCR, IR, DL, pattern recog-
nition, etc.) that could change the paradigm for searching, accessing, and
interacting with the mathematical corpus.22
Integrating math-aware tools and services developed by diverse partners
may be challenging but would benefit the DML. One math-aware standard
of particular interest to DML developments proposed in this report is
that of OpenMath,23 which is an extensible standard for representing the
semantics of mathematical objects. The OpenMath website describes its
objective as follows:
OpenMath is an emerging standard for representing mathematical objects
with their semantics, allowing them to be exchanged between computer
programs, stored in databases, or published on the worldwide Web. While
the original designers were mainly developers of Mathematical objects
encoded in OpenMath can be
• Displayed in a browser,
• Exchanged between software systems,
• Cut and pasted for use in different contexts,
• Verified as being mathematically sound (or not!), and
• Used to make interactive documents really interactive.
OpenMath is highly relevant for persons working with mathematics on
computers, for those working with large documents (e.g., databases, manu-
als) containing mathematical expressions, and for technical and mathemati-
22 Conferences on Intelligent Computer Mathematics, “Track B: DML,” last modified
March 4, 2013, http://www.cicm-conference.org/2013/cicm.php?event=dml.
23 OpenMath, http://www.openmath.org/, accessed January 16, 2014.

OCR for page 91

104 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY
cal publishing. The worldwide OpenMath activities are coordinated within
the OpenMath Society, based in Helsinki, Finland. It is coordinated by an
executive committee, elected by its members. It organizes regular work-
shops and hosts a number of electronic discussion lists. The Society brings
together tool builders, software suppliers, publishers and authors.
This standard and the community that has developed around it should
contribute to development of the DML.
Authority Control
The committee expects continuing advances in authority control24 over
entities and the provision of adequate human-computer interfaces for the
semi-automated curation of large digital collections. Some customization of
these tools will be necessary to apply them to mathematical objects. How-
ever, once the tools are built and the editorial workflows established, these
tools and workflows should be largely replicable across multiple distributed
nodes in the network of bibliographic data stores contributing to the DML.
The problems of identification and deduplication of conventional bib-
liographic records are by now largely solved. Solutions and workflows
developed by other organizations, such as OCLC and ORCID, should be
adopted to the extent that these organizations are willing to share them.
In mathematics, existing automated tools such as MRef and MRLookup25
return similar matches to queries from traditional bibliographic reference
data. This enables machine enhancement of reference lists by matching
into the MathSciNet database. However, the universe of mathematics in-
formation resources of interest to the DML is not limited to traditionally
published items alone. Neither ORCID nor MRef are comprehensive in
providing identifiers for all mathematicians.
The problem of identification and deduplication of various of math-
ematical entities remains a research problem on which more effort will
need to be expended before the fullest potential of DML navigation can
be realized. Like searching for articles, exploring the citation graph in the
DML will need to deal with the “identity problem”—that is, the problem
of deciding that two citations are actually to the same article, although the
names of authors can be different (e.g., initial instead of full first name) the
journal names can be altered (e.g., abbreviations or misspellings), the order-
ing of terms can be changed, and so on. Another aspect of this problem is
determining to what degree lightweight authorities (e.g., MathPeople, men-
24 In
library science, authority control is a process that organizes bibliographic information
by using a single, distinct name for each topic.
25 American Mathematical Society, MRLookup, http://www.ams.org/mrlookup, accessed
January 16, 2014.

OCR for page 91

TECHNICAL DETAILS 105
tioned above) can be relied on as supplements to more traditional authori-
ties. It is interesting to note that Google Scholar and Microsoft Academic
Search deal with this problem reasonably well by using a statistical model-
ing approach rather than the more in-depth approach of writing down all
possible transformations and then unraveling those transformations.
Client-Side Software
The DML would likely benefit from using a combination of client-side
software and Web services to provide its content to users. Client-side soft-
ware can be thought of as a computer application, such as a Web browser,
that runs on a user’s local workstation and connects to a server as neces-
sary. If part of the DML were run client-side, a user would download a
DML application that would carry out much of the data processing on the
users machine, thereby lessening the server load on the DML. However, it
is not always clear what resources are available on the user’s machine, and
users may not like the DML application using their machine’s potentially
limited storage and processing ability. Another concern is DML security;
if too much of the DML data and processing is pushed client side, it may
become an easy target for unintended manipulation. To balance the security
and processing load concerns, the DML may benefit from moving much of
the processing layer client-side while keeping the data layer server-side (or
accessible only as a Web service that cannot be easily manipulated).
There are a number of services that use a mix of client-side software
and Web services to provide enhanced document navigation capabilities,
some of which may serve as an example of how to set up the DML:
• BibSonomy26 (very open data and services, great scrapers for ac-
quiring bibliographic metadata from publisher sites),
• CiteULike27 (could easily go the way of Mendeley, which had a
partnership with Springer at one time but has since stalled),
• Connotea,28
• Delicious,29
• JabRef30 (desktop bibliography manager, syncs with BibSonomy),
• Mendeley,31
26 BibSonomy, http://www.bibsonomy.org/, accessed January 16, 2014.
27 CiteULike, http://www.citeulike.org/, accessed January 16, 2014.
28 Nature Publishing Group, Connotea, http://www.connotea.org/, discontinued service on
March 12, 2013.
29 Delicious, https://delicious.com/, accessed January 16, 2014.
30 JabRef, last updated October 29, 2013, http://jabref.sourceforge.net/.
31 Mendeley, http://www.mendeley.com/, accessed January 16, 2014.

OCR for page 91

106 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY
• MindMaps,32
• Papers,33
• Scholarometer,34 and
• Zotero35 (open source, but focused on the humanities).
REFERENCES
Alexanderson, G.L., ed. 1987. The Pólya Picture Album: Encounters of a athematician.
M
Birkhäuser Basel. http://www.springer.com/birkhauser/history+of+science/book/978-1-
4612-5376-1.
Beery, J., and C. Mead. 2012. Who’s that mathematician? Images from the Paul R. Halmos
Photograph Collection. Loci (January), doi:10.4169/loci003801.
British Library. 2010. British Library to Share Millions of Catalogue Records. Press release.
August 23. http://pressandpolicy.bl.uk/Press-Releases/British-Library-to-share-millions-
of-catalogue-records-43b.aspx.
Forrester, P.J., and N.S. Witte. 2012. Painleve II in Random Matrix Theory and Related Fields.
http://arxiv.org/abs/1210.3381.
Gauthier, B. 1999. HYPERG: A Maple package for manipulating hypergeometric series.
éminaire Lotharingien de Combinatoire: 43:S43a.
S
Gould, H.W. 1972. Combinatorial Identities. University of West Virginia, Morgantown.
Libbrecht, P. 2010. Notations around the world: Census and exploitation. Intelligent Com-
puter Mathematics 6167:398-410.
OCLC (Online Computer Library Center). 2012. OCLC Recommends Open Data Commons
Attribution License (ODC-BY) for WorldCat Data. News release. August 6. http://www.
oclc.org/news/releases/2012/201248.en.html.
Tracy, C.A., and H. Widom. 2011. Painlevé functions in statistical physics. Publications for
the Research Institute of the Mathematical Sciences 47(1):361-374.
32 For general concept, see “Mind Map,” Wikipedia, last modified January 14, 2014, http://
en.wikipedia.org/wiki/Mind_map. For a specific software implementation, see Docear—The
Academic Literature Suite (http://www.docear.org/). Docear (pronounced “dog-ear”) is a free
and open academic literature suite that integrates tools to search, organize, and create aca-
demic literature into a single application. Docear works seamlessly with many existing tools
like Mendeley, Microsoft Word, and Foxit Reader.
33 Mekentosj B.V., Papers 3, http://www.papersapp.com, accessed January 16, 2014.
34 Indiana University, Scholarometer, http://scholarometer.indiana.edu/, accessed January 16,
2014.
35 Zotero, http://www.zotero.org/, accessed January 16, 2014.