The ability of the proposed Digital Mathematics Library (DML) to foster and nurture a wide range of partnerships will be key to engendering and supporting the kinds of functionality and services envisioned.
Historically, many of the most important ontologies, taxonomies, and other knowledge organization systems and services in mathematics started as the research project of an individual mathematician or a small handful of mathematicians working in close collaboration. There are both social and technical challenges to establishing such partnerships.
Over time, the resources required for the individual researcher to maintain and grow these projects online can be unsustainable. Or, researchers reaching the end of their career find that they need to transfer responsibility for the knowledge organization systems they have built to someone else. Whereas in past generations it was often considered sufficient in such situations to simply instantiate a current snapshot of a mathematics ontology or taxonomy in static print form, today there is a need to sustain such knowledge organization systems online so that they can interact with the literature as it continues to be produced and so that that the knowledge organization systems can themselves continue to grow and be enhanced over time. These practical realities provide ready-made incentives for researchers to partner with a community-based entity like the DML.
Because the DML will be new, compared to existing society and commercial publishers, it would have to demonstrate that it is a worthy and stable long-term home for such research output. Firmly situating the DML
within the established research mathematics community, demonstrating its commitment to openness, and adhering to technological best practices and established standards could help make the DML a new, but natural, home for this material.
The DML must be open to a range of partnerships of various degrees with mathematics researchers. In some cases, the coupling might be quite loose, with a researcher continuing to maintain and develop their knowledge organization service while it is simply used and leveraged by the DML in pursuit of the DML’s broader mission. In other cases, the collaboration might be quite close, with the DML taking over (after a transition period) from the individual researcher the ongoing responsibility for an ontology or taxonomy. In the former case, adherence by both parties to a suitable application programming interface (API) will be essential.1 In the latter case, the ability to map and automatically transform the original serialization of an ontology or taxonomy into a community-standard knowledge management serialization or encoding will be crucial.
For example, today the DML might reasonably adopt as one of its standards the Simple Knowledge Organization System (SKOS)2 specification, an established, well-thought-of standard maintained by the World Wide Web Consortium for representing thesauri, classification schemes, subject heading lists, and taxonomies within the framework of the Semantic Web. However, few mathematicians starting out on a project to create a taxonomy of mathematical information objects pertinent to their research interests will begin by serializing their taxonomies in SKOS. Additionally, there are viable alternatives to SKOS with other vocabularies3—for example, ISO 25964, the international standard for thesauri and interoperability—or the metadata authority description schema (MADS4) promulgated by the Library of Congress.
Cognizant of the variety of ways to serialize ontologies and taxonomies and of the realities of how idiosyncratic serialization schemes adopted by individual researchers can be, it will be incumbent on the DML to work with partners to develop mappings and automated tools for transforming ontologies from one standard to another and from an idiosyncratic
1 Currently, the Representational State Transfer (or RESTful) model of Web services enjoys broad consensus for this kind of scenario (see Pautasso et al., 2008). However, the correct approach will vary with time as technology and standards evolve.
3 National Information Standards Organization, “ISO 25964—The international standard for thesauri and interoperability with other vocabularies,” http://www.niso.org/schemas/iso25964/, accessed January 16, 2014.
serialization into a standard like SKOS that is more suitable for long-term sustainability and growth. The DML will also have the challenge of doing such transformations in ways that do not foreclose on the opportunity for the partners involved (and others) to continue to help develop and refine their ontology or taxonomy.
It is important that the DML engage members of the mathematics community from around the world. Many countries have made considerable investments in mathematical resources, and these investments should be captured, wherever possible, within the DML’s outreach. There are challenges in engaging researchers across languages, but these should be addressed to the best of the DML’s ability.
Partnerships with institutional entities (such as publishers and existing digital resources) are also crucial to the success of the DML. Here, the primary challenge is to be seen as complementary and enhancing, not competitive, while navigating constructive and effective partnerships with publishers, societies, Web services, and others, both specific to mathematics and those serving the much broader scholarly community. These entities control access to much of the mathematical literature under copyright. Only by establishing fruitful partnerships with such content providers and gate keepers can the DML encompass and link into and out of copyrighted scholarly literature. It is vital that users perceive that the DML is well-integrated with commercial services and commercially managed content. As described above, the committee envisions the resources, services, and tools offered by the DML as coexisting with, and often enhancing, the offerings from existing players in the mathematical information landscape.
The key to establishing such partnerships is perceived mutual benefit. Often such mutually beneficial agreements can be built around community-adopted standards and best practices. Patience may also be required, and it may be necessary to start with small agreements and collaborative undertakings. It is necessary to establish trust. Even small agreements can bear significant fruit. For example, the decision of the American Mathematical Society in 2002 to integrate OpenURL5 into the version 8 release of MathSciNet has proven beneficial to MathSciNet, publishers, and MathSciNet users on campuses supporting OpenURL-based link resolvers. The longer-term goal, of course, is richer partnerships between publishers and services like Wikipedia, MathSciNet, MathOverflow, and zbMATH that would facilitate large-scale analytics, linking, and annotation.
Partnerships with academic institutions involved in the education of future research mathematicians will also be important. These should include
5 OpenURL is a standardized Web address format intended to enable Internet users to more easily find resources. OpenURL can be used with any kind of Internet resource but is most commonly used by libraries to connect users to subscription content.
both departments of mathematics and academic libraries that serve members of these departments. The long-term sustainability of the DML is dependent on how its value is perceived by future mathematicians. As a distributed entity, many elements of the DML will almost certainly reside on campuses in mathematics departments and libraries. Additionally, alliances and partnerships with mathematics departments and academic libraries can facilitate partnerships with individual researchers and with publishers and others whose business models depend on subscriptions and memberships from academic libraries and departments.
Finally, there are rich opportunities for collaborations and partnerships with other departments and faculty within higher education, and even commercial partners that share common interests in underlying technologies and processing challenges. This would include computer science departments, schools of information, search engine developers, and others.
As discussed throughout this report, it is essential that the DML engage the mathematics community as it works to cultivate and make sense of available mathematics knowledge. This report does not attempt to recommend how to do this but simply states that this is an important consideration for a future DML planning.
Recommendation: Community engagement and the success of community-sourced efforts need to be continuously evaluated throughout DML development and operation to ensure that DML missions continue to align with community needs and that community engagement efforts are effective.
Involvement of the mathematics community is being done well in a number of mathematics resources. MathOverflow does a particularly good job of providing a platform for individuals to post and respond to mathematics research questions. It rewards active users by granting them status and giving them access to additional features.
There is considerable skepticism among the mathematical community that it would be possible to encode the whole mathematical literature “by hand.” It also remains an open question whether such functions can be automated, even in part. One approach to this may be developing a suitable community-sourcing algorithm, similar to the working of Duolingo6 (see Chapter 3) or reCAPTCHA7 (von Ahn et al., 2008). In both cases,
users are asked to perform short identification or classification tasks in order to be given access to further material that they want to use or consult. Once a certain statistical consistency is achieved among several user responses for the same task, the identification or classification task is considered complete, and the result can be used (for translation of webpages in the case of Duolingo, or digital encoding of scanned documents in the case of reCAPTCHA). It might even be possible to design a “game with a purpose”8 in which mathematicians worldwide would pair up to play entertaining games, the intermediate results of which would help recognize or characterize formulas in scanned texts, or that graduate students could use to help enhance their understanding of subject matter through chance collaborations and competitions with fellow students around the world.
If the DML is to be successful as a platform that enables mathematical users to access information and each other more easily in their pursuit of mathematical learning, then these users will be a huge resource to the DML. Like in Wikipedia, individual items such as papers, theorems, formula, comments, or open problems will be followed and maintained by volunteers. A large number of these volunteers will be students and researchers in mathematics or related fields. They could also play an important role in initiatives that mix community input and machine learning in order to provide useful tagging and links.
These and other models of community engagement should be assessed for the DML.
From the perspective of modern data science, with data sets of petabyte scale, it is not a huge leap to move from dealing with millions of records of publications to hundreds of millions of locations of mathematical equations in the aggregated text of all mathematical documents with thousands of millions of occurrences of mathematical terms in that corpus. Based on MathSciNet by the Numbers,9 at the time of publication there have been approximately 3 million articles and more than 696,000 authors from 1941. Traditionally, data sets of this size have been handled with relational database technology, with searches offered to users through a Web interface. More recently, it has become possible for such data sets to be usefully and easily manipulated using common technology. This means there is a dramatic increase in the potential for distributed users to contribute to
8 See von Ahn (2006) and Association of Computational Linguistics, “Games with a Purpose,” last modified May 22, 2013, http://aclweb.org/aclwiki/index.php?title=Games_with_a_Purpose.
9 American Mathematical Society, MathSciNet, “MathSciNet by the Numbers,” http://www.ams.org/mathscinet/help/byTheNumbers.html, accessed January 17, 2014.
substantial analysis and curation of bibliographic data sets on the order of magnitude of all mathematical books and articles.
Emboldened by such technological progress, a consideration is to break mathematical papers down into their component parts, such as concepts, definitions, equations, theorems, proofs, etc., resulting in a much larger universe of mathematical artifacts, perhaps hundreds or thousands of millions of instances of these component parts. Appropriately deduplicated, these might amount to perhaps 10 million recognizable entities of mathematics, something for which one could imagine creating a webpage with pointers to at least some of its occurrences in the literature and capabilities for advanced searches and information retrieval at the level of mathematical entities rather than mathematical books and articles. A key point is that once the process of data mining some literature is done to identify mathematical entities, for example, by a process of unsupervised machine learning, these entities can be largely machine-generated but likely also manually curated for at least the most interesting and important of these entities. Managing large data sets often requires special considerations (NRC, 2013), some of which are discussed in this section.
Dealing with Highly Distributed Data Sources
The base layer of mathematical publications is stored in a large number of widely distributed repositories owned and controlled by a variety of agents—commercial and academic publishers and various digital libraries (JSTOR, Project Euclid, arXiv, etc.)—as is the secondary indexing layer (zbMATH, MathSciNet, Google Scholar, Scirus, Microsoft Academic Search, CrossRef, etc.). Each of these sources has a distinct internal format. The European Digital Mathematics Library (EuDML) already has considerable experience in aggregation of both full text and metadata from diverse sources, and this experience should inform DML efforts.
Tracking Data Provenance—From Data Generation Through Data Preparation
There are several distinct issues to consider as one moves into a complex digital ecosystem such as that characterizing the DML operating environment. One problem is technical and has to do with sourcing information that is aggregated, extracted, computed upon, and the like by the DML (or perhaps other services layered upon the DML services). In this case, one needs, most vitally, to be able to track where information came from; secondarily, there is a need to manage synchronization (but not always automatically preform such synchronization). If information is changed in some source repository, the DML may want to note that the information
it is providing depends on an out-of-date version of the source data, unless the DML updates (recomputes) the information to reflect changes. At times it may be necessary to understand dependence and sensitivity—for example, if a given result turns out to be incorrect, what are the implications?
A second issue deals with permissions and legality and with scholarly norms of attribution. In the primary mathematical literature, citations to other papers, quotations from them, and sometimes reproduced figures, are legally covered in the United States under the doctrine of fair use, although occasionally also by explicit permission of the rights-holder of the cited material. As other types of digital uses become common, both scholarly norms of attribution and legal requirements must follow. Providing attribution is mechanical and largely covered under the source-tracing kind of provenance discussed earlier, although there are details about different levels of abstraction in cited objects that need to be sorted out, for example. From a more legalistic perspective, case-by-case analysis is needed; the first step is trying to make sure that there is enough information available to carry out the analysis, algorithmically whenever possible, due to scale and cost factors. For example, if it can be determined that sources are in the public domain, not subject to copyright, or are covered by certain kinds of well-known license (such as the Creative Commons series licenses), then much of the work is already done. In some cases, particularly involving articles published before digitization was anticipated (meaning that participation between author and publisher regarding rights to digitization are uncertain), various entities may have to explicitly give the DML permission to perform its content analysis and reuse computations. The DML will need to research and develop novel approaches to support these cases at scale.
For the primary publications work of mathematics, this problem is largely solved by widely adopted conventions of academic publication (providing authority through publication in peer-reviewed journals), the acknowledgement of primary sources through citation, and, more recently in the digital environment, the use of digital object identifiers and http links to point to sources.
Born-digital enhancements, such as the creation of derivative works from the existing base layer of book and journal data, will necessarily require indications of provenance, but the committee believes that this can be accomplished through open licensing.10 For bibliographic data in the
public domain, there is no legal requirement to acknowledge the source of a bibliographic item, although it is best academic practice to acknowledge its source, if only by a hyperlink.
Data validation is an important concern in any information management system but becomes especially important when aggregating multiple data sources together into a coherent knowledge base. The DML would face the challenge of addressing issues such as conflicting and incorrect data, incorrect tagging, and varying formatting syntaxes that can lead to confusion.
ChemSpider,11 a free chemical structure database providing fast text and structure search access to more than 29 million structures from hundreds of data sources, faced data validation problems with its large aggregation of existing databases, data from peer-reviewed journals, and data provided through crowdsourced efforts. For example, each structure of chemical molecules can be described using a unique simplified molecular-input line-entry system (SMILES). However, Williams (2013) found that when the data from various sources were combined, there were instances where a unique chemical SMILES was being mapped incorrectly to multiple chemical structures. Although these inconsistencies had to be addressed through human intervention, the end result was a much more reliable database with fewer errors.
Two separate issues with validating bibliographic data need to be considered:
- The provision and maintenance of adequate schemas for the representation of mathematical bibliographic data records and the capability to check that the structure of a particular record is compliant with the schema; and
- The correctness or accuracy of particular data elements as they appear in a particular record.
Regarding the first issue, the committee expects multiple schemas for the representation of mathematical bibliographic records to coexist for a long time to come, due to a lack of heterogeneity of potential data sources and because normalizing records from different sources to confirm to a single schema would be an unnecessary cost. The challenges for the DML in utilizing metadata describing mathematical items will vary according to the type of resource being described. For instance, bibliographic metadata
about library books are published in a relatively small number of schemas and are relatively consistent because of the large volume of standards and best practices published over the years by the Online Computer Library Center and most national libraries. Article-level bibliographic metadata for formally published mathematics are found in a greater variety of schemas. Augmentations to and annotations of book-level metadata, particularly in regard to digitized resources, may come from the Open Library, the EuDML, or other library or mathematics-specific community sources in some schema supported by that community.12 These metadata inputs are even more diverse and less interoperable. The main issue for bulk processing of metadata is ensuring that every record is compliant and minimally complete to some schema and associated application profile and that every record clearly indicates the schema and profile to which it adheres. For nonbook resources especially, some resources may be required to normalize certain key properties (e.g., names). Once that is done, it is up to the processing service to achieve an acceptable level of interoperability across a modest number of schemas. The level of interoperability required varies according to service requirements. The committee anticipates that as the DML moves increasingly beyond formally published mathematics literature to also deal with nonbibliographic metadata, the resources needed for metadata remediation and higher levels of interoperability will grow. Again, community involvement in these processes will be critical.
Regarding the second issue, it is important that correctness and accuracy of data elements be monitored closely as bibliographic data acquisition and processing are undertaken by the DML. The DML collection will likely contain errors, and there should be procedures in place for users to flag these errors to draw the attention of qualified editors.
Working with Different Data Formats and Structures
The different data formats and structures that the mathematical community finds useful for data representation will evolve over time.13 The committee does not expect the DML to be an innovator in the field of data formats and structures, but rather to be an accommodator of the formats and structures that are widely accepted by the mathematical community and a facilitator of services for translating, when necessary, between formats.
12 Examples include Marc records (see Library of Congress, “Understanding MARC,” September 9, 2013, http://www.loc.gov/marc/umb/) or DublinCore (Dublin Core Metadata Initiative, http://dublincore.org/, accessed January 16, 2014).
13 This evolution may start with legacy formats such as DublinCore, TeX and BibTeX, and progress through more advanced forms of XML including MathML, also JSON for lightweight Web services, and also incorporate formats from Mathematica and other mathematical programming languages to the extent possible.
Ensuring Data Security
The committee sees some potential value in providing some user services that require login and storage of private data, such as for private annotations and/or the collection and mining of usage data, which might provide enhanced search and navigation features over the corpus. The committee is open to the possibility of including copyrighted data and extended metadata in the DML, with the aim of providing better services and linking to restricted-access content. These services would require enhanced data security. This would, however, impose a considerable administrative and legal burden on the organization managing the DML. Solution of this problem may depend on how monolithic or distributed the eventual DML architecture turns out to be. Having a safe, secure node in the system operated for the DML as a whole by one of the parties involved might be more feasible than having the parent DML entity responsible for it all.
Developing Scalable and Incremental Algorithms
Literature-based data sets within mathematics are already large enough to provide some algorithmic challenges for tasks like clustering and deduplication. The problem of incremental processing is particularly important for a literature and knowledge base that continues to grow. Typically, some algorithm is applied to generate, say, a clustering or deduplication of a large data set. When new data come in, which might be recent publications or a newly digitized historical source, an update of the data processing is needed to incorporate the new data without reprocessing all of the data. This is particularly important if, subsequent to the original machine processing, there was some annotation or correction of data by human agents. Unless care is taken in managing workflows, there is the danger that these human contributions may be lost or overwritten in the reprocessing.
Usage Tracking for Improvement and Diagnostics
Usage tracking refers to the process of capturing data on how a system is used and by whom. Such information is generally useful in identifying classes of users and their special needs, patterns of usage, beneficial workflows, underutilized areas of the system, and software bugs. Including technology to track such information would help to make the DML increasingly useful and would support diagnostics when users report errors. Types of usage tracking could include the number of times various sub-tools are used by a user during a session, the order of usage, and whether the system failed when a sub-tool was called. This usage data could then be aggregated to get system-wide usage by sub-tool and by pattern of activity. Usage data
generally do not contain personally identifiable information; however, they may contain user class information—such as number of times accessed by novice, intermediate, or advanced users or related to classes of data providers, data searchers, and so on.
User Security and Privacy Control
Systems that require users to register to use that system collect some personally identifiable information. This may just be name and email, but it can include contact information, location, and even financial information. This information is valuable from a system administration perspective because it can be used in a number of ways, from determining billing to identifying special needs by locations. In systems where the users can submit data, personal identifiers are also useful to limit the access of those who abuse the system and to provide recognition for those who provide high levels of valuable content. Such user tracking is thus of particular value when any part of the system employs contributions or community input. Keeping these data segmented from other data and not selling them or giving out user lists can preserve user privacy.
Another reason to have users register is to provide automated links to various social media systems and other online search systems, making it possible for the user to maintain a consistent user profile across tools. User desire for privacy can, in part, be maintained by having an opt-in system.
Interoperability and Linkage to Social Networking Sites
Increasingly, scientists use social networking capabilities as a way to gather and vet data and ideas and as a way to identify and communicate with colleagues. Currently, a plethora of social networking sites are evolving independently. It may be prudent for the DML to let this functionality continue to evolve while supporting interoperability and linkage between various social networking sites to attain full functionality and to support broad usage styles even beyond what has been envisioned.
The mathematical community has a limited capability to create and maintain a new information resource in an environment where a number of organizations, both commercial and noncommercial, have strong interests in owning, controlling, and profiting from the information and knowledge that potentially can be mined from mathematical publications. Scientific publishing as a whole seems to be at a crossroads regarding copyright. The committee foresees that the broad movement toward openness, mostly
focused on open-access publications, open-source software, and open data, will likely encourage changes to the current copyright models used by many major publishers, as well as to scholarly practice and scholarly communication more broadly. While this report does not take a position with respect to publishing copyrights, the committee believes that all content created by and for the DML should be open to encourage the most buy-in from mathematicians and from potentially collaborating organizations.
The proposed DML organization could, for instance, oversee the creation and maintenance of a set of open resources—an ontology and collections of links—many of which rely on identification and extraction of objects or structures within the mathematical literature,14 community input related to these objects, software used in mathematical research, and links to published literature. These object and link collections could be built up in large part by repeatedly computing over available collections of mathematical content. The initial DML creation and development will be challenging in terms of establishing the technological capabilities, engaging partners and the community, and planning for future growth. The insights about connections across the literature will be strengthened and become more useful. The process could begin with relatively open materials and willing partners; assuming that these services prove to add sufficient value, more holders of restricted-access materials may make arrangements to participate, and the net coverage of the mathematical literature would grow.
It is essential that the DML have access to and work well with all of the available mathematics literature, regardless of copyright status. While it might be tempting to build a system based on openly available material, such as mathematics heritage literature, the committee is convinced that the DML can be productive only if it has systematic input from and enthusiastic support by the mathematical community, which is unlikely to happen if the scope is restricted to open literature. In addition, it is envisioned that the DML computational services will be hospitable to new forms of mathematical scholarly communication (preprints, review papers, books, video material, etc.).
The committee is also cognizant of the current state of mathematics information resources and the systemic problems of compartmentalization, navigation, access, and maintenance. Briefly, compartmentalization—the partitioning of information and its maintenance by publisher or service provider—results in various agents having ownership and control of information and its maintenance, which can be sold to users as subscription services. Compartmentalization makes it difficult for users to navigate across boundaries, determine what information is accessible to them, and
14 These mathematical objects and structures include a reference, keyword or phrase, theorem, proof, definition, equation, special function, conjecture, formula, transform, sequence, or symbol.
quickly access information. Unfortunately, the ability of services such as Google Search and Wikipedia to counter the compartmentalization problem are of only limited value in the highly structured discipline of mathematics, which requires structured information resources to provide better means of browsing and navigating the mathematical universe. Finding practical solutions to the challenges of compartmentalization, navigation, access, and maintenance—or at least compromises that allow progress—is the main challenge facing DML development.
Recommendation: The Digital Mathematics Library should be open and built to cooperate with both researchers and existing services. In particular, the content (knowledge structures) of the library, at least for vocabularies, tags, and links, should also be open, although the library will link to both open and copyright-restricted literature.
Many of the lists of mathematical objects described in Chapter 1 require expert and ongoing maintenance, and the DML needs to consider how to design its lists in such a way as to lessen their maintenance burden. With existing lists, it is often not clear how a user of the list can contribute new entries or edit existing ones. The problem is most obvious for lists published in copyrighted books, but it also exists for lists housed on other public sites. Some questions that need to be addressed are these: Who is responsible for maintaining this list? Is it a robot or a human? There is no established format or data schema for online publication of lists of mathematical objects, which complicates a machine’s ability to read and reuse them. Rather, online representations of traditional print copies are prevalent. Often, and especially for lists contained in books, there are copyright restrictions that inhibit the process of maintenance, enlargement, enhancement, and reuse of these lists. Many of these lists do not provide links to primary or even secondary online sources.15 Very few of these lists provide computable representations of the objects listed, such as code that can be passed to computing software.16 These capabilities are important to mathematical research because merely knowing the formula is often insufficient; researchers also want to how it was proven, the history of the equation, and how it has evolved over time.
15Wikipedia supplies links where they have been provided and Online Encyclopedia of Integer Sequences (OEIS) does provide plain text references, but they are not generally hyperlinked.
16 OEIS does provide both Mathematica and Maple code to generate most of its sequences and the Wolfram functions site offers Mathematica code for its basic functions, but not for any functional identities.
Experience to date with digital libraries and digital resources provides some insight and guidelines for how to approach the maintenance problems, specifically how to set up copyrights and licensing agreements, how to provide APIs, how to ensure that multiple copies of the information are always available, how to establish clear indications of provenance, and how to standardize and manage user contributions. These are fairly universal problems, and they should be amenable to fairly universal solutions with best practices provided by a central DML organization that is sensitive to the needs of the math community.
The maintenance strategy of the Online Encyclopedia of Integer Sequences (OEIS) seems particularly well suited for the DML. OEIS has developed a community of researchers in combinatorics who use it routinely in their research and who contribute to its maintenance. Essential here is the grass-roots nature of the effort. It was developed by one leading initiator, Neil Sloane, who had a vision of what could be done with a database of sequences and who gradually got people around him to contribute to it while enhancing the underlying software and functionality. The resource was developed in direct response to the interests and needs of a research community (and also with considerable interest from a larger community focused on recreational mathematics and pedagogy), and it was kept free and open, which engaged the community.
Another resource with similar communities of contributors/maintainers is Research Papers in Economics (RePEc). This is more of a traditional bibliographic resource than a database of entities, but the principles are very similar: find a way to make it easy for experts to contribute their domain knowledge and build up a knowledge base.
Community information projects often require both an inspired creator, often unrewarded at the start, and eventually a transition to a paid staff after the work grows beyond the capacity of an individual, even an individual assisted by a crowd-sourced effort. For example, arXiv was started by Paul Ginsparg alone at Los Alamos National Security Laboratory but is now run by the Cornell University Library. Ginsparg is still very active and involved in policy, but he cannot personally make every decision of the form, “Does this paper belong in cs.DL or cs.CY?” The Internet Archive similarly has a visionary, Brewster Kahle, founder and still in charge, but it also has a paid staff to keep operations going.
Once the resource gets large enough to be of substantial value to the community, it has to be legally constituted to avoid issues of ownership and control. The use of the Creative Commons license17 is an approach that the committee believes would work well for the DML. OEIS uses the
In order for the DML to successfully maintain a database resource, it has to deal with the technical and human components. On the technical side, the DML has to provide adequate version-control and editorial software (similar to Wikimedia) to manage the deposit, editing, and cross-linking of documents. It is essential that this software work well and be kept up to date and well adapted to the current information environment. Some centralization of this activity seems beneficial. On the human side, the DML has to motivate people to contribute to the parts of the effort that are not easily or fully automated. One way to do that is to provide nice software that does the boring parts for them easily and allows them to focus on the parts where their expertise is really needed. Many database maintainers try to build and customize this sort of software for themselves, but then they get overwhelmed by the issue of software maintenance and spend more time on trying to deal with that than they do with contributing their domain expertise to the database. The DML could provide out-of-the-box software (or a Web service solution) for each math sub-community to curate its own material for benefit of a larger audience. The DML software would include mathematical knowledge, so that it could display properly formatted theorems and recognize structural similarities, often not possible in the numerous existing collaborative software offerings.
If the DML can provide a good software solution for managing mathematical entities, and deal with the management of that software in a central way, it can provide something that a large number of different mathematical communities could adapt for their own purposes, hopefully maintaining some centrally supported capabilities (version control, linking, math display, search, etc.) without each sub-community having to solve these problems separately. At the very least, having some common standards for data exchange and interoperability, and some common reliable components for which there was some central support, would lessen the maintenance problem.
Some of the maintenance of the DML lists may be automated as well. The key is to find a balance between automated data mining of the literature and human annotation and curation. More work and experimentation is needed to develop editorial systems to assist this process. The main goal is to provide good tools to do largely successful cleaning and reduction of
data before bringing portions of them to the attention of domain experts, whose time is limited, or possibly crowdsourcing less demanding tasks. Tools like Google Refine20 and flexible, faceted displays of bibliographic data like BibServer21 are very useful for this.
Both Google Scholar22 and Microsoft Academic Search23 do a huge amount of fully automated data processing of general academic bibliographic data. The methods behind these services could undoubtedly be brought to bear on more specialized data mining and data structuring tasks of the kind relevant to text mining the mathematical literature for formulas and the like. LaTeX Search24 (Springer’s free formula search) provides a step in this direction by allowing users to locate and view equations containing specific LaTeX code, equations similar to another LaTeX string, equations belonging to a specific digital object identifier, and equations belonging to an article or articles with a particular word or phrase in the title.
The DML will also have to develop in such a way as to learn from and complement the broader data conservation and data preservation movement, helping to organize and preserve the mathematical information it contains. It may be beneficial to cooperate with groups such as LOCKSS,25 Portico,26 or HathiTrust,27 which do digital preservation today, and coordinate with projects such as the Data Conservancy,28 DSpace,29 and the linked open data movement, which are laying the groundwork for more powerful preservation techniques in the future.
National Research Council. 2013. Frontiers in Massive Data Analysis. The National Academies Press, Washington, D.C.
Pautasso, C., O. Zimmermann, and F. Leymann, F. 2008. RESTful Web Services vs. Big Web Services: Making the Right Architectural Decision. 17th International WWW Conference. http://doi.acm.org/10.1145/1367497.1367606.
von Ahn, L., B. Maurer, C. McMillen, D. Abraham, and M. Blum. 2008. reCAPTCHA: Human-Based Character Recognition via Web Security Measures. Science 321:1465-1468.
von Ahn, L. 2006. Games with a purpose. Invisible Computing, June, pp. 96-98. http://www.cs.cmu.edu/~biglou/ieee-gwap.pdf.
Williams, A., Royal Society of Chemistry. “Engaging Participation from the Chemistry Community.” Presentation to the National Research Council’s Committee on Planning a Global Library of the Mathematical Sciences on February 18, 2013.