4

Strategic Plan

This chapter proposes a strategic plan for incremental and modular development of the Digital Mathematics Library (DML), with the aim of providing the mathematical community with at least some of the specific capabilities described in Chapter 3, as well as some further capabilities that should follow as corollaries of the basic development. The committee’s strategic plan contains the following elements:

  • Fundamental principles of the DML vision;
  • Constitution of a nonprofit organization committed to development of the DML collection and services, called the DML organization;
  • Initial development;
  • Priorities for collections and service development;
  • Technical considerations; and
  • Resources needed.

Each of these elements is discussed in detail in the following sections.

FUNDAMENTAL PRINCIPLES

The committee envisions the next step in advancing mathematics to go beyond traditional mathematical publications and take advantage of the mathematical information and knowledge stored in those publications to create a network of information that can be easily explored and manipulated. There is a compelling argument that through a combination of machine learning methods and editorial effort by both paid and vol-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 72
4 Strategic Plan This chapter proposes a strategic plan for incremental and modular development of the Digital Mathematics Library (DML), with the aim of providing the mathematical community with at least some of the specific capabilities described in Chapter 3, as well as some further capabilities that should follow as corollaries of the basic development. The committee’s strategic plan contains the following elements: • Fundamental principles of the DML vision; • Constitution of a nonprofit organization committed to development of the DML collection and services, called the DML organization; • Initial development; • Priorities for collections and service development; • Technical considerations; and • Resources needed. Each of these elements is discussed in detail in the following sections. FUNDAMENTAL PRINCIPLES The committee envisions the next step in advancing mathematics to go beyond traditional mathematical publications and take advantage of the mathematical information and knowledge stored in those publica- tions to create a network of information that can be easily explored and manipulated. There is a compelling argument that through a combination of machine learning methods and editorial effort by both paid and vol- 72

OCR for page 72
STRATEGIC PLAN 73 unteer editors, a significant portion of the information and knowledge in the global mathematical corpus could be made available to researchers as linked open data through the DML. The DML would help index and make discoverable collections of information created and maintained by distrib- uted editors and specialized machine agents—much as Google now indexes and makes available information drawn from across the Web—but without the centralized processing and caching. But the DML would also need to engage substantial editorial input from the mathematical community. The DML would afford functionalities and services over the aggregated infor- mation, including capabilities for searching, browsing, navigating, link- ing, computing, and visualizing and analyzing, over both copyrighted and openly licensed content. Some, but by no means all, of the proposed additional services and knowledge management utilities will rely on analysis of full content, done in a coordinated fashion. Other services will rely on analysis of metadata, which are often accessible with fewer or no restrictions. The committee feels that today—through reliance on a broad, distributed community, adherence to emerging standards and best practices, the use of new distributed col- laboration and editing workflow models, and reliance on the affordances of emerging technologies such as linked open data and machine learning methods—these content and metadata analyses can be accomplished suc- cessfully in a distributed fashion—that is, without having to acquire, pro- cess, or store the entire universe of all mathematics publications centrally. While the approach outlined would require the central (or at least centrally coordinated) maintenance of key concept vocabularies and ontologies, large-scale, centralized processing and storage of mathematical publications would not be necessary. The committee has identified a compelling opportunity for the following: • The DML as a large, open collection of mathematical bibliographic information and mathematical concepts (e.g., axioms, definitions, theorems, proofs, formulas, equations, numbers, sets, functions) and objects (e.g., groups, rings) aggregated from diverse sources; • Integrating and organizing the DML with existing repositories of publications and with indexing and computing services (as dis- cussed in the Chapter 3 section on “Developing Partnerships”); • Encouraging, facilitating, and supporting the development and promulgation of novel Web and desktop services, including annota­ tion, collection, and collaboration tools, and tools for search and literature-based discovery, that can be utilized within the DML; • Supporting experimental and production applications of machine learning methods for the extraction of various mathematical enti- ties, including topics, formulas, equations, and theorems, by data

OCR for page 72
74 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY mining and large-scale data analysis of suitable portions of the mathematical corpus; and • Supporting a combination of community input and traditional editorial workflows for validation of outputs of such machine p ­ rocessing and contribution of such outputs to the DML. The committee believes that it is necessary for the people and organiza- tions involved in the DML to adopt some basic principles to guide the DML to reach its full potential. Adherence to Best Practices and Standards The proposed DML would benefit from adhering to broad technical standards and built-in interoperability, both for encouraging partnerships and taking advantage of non-mathematics-specific Web technologies that become available (Aalbersberg and Kähler, 2011; Gill and Miller, 2002). The DML would benefit from being developed with a modular architecture, allowing various technical development efforts to proceed in parallel with minimal coordination. There should be some initial agreements in principle about the nature of inputs and outputs of various components and Web services. One illustration of the importance of technical standards in math- ematics is the value of Tex (and LaTeX), which standardized mathematical typesetting and revolutionized research mathematics publications. The DML architecture should adhere as much as possible to con- temporary and evolving Web architecture standards for all its services, especially the standards of linked open data for publishing structured data on the Web so that it can be interlinked and become more useful. Linked open data allow a webpage to dynamically pull relevant information from related websites. For example, a website that displays local weather could pull information from an unrelated local traffic monitoring site to alert users to delays or road conditions, and it could pull from the local school district’s website to alert users about potential closures. Linked open data are particularly valuable within the proposed DML because much of the value of the information comes from its connections with outside existing data. If these connections can be strengthened, the network of mathemati- cal information will solidify, providing a clearer picture of the realm of mathematical research. The DML, as proposed, would not be collecting large amounts of copyrighted material; however, it would be amassing its own data collec- tion of connections and understanding of mathematical information. These data (i.e., vocabularies, ontologies, annotations) and the DML-developed/ supported software would benefit from being open source so that other researchers and developers could build upon it. The DML would need to

OCR for page 72
STRATEGIC PLAN 75 respect and recognize copyright limitations and work with publishers to make sure these can stay in place even while having minimal impact on the ability of users to discover and learn about resources. The DML would also benefit from adhering to accepted norms for citations and evaluations. This may take the form of systematic applica- tion and support of the San Francisco Declaration on Research Assessment (American Society for Cell Biology, 2012) about emerging practices related to the evaluation of research articles. Recommendation: The Digital Mathematics Library should serve as a nexus for the coordination of research and research outcomes, includ- ing community endorsements, and encourage best practices to facilitate knowledge management in research mathematics. Competition and Cooperation with Other Organizations To the greatest extent consistent with its goals and principles, the DML should seek to cooperate with and not to compete with existing information services and communication and desktop tools that are widely used by the mathematical community. Cooperation would include the following: • Agreements on the structure of suitable data schemas for represen- tation of bibliographic and mathematical information, including standards for representation of mathematics on the Web (MathML, MathJaX, etc.); • Agreements on systematic use of identifiers and openly accessi- ble Web services supported by other organizations (e.g., DOIs, Handles,1 ORCIDs, MR and ZMATH identifiers, OCLC identi­ fiers) instead of replication of these identifiers and associated ser- vices by the DML; • Provision of agreements and conversion services, as needed, to ensure metadata interoperability and aggregation of data from various services; and • Support for interfaces between the DML and existing informa- tion resources listed in Appendix C—for example, bibliographical, e ­ ncyclopedic, content, social environments. This cooperation also applies to arXiv, Wikipedia, MathSciNet, zbMATH, Google, Microsoft, and the general abstracting and indexing services, as well as to various companies with proprietary interests in mathematical communication and computation whose products the DML should seek to 1  Handle System, http://www.handle.net/, accessed January 16, 2014.

OCR for page 72
76 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY enhance and make more openly accessible and reusable. This list of com- panies includes the following: • Springer (with large amounts of mathematical information in SpringerLink2 and its proprietary LaTeX search); • Wolfram (with large amounts of mathematical information embed- ded in Mathematica and Wolfram|Alpha); • Elsevier; and • Maplesoft, a subsidiary of Cybernet Systems Co. Ltd. in Japan and a provider of software tools for engineering, science, and math- ematics, especially Maple,3 a powerful mathematical computation engine. In areas where data standards are well established, such as for basic bibliographic data elements, such cooperation may be achieved by the DML organization with different data sources and services individually. For more complex data objects, especially those representing mathematical concepts, a community process, such as those commonly conducted by the World Wide Web Consortium,4 should be involved in the selection and adoption of data standards by the DML. It is recognized that such data standards may typically start as ad hoc standards that eventually become codified and formalized through widespread use (e.g., Microformats Wiki5). The com- mittee recognizes that some existing agents may be reluctant to cooperate with the DML in either development of data schemas, sharing of data, or both. In those cases, the DML should not allocate administrative effort on negotiating cooperation but rather find alternative agents who are willing to cooperate in providing the needed data or services in a manner consistent with DML principles. Collection from Diverse Sources The DML should commit to support curation and management of mathematical information from diverse sources and facilitate access to mathematical information even though the sources are stored in different organizations. Similarly, CrossRef6 currently tells users how to find items 2  Springer Link, http://link.springer.com/, accessed January 16, 2014. 3  Maplesoft, “Maple 17,” http://www.maplesoft.com/products/Maple/, accessed January 16, 2014. 4  W3C, http://www.w3.org/, accessed January 16, 2014. 5  Microformats, “The Microformats Process,” last modified April 28, 2013, http://­ microformats.org/wiki/process. 6  Crossref, http://www.crossref.org/, accessed January 16, 2014.

OCR for page 72
STRATEGIC PLAN 77 from different vendors, LOCKSS7 manages shared storage across libraries, and ORCID8 helps identify authors across publications. In particular, the DML should aim to acquire and process the following: • Previously unindexed or partially indexed information about math- ematical publications—including traditional journal papers, books, and other electronic resources—and their contents, such as their reference lists, names of their sections or chapters (table-of-­ ontents c data), their formulas, equations, theorems, and conjectures; • Information relating to the relations of such data elements within various publications and the relations of these elements to various standardized lists of such elements; and • Information from mathematicians’ homepages, blogs, and discus- sion forums. The DML should accept inputs of such data from all sources, commer- cial and noncommercial, subject only to copyright and licensing require- ments indicated earlier, the judgment of DML-appointed editors that the material is suitable for inclusion in the DML, and the resources to process the data for ingestion into the DML. In particular, the DML should in- vite contributions of such content from both copyrighted and open-access sources. In all cases, the DML should commit to appropriate acknowledge- ment of the source and to inclusion of agreed indications of provenance in its data records. Support for Multiple Formats, Conversion Tools, and Best Practices In many instances the cost of negotiating cooperation in schema stan- dards may greatly exceed the potential reward of doing so. In such cases, it will be best for the DML to move ahead with lowest-common-denominator standards that are good enough for most applications and to which it is possible to map data from multiple alternative formats. Current examples of such standards are BibTeX, or slight enhancements thereof like BibJSON and BibXML, to which it is possible to map almost any reference text string that can be recognized as such by a human. A somewhat higher stan- dard is provided by the European Digital Mathematics Library (EuDML) metadata schema specification9 for typical mathematical article metadata 7  LOCKSS: Lots of Copies Keep Stuff Safe, http://www.lockss.org/, accessed January 16, 2014. 8  ORCID, http://orcid.org/, accessed January 16, 2014. 9  European Digital Mathematics Library, EuDML Metadata Schema Specification (v2.0-final), https://project.eudml.org/eudml-metadata-schema-specification-v20-final, accessed January 16, 2014.

OCR for page 72
78 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY supplied by a cooperative publisher. The DML should research and sup- port multiple tools and services for the acquisition of data in diverse native formats and its conversion to higher-quality bibliographic formats such as those mentioned above. It should also provide guidance for best practices in managing various data and metadata formats and support basic com- munication spaces, such as an email list or help desk for data managers en- countering issues in cleaning and converting diverse data sets of interest to the DML community. Examples of conversion tools for bibliographic data that are already very useful, although relatively unknown, are pdftotext,10 MREF,11 EJP-ECP Reference List Formatter,12 inSPIRE-HELP,13 BibSonomy Scapers,14 Google Refine,15 and Beautiful Soup.16 The creation of such data-conversion tools is typically a fairly straight- forward programming task in which the difficulty depends on the com- plexity of the tool. However, such tools and their derivatives do impose a progressive maintenance burden to keep them compliant with changing data formats and expectations for both inputs and outputs, and with new versions of underlying software libraries and implementations. But the maintenance of such low-cost, high-reward data conversion and cleaning services, or links to the best maintained of these services and documenta- tion of how to use them for DML purposes, is among the things the DML should commit to supporting. Flexibility and Extensibility of Schemas and Services Recognizing the systemic compartmentalization problems caused by traditional database schemas and implementations, all DML schemas should adhere to current and emerging best principles of flexibility and extensibility. In particular, DML architecture should allow and encourage the following: • Inclusion of data in a virtual collection from an essentially unlimited number of disparate and distributed resources of greatly varying 10  “Pdftotext,”Wikipedia, last modified July 11, 2013, http://en.wikipedia.org/wiki/Pdftotext. 11  American Mathematical Society, MRef, http://www.ams.org/mref, accessed January 16, 2014. 12  Electronic Journal of Probability, “Reference List Formatting,” http://ejp.ejpecp.org/ pages/view/ref_list, accessed January 16, 2014. 13  INSPIRE, “Generating Your Bibliography,” http://inspirehep.net/info/hep/tools/­ bibliography_generate?ln=en, accessed January 16, 2014. 14  BibSonomy, “Scraper Info,” http://www.bibsonomy.org/scraperinfo, accessed January 16, 2014. 15  Google-refine, https://code.google.com/p/google-refine/, accessed January 16, 2014. 16  Freebase, http://www.freebase.com/, accessed January 16, 2014.

OCR for page 72
STRATEGIC PLAN 79 sizes. Examples would include data stored on individual webpages and marked up with information, as is done with CoINS,17 the emerging standards of schema.org or similar math-specific stan- dards that might be developed by the DML community, or data available from various data providers via application programming interfaces (APIs) or periodic data dumps; and • Creation of new features, tools, and services over DML data by individual and organizational participants, such as those outlined in Chapter 3, or by yet unimagined services that will develop in the future. Relation of the DML to Computer Algebra Systems and Formalization of Mathematics There is a community of mathematical knowledge management, built largely around the development of formal theorem provers and reason- ers (Carette and Farmer, 2009).18,19 This community proposed an ambi- tious program of formalization of mathematics, following earlier efforts by Whitehead and Russel (1910, 1912, 1913), Hilbert’s program,20 and others. Some notable successes of this school are computer automated proofs of a number of important mathematical theorems, such as the famous four-color theorem. The committee anticipates further advances in this field, and per- haps some eventual synthesis of computer algebra systems (Mathematica, Maple, Sage, etc.) with the theorem provers. However, progress in this area has been slow, and there are deep cultural impediments, principally the fact that the dominant computer algebra systems are proprietary and likely to remain so for the foreseeable future. Summary of Principles Consistent application of the principles in this section to the repre- sentation of mathematical information and conceptual knowledge in the World Wide Web will enable the mathematical community to achieve the most effective instantiation of the DML as an openly navigable represen- tation of the universe of mathematical concepts, formulas, and relations. To achieve this, the DML would be just as accessible to human users as Wikipedia is today, with the same open license for text contributions and 17  OpenURL COinS: A Convention to Embed Bibliographic Metadata in HTML, Stable Version 1.0, http://ocoins.info/, accessed January 16, 2014. 18  Mizar Home Page, last modified January 8, 2014, http://mizar.org/. 19  Coq Proof Assistant, http://coq.inria.fr/, accessed January 16, 2014. 20  “Hilbert’s Program,” Wikipedia, last modified January 3, 2014, http://en.wikipedia.org/ wiki/Hilbert%27s_program.

OCR for page 72
80 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY a public domain license for bibliographic and mathematical facts; it would ­ be properly structured for machine access and reuse in discovery services; and it would be connected directly, through desktop software and Web ser- vices, to the mathematical research literature, current and future abstracting and indexing services, computational services such as Wolfram|Alpha, and desktop programs such as Mathematica, Maple, and Sage. CONSTITUTION OF THE DIGITAL MATHEMATICS LIBRARY ORGANIZATION The first step in this process is creating an organization that can manage and encourage the creation of a knowledge-based library of mathematical concepts and advocate for the needs of the mathematical community. The committee believes the DML effort would benefit from being spearheaded by a small centralized agent to avoid the project failing because of compet- ing time commitments of its founders, which has happened in several cases mentioned in Appendix C. It is hoped that the DML can reach beyond this initial startup hurdle and ultimately succeed because of its core of dedicated staff, collaborators, and funders, and to ultimately create a strong, stable, and meaningful resource that is worthy of continued investment from the mathematics community. Recommendation: A Digital Mathematics Library organization should be created to manage and encourage the creation of a knowledge-based library of mathematical concepts such as theorems and proofs. Recommendation: The Digital Mathematics Library organization should be an advocate for the mathematics community and help develop plans for development and funding of open information systems of use to mathematicians. The DML organization would benefit from being a small organization with minimal central agency and control. It is also important that the DML be able to operate in an environment of much larger organizations with big budgets and capability for sustained legal actions to achieve their ends. To survive as a small operation in a big information universe, it is important that the DML be organizationally nimble, quick to initiate pilot projects, and generally quick to learn from the experiences of both successful and unsuccessful efforts, both its own and those of others aiming to develop domain-specific knowledge bases. The DML could be largely reliant on other organizations to provide hosting for such organizational essentials as

OCR for page 72
STRATEGIC PLAN 81 • Basic computing and networking infrastructure, support, and services; • Archiving (to be achieved in collaboration with existing scientific data and library archiving organizations); and • Office space and administrative and support services of all kinds. Management overhead can be minimized, for example, by making the ex- ecutive director of the DML an employee of a supporting institution, most likely a major university library, whose time is funded either completely or in large part by a grant from the initial DML funder to that university. A modest number of initial staff positions could be funded similarly. This could be a good approach for the DML because many of the technical skills it will need are specialized and may be needed only on a part-time or fluctuating basis as various projects are taken on by the DML. The DML could at least initially avoid the management responsibility of having a large number of employees, but rather work on a contractual basis with staff employed by a variety of partner organizations with a commitment to various aspects of the DML effort. The DML organization may also wish to consider other names before finalizing its constitution, both for itself as an organization, and for the collection and services it plans to create. One of the early administrative ef- forts of the DML organization would be to evaluate a number of legal and economic considerations involving branding and trademarks related to the choice of name. The committee envisions the DML organization as a coali- tion of member partners with commitment to the DML concept—the creation of a substantial digital representation of an open collection of mathematical information and knowledge—and to the DML development principles. The DML organization could be governed by the mathematical sciences commu- nity through an organization such as the International Mathematics Union (IMU) and, thence, through the member organizations of that union. The DML constitution can support the general principles outlined above by including the following elements: • Acquire and maintain a collection of digital representations of math- ematical objects (e.g., theorems, functions, sequences) in machine ­ processable formats; • Advance mathematics by provision of useful information services over the collection; • Maintain the DML data collection with stable URLs, an underly- ing Web-based open architecture, and APIs so new tools can be contributed, linked, and shared; • Support development of a large community of users who will also help curate and contribute to the collection and its services;

OCR for page 72
82 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY • Support a community of developers of tools and services over the collection; and • Collaborate with publishers and information providers to pro- vide superior mathematical and information services built over the collection. Governance of the DML could be overseen by an organization such as the IMU, with invitations to representatives from partner organiza- tions. Initial funding of the DML for a 10-year period would be ben- eficial, during which long-term models for sustainable operations could be examined. The DML may benefit from including as many of the rel- evant organizations as are willing to participate. Some examples include M ­ athSciNet (American Mathematical Society), the Society of Industrial and Applied Mathematics, the International Council for Industrial and Applied Mathematics, the European Mathematical Society, the Cornell University ­ L ­ ibrary, Fiz Karlsruhe/Springer, Wolfram, MicroSoft, Google, Wikipedia, ­ OEIS, EuDML, Elsevier, and Thomson Reuters. Publishers and volunteers will see the DML as more accurate and more tailored than other services and should recognize the gains possible from a coordinated approach to merging mathematical knowledge. As the DML grows, the community will accord respect to the volunteers who help build it. To protect itself from legal obligations regarding copyright infringement, the DML could consider a variety of approaches, including not claiming copyright on any DML material and requiring of contributions to be licensed by the contributor, or using a creative commons license. The first step to confirm feasibility of this DML concept is to announce a proposal to the community, confirm that enough parties are willing to par- ticipate in the DML by contribution of data, expertise, or services to make the project viable, and, if so, support a meeting to resolve a basic constitu- tion for the organization to establish its legal status in a suitable location. INITIAL DEVELOPMENT Initial development of the DML would benefit from focusing on recruit- ing partners with potential data sources and resources, beginning a collec- tion of mathematical entities to achieve some of the desired capabilities described in Chapter 3, and providing a foundational platform on which most of these capabilities might imaginably be achieved in a 10- or 20-year time frame. The committee sees value in separate groups working on the techno- logical infrastructure and on the administration of these projects, because they require different kinds of technical expertise, community input, and project management for their success.

OCR for page 72
STRATEGIC PLAN 83 Recruiting Partners The DML cultivation of partnerships would benefit from being stra- tegic more than opportunistic. As a first step, the DML will need to assess potential partnerships in terms of the potential of the partnership to help the DML meet its goals, the likely incentives on both sides for the partner- ship, the maturity and stability of any technical standards required to make the partnership work, and the likely obstacles to consummating the partner- ship. A diversity of partnerships will be important. The advice and help of existing elements of community infrastructure could be valuable in this; for example, the IMU (in particular the CEIC) and its member societies, FIZ Karlsruhe, European Mathematical Society, the Association of Research Libraries (ARL) and similar organizations outside North America, existing mathematics digital libraries (such as HathiTrust and the EuDML), and prominent and influential mathematicians who have expressed an interest in the mission of the DML. Finally, while keeping in mind long-range goals and objectives, it is important to identify and pursue high-likelihood, high- potential-benefit, low-risk, near-term partnerships and agreements, even if somewhat limited in scope, as long as such partnerships can help illustrate the longer-term potential of partnering with DML. For example, a produc- tive, beneficial partnership with arXiv might be achievable in relatively short order and at the same time be useful to illustrate some of the potential benefits of DML partnerships between the DML and content providers. Entity Collection Even in advance of construction of a central repository, work could proceed immediately on development of adequate object classes for descrip- tion and discovery of mathematical content in ways that complement exist- ing capabilities—for example, at finer granularity—and on the aggregation of the lists of object instances for inclusion in the DML. The committee believes that the following mathematical objects and bibliographic entities are good targets for early DML development (each of which is discussed in more detail in Chapter 5): • Mathematical objects: subject topics, sequences, functions, trans- forms, identities, symbols, formulas, and assorted mathematical media; and • Bibliographic entities: people, homepages, journals, books, and bibliographies. The committee recognizes that progress on aggregation, cleaning, and de- duplication of these various lists will move at very different rates. Some of

OCR for page 72
84 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY these lists may be completed quickly, while others that require input from many sources will mature slowly over time, and some might never be re- garded as truly complete. Those in data rich areas may be ripe for initial developments. Still, the committee believes that the difficulty of completing some of these lists should not deter contributors from starting them or from converting what is already available into machine-readable formats, which can then feed various linking, navigation, search, and discovery services. Chapter 5 outlines which entity types should be targeted, at least initially, and gives some indication of the efforts required for each. Planning for More Complex Entities Planning should start for the development of more complex lists where possible. These lists are outlined in Chapter 5, and some may be difficult to create and maintain. Wolfram|Alpha has a significant start on this with its continued fractions project. The potential rewards in terms of discovery and cross-linking are greatest if these mathematical objects can be adequately formalized and managed, even on a modest scale. These lists may benefit from starting small and growing slowly, to reduce the maintenance chal- lenges before they become too burdensome, and by development of ­ achine m learning techniques for extraction of these entities from the literature. The committee anticipates a fairly loose structure in cooperation with Wikipedia, with input from the Wolfram experience with continued frac- tions and others in managing problem lists. Data Structures Initial effort is best invested in choosing an adequately flexible and extensible data structure, which needs to be easily expressible and export- able to handle diverse types of objects. The experience of Wolfram|Alpha, EuDML, and others working with metadata standardization will be essen- tial input for this process. It is important to quickly codify the workflow for initiation of new lists of this kind and to gain a realistic assessment of the incremental cost of developing and maintaining new lists of various sizes and complexities. The intention is to lower the barriers to creation and maintenance of such lists to a point where there is substantial community enthusiasm for the activity. Simple user interfaces for the input of new en- tries and editing of existing ones consistent with schema restrictions are an essential requirement. The interface should be generic, much the same for all object classes, with customization as necessary for particular classes.21 21  Prototype interfaces are provided by BibSonomy (http://www.bibsonomy.org/, accessed January 16, 2014), Zotero (http://www.zotero.org/, accessed January 16, 2014), and various library catalog tools.

OCR for page 72
STRATEGIC PLAN 85 Growth and Cross-linking Some initial effort will need to be expended on planning for eventual cross-linking of a substantial number of entries in different lists through s ­ emantic relations, such as connections between lists of authors and journals or mathematical symbols and equations. This initial step is not intended to build a complete ontology of mathematics, but obvious semantic links will need to be supported to the greatest extent possible. This would aid in the creation of a Web of mathematical information that supports further processing by modern methods of graphical data analysis and may yield unexpected visualizations and insights into the structure of the mathemati- cal universe. The proposed development would likely benefit from starting small, demonstrating the successful ingestion of data and exposure of vari- ous facets incrementally, leveraging available ontologies and services, and building new ones as needed. Workflow Support A workflow is a sequence of connected steps where each step concludes immediately before the next step begins. Workflow management systems in computer systems manage and define a series of tasks to produce a final outcome or outcomes. Once the task is complete, the workflow software ensures that the individuals responsible for the next task are notified and receive the data they need to execute their stage of the process. These systems can also automate redundant tasks and ensure that uncompleted tasks are followed up, as well as reflect the dependencies required for the completion of each task. The DML could provide support for schema development and produc- tion software for editorial workflows involved in creation and maintenance of structured lists. It is to be expected that these workflows will evolve over time as different data sources and editorial agents become involved, and that somewhat different workflows may be required for different lists. RESOURCES NEEDED As discussed throughout the report, the DML will require a small paid staff, technical infrastructure, a funded research portfolio to support rel- evant projects, and a governing board to ensure that the DML’s components continue to function and develop properly. This section describes what is needed in each of these areas.

OCR for page 72
86 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY Financial Resources While it is difficult to accurately assess the necessary financial resources for the DML at this early stage, this section gives a general sense of the scale of the necessary human and technical resources. Some of these resources might be shared, too, depending on the particular arrangement developed for the DML. However, the amount of financial resources necessary is ob- viously an important component of evaluating the future development of the DML, and the committee provides the following recommendation for evaluating these resources before DML development. Recommendation: The initial DML planning group should set up a task force of suitable experts to produce a realistic plan, timeline, and prioritization of components, using this report as a high-level blueprint, to present to potential funding agencies (both public and private). The cost of development and upkeep for the DML will not be trivial but is currently too uncertain to be specified in this report. For some per- spective on operating costs, arXiv may provide a reasonable example. In calendar year 2012, arXiv spent nearly $800,000 in expenses relating to the following:22 • Personnel costs (including benefits)—totaling $492,061 ——User support (2.70 full-time equivalent and 0.36 student) ——Programming and system maintenance (2.13 full-time equivalent) ——Management (0.50 full-time equivalent) • Nonpersonnel costs—totaling $71,807 ——Servers (physical and virtual), hardware maintenance, storage and backup—$24,240 ——Network bandwidth and telephony—$10,867 ——Staff computers, software, and supplies—$2,700 ——Staff and arXiv Board travel—$34,000 • Indirect and in-kind costs—$208,631 ——College and department administration, staff support (26 per- cent of direct costs)—$146,606 ——Facilities (11 percent of direct costs)—$62,025 ——arXiv moderation (130+ moderators, varying time ­ ommitments)— c volunteer efforts 22  Cornell University, “Arxiv Projected Budget—Calendar Year 2012,” August 29, 2012, https://confluence.cornell.edu/download/attachments/127116484/arXiv2012budget.pdf.

OCR for page 72
STRATEGIC PLAN 87 To give some perspective of potential costs of developing capabilities, the committee would like to draw attention to some of the resources re- cently devoted to developing and deploying the Wolfram|Alpha continuous fractions work23 discussed in the Chapter 2 section “What Gaps Would the Digital Mathematics Library Fill?” Wolfram received a 1-year grant from the Alfred P. Sloan Foundation to prototype and build a technological infra­ tructure for collecting, tagging, storing, and searching a representative s subset of mathematical knowledge (including definitions and theorems) and presenting it through a Wolfram|Alpha-like natural language interface. This work required some 3,000 hours of work from a team consisting of four professionals and one intern. The subject of continued fractions was selected for this project because much of the relevant literature is older (therefore more representative of the type of content that can be utilized in a future system such as the DML) and is distinct from Wolfram’s main computational expertise (as to lessen the bias in the results). The individuals who worked on it had no detailed prior knowledge about continued fractions, which made the work go slower than it would if it were performed by an expert in the field, but this example is likely representative of how the DML would be approached. However, three of the four team members have written multi-volume books about mathematics, as well as websites each having more than 10,000 pages, so they had some experience in covering a wide range of mathematics. There was not enough time in a 1-year project to cover the 100,000 pages of printed continued fraction literature, so the team tried to explore and cover various content and presentation aspects to see what might be possible in future efforts. In most ways, this project succeeded in meeting its objectives but in some parts, especially fully computational representations of the content, the system still needs improvement. In addition to having qualified people, two software infrastructure components were important in carrying out this project: Mathematica and Wolfram|Alpha. Mathematica allowed the team to check the mathematics and to generalize it, and Wolfram|Alpha allowed them to collect the infor- mation in such a way that one can access it through free-form language inputs and deliver the information in various formats, from Web to TeX. This project is a meaningful example of how various DML features can be developed within a larger infrastructure. The following sections draw some specifics of needed human and technical resources to make the rest of the DML possible. 23  M. Trott and E.W. Weisstein, “Computational Knowledge of Continued Frac- tions,” WolframAlpha Blog, May 16, 2013, http://blog.wolframalpha.com/2013/05/16/ computational-knowledge-of-continued-fractions/.

OCR for page 72
88 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY Lastly, the committee would like to note the importance of a sustained investment and commitment from its potential funders. The committee be- lieves that a ramped investment pattern, starting as a prototype and scaling up, may be more beneficial than a large initial investment. The DML will require a long and sustained effort to be successful. Human Resources A small paid staff will work to develop the DML vision, address issues that arise, pursue fruitful partnerships, and manage the day-to-day opera- tions of the DML. The following is a list of staff functions that the com- mittee sees as essential during the initial phases of the DML. These staffing needs will change as the DML grows and matures. The committee believes it is essential to include a distinguished mathematician in the senior man- agement of the DML to provide credibility to the academic mathematics community and to gain startup funding and respectability in the nonprofit world. • Academic director. A well-respected leader in both the technical and social aspects of the DML who is able to make editorial deci- sions and can engage and appoint editors and curators for their domain knowledge and reputation. This could be part-time posi- tion (e.g., half-time of a senior mathematician). • Executive director. A manager with knowledge of large-scale data methods and digital libraries. This person would be responsible for directing the project manager, budget allocations, promotion of the project, and negotiations with partners, and also consulting with the academic director about priorities. • Project manager. This person would be in charge of the creation of the DML. He/she would interface with programmers, contracting organizations, and technical partners. • System manager. This person would be responsible for setting up adequate server infrastructure for day-to-day DML operations and for expanding operations as needed. • Data wrangler. The person would work on an ongoing stream of specific data conversion projects and provide documentation of best practices. He/she would engage and oversee other volunteer, or pos- sibly paid, staff and also set up and experiment with crowdsourcing tasks and implementations. • Rights management and legal. This person would provide guidance on critical licensing and copyright choices for both data and soft- ware, and for possible negotiations of agreements with data and service providers. This may be a consultant position.

OCR for page 72
STRATEGIC PLAN 89 • Research analyst. This person would be responsible for keeping abreast of emerging technologies, researching solutions for identified problems, assisting the executive director with technology choices, and preparing white papers to explain proposals and processes. • Community liaison. This person would be responsible for com- munity building, advocacy of the project, intelligent responses to incoming emails, blog development, negotiations to engage and persuade partners to contribute data, and other such activities. This would likely be a full-time staff person or contractor. Technical Resources A mathematics digital library requires a technical infrastructure. This infrastructure needs to support storage, backup, search, retrieval, and at least some support for analysis and visualization. Storage is needed for some documents, the software component, and the management data on the system. In general, a different storage solution will be needed for each type of data due to differences in usage associated with size, security issues, speed required, and level of backup needed. Security, in particular, will need to be carefully planned and assessed throughout the DML development to ensure that the data it stores will be well protected. Storage can be handled in-house by purchasing a number of servers or outsourced to server farms or cloud storage. The key is to plan for growth and to consider, for the operations of interest, whether it is more cost-effective to store data in multiple formats to facilitate search or to minimize storage and do data conversion on the fly. At this point it is not clear which option will be more economical. Other resources required include machines for developing and testing software, backup facilities, and machines for monitoring and managing the system. For development, the key resource needs include high-end desktop computers, access to storage devices, Internet connectivity, backup facility, and a test bed environment for trying new features before they are launched. For managing and maintaining the ongoing system, handling the business tasks and associated financial issues, basic desktop machines with Internet connectivity, printers and associated fax, and backup to machines off the Internet for security are needed. Necessary Research Areas There are many technological aspects of the proposed DML that are not currently possible. To help accelerate needed technological develop- ments, the committee believes that several research areas can be targeted by the DML organization.

OCR for page 72
90 DEVELOPING A 21ST CENTURY MATHEMATICS LIBRARY Recommendation: The Digital Mathematics Library needs to build an ongoing relationship with the research communities spanning math- ematics, computer science, information science, and related areas con- cerned with knowledge extraction and structuring in the context of mathematics and to help translation of developments in these areas from research to large-scale application. Some of these players include the following: • National Institute of Standards and Technology, • Cornell University Library (both Project Euclid and arXiv), • American Mathematical Society (MathSciNet), • Wolfram, and • European technical partners in EuDML, including FIZ Karsruhe (zbMATH).24 These organizational partners would be the employers of some people engaged in DML work, funded by contracts approved by DML central administration and funded through some arrangement with DML funding sources. It may be best for the DML to collaborate with its partners to complete such work, rather than directly employing large numbers of its own people. All of the above partners have existing capabilities and services of this kind, which should not be threatened, but rather enhanced, by DML developments. REFERENCES Aalbersberg, I.J.J., and O. Kähler. 2011. Supporting science through the interoperability of data and articles. D-Lib Magazine 17(1/2), doi:10.1045/january2011-aalbersberg. American Society for Cell Biology. 2012. The San Francisco Declaration on Research Assess- ment (DORA). http://am.ascb.org/dora/. Carette, J., and W.M. Farmer. 2009. A review of mathematical knowledge management. Pp. 233-246 in Intelligent Computer Mathematics. Springer. Gill, T., and P. Miller. 2002. Re-inventing the wheel? Standards, interoperability and digital cultural content. D-Lib Magazine 8(1). http://www.dlib.org/dlib/january02/gill/01gill. html. Whitehead, A.N., and B. Russell. 1910-1973. Principia Mathematica, 3 volumes. Second edi- tion, 1925 (Volume 1), 1927 (Volumes 2, 3). Abridged as Principia Mathematica to *56, 1962. Cambridge, U.K.: Cambridge University Press. 24  FIZ Karsruhe is a nonprofit corporation and the largest nonuniversity institution for information infrastructure in Germany (http://www.fiz-karlsruhe.de). FIZ (together with EMS and Springer) is one of the joint owners/controllers of zbMATH (http://zbmath.org/).