Read "Improving Findability and Relevance of Transportation Information: Volume I—A Guide for State Transportation Agencies, and Volume II—Background Research" at NAP.edu

« Previous: Summary

Page 3

Suggested Citation:"Volume I - A Guide for State Transportation Agencies." National Academies of Sciences, Engineering, and Medicine. 2017. Improving Findability and Relevance of Transportation Information: Volume I—A Guide for State Transportation Agencies, and Volume II—Background Research. Washington, DC: The National Academies Press. doi: 10.17226/24804.

Page 4

Page 5

Page 6

Page 7

Page 8

Page 9

Page 10

Page 11

Page 12

Page 13

Page 14

Page 15

Page 16

Page 17

Page 18

Page 19

Page 20

Page 21

Page 22

Page 23

Page 24

Page 25

Page 26

Page 27

Page 28

Page 29

Page 30

Page 31

Page 32

Page 33

Page 34

Page 35

Page 36

Page 37

Page 38

Page 39

Page 40

Page 41

Page 42

Page 43

Page 44

Page 45

Page 46

Page 47

Page 48

Page 49

Page 50

Page 51

Page 52

Page 53

Page 54

Page 55

Page 56

Page 57

Page 58

Page 59

Page 60

Page 61

Page 62

Page 63

Page 64

Page 65

Page 66

Page 67

Page 68

Page 69

Page 70

Page 71

Page 72

Page 73

Page 74

Page 75

Page 76

Page 77

Page 78

Page 79

Page 80

Page 81

Page 82

Page 83

Page 84

Page 85

Page 86

Page 87

Page 88

Page 89

Page 90

Page 91

Page 92

Page 93

Page 94

Page 95

Page 96

Page 97

Page 98

Page 99

Page 100

Page 101

Page 102

Page 103

Page 104

Page 105

Page 106

Page 107

Page 108

Page 109

Page 110

Page 111

Page 112

Page 113

Page 114

Page 115

Page 116

Page 117

Page 118

Page 119

Page 120

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

A Guide for State Transportation Agencies V o l u m e I

C o n t e n t s V o l u m e I A Guide for State Transportation Agencies I-5 Chapter 1 Finding Information When You Need It I-5 1.1 Business Drivers of Findability: Risks and Opportunities I-8 1.2 Guide Organization I-8 1.3 A Note on Terminology I-9 Chapter 2 Understanding Findability I-9 2.1 Overview I-10 2.2 Elements of Findability in a DOT I-12 2.3 Typical Impediments to Findability I-14 Chapter 3 Improving Findability I-14 3.1 Improving Information Management Discipline I-18 3.2 Improving Search and Navigation Capabilities I-24 3.3 Improving Metadata and Terminology Management I-34 Chapter 4 Planning for Findability Improvements I-34 4.1 Understanding User Needs I-37 4.2 Surveying the Information Landscape I-42 Chapter 5 Implementing Findability Improvements I-42 5.1 Establishing a Road Map for Improving Findability I-46 5.2 Putting Management Functions and Processes in Place I-53 References I-54 Abbreviations I-56 Appendix A Example Improvement Initiatives I-56 Improvement 1: Focus on Findability of Construction Project Information I-59 Improvement 2: Focus on Findability of Critical Corporate Documents I-62 Improvement 3: Focus on Findability of Information for Critical Job Functions I-65 Appendix B Glossary I-71 Appendix C Special Topics I-71 Topic 1: Search I-79 Topic 2: Metadata I-82 Topic 3: Text Analytics I-88 Topic 4: Terminology and Semantic Structures to Improve Search I-93 Topic 5: Integration Considerations for Enterprise Search I-95 Appendix D DOT Information Organization Resources I-110 Appendix E Examples of Commercially Available Enterprise Search and Text Analytics Products

I-5 1.1 Business Drivers of Findability: Risks and Opportunities Over the past two decades, technologies for creating and sharing information in electronic form have become pervasive. Like most large organizations, transportation agencies have expe- rienced challenges managing a growing collection of information including tabular data, spatial data files, Computer-Aided Design and Drafting (CADD) files, presentations, spreadsheets, manuals, meeting notes, emails, reports, video, and images. In addition, large data streams from sensor and video feeds, and Light Detection and Radar (LiDAR) point clouds have become part of the mix in recent years. As information volume and diversity grow, an organizationâs employees and partners find it increasingly difficult to be aware of relevant information and how to access it. The term findability is used to characterize how easy it is to find relevant information. Peter Morvilleâs book Ambient Findability (2005) provides three definitions: 1. The quality of being locatable or navigable. 2. The degree to which a particular object is easy to discover or locate. 3. The degree to which a system or environment supports navigation and retrieval. Improving information findability is viewed as both an art and a science (AIIM 2008); it involves a variety of information organization, classification, tagging, and search techniques, as well as infor- mation governance and training. Poor findability results in wasted resources due to excessive time spent searching for information, rework to investigate issues that have already been researched, duplication of data collection (due to lack of awareness of existing data), and decisions and actions that fail to take advantage of the full knowledge base of the organization. Poor findability also can put the agency at risk, inhibiting responses to legal actions, claims, or Freedom of Information Act (FOIA) and public records requests. Key risks associated with poor findability include: â¢ Lack of ability to produce information in support of audits and public information requests. â¢ Lack of ability to respond properly to e-discovery requests associated with claims and litigation, potentially leaving the agency open to multi-million dollar fines. â¢ Lack of ability to find authoritative versions of documents, resulting in inconsistent or improper implementation of agency policies and procedures, which in turn can negatively impact timely project delivery and consistent use of proven effective design practices. â¢ Lack of timely business decisions. â¢ Lack of ability to provide new employees and contractors with the information they need to get up to speed, resulting in unnecessary expenditures of staff time to support the ramp-up period. Finding Information When You Need It C h a p t e r 1

I-6 Improving Findability and relevance of transportation Information In addition to the risks noted above, there are hidden costs to poor findability. When a DOTâs information is not well-organized, accessible, and easily searchable, employees spend a great deal of time looking for relevant, accurate information. This is time that could be more productively spent analyzing and using the information. Of perhaps greater concern is that agencies are pay- ing for plans, studies, data sets, and so forth, but these products may be under-utilized because they cannot be easily discovered. Thus, poor findability not only results in unproductive use of staff resources and increased support costs, it also limits the usefulness of agency investments in information. Many DOTs are still transitioning from predominantly paper content (reports, files, maps) to predominantly digital content. The need to find physical copies of reports, maps, and files will likely continue for some time. Therefore, DOTs must plan for the co-existence of two separate types of processes for storing and retrieving content; one for paper content and one for electronic content. In many respects, the shift to electronic content improves ease of access to information. How- ever, some of the discipline required to manage paper content may be lost in this transition. For example, many agencies have procedures in place to send official copies of printed agency publications to their library. With electronic distribution, doing this may no longer be viewed as necessary. Disciplined processes for designating and storing authoritative versions of files may be disrupted, even though they are as important for electronic files as they are for paper documents. Findability can be one of those âout of sight, out of mindâ issues. Every day, a large number of employees (or contractors on the clock) could each be spending over an hour searching for information, but because these wasted hours are spread across the agency, the problem is not visible to agency leaders. Yet, these hours can represent large losses in agency productivity. A 2001 study estimated that âan enterprise employing 1,000 knowledge workers wastes $48,000 per week, or nearly $2.5 million per year, due to an inability to locate and retrieve informationâ (IDC 2001). A 2015 paper reported that a review of several surveys spanning different business sectors found that â24% of a business professionalâs time is spent looking for informationâ and that â48% of organizations felt search was unsatisfactoryâ (Cleverley 2015). Table I-1 illustrates potential savings over a 10-year period from findability improvements at a DOT using varying assumptions about time saved per day and hourly wage rates. The 10-year savings in this analysis is in the $10â$40 million range. Table I-1. Potential agency cost savings from findability improvements. Average Hourly Rate Average Time Savings per Employee per Day (min.) Annual Savings per 1,000 Employees (@230 work days per year) Present Value of Savings Over 10 Years (3% discount rate) $20 15 $1,150,000 $9,809,733 $30 15 $1,725,000 $14,714,600 $40 15 $2,300,000 $19,619,467 $20 30 $2,300,000 $19,619,467 $30 30 $3,450,000 $29,429,200 $40 30 $4,600,000 $39,238,933

Finding Information When You need It I-7 For DOTs, here are some realistic scenarios that could be avoided through improved findability: â¢ The DOT receives a FOIA request asking for all information related to the design of a new inter- change. The requestor wants information on the chronology of key design decisions, including meeting notes, plans, and emails. The request is assigned to an assistant in the Design Division who spends 80 hours locating the various pieces of information needed for the response and another 40 hours figuring out which items are duplicates or outdated versions. â¢ An executive asks a pavement engineer who is new to the agency why expenditures for data collection are so high, and if there are options for reducing the sample size or collection fre- quency. The pavement engineer requests that staff prepare a response that includes statistical analysis and a survey of other statesâ practices. Unfortunately, since the pavement engineer is new to the agency, she is unaware of a study that was done 3 years previously on this same topic. The staff searches the agencyâs intranet site for âpavementâ but do not find anything relevant in the first two pages of the search results and conclude that nothing has been done on this topic. They proceed to re-study the issue. â¢ An audit is conducted of the agencyâs maintenance practices. The state maintenance engineer is reviewing the audit findings and recalls that there was an effort a year ago to implement maintenance planning and tracking improvements related to the audit findings. He searches the agencyâs collaboration site and the shared file drive for meeting minutes or presentation slides that provide the details, but to no avail. He asks his staff to call colleagues and dig up the documents. After 2 days of phone tag, they are finally successful. â¢ The agency programming office issues an advisory regarding changes to the process for assign- ing funding sources to candidate projects. They post this advisory on their internal web page and notify district financial staff via email. However, several districts had posted previous versions of the guidance on their district web pages. An assistant in the district is not aware of the change, and uses the outdated version of the guidance, which causes a 4-week delay in processing a major project. DOTs can take a variety of steps to improve information findability, including implementing more disciplined information management methods, investing in content management systems, and expending the effort needed to implement information search and query capabilities that function well and meet employee needs. Making the effort to improve findability can reduce risks and provide opportunities to improve effectiveness of core business functions from planning to maintenance. In a DOT, multiple efforts may be undertaken within the agency to ensure that people can find information. For example: â¢ Work groups may set up standard folder structures on a shared file drive so that group members can find different types of documents. â¢ An engineering content management system may be deployed to make design plans and related files accessible to internal staff and external consultants. â¢ A DOT librarian may maintain a catalog of agency reports that can be searched by topic area, date and author. â¢ A DOT intranet site may have pages providing links to important agency-wide and departmental documents. â¢ A DOT data office may provide a searchable catalog of GIS data resources. Each individual solution for improving findability can be helpful. However, without a coor- dinated approach and an overall agency findability strategy, gaps are likely (i.e., there will be content that is not discoverable), an inability to search across different information reposi- tories, duplication of effort creating single purpose search capabilities that could have wider

I-8 Improving Findability and relevance of transportation Information applicability, and a lack of consistency in both search interfaces and information management practices, resulting in confusion on the part of people searching for information. An agency-wide approach to findability takes resources and coordinated action within the orga- nization. Making progress will require that agency leaders and managers (1) perceive findability to be an issue that merits attention and priority, (2) understand available strategies and methods for improving findability, and (3) align themselves around a feasible and coordinated improvement plan. The remainder of this guide addresses how to identify and assess findability needs, and how to implement solutions that will address these needs. 1.2 Guide Organization The organization of this guide is illustrated in Figure I-1. More detailed reference material is included in the appendices. â¢ Chapter 2 (Understanding Findability) provides an overview of the different elements of find- ability and reviews typical impediments to findability. â¢ Chapter 3 (Improving Findability) reviews key strategies for improving findability. â¢ Chapter 4 (Planning for Findability Improvements) discusses information gathering, assessment and analysis activities required to design an effective improvement. â¢ Chapter 5 (Implementing Findability Improvements) discusses implementation of findability initiatives and ongoing management functions needed to maintain and improve findability. â¢ Appendix A provides several example DOT findability initiatives illustrating the planning and implementation framework discussed in the body of the report. â¢ Appendix B is a glossary of terms related to information management and findability. â¢ Appendix C provides more detailed treatment of specific topics related to search, metadata and terminology management. â¢ Appendix D provides examples of information organization schemes relevant for transportation agencies. â¢ Appendix E lists example available products for enterprise search and text analytics. 1.3 A Note on Terminology This guide is concerned with findability of information in multiple formats (reports, presenta- tions, data sets, web pages, emails, etc.). The term content is used to refer to this variety of infor- mation types; the terms content object and information resource are used to refer to specific items. Appendix B provides a glossary of these and other terms used in this document. Un de rs ta nd in g What is ï¬ndability? What are the impediments to ï¬ndability? Im pr ov in g What strategies are available for improving ï¬ndability? Pl an ni ng What factors do we need to consider when planning ï¬ndability improvements? Im pl em en tin g How should we approach implementing ï¬ndability improvements? What management functions are needed? Figure I-1. Guide organization.

I-9 C h a p t e r 2 This chapter introduces the different elements of findability. A holistic understanding of these elements is important for identifying appropriate improvements. 2.1 Overview At a basic level, any search for information involves the following: â¢ A person with an information need. â¢ A target body of content (databases, manuals, spreadsheets, web pages, etc.) being searched. â¢ If the search is successful, retrieval of one or more items that meet the information need (see Figure I-2). Many distinct types of information needs exist, however, and information searches can take place within very different contexts. Before an organization sets out to improve findability, it is necessary to recognize and clarify these different needs so that appropriate solutions can be developed. There are several key vari- ables to consider: â¢ The scope of information retrieval. Providing access to a well-defined body of information (e.g., as-built plans) is more straightforward than addressing a more complex and varied set of information needs (e.g., all construction project-related information). â¢ The information need. Ensuring that people can find specific known documents or data sets requires a different strategy than does helping people explore available information about a particular topic area, project or location. â¢ The target content. The extent, volume, format and diversity of potentially relevant informa- tion resources to be searched will influence the design of an appropriate solution. For example, consider the need to ensure that DOT employees can find the most recent copies of agency manuals. This need may be addressed by providing a single agency web page with links to all of the manuals and defining clear responsibilities for refreshing this content when updates are made. On the other end of the spectrum, consider the need to provide bridge engineers with access to information about standards, best practices, and relevant research. Addressing this need might involve a combination of strategies to provide access to a set of curated, authoritative information resources, along with a search interface that allows the bridge engineers to search and explore the content depending on their specific questions. Understanding Findability

I-10 Improving Findability and relevance of transportation Information 2.2 Elements of Findability in a DOT Figure I-3 illustrates in more detail the different elements of findability within a DOT. A typical search scenario involves: â¢ Users (people seeking information), who type keywords or query conditions into a search interface (e.g., a search box on a web page) and receive back a list of relevant results. â¢ Search engines, which crawl available information repositories and create an index of available items (documents, data sets, etc.). The index can track every word in every docu- ment present in the information repository, and may also count the frequency of each word. The search engine is structured to quickly match a search query against the words in the index and, where possible, to use the frequency to assign a basic relevancy score. Thus, the search engine provides a fast and convenient way for users look up the items containing each term. â¢ The search engine looks in the index and returns a set of results that are relevant to the userâs request. The index includes all of the words used (except what are called stop words: pronouns, articles, and such). â¢ If metadata (e.g., subject key words, author, title, version, date, etc.) are available for the items that are stored in the information repositories, then this metadata can be used by search engines to provide more relevant results. The metadata also can be used within the search interface to help the users formulate their information requests. â¢ Classification schemes and related terminology resources such as taxonomies, thesauri, and synonym lists can be used to assign keywords and categories within the metadata to facilitate browsing and improve relevance of search results. Many factors influence the overall success of information searches in a DOT: â¢ Users. Information seekers need to (1) be motivated to search in the first place, (2) know where to search, and (3) know how to search. They need to be reasonably well informed about the existence of relevant information sources in the agency, and how to access them. Users may need to obtain security credentials required for access. They also may need some education about how to use the available search tools to maximize their chances of finding what they seek. â¢ Search interfaces and search engines. Search tools must be available that are appropriately configured for ease of use and tuned on an ongoing basis to provide effective results. Integra- tion of text analytics tools can boost the power of search tools to go beyond simple matching of search terms to search index entries. Information Seeker Body of Available Content Information Need Desired Results SEARCH Figure I-2. Elements of a search.

Understanding Findability I-11 â¢ Information repositories. The information being sought must be stored in a location from which it can be retrieved. Repositories must be periodically purged of irrelevant, older con- tent. The organization of information within the repository must be logical and intuitive for users, and files must be named in a consistent, informative manner. â¢ Classification and metadata. Policies, standards, and sustainable processes must exist for clas- sifying and documenting or tagging different types of content to facilitate findability. Qual- ity metadata will always improve search results, though the importance of metadata will vary depending on the situation. Metadata is essential for navigating and searching large and diverse collections of information and for describing the content of image files and data files. On the other hand, for relatively small collections of text documents, full-text searches may be sufficient. â¢ Information producers. Information producers must use appropriate file formats and nam- ing conventions, and must store their content in a location where it can be discovered. They may also be asked to assign appropriate metadata (or ensure that it is assigned or created by others given this responsibility). Each of these elements needs to be considered in planning findability improvements. A problem with any one of them can create a roadblock to a successful result. An expensive and full-featured information repository will not pay off unless people know how to use it and the content in that repository is relevant and authoritative. An information repository with terabytes of material is not of value unless it includes an appropriately configured search tool and the metadata to make search successful. Figure I-3. Elements of findability.

I-12 Improving Findability and relevance of transportation Information 2.3 Typical Impediments to Findability Fundamental impediments to findability include lack of disciplined information manage- ment practices, non-searchable information, lack of investment in common information reposi- tories and search tools, and lack of a consistent approach to metadata and use of terminology for information classification. Lack of Disciplined Information Management Practices â¢ Data and information resources are not handled as organizational assets with deliberate plan- ning of acquisition, maintenance, retirement and valuation for long-range planning. â¢ No explicit processes designate and manage authoritative versions of files for important agency content. â¢ No policies are in place that specify what types of information should be kept, how and when outdated content should be removed, where different types of content should be stored, and how content should be classified and organized. â¢ Even where policies have been developed, staff do not receive adequate training on how (and why) to follow them, little or no monitoring exists, and few or no consequences exist for lack of adherence to the policies. â¢ Where they exist, information management processes are implemented within organizational silos (e.g., district offices, project teams), which results in varying approaches to information classification, organization, and storage. Consequently, there is no easy way to search across different repositories. â¢ Limited resources are devoted to information management tasks. Business units are busy delivering required products and services, and they have difficulty making time available for tasks such as content cleanup and documentation. Agency information management func- tions (website management, library, data management) are short-staffed and are unable to take on responsibility for cleaning, organizing, and documenting content. Non-Searchable Information â¢ Large bodies of content (books, reports, plans, maps, forms, etc.) have not been converted to electronic formats or are not included in searchable library catalogs. â¢ Image files are stored without the annotation or metadata that would allow them to be searched by any attribute other than standard file properties (name, date, etc.). â¢ Paper documents or plans have been scanned as images (pictures) and not converted to text- searchable files using optical character recognition (OCR) software. â¢ Employees save information on local drives that are not accessible to others in the organization. â¢ Employees also store official data on personal or office-sponsored cloud storage accounts, which can be invisible to central IT or, even if discovered, can be ungovernable. â¢ Information is stored in repositories that are outside the reach of available search tools due to the lack of specific connectors to these repositories and/or access restrictions. Lack of Common Information Repositories and Search Tools â¢ Options are limited for common, shared information storage such as content manage- ment systems that enable management and retrieval of documents, plans, maps, data sets, etc. â¢ Available search tools have limited functionality, have not been configured to operate effectively, and/or are not actively monitored and tuned on an ongoing basis.

Understanding Findability I-13 â¢ Inherent limitations of full-text search exist with regard to going beyond literal matching of terms and searches based on meaning. (This situation is improving with advancements in text analytics and machine learning.) Metadata and Terminology Management â¢ A lack of metadata makes it more difficult to overcome inherent limitations on effectiveness of full-text search, including ability to account for synonyms (e.g., âcollisionâ versus âcrashâ) and homonyms (e.g., âplanâ as in a strategic plan versus âplanâ as in a design plan). â¢ Standards and practices for metadata creation are not in place. â¢ Tools and automation to support metadata creation are not in place, resulting in incomplete and inconsistently applied metadata. â¢ A lack of consistent information categorization schemes presents a barrier to providing effec- tive interfaces for browsing available content based on different attributes or facets (such as the interfaces offered by Amazon). Chapter 3 covers specific strategies for improving findability.

I-14 C h a p t e r 3 Three overarching strategies for improving findability are: â¢ Improving information management discipline by cultivating disciplined practices for creat- ing, naming, storing, versioning, and culling content. â¢ Improving search and navigation capabilities by implementing and configuring tools to explore available content and locate items of interest. â¢ Improving metadata and terminology management, by developing and using consistent ways of cataloging, describing and classifying content. These strategies are mutually reinforcing: Without disciplined information management practices, search engines will return pages of duplicative and non-authoritative items. Stan- dardized metadata, classification schemes, and terminology can make the difference between a rudimentary full-text search capability and one that helps a user navigate a large body of content and find the most relevant items to meet their need. 3.1 Improving Information Management Discipline Why Is This Important? Following a standardized and disciplined approach to creating, naming, storing, version- ing and periodically culling content can have a significant impact on how easy it is to find and retrieve that information. It is self-evident that a good, well-maintained filing system supports efficient information retrieval. However, it can be challenging in practice to establish and sustain the discipline needed to do this. It is common to see content repositories containing inconsis- tent folder structures, outdated files, poorly named files, and duplicate or near-duplicate files. When a significant percentage of content is redundant, outdated, or trivial (ROT), people find it difficult to locate authoritative versions of documents. When an employee leaves behind a file directory that has not been cleaned or organized, any valuable content may be as good as lost. What Can Be Done? Information management improvements can be implemented for targeted types of content or within particular business units, or more systemic improvements can be pursued. Ideally, cleanup and organizing efforts will be combined with implementation of ongoing processes for managing information that make it easier to maintain the newly established order. A first step for establishing ongoing processes is development of guidelines that support con- sistency across the organization with respect to the following: Improving Findability

Improving Findability I-15 â¢ Where content should be stored. â¢ How to name files. â¢ How to organize folder structures. Developing, adopting, and using such guidelines requires strong champions and ongoing effort. Specific improvements are briefly outlined below. Document and Content Management Systems Establishing document and content management systems provides a foundation for a disci- plined information management approach. Care must be taken to ensure user acceptance by making the document intake and update process as simple as possible. An established workflow, which can include use of electronic forms and digital signatures, can ensure that the products of key processesâsuch as construction contracting and performance reportingâwill be produced, stored, and documented consistently. Integrating document storage and metadata assignment as part of the normal business workflow is an important strategy that can facilitate findability. Content Storage Guidelines Establishing guidelines for where different types of important content should be stored is a low-cost way to enhance findability. For example, many DOTs require official design plans to be stored in their engineering content management systems. At a minimum, content storage guidelines should ensure that important files are not maintained on employeesâ local hard drives or on individual directories that cannot be easily accessed by others in the agency. Eliminating ROT Eliminating content that is no longer needed reduces the size of the search pool, making searches faster and increasing the chances that searches will yield relevant search results. Cleanup Sample Content Types for Storage Guidelines â¢ Policy directives â¢ Manuals â¢ Guidance documents â¢ Standards and specifications â¢ Transportation plans â¢ Corridor studies â¢ Right-of-way plans â¢ Design/as-built plans â¢ Survey results â¢ Meeting minutes and presentation materials â¢ Invoices â¢ Contracts and agreements â¢ Construction claims â¢ Construction change orders â¢ Work plans and budgets â¢ Asset inspection reports â¢ Contractor correspondence â¢ Public notices â¢ Training materials â¢ Environmental commitments â¢ Agency/business unit performance reports â¢ Purchase orders

I-16 Improving Findability and relevance of transportation Information or pruning activities typically will not happen on their own. Managers of different information repositories must provide opportunities, tools, services, and motivations for eliminating ROT. Ideas to consider include: â¢ Establishing clear guidelines for deleting or archiving draft and superseded versions of documents and transient files that have no lasting value, and discussing these guidelines in employee onboarding and other training activities. â¢ Training agency employees on records-retention regulations and how to follow them. â¢ Utilizing available tools that identify duplicate files and provide convenient ways of sorting files by date and size on shared file drives. â¢ Making available email cleanup tools. â¢ Providing cleanup reminders when user files exceed a predefined disk space limit. â¢ Conducting intranet content audits (utilizing available commercial or open source tools). â¢ Setting up automated reminders for file review in content management systems when files reach a certain age. â¢ Encouraging supervisors to set aside âfile cleanup timeâ analogous to an office cleanup. File Naming Conventions Use of meaningful file names and consistent file naming conventions are important for findability. Meaningful file names ensure that users can ascertain the contents of a file with- out having to open it. Consistent use of file naming conventions within a work group, project team, or agency helps ensure that users can access rich information about available files at a glance. Although files stored on shared drives may be organized within folders that have descriptive names, the file names should be designed to stand alone in case they are moved or retrieved using a search tool. Within content management systems, file naming can be automated based on use of metadata provided at the time the file is added. Specialized tools also can be used to automate file naming, especially for standard document types. File naming guidelines are available from NIST and other sources (NIST 2016, Alberta Government 2016, Minnesota Historical Society n.d.). Standardizing File Formats Standardizing file formats for different content types ensures that users will have the appro- priate software needed to open (and if necessary, modify) the content that they retrieve. Sample File Naming Conventions â¢ Use meaningful and concise names (e.g., âFile001â is not good practice). â¢ Use a consistent naming structure by content type (e.g., meeting notes, consultant studies, change orders, etc.). â¢ Use standard date formats (e.g., YYYYMMDD for meeting minutes and drafts; putting the year at the beginning facilitates searching) and include a document effective date (which may differ from the operating system file date) in metadata if possible. â¢ Use standard file extensions (e.g., X.DOC or X.PDF). â¢ Eliminate embedded spaces. â¢ Use title case to separate words (e.g., âBridgeDesignâ). â¢ Avoid special characters.

Improving Findability I-17 Standard File Scanning Protocols Scanning protocols should ensure that the content is text-searchable. Non-OCR-compliant image files are not subject to full-text search capabilities. If non-text-searchable files exist, OCR tools are available to convert image files to a text-readable format. Email Practices The practice of sharing files by emailing attachments results in duplication and increases the risk that a non-current or non-authoritative file will be used. These issues can be avoided by encouraging employees to post files they wish to share in an accessible and stable location (e.g., in a content management system) and email links to these files instead of attaching the files themselves. Digital Signatures Using digital signatures improves findability in that it enables agencies to transition from paper forms to searchable digital documents. Digital signatures employ mathematical schemes for demonstrating the authenticity of a digital message or document. A valid digital signature gives a recipient reason to believe that the message was created by a known sender, such that the sender cannot deny having sent the message and that the message was not altered in transit. Digital signatures are commonly used for software distribution, financial transactions, and in other cases where it is important to detect forgery or tampering. Digital signatures: â¢ Authenticate the source of a message or document. â¢ Provide confidence that the message was not altered in transmission. â¢ Ensure that the signer cannot at a later time deny having signed it. â¢ Eliminate the need for inefficient paper processes. â¢ Maximize the amount of content in digital form that can be located by search engines. The United States Government Printing Office (GPO) publishes electronic versions of the federal budget, public and private laws, and congressional bills with digital signatures. Several universities, including Penn State, the University of Chicago, and Stanford University, publish electronic student transcripts with digital signatures. Questions to Ask The following questions provide a starting point for identifying opportunities to improve findability through improved information management practices: â¢ Has the agency clearly defined and documented the types of content that it wants to be sure are findable by others beyond the individual that created them? â¢ Are employees familiar with the agencyâs policies about what types of content needs to be findable? â¢ Are content or document management systems available to store and manage these impor- tant files? â¢ Are several different content management systems in place with duplicative functionality? Do they have inconsistent user interfaces? Should consolidation be considered? â¢ Do guidelines exist for what types of content should be stored in different repositories (e.g., shared drives, available content management systems)? â¢ Are workflows and responsibilities defined for management of important content types (e.g., plans, standards, and manuals)? Do these definitions ensure the availability of a single

I-18 Improving Findability and relevance of transportation Information authoritative version at any given time? Where appropriate, do these workflow and responsi- bility definitions enable auditing of changes? Do they ensure adherence to records-retention requirements? â¢ Are responsibilities, incentives, and tools defined for identifying and removing duplicate files and deleting or archiving superseded versions of files? â¢ Are guidelines and training available on naming different types of files so that the files can be easily identified by others? â¢ Are guidelines available for file organization or folder structures, and are they consistently applied across work units with similar responsibilities (e.g., district or regional offices; project offices)? â¢ When paper files are scanned, is OCR software used to produce searchable text content? â¢ Are digital signatures utilized to enable transition from paper to digital content for documents requiring authoritative evidence of approval and assurance that the approved version has not been altered? 3.2 Improving Search and Navigation Capabilities Why Is This Important? An effective search capability enables people to quickly find the information they seek through typing keywords into a search box, and/or using filters or ârefinersâ to navigate through a body of content. The availability of fast and responsive Internet search has led to the widespread expectation that similar capabilities should be possible for searching across information reposi- tories within organizations. It is important to recognize, however, the wide disparity that cur- rently exists between the effectiveness of Internet search tools and those available for use within an enterprise. This disparity is, in part, why search has been described both as âamong the biggest, baddest, most disruptive innovations aroundâ and âthe source of endless frustrationâ (Morville and Callender 2010). The implication is that, within a DOT, implementing effective search is not simply a matter of acquiring a search tool and using it âout of the box.â A 2001 survey conducted by IDC, called âQuantifying Enterprise Search,â found that search- ers were successful in finding the information they seek 50% of the time or less (IDC 2002). In 2014, a survey conducted by the Association for Image and Information Management (AIIM) of 415 organizations (including 120 public sector organizations) found that enterprise search is still at a relatively early stage of adoption (AIIM 2014): â¢ Of these organizations, 52% reported having enterprise content management (ECM) sys- tems; 15% relied exclusively on file shares and network drives; and the remaining 33% reported having a number of unconnected document, content, and scanned-file repositories. â¢ Of these organizations, 51% had implemented a capability to search across repositories from a single interface. The majority of these reported using their ECM tool for this; 12% reported using a stand-alone search tool or portal. â¢ Of those organizations that used their ECM tools to search other repositories (outside of the ECM), 52% had purchased standard connectors or custom connectors from their ECM ven- dor, and 45% developed their own connectors or used third-party developers. â¢ Less than 20% of the organizations had a dedicated and trained staff supporting enterprise search, or a designated owner for enterprise search. The AIIM survey also found substantial levels of interest in utilizing enterprise search for mul- tiple content types (office documents, emails, scanned files, maps, drawings, photos, videos, social network text, and audio files) across multiple repositories (email systems, file shares, enterprise and line-of-business databases, intranet sites, ECM systems, data warehouse/business intelligence reports, staff directories, cloud-based repositories, and internal social network systems).

Improving Findability I-19 The disparity between usersâ expectations for enterprise search and the ability of existing tools to meet these expectations also was observed in a recent paper (Stocker 2014). This paper cited results of a survey of 2,000 managers, finding that more than half of respondents (52%) reported that they cannot find information they are seeking using their organizationsâ enterprise search facility within an acceptable amount of time. The Stocker paper also noted that while research on information retrieval in general has a long and rich history, the topic of searching for informa- tion behind enterprise firewalls has not yet been well studied. What Can Be Done? Providing effective search capabilities within a DOT requires (1) identifying and prioritizing the specific types of needs to be addressed, (2) acquiring the right tools, (3) spending the effort to configure them and build search interfaces tailored to the DOTâs specific needs, and (4) devoting ongoing resources to monitoring and improving search results. Identifying needs is covered in Chapter 4; the other elements are described in the balance of this chapter. Why Internet Search Performs Better than Enterprise Search Googleâs PageRank algorithm requires billions of documents and billions of links to work well. Within an organization, however, the linking structures and the rel- atively small number of documents are not of sufficient scale to allow this kind of algorithm to work effectively. The result is that products available for âenterprise searchâ (searching websites and other information repositories within an organi- zation) are less powerful and less effective than those available for Internet search. In addition to the major difference in scale between Internet and enterprise search, several other reasons explain why Internet search currently outperforms enterprise search: â¢ Publishers to the World Wide Web are eager to have their websites found, so they employ methods of search engine optimization (SEO) based on available information about the ranking algorithms of leading search engines. In con- trast, agency or company staff that create content typically have other priori- ties over making their content found; therefore, they do not put much effort or resources into ensuring that their content is discoverable. â¢ With Internet search, emphasis on making the home page and a few high level pages of a website findable will typically suffice. With enterprise search, a more comprehensive approach is needed to ensure that each individual doc- ument can be found. In addition, enterprise content typically consists of docu- ments and files in multiple formats without standard metadata, in contrast to the standard meta-tags used with HTML files on the web. â¢ Website creators dedicate more human resources (editors, taggers, and others) for a relatively small amount of content, whereas most organizations often donât have even one full-time person devoted to enhancing search results. â¢ Searchers on the web typically are satisfied to find any or some web pages on a topic, whereas searchers within an enterprise often want all of the docu- ments on a topic or a very specific document or piece of information that they know exists.

I-20 Improving Findability and relevance of transportation Information Search Tools Many DOTs have deployed content or document management systems and collaboration tools that have embedded search capabilities. For example, collaboration tools typically include a search box to help users navigate to content of interest. Engineering document management systems typically are configured to support searches by project location and project number. These embedded search capabilities have the advantage of being incorporated into software that people are used to dealing with as part of their jobs. Embedded search capabilities are rarely very advanced, however, and typically have much less functionality than commercially available search products. Embedded search capabilities can work if a searcher knows the exact document(s) he or she is looking forâand if all the documents the searcher is interested in reside in that particular appli- cation. However, typically only a small percentage of searches (and of total time spent search- ing) involves a user trying to locate a specific known item. More time is spent on exploratory searches in which users seek to collect resources pertinent to a particular question or topic area. For exploratory browsing of related documents across multiple repositories, the usual embed- ded search capabilities may not suffice. A specialized enterprise search tool can deliver improved search results. Several vendors offer enterprise search tools, and there are open source offerings as well as commercial products based on open source components. Enterprise search engines have a variety of advantages over the search capabilities that are embedded within individual applications. In addition to support for integrating content from multiple repositories, they can index large amounts of documents in a short period of time, which is essential when searching across large and multiple repositories. Another advantage is the ability to index multiple types of documents, including Word, PDF, and many others. A third advantage is their built-in capabilities for incorporating available rich metadata and taxonomies to improve search. Best Bets A simple approach that can work well for both Internet searches and within the enterprise is called Best Bets. Best Bets are manually created lists of key resources for common queries. Google employs thousands of editors to generate such lists to improve search performance. At a smaller scale, the technique requires only a modest level of effort. Faceted Navigation Faceted navigation is the most cited advance in search during the last 10 years. It allows users to filter a body of content based on multiple attributes, called facets. This is the approach used in Internet shopping sites to enable people to browse through a product catalog. In a shopping site, relevant facets include product type, size, cost, and features. In a DOT setting, relevant attributes might include content type, date range, source, district/region, topic, or project. Tagging information resources with facets works because it gives users simple ways to filter search results and does not require any advances in calculating search relevance. The one draw- back to faceted navigation is that it requires a great deal of consistently applied metadata. In some cases, this metadata can be automatically generated. Typically, however, considerable manual effort is required. (See next section for further discussion of metadata.) For faceted navigation to work, standard methods for classifying information resources need to be in place. Standardizing classification methods provides a way for users to retrieve most or all of the information relative to a particular topic area with a single search, regardless of where the information is stored and what type of format it is in.

Improving Findability I-21 For example, a bridge design manual could be categorized using the following facets: â¢ Content type: Manual â¢ Mode: Roadway â¢ Asset type(s): Structures and Bridges â¢ Document owner: Bridge Division â¢ Business function: Design â¢ Sensitivity: Public â¢ Status: Current â¢ Issue date: April 1, 2015 If all of an agencyâs corporate documents were classified using these facets, it would be pos- sible to build a one-stop shop for documentsâone that would allow employees in any division to find all of the: â¢ Corporate documents related to roadway design. â¢ Manuals related to maintenance. â¢ Procedural documents owned by the Construction Division. â¢ Corporate documents that can be distributed to the public (versus kept for internal use only). â¢ Policy directives related to funding that were issued in 2015. On the other hand, if the agencyâs Bridge Division classified its documents differently from the Construction Division and the Maintenance Division, this one-stop shopping would not be possible. Figure I-4 illustrates a sample list of facets that DOTs can consider for implementation within search interfaces. The lighter shaded elements are master data specific to an agency. The darker shaded elements are more generic categories that could, potentially, be commonly defined across DOTs. The facets shown in Figure I-4 represent a core set of elements; other facets, representing other DOT information data or categories, could be appropriate for specific classes of content. Content Type File Format DOT Business Function Project Transportation Mode Transportation Asset Sensitivity Organizational Unit Project Status Route District/Region Jurisdiction Agency-specific Master Data Potential DOT Community Vocabulary Figure I-4. Possible DOT search facets.

I-22 Improving Findability and relevance of transportation Information Implementing standard classifications in an agency can be challenging because it involves get- ting agreement across multiple business units (that may already have competing or inconsistent classification methods). Overcoming these challenges requires strong governance. A minimalist approach can be helpful: Rather than trying to be comprehensive in classifying information, the agency can select a few essential facets and give greater priority to those that can be automatically assigned and that involve insubstantial amounts of staff effort. Text Analytics Tools Text analytics tools are used to analyze unstructured text-based content (Word docs, PDF, text files, tweets, blogs, etc.) in a variety of ways to extract new information. These tools can be used in conjunction with search tools to automatically or semi-automatically generate the types of metadata that makes faceted navigation work. They also can support the process of developing taxonomies and controlled vocabularies. Because metadata is time-consuming and difficult to create in a consistent fashion, text analytics offers a solution to what has been a key limiting factor in the success of search solutions. Increasingly, text analytics functionality and search tools are being packaged together (see Appendix E). Search-Based Applications Enterprise search tools can be used as a platform for the development of search-based applica- tions (SBAs) that can greatly extend the value of search and of all the metadata and taxonomies that go into an effective search. SBAs are software applications in which a search engine platform (rather than a database) makes up the core infrastructure for information access and reporting. These applications were first described in an IDC report by Feldman and Reynolds (2010). The basic idea is to build on a search engineâs capability of dealing with unstructured text to enrich applications that previously could only utilize structured data. SBAs not only open up the use of the 80%â90% of business information that is in unstructured text, they also are significantly faster than databases and do not require users to learn structured query language (SQL), thus reducing training costs and leading to broader adoption. SBAs are rapidly growing. Applications are being built for multiple purposes, including e-Discovery, Business Intelligence (BI), and the development of rich dashboards for everything from marketing to scientific research. Two trends are current: SBAs being developed to create new kinds of analytic tools and SBAs being developed for use as off-the-shelf vendor solutions for older, established applications (e.g., e-Discovery and BI). Some benefits of SBAs are that they: â¢ Enable rapid access to information in multiple formats and from multiple sources. â¢ Are delivered as a unified work environment to support specific tasks or workflows, such as regulatory compliance, e-Discovery, sales prospecting, and customer support. â¢ Integrate all the tools commonly needed for a specific task or workflow, including: multi- source information access, authoring, collaboration, reporting and analysis, and alerting. â¢ Integrate domain knowledge to support the particular task, including industry taxonomies and vocabularies. â¢ Eliminate the need for users to âpogo stickâ (continually jump from one application to another). â¢ Are quick to deploy, easy to customize or extend, and economical to administer. Search as a Service Software as a Service (SaaS), which falls under the general category of âcloud computing,â is a software distribution model in which applications are hosted by a vendor or service provider and made available to customers over a network (typically the Internet). SaaS has become a common

Improving Findability I-23 delivery model for many business applications, including office and messaging software, payroll processing software, database management system (DBMS) software, CADD software, account- ing, collaboration, customer relationship management, content management, and service desk management. Search is an example of an application that can be delivered as a service. Benefits of the SaaS model include: â¢ Easier administration. â¢ Automatic updates (all users have the same version of the software). â¢ Easier collaboration (for the same reason). â¢ Reduction in software support costs. â¢ Global accessibility. An approach that could be considered within a DOT environment is the development of a single search capability for deployment across multiple applications to provide consistent ter- minology and a consistent user interface. DOTs now have search capabilities that are built into different document management sys- tems (e.g., for engineering drawings, team collaboration sites, records management, library repositories). SaaS could provide a way to develop a single search tool that could be called from these individual repositories. The tool could build in standard vocabulary/semantics (e.g., trans- lating route IDs to street names and vice-versa, returning results for cities within a queried county, returning corporate documents based on a set of standard search terms for these, etc.). The scope of search as a service would be limited to the individual calling applicationâs repository. Personalization and Subscriptions Personalization involves tailoring search results based on a personâs browsing history, geo- graphic location, defined persona (e.g., role within the organization), or peer recommendations (from within a social networking framework). Subscriptions involve pushing content to users instead of relying on users to seek out informa- tion. A push model provides an efficient way to ensure that people get the information they need (e.g., updates to policies and standards or notifications of training events). In a DOT, personalized searches or subscriptions might be defined for classes of users such as: â¢ District engineers â¢ Bridge engineers â¢ Project managers â¢ Construction inspectors When a new document is added (or an existing document is modified) within a particular category of relevance to their responsibilities, these users would receive a notification and link. Search Monitoring and Tuning Once a search capability is up and running, ongoing monitoring and tuning are required to maintain effective and efficient performance. Monitoring and tuning involve periodically: â¢ Soliciting feedback from users on how search tools are working in order to identify areas of weakness to be addressed. â¢ Reviewing search logs to identify unsuccessful searches (those yielding no results) and common search terms. Commonly used search terms can be used to identify Best Bets, add synonyms to be used in search query processing, or refine relevancy ranking methods.

I-24 Improving Findability and relevance of transportation Information Questions to Ask An assessment of agency search and navigation capabilities should consider the following questions: â¢ Are the range of search needs well understood within the agency? What types of needs are most important to address from an agency business perspective? â¢ Do agency managers understand that investment is required to improve and sustain search capabilities? â¢ What search capabilities are available within our various information repositories? â¢ Does the agency have enterprise search tools that search across repositories? â¢ How would users assess the performance of the currently available search tools? Are they able to find what they are looking for? â¢ Are search logs available that could reveal what searches are being conducted that are success- ful or unsuccessful? â¢ Has the agency compared the capabilities of its existing search tools to current commercial and open source offerings? Can a business case be made for upgrading? â¢ Are current agency search tools still supported by the vendors? â¢ Who in the agency is responsible for monitoring and improving search performance? â¢ Does the agency have staff with the knowledge and experience required to improve its search capabilities? 3.3 Improving Metadata and Terminology Management Why Is This Important? Metadata has been described as the âunsung hero of the information ageâ and âthe plumb- ing that makes the information age possibleâ (Coursera 2016). In a recent book, Pomerantz observed: âIn the modern era of ubiquitous computing, metadata has become infrastruc- tural. . . . Metadata, like the electrical grid and the highway system, fades into the background of everyday life, taken for granted as just part of what makes modern life run smoothlyâ (Pomerantz 2015). Metadata provides a structured way to describe information resources so that they can be managed, searched, and evaluated. If one were to create a catalog of all reports, data sets, pre- sentations, and other information sources in an agency, this catalog might consist of metadata elements (items such as title, date, author, topic, etc.). Each metadata element could be used for search and navigation. Metadata can be managed in database record fields, in meta-tagged text in the hidden header of web pages, or in the document properties fields of documents. Content management systems, document management systems, records managements sys- tems, and other such systems all have features for entering and storing metadata for each content item. When metadata is available, search engines can use it to improve relevancy ranking, which is particularly important when a full-text search would yield a very large number of results. Metadata is necessary to enable searches of rich media and other non-textual formats. In addi- tion, metadata enables advanced search and faceted navigation interfaces to be constructed (as discussed in the preceding section). Once an item has been retrieved, metadata helps users understand whether it is relevant and sufficiently authoritative to address their needs. Metadata also serves an important role in managing records-retention schedules and periodic culling of content within repositories.

Improving Findability I-25 As used in this context, terminology refers to a variety of controlled vocabulary and semantic resources including glossaries, lists of synonyms, taxonomies, thesauri, and ontologies. These resources can be used to: â¢ Provide standard lists of values for certain metadata elements (e.g., a list of DOT organiza- tional units, project phases, or infrastructure asset types). â¢ Expand user-entered search terms to also retrieve resources that use synonyms or related terms, thus retrieving a broader range of information resources (e.g., the search term âbarrierâ could be expanded to also retrieve documents that reference âguardrailâ). â¢ Provide a structure for faceted navigation. TRB maintains a defined list of modes and subject areas that are used to structure its main web page and can be used to navigate to relevant resources (see text box). These subject areas also can be used to query the TRB Research in Progress (RiP) database. This database consists of a set of metadata records for active research projects. A user can search by TRB subject areas or by index terms from the online Transportation Research Thesaurus (TRT). The TRT is a more comprehensive terminology resource that includes equivalent terms, broader terms, narrower terms, and definitions for selected terms (TRB n.d.). With improvements to natural language processing, there has been some debate in the literature as to whether the effort required to develop and maintain controlled vocabularies adds sufficient value in terms of improved findability over full-text searches. However, several studies have documented the value of using controlled vocabulary subject headings within bibliographic records. For example, Garrett (2007) found that addition of subject headings to a bibliographic database increased the rate of retrieval by 29%. A more recent study by Gross, Taylor, and Joudrey (2014) analyzed 194 search terms and found that of all documents retrieved based on a keyword search with these terms, 28% would not have been found if the subject heading information (based on controlled vocabulary) did not exist. Transportation agencies may have metadata standards and practices in place for certain types of content. For example, agencies with libraries may maintain a catalog of agency publications and consultant studies. Agencies with engineering content management systems will typically require basic metadata for each item added to the system to identify the project, author/creator, description and other characteristics of the item. Many agencies maintain standard metadata for their geospatial data sets, which allows them to leverage available tools that can harvest this metadata for data catalogs and support open data portals. Although DOTs can draw on resources like the TRT for tagging their content with subject terms, more specific terminology tailored for the DOT environment and for specific search needs within the DOT may be required. What Can Be Done? This section presents a set of activities that can be undertaken to improve use of metadata for purposes of findability. Metadata serves multiple purposes and is implemented in different ways depending on the context, so a multi-pronged approach typically is needed. Because creating and maintaining metadata requires effort, improvements should be prioritized based on business value added. Adopt/Adapt Metadata Schemes Adoption of standard metadata schemes within the agency will promote consistency across search interfaces for users and will enable and facilitate implementation of federated search and service-oriented models for discovery of information resources. Agencies can draw upon

I-26 Improving Findability and relevance of transportation Information publicly available metadata standards for different content types and extend them as needed for internal use. Table I-2 lists selected standards that may be relevant to DOTs. Although it may not be realistic to standardize metadata elements agency-wide, standardization can be implemented incrementally by: â¢ Focusing on specific content types (e.g., spatial data sets, image files). â¢ Requiring use of standards for new document and content management systems. â¢ Translating existing metadata elements to the standard based on crosswalks and transforma- tion logic. Build Terminology Resources Identify available terminology resources to use for subject keywords or as the controlled source for other metadata elements to be included in a search capability. Publicly available resources are listed in Table I-3. Transportation Research Board Modes and Subject Areas (Topics) Modes Aviation Pedestrians and Bicyclists Highway Pipelines Marine Transportation Public Transportation Motor Carriers Railroads Topics Design & Construction Planning & Environment Bridges & Other Structures Economics Construction Energy Design Environment Geo-technology Planning & Forecasting Hydraulics & Hydrology Society Materials Pavements Policy & Organization Safety, System Components and Users Administration & Management Freight Transportation Data & Information Technology Passenger Transportation Education & Training Safety & Human Factors Finance Terminals & Facilities History Vehicles & Equipment Law Policy Research Transportation (general) Operations & Preservation Hot Topics Maintenance & Preservation Climate Change Operations & Traffic Management Transformational Technology Security & Emergencies Source: TRB home page available at: http://www.trb.org/main/home.aspx (accessed March 2017).

Improving Findability I-27 Scope Standard General Dublin Core (ISO 15836) â 15 standard elements: Contributor, Coverage, Creator, Date, Description, Format, Identiï¬er, Language, Publisher, Relation, Rights, Source, Subject, Title, Type. Reference: http://www.dublincore.org/documents/dces/ Geospatial Data Content Standard for Digital Geospatial Metadata (CSDGM) â Elements cover: Identiï¬cation, Data Quality, Spatial Data Organization, Spatial Reference, Entity and Attribute (data dictionary), Distribution, Metadata Reference, Citation, Time Period, Contact. Geospatial Metadata (ISO 19115) â Similar information to the CSDGM with a diï¬erent structure and added ï¬exibility. Tools are available to convert CSDGM to ISO format. Reference: https://www.fgdc.gov/metadata Open Data Sets Project Open Data Metadata Schema v1.1 â Metadata scheme to support sharing of open data sets based on W3C DCAT, a vocabulary designed to facilitate interoperability between data catalogs published on the Web. Reference: https://project-open-data.cio.gov/v1.1/schema/ Survey Data Sets Data Documentation Initiative (DDI) â Detailed metadata standard for describing social and behavioral science data (e.g., household travel surveys). Reference: http://www.ddialliance.org/ Web Pages schema.org â A structured data markup schema for web pages developed through collaborative eï¬ort among major search engine providers; includes shared markup structure and vocabulary for describing scope and type of information included on the page. Reference: http://schema.org/docs/schemas.html Image Files Data Dictionary â Technical Metadata for Digital Still Images (ANSI/NISO Z39.87-2006 [R 2011]) â Working with a standard that deï¬nes âa set of metadata elements for raster digital images,â the data dictionary has been designed to âsupport the long-term management of and continuing access to digital image collections.â Reference: http://www.niso.org/apps/group_public/download.php/ 14698/z39_87_2006_r2011.pdf (continued on next page) Table I-2. Metadata standards.

I-28 Improving Findability and relevance of transportation Information Frequently, efforts to develop metadata schemes and ways of classifying information are per- formed in the context of particular initiatives (e.g., implementation of a new content management system). These efforts can be leveraged to build a standardized agency-wide approach. An agency-wide approach would enable searching across different repositories based on com- mon metadata elements. A comprehensive approach to classifying DOT information involves an architectural exercise to identify the major categories of objects or entities of concern, and the ways in which these entities are identified or distinguished. DOTs wishing to undertake such an exercise can use work already done to structure and organize their tabular databases. Dimensions useful for classifying unstructured content often will be the same as those found within DOT data warehouses or common lists of values within line-of-business applications. Similarly, metadata elements useful for retrieval and linkage of unstructured content will match with unique identifiers found within DOT databases. Internal agency databases and applications can be mined for additional master data and code lists that could be used for classification/tagging of information resources and for search. Types of classification terminology to compile include: â¢ Program categories â¢ Funding categories â¢ Performance categories or agency emphasis areas â¢ Project or work type classification â¢ Project status â¢ Asset categories or types â¢ Equipment types â¢ Material types â¢ Transportation modes â¢ Highway network or system subsets (where these differ from federal designations) â¢ Content types Content types may vary by business area or function (e.g., with construction staff using terms like material tests, daily inspector reports, claims, change orders, etc., and maintenance staff using terms like work orders, winter maintenance route maps, permit applications, and equipment warrantees). Scope Standard Multimedia Files MPEG-7 (ISO/IEC 15938) â A multimedia content description interface. Reference: http://mpeg.chiariglione.org/standards/mpeg-7 Library Resources MARC 21 Format for Bibliographic Data: Field List â Standard used for library cataloging. Reference: https://www.loc.gov/marc/bibliographic/ecbdlist.html Digital File Embedded Metadata Extensible Metadata Platform (XMP) (ISO 16684-1:2012) â A platform that can be used for embedding metadata within a variety of ï¬le types including PDF, TIFF, JPEG, TNG, GIF, and MP3. Reference: https://wiki.creativecommons.org/wiki/XMP Table I-2. (Continued).

Improving Findability I-29 Agency master data that could be of value to identify for integrating into search capabilities includes: â¢ Districts/regions â¢ Organizational units â¢ Business function categories (more stable than organizational unit names, which tend to shift fairly often due to reorganization) â¢ Maintenance areas Scope Resources Transportation Research Transportation Research Thesaurus Reference: http://trt.trb.org/trt.asp Australian Transport Index Thesaurus Reference: https://www.arrb.com.au/admin/ï¬le/content2/c7/ATRI_2013.pdf Public Sector Subject Vocabulary (UK) Integrated Public Sector Vocabulary (IPSV) â Developed for use by local governments for subject key words. Reference: http://id.esd.org.uk/list/subjects Gazetteer Geographic Names Information System (GNIS) References: http://geonames.usgs.gov/domestic/download_data.htm (see heading âTopical Gazetteersâ on webpage); and https://geonames.usgs.gov/domestic/index.html Federal Functional Classiï¬cation Highway Functional Classiï¬cation Concepts, Criteria and Procedures (2013 ed.) â A hierarchical view of functional classiï¬cation is presented in Figure 3-4 within the reference document. Reference: http://www.fhwa.dot.gov/planning/processes/statewide/ related/highway_functional_classiï¬cations/ Geotechnical Taxonomy for Geotechnical Assets, Elements, and Features Reference: https://trid.trb.org/view.aspx?id=1394094 Commodity Codes Standard Classiï¬cation of Transported Goods (SCTG) Commodity Codes Reference: https://www.census.gov/econ/cfs/2012/2012_manual.pdf Intelligent Transportation Systems Taxonomy of Intelligent Transportation System Applications â 3-level hierarchy of ITS applications used to organize cost information. Reference: http://www.itsbeneï¬ts.its.dot.gov/its/benecost.nsf/images/Reports/$File /Taxonomy.pdf Table I-3. Terminology resources.

I-30 Improving Findability and relevance of transportation Information â¢ Maintenance activities â¢ Projects â¢ Contracts â¢ Vendors â¢ Employees â¢ Partner organizations â¢ Routes â¢ Road sections â¢ Bridges â¢ Financial accounts â¢ Funding sources â¢ Programs Embedding Metadata Creation into Information Management Workflow Where possible, creation of metadata should be integral to workflows for creating and storing information resources. For example, part of saving a document within a document management system typically requires a user to fill in a set of metadata elements. Submitting a spatial data set to a GIS unit may require completion of the Federal Geographic Data Committee Content Standard for Digital Geospatial Metadata (FGDC CSDGM) or ISO 19115 metadata. (For information about FGDC CSDGM, see https://www.fgdc.gov/metadata/selecting-a-geospatial-metadata-standard.) Some document and content management systems will automatically populate certain metadata elements to minimize the burden on users. Metadata Quality Improvement Opportunities may exist to improve existing search capabilities by undertaking an effort to improve existing metadata. For example, an existing engineering content management system may have defined metadata elements, but the actual metadata it contains may be incomplete or inaccurate. System owners can conduct a review of metadata quality and initiate an effort to complete and improve the metadata. This effort may involve enlisting assistance from content creators, or it may be done independently. Automated methods can be employed to assist with auditing and reviewing existing metadata and with filling in the gaps for certain types of elements. Options for assigning metadata are discussed in the next section. Metadata Assignment Adding metadata to unstructured documents is the single best way to improve the findability of those documents. However, the process of adding metadata requires both human and tech- nical resources, and many organizations have found it difficult to make the necessary business case for making investments in metadata. It is important to consider the level of effort required to assign (or improve) this metadata. A variety of methods can be used for tagging documents, including asking authors to do the tagging, having librarians or other information manage- ment professionals do the tagging, and/or using automated text analytics software to assign tags. Because each method may involve a significant cost, it is important to know how much effort is required, consider whether this effort is feasible, and evaluate whether the anticipated improve- ment to findability is worth the cost. Five options for populating metadata can be considered. An appropriate option can be selected based on the amount, type, and diversity of the target content and the specific find- ability needs being addressed. Option 1: Minimalist Approach. This option relies on automatically populated file prop- erties (e.g., for an office document, the file name, file type extension, date, and author) as well

Improving Findability I-31 as the full text of the document. A folder structure can be set up (within a file drive or within a document management system) that provides a single-dimensional filing system. This approach requires little effort to implement, and may be appropriate for smaller collections and relatively small user populations. Search can be facilitated by use of consistent naming conventions and proper adherence to established folder structures. There are disadvantages, however, to using folder structures. A given document may fit within more than one category. Some folder struc- tures involve multiple levels of hierarchy that are inconvenient to navigate. In addition, once established, folder structures can be difficult to modify to meet changing needs. Option 2: Free-Form, User-Assigned Metadata. With this option, authors assign keywords or tags of their choosing to their own content at writing or on publication. Alternatively (or in addition, if supported by the content or document management system), content reviewers or readers can assign keywords or tags. These keywords or tags can be personal or they can be exposed to other users for searches. The advantage of this approach is that keywords, by defini- tion, capture the vocabulary of authors. In addition, because there is no controlled vocabulary to develop or maintain, and tagging is part of the authoring process, there is no perceived organizational overhead cost for metadata. Most content management systems support user- assigned tags, and even stand-alone office documents stored on file drives can include keyword properties or tags. In practice it is difficult to obtain consistent, high quality metadata with this decentralized approach. Authors may not reliably assign meaningful keywords, given that doing so requires a conscientious effort and the authors may not understand the value or purpose of the metadata. Moreover, even if authors always specify keywords, in general free-form tags are less effective for search than keywords selected from a professionally designed, controlled vocabulary. Tags that are meaningful to the author may not be meaningful to others. For example, expert authors tend to tag for other experts rather than for people without specialized knowledge of the content. Author-selected tags may include jargon and acronyms unfamiliar to other people. Also, authors from different parts of an organization may use distinct vocabularies for the same topic area, so a tag applied by an author in one part of an organization may not be helpful to an information seeker from another part of the same organization. In sum, although metadata obtained in this manner generally is not sufficient to support faceted navigation or ensure a high degree of recall or precision for searches, this option pro- vides a low-cost method for obtaining metadata to characterize the topic areas of information resources. Option 3: User-Assigned Metadata from Controlled Vocabulary. Using this approach, authors select metadata tags from a controlled vocabulary, which can be an approved list of keyword terms or a more fully developed taxonomy. This approach results in better quality keywords than Option 2, and it has the cost advantages of the distributed model for metadata assignment. However, the success of this approach will depend on the design of the vocabulary and the level of training provided to authors. If the vocabulary is long or complex, it can be dif- ficult and time-consuming for authors to select the right set of terms, leading some authors to respond by simply selecting the first term on the list. On the other hand, if the list of terms is too short, the list may lack sufficient granularity to provide meaningful search criteria. Tagging from a controlled list is a specialized skill that requires training, though few organizations provide this type of training. It is possible to implement a hybrid of options 2 and 3. In the hybrid approach, authors assign terms from a preselected, controlled vocabulary but can also add terms to cover situations for which the list has no suitable standard terms. This hybrid approach offers some helpful flexibil- ity, but risks introducing potentially ambiguous search terms.

I-32 Improving Findability and relevance of transportation Information Option 4: Centrally and Manually Assigned Metadata. This is the traditional approach to metadata assignment used in libraries. Teams of taggers (most often professional indexers) assign metadata from a controlled vocabulary to information resources as they are published, or at set intervals. Librarians may also be responsible for developing the controlled vocabularies that are used to tag the resource. A best practice is to provide the tagging staff with feedback on the use of tags so that they can refine the controlled vocabulary. The advantage of this approach is that it normally produces much higher quality metadata, given that the vocabulary development and tagging are done by people who are specifically trained for these functions. A higher level of consistency in tagging is achieved by centralizing the tagging function within a small team. Whereas content authors may perceive tagging to be an add-on activity of marginal value, professional librarians view tagging to be part of the job, and will generally produce complete and appropriate metadata. Nevertheless, even among profes- sionals, inconsistencies may arise in how information resources are tagged. A quality assurance process can be used to provide greater consistency across members of the tagging team. Over time, mismatches also may arise between the terms in the controlled vocabulary and the terms in current use within the user/content author community. An active approach to obtaining user feedback on search results, along with frequent updates to the vocabulary, can address this issue. The major disadvantage of this approach is the cost and time required to do the tagging. The organization must pay for the tagging teamâs time. Because the tagging activity is centralized, delays may be introduced in making the information resources available while a small tagging team works through a backlog of resources to be tagged. Given the cost of manual tagging, this approach is practical to use only for a subset of high value information resources in an organization. Option 5: Centrally and/or User-Assigned Metadata through Automated/Semi-Automated Methods. This final option involves the use of software (usually text analytics software) to automatically add or suggest metadata for information resources. The software analyzes each resource and tags it with one or more primary subjects based on one or more facets or categories. In addition, the software may extract descriptive metadata elements such as title, author, and date. Typically, the software can be used in batch mode to process large numbers of resources or in manual mode to support semi-automated metadata assignment. In manual mode, the software analyzes each document and generates suggested metadata. Suggested terms are then reviewed and accepted (or rejected) by human beings (the content authors and/or professional taggers). This type of semi-automated capability often is integrated into the publishing workflow within content management software. Advantages of this approach are that large sets of information resources can be tagged in a short period of time with minimal human effort. Once a set of rules for tagging have been developed, and/or the software has been trained, metadata can be created in a much more con- sistent fashion than would be possible with human taggers. Of course, the metadata assigned will only be as good as the rules that have been developed. Semi-automatic tagging is more time-consuming, but can improve results. Semi-automatic tagging by authors can minimize the time required by a central tagging team while overcoming some of the disadvantages of Option 3 (User-Assigned Metadata from a controlled vocabulary). The tagging task is much easier for authors when they are presented with suggested metadata values and all they have to do is accept or reject them (or in some cases, make suggested changes). Even if authors choose to do nothing, the information resources will still be tagged with the automatically generated metadata. Thus, the approach of semi-automatic tagging, with humans reviewing the suggested metadata, com- bines the advantages of automation (consistency, speed) with the advantages of human tagging (expert judgment). This option also can be used with a central team of library/indexers to tag large sets of documents, either by a publishing process that routes all documents through the central team or in the special and often critical task of tagging large sets of legacy documents.

Improving Findability I-33 The disadvantage of this approach is that the cost of text analytics software is not insignificant, typically running in the six figures. A significant development effort also is required to establish rules for metadata assignment. This is an iterative process involving initial rule development, application of the rules on a test set of documents, rule refinement, re-testing, and so on. It may also be necessary to develop or refine controlled vocabularies. Given the costs of this approach, it is most suitable for application to large (e.g., tens of thousands to millions) and growing col- lections of information resources (e.g., email) and less appropriate for use in an environment with a set of smaller, heterogeneous collections. Although low-cost and free open source text analytics software is available, these lower cost options typically require significant effort by an IT team to implement. Questions to Ask An assessment of metadata and terminology management practices should consider the fol- lowing questions: â¢ What metadata elements would be useful to support search or navigation but are not available? â¢ How complete is current metadata? How accurate and up to date is it? â¢ If metadata is being created manually, who is doing it, and how much time is required? â¢ What incentives or rules are in place for people to create metadata? â¢ Are automated or semi-automated processes in place to create metadata? Have options for automating metadata creation been explored? â¢ What terminology resources (e.g., pick lists, synonym lists, taxonomies, thesauri) are cur- rently being used for subject keywords or other metadata elements? Do these resources come from external sources (e.g., the TRT) or were they developed internally? â¢ Are integration processes in place supporting the use of standard code lists and master data sources within metadata assignment and search applications? â¢ Within the agency, who is involved in defining, populating, and/or managing metadata? â¢ Does an individual, unit, or team have responsibility for establishing metadata standards in the agency? If not, what individual or group could potentially take this on? â¢ Is there confidence that the staff responsible for metadata design and implementation activi- ties have sufficient training and background in library and information sciences to operate effectively and draw on current practices and technologies? â¢ What metadata elements have been defined for different types of content in different reposi- tories? Which of these elements are required and which are optional? â¢ What elements are consistent across repositories? What inconsistencies exist? Are there opportunities to standardize? â¢ If internal terminology resources are being used, how are they updated (e.g., who is respon- sible, what triggers the updates, what inputs are used, how are updates approved, etc.)?

I-34 C h a p t e r 4 Any effort to improve findability should begin with an understanding of: â¢ The users (what they are seeking and why). â¢ The information landscape (the types of information resources that are available, as well as methods for storing and accessing them, and the methods by which information resources are created and managed throughout their life cycle). The assessment should produce a list of the most important needs to be addressed and a diagnosis of what the issues are. Examples of needs and issues might include the following: â¢ Information being sought does not exist. â¢ Information has not been stored in a searchable location. â¢ Information is not in a form that supports discovery (e.g., is stored as an image rather than as searchable text). â¢ Search tools are not available. â¢ Search tools do not provide acceptable results. â¢ Search tools do not meet user needs (e.g., they support a simple keyword search but users need to explore a body of content through use of multiple filters). â¢ Users are not aware of existing search resources or do not know how to use them effectively. 4.1 Understanding User Needs Value of Understanding User Needs Understanding what people are searching for, why, and how is fundamental to the selection and design of effective findability improvements. The first logical step for many agencies will be to get a better understanding of the amount of time being spent by staff looking for authoritative information, the amount of rework that is occurring as a result, and the consequences of staff taking action without the benefit of available information. Key questions are: â¢ What types of information are needed? By whom? â¢ How do users currently look for information? â¢ What methods are most successful? Least successful? â¢ What kinds of questions are most difficult or time-consuming to answer? â¢ What kinds of searches do not turn up relevant results? â¢ What problems have resulted from the lack of findability? What risks have been created (e.g., lack of awareness of an internal study or analysis that was important to their work)? â¢ How much time do employees spend looking for information? Planning for Findability Improvements

planning for Findability Improvements I-35 â¢ How much time do employees spend responding to questions when the information is (or should be) readily available on the agencyâs intranet site? The text box, âExamples of DOT Search Needsâ illustrates the diversity of needs that may exist within a DOT. Examples of DOT Search Needs â¢ Known item search. Find available Professional Engineerâs Exam preparation materials. â¢ Engineering standard search. Find the latest version of a particular ASTM standard. â¢ Subject search (internal). Find all meeting notes, presentations, internal memos, and intranet web pages related to bicycle and pedestrian accommodations. â¢ Subject search (internal and external). Find all of the public correspondence and local news articles related to roundabouts during the past 5 years. â¢ People/expert search (internal). Find all people in the agencies who have project management certification. â¢ People/expert search (external). Find the individuals currently responsible for right-of-way management in each DOT in the Northeast. â¢ Procedure and policy search. Find the most recent national and state-specific policies related to design and installation of traffic devices. â¢ Location search. Find all of the pavement tests that have been conducted within a particular route section during the past 10 years. â¢ Project search. Find all plans, tests, contracts, and inspector logs related to a specific construction project. â¢ Legislation search. Find federal and state legislation and related administrative guidance related to tolling. â¢ Best practice/lessons learned search. Find lessons learned related to use of recycled pavement materials. Although it is not practical to obtain a comprehensive understanding of all search needs in an agency, it is worth investing the effort to understand search behaviors before embarking on an effort to improve findability. Investigation of search needs can help identify the areas likely to produce the greatest payoff from findability improvements, ensure that findability initiatives are targeted to the right problems, and provide information necessary for designing an effective search solution. More specifically, understanding areas where employees either spend a lot of time searching for information, or where searches for critical information are unsuccessful can help to target improvements that will result in greater organizational efficiency and effectiveness. Understand- ing where employees experience the most difficulties also will help to target improvements. For example, an effort might be started to ensure that employees have ready access to the most cur- rent engineering standards and guidelines. However, interviews with employees might reveal that most of them already know how to access the core corporate documents, and what they really need is help interpreting these documents and knowing what to do in special situations. For this type of need, it may be more effective to implement a robust people/expertise search capability than a policy/procedure search. There may also be a need to extend this type of search beyond the organization into other peer agencies. Understanding specific needs is essential for developing solution requirements. Requirements would consider information sources, content formats, collection sizes, adequacy of full-text

I-36 Improving Findability and relevance of transportation Information search approaches, need for content classification metadata, and so forth. Understanding of requirements also provides the basis for determining whether the search need would be best addressed via existing content management systems, improvements to general enterprise search capabilities, or development of a search-based application that targets a specific business process or user group. Mapping out the variety of search needs is a critical step in designing an effective faceted navigation capability. Approaches for Understanding User Needs User needs can be investigated with a variety of techniques, including surveys, logs, inter- views, focus groups, direct observation of how users search, or by using search analytics to track actual usage. It is helpful to identify a group of end users who are willing to be inter- viewed about their search behaviors and needs, spend time testing new capabilities, and provide substantive feedback on an ongoing basis. Focus groups and interviews are useful techniques for getting an in-depth understanding of what types of information employees look for, where they look, and the extent to which their search needs are being met (and if not, why). Because these techniques are time-intensive, they would typically be applied selec- tively to a set of target groups (e.g., district project managers, the central office, research or performance management units, employees with traffic engineering and operations respon- sibilities, or recent hires agency-wide). Agency librarians (where they exist) can be a valuable source of information on search needs in the agency; some may be able to provide logs of requests received. Interview subjects and focus group participants can be asked to provide examples of searches that they have attempted. Information of interest on these searches includes: â¢ Search contexts: â Impetus for search (external request, manager request, self-initiated) â Related business process or task â Level of urgency (need for immediate answer versus longer timeframe available) â¢ Nature of search: â Known item versus subject search â¢ Target content of search: â Person/expertise â Document â Image â Data â Web page â Combination â¢ Target repositories: â Internal â External â Combination â¢ Search criteria: â Single keyword, multiple keyword phrases, progressive iteration â Advanced (multiple attribute query) â Data query â Map query â¢ Results: â Level of success â Time spent on the search â Perceived ease of use of search tools

planning for Findability Improvements I-37 â¢ Reasons for limited search success: â Target material not available â Lack of search skills (e.g., use of wildcards, use of quotation marks) â Poorly configured search tools â Poor performance of text search â Poor metadata quality An online survey is another approach that can be used to gain a broader understanding of search needs and behaviors. For example, a survey can ascertain the level of awareness and use of existing information repositories and perceptions of their adequacy or value. A survey also can be used to ask employees to identify specific search needs that are not currently being met, although results from such open-ended inquiries are likely to be more anecdotal than comprehensive or representative. Finally, automatically generated information from search logs can be another source of insight into the types of searches that are being conducted and the nature of the results gener- ated. Although search logs can be automatically generated, effort is typically required to process the raw search log data into meaningful information. Search analytics capabilities are being continually improved by vendors. 4.2 Surveying the Information Landscape Surveying the information landscape involves: â¢ Establishing the scope for the findability improvement. â¢ Identifying where relevant information is stored. â¢ Understanding the mix of content types and formats. â¢ Assessing how information is organized, classified, and maintained throughout its lifecycle. Establishing the Target Scope The user needs assessment can provide a first cut at the target scope for the types of informa- tion to be included in the findability improvement effort. If the scope is large, it may be a good idea to limit it further to provide a realistic focus. Additional information types may be added in later efforts. Although it may be tempting to include as many sources and types of information as possible, an incremental approach that initially emphasizes mission-critical or core content relevant to the identified user needs should be considered in order to manage expectations and deliver results within a reasonable time frame. Agencies may choose to cast a relatively broad net initially, and then refine the scope based on the findings of the information landscape assessment. Focus is important to keep the effort manageable, but it is a good idea to establish a com- prehensive understanding of agency content up front, even if only parts of it are used initially. Later, the agency can add more information types or repositories. Taking this approach helps the agency avoid running into dead-ends and the need to restructure the solutions. Identifying Information Repositories (Where Is the Data Stored and What Type of Content Is There?) The information landscape at a DOT includes multiple repositories spanning a wide range of topic areas with information available in a wide variety of formats. Figure I-5 illustrates a collection of information repositories (places where data and content can be stored and accessed) that may exist in a typical DOT, with examples of the types of con- tent one might find in each.

I-38 Improving Findability and relevance of transportation Information Repositories may include: â¢ An enterprise collaboration platform with sites for organizational units and project teams, used for active document sharing (e.g., SharePoint). â¢ A specialized engineering document system used to manage and share design plans and other content related to a construction project (e.g., ProjectWise). â¢ One or more document or records management systems used to archive documents of long- term value to the organization, or to maintain access to active documents. â¢ Shared drives on file servers used by individuals, teams, and organizational units to store and share a variety of content. â¢ Enterprise databases supporting a wide variety of business applications. â¢ Enterprise GIS (geodatabases or other repositories of spatial data layers). â¢ Data warehouses (integrated data extracted from source systems for reporting and analysis). â¢ Special purpose repositories (e.g., email, multimedia); and â¢ Internal and public-facing websites, which may utilize web content management systems (e.g., WordPress, Drupal, Plone). In addition to these information repositories, catalogs may be in place containing metadata about internal or external information resources. Library catalogs, data dictionaries, data cata- logs, and enterprise metadata repositories are examples. When identifying the relevant repositories, it is also important to ascertain whether plans are in place for upgrading or replacing them. Figure I-5. DOT information landscape.

planning for Findability Improvements I-39 Understanding Content Types and Formats Once the relevant repositories have been identified, the next step is to understand what con- tent types and formats they include. For each repository, the goal is to obtain a breakdown of both the number of information resources (e.g., documents, data tables, images) and the amount of disk space consumed by these resources by each content format and content type category. â¢ Content formats include: â Applications â CADD files â Calendar entries â Databases â Mails â GIS files â Image files â PDFs â Photographs â Presentation files â Spreadsheets â Text (e.g., Microsoft Word) documents â Videos â Web pages Each content format may require a specific set of strategies to ensure findability. For example, scanned text documents will require application of OCR to be searchable. â¢ Content type offers a second way to characterize contentâbased on its purpose. Examples of DOT content types include: â Access permits â Accident/incident reports â Checklists â Code lists â Contracts â Correspondence â Crash reports/police accident reports â Design/as-built plans â Environmental permits â Floor plans â Forms â Graphics â Hauling permits â Inspection reports â Invoices â Legislation and regulations â Manuals/instructions â Maps â Meeting agendas/minutes â Memoranda â Organization charts â Plans, reports and studies â Policies and standards â Programs and budgets â Purchase orders â Project charters

I-40 Improving Findability and relevance of transportation Information â Public notices â Software manuals â Specifications â Timesheets â Training materials Content type classifications can be useful for establishing the scope of a findability improve- ment effort based on agency priorities and user needs. This type of content classification also can be used as a framework for defining information organization policies (what types of content should be stored where). It also can provide the basis for establishing metadata standards to sup- port search and integrate information. For example, location metadata would be applicable to content types such as as-built plans, hauling permits, and project charters but not to legislation, policies, or organization charts. Correspondence and memoranda might require source (from), audience (to), subject (re), and date metadata elements. Mapping the Information Life Cycle Strategies and processes for making information findable are implemented in the context of an information management life cycle (Figure I-6). Key activities in this life cycle are: â¢ Information is created or acquired (e.g., a new procedure document is drafted). â¢ Information is documented (i.e., the new procedure is assigned a file name and a content type such as âprocedure document,â and a brief description of it is created). â¢ Information is stored or archived for active use or long-term preservation (e.g., the new procedure is posted on the agency website). â¢ Information is periodically reviewed to identify what should be saved and what is no longer needed, subject to applicable records-retention schedules (i.e., older versions of procedures are discarded to avoid confusion). â¢ Information is discovered and accessed (through search engines, web pages or library catalogs). Figure I-6 also shows connections between information management activities and knowledge management activities: â¢ Employees build knowledge through a learning process that is informed by content they dis- cover and access and/or through collaboration or mentoring relationships with their peers. â¢ Employees use and apply their knowledge in the course of carrying out their jobs. â¢ Periodically, employee knowledge is captured and codified for use by others. This capture process may yield information in multiple forms (e.g., procedural documents, training mate- rials, lessons-learned descriptions, blog posts, or taped interviews). The captured knowledge becomes an input to the information cycle. The connections between employee learning and knowledge transfer are important, as this is where the business value from findability improvements is created. Understanding the information life cycle for a particular type of content involves document- ing how each of the following activities is accomplished: â¢ Create/acquire â Is content being created in the organization or acquired from external sources? â Is there a deliberate curation process to identify what content is valuable for particular business purposes? â Are there multiple creators/sources or just one? â Where in the organization are the creators of content located? â How frequently is content added/acquired/modified? Figure I-6. Information management life cycle.

planning for Findability Improvements I-41 â¢ Document/classify â Is any informal or formal process in place to assign metadata to the content (e.g., populat- ing document properties for a spreadsheet; completion of a metadata form for a spatial data file, updates to data dictionaries)? â What type of metadata is assigned? Is it limited to basic administrative items such as author and date created or does it include information that classifies or describes what the content is about? â Does a process exist to identify sensitive content, including content with personally identifi- able information (PII)? â Who has responsibility for creating and updating metadata? â Where is the metadata recorded? â What standards or guidelines are followed in creating metadata? â What is the quality of the metadata? â How consistent is metadata across different repositories? â¢ Store/archive â Where is content initially stored? â Are there multiple locations or a single location? â How is the content organized (e.g., folder hierarchies)? â Is an archiving process in place for older or inactive content? â Who manages each repository, and what processes or standards are in place regarding its use? â¢ Discover/access â What are the different ways that the content can be discovered and accessed (e.g., via a search tool, via a link on an intranet or external web page, retrieval from a library, etc.)? â How is access to the repository controlled? Is it open for general access, or is access granted to selected users? â What types of access privileges are defined? â What type of authentication is in place? â What type of content indexing is in place, if any? â What search tool(s) are in use? â Do existing search engines allow for integration with a third-party search engine? â¢ Cull/dispose â Is a formal or informal process in place to periodically review a body of content and remove items that are no longer needed? â How often does culling/disposal occur, and who does it? â Do any policies or guidelines drive this process? â To what extent do duplicative content or obsolete versions of content exist? Answering these questions will provide a solid understanding of current practice. Based on this understanding, opportunities and constraints can be identified for improving information management practices, search, and metadata.

I-42 C h a p t e r 5 Chapter 3 of this guide reviewed strategies that can be pursued to improve findability. Chapter 4 covered information gathering and analysis needs for planning improvements. This chapter provides a management framework for implementation. The framework is flexible, recognizing variations across agencies with respect to current capabilities, needs, and resources available. The first section provides a road map for improving findability by presenting an iterative series of steps that can be taken to make incremental progress. The second section covers creat- ing an ongoing management structure for continued maintenance and improvement. 5.1 Establishing a Road Map for Improving Findability The goal of improving information findability in a DOT may appear overly general and per- haps somewhat daunting, akin to trying to âboil the ocean.â Improving findability is a multi- faceted problem that is best tackled through an ongoing series of activities rather than through a single project. Figure I-7 presents a road map that outlines steps for improving findability in an incremental fashion. Each step of this road map is described briefly in the text sections that follow. Further information can be found in later sections of this guide. The basic approach in the findability road map is to begin with an agency-wide architectural vision for findability, then incrementally build toward that vision through a series of targeted improvements. Each improvement should focus on a well-defined area to make progress and demonstrate success within a relatively short timeframe (e.g., 3 to 6 months). Step 1: Create Architectural Vision for Findability Creating an overall architecture and approach helps define foundational components that will be developed and managed centrally, which allows for better consistency and efficient re-use as each specific improvement is implemented. These components are: â¢ Shared information repositories. A small number of shared repositories are identified where content will be stored and managed. There is no need to have all of the agencyâs information resources in a single place, but limiting the number of different repositories will reduce com- plexity of findability solutions. â¢ Enterprise search. One or more search tools (including text analytics) are selected and an approach to searching across information repositories is developed. â¢ Standard metadata elements. Initial agreement is established on a core set of standard data elements that will be used to describe content. Doing this provides the consistency needed Implementing Findability Improvements

Implementing Findability Improvements I-43 to search across different repositories. Agreeing on a core set of metadata elements does not necessarily mean that the agency intends to create metadata for every piece of content; rather, it means that where metadata is to be created, it will utilize the core elements. â¢ Common terminology. Agreement is established on an approach to building and maintain- ing a core base of common terminology (e.g., standard keywords and synonyms) that can be used to enhance information search and discovery. â¢ Master data. Agreement is established on an approach to building and maintaining centralized, authoritative data on fundamental agency entities (e.g., projects, employees, assets, etc.). Doing this provides the consistency needed to link or search across disparate information sources. â¢ Reference data. A protocol is established for managing and sharing common code lists for geo- graphic areas, organizational units, asset types, project phases, project statuses, and so forth. Creating an architectural vision for these common components allows each incremental improve- ment to be designed with an agency-wide perspective. For example, a findability initiative focused on project scoping may develop a set of categories of information needed for scoping and a particu- lar set of keywords for search. These categories and keywords can later be re-used or extended for a future initiative related to project management. An agency-wide terminology management system can be implemented to store and manage the terms created in the project scoping initiative so that they are available for use in future initiatives. In effect, this leverages the work accomplished within the specific findability initiative to strengthen the agency-wide foundation for findability. Other agency-wide foundations that can be set up and enhanced over time include search tools, content management systems, information management policies, and information governance processes. Over time, each individual effort builds toward a unified agency-wide approach to organizing and finding information. An initial vision for information findability ideally considers the perspectives of multiple areas and levels within the agency, including the: â¢ Agency leadership team â¢ Division managers and technical staff from core business units â¢ Research division and library â¢ Website management team â¢ Records management unit â¢ IT division 4. Identify Candidate Improvements 3. Conduct an Assessment 5. Implement Quick Wins 6. Implement a Pilot 1. Create Architectural Vision for Findability: Shared Repositories Common Metadata Elements Common Terminology Master Data, Reference Data Enterprise Search 7. Expand and Formalize the Pilot 2. Identify a Focus Area Refine and Build Figure I-7. Findability road map.

I-44 Improving Findability and relevance of transportation Information â¢ GIS unit â¢ Data management unit â¢ Risk management unit Executives can provide their vision about what and how information needs to be easily located and accessed in order to ensure effective core business functions. Representative staff from core business units (e.g., in planning, engineering/design, construction, maintenance, and opera- tions) can be asked to provide specific examples of how and where they search for information, what types of searches consume the most time, and how often they give up without finding what they need, or proceed without looking for relevant information. Staff in information manage- ment roles (library, records management, GIS, IT) can be asked to provide their perceptions of issues based on the types of requests for information they receive. They also can assess how well current solutions for information storage and search are working, and what improvements are needed. External stakeholders can provide their views of what types of information would ideally be accessible through a search of the agencyâs external web page. IT managers should be recognized as key players for improving findability in an agency and need to be involved from the beginning. Although implementing findability improvements is not exclusively a technology endeavor, it includes a strong IT component. IT staff will play critical roles in acquisition and configuration of search tools, creating connectors from search tools to various information repositories, addressing complex security and access constraints, and managing net- work performance in support of indexing functions. They will need to be bought in and involved in strategic, operational, and tactical-level planning. Identifying hardware, software, security, and staff- ing constraints and needs up front is an important activity that will shape how the agency proceeds. The Open Data movement is increasingly an important driver for sharing some types of agency information with partner agencies and other stakeholders. It may be appropriate to address external findability of information as part of the vision. This part of the vision can be informed by additional perspectives from the agencyâs public affairs and communications offices, regional and metropoli- tan planning agencies, and representatives of the state legislature and the governorâs office. Step 2: Identify a Focus Area Once a basic vision is in place, findability improvements can be implemented incrementally. The first step is to decide on what focus area to tackle first. For example, an agency could decide to focus on: â¢ A core business function that is distributed across multiple offices (e.g., construction project scoping or maintenance management); â¢ An individual business unit (e.g., one responsible for research, policy, internal agency process improvement, audit, or strategic planning, or another work group whose effective functioning depends on rapid access to relevant information); or â¢ A type of content (e.g., high value content such as design manuals, agency policies and pro- cedures, or design drawings). When selecting a focus area it is useful to consider current or planned information systems initia- tives that might serve as a motivating force or focal point for findability improvements. Initiatives that can provide excellent opportunities for improvements to information findability include: â¢ New document, content, or project management system implementation, or a technology upgrade to an existing system (e.g., migration to a new version); â¢ Internal or external website upgrade/redesign; â¢ Data warehouse implementation; â¢ An enterprise architecture or IT application consolidation/rationalization study; and â¢ New records management initiatives.

Implementing Findability Improvements I-45 Findability should be an important consideration in the selection and implementation of any of these initiatives. Keeping track of these initiatives and proactively working with the sponsors and project teams to ensure that each initiative incorporates elements of the overall findability vision allows the agency to leverage its ongoing technology investments to improve agency-wide findability. Step 3: Conduct an Assessment Once a focus area has been selected, an assessment can be conducted to provide an under- standing of needs and a diagnosis of the barriers to information findability. This assessment is essential to identifying appropriate candidate improvements. An assessment should consider both the demand and the supply of information. On the demand side, it is important to identify the major information needs and the current ways that information is discovered. Once the needs are established, an investigation of where information is stored and how it is managed is helpful for diagnosing what currently limits information findability and identifying what types of improvements are needed. The material in Chapter 4 of this guide provides a framework for investigating user needs, the information landscape, and life cycle management of information. Step 4: Identify Candidate Improvements Based on the results of the assessment, candidate improvements should be considered from each category of the strategies that were discussed in Chapter 3 of this guide. Specifically, the agency may consider improving: â¢ Information management discipline. This category includes actions to practice effective information hygiene and manage information throughout its life cycle, from creation to dis- posal or long-term archiving. Information management discipline may include converting content (e.g., scanning paper content to a text-readable electronic form), building electronic indexes for paper content where it is impractical or not cost effective to scan, standardizing content formats and storage locations, and implementing workflow for content creation, tag- ging, updating, archiving and removal. It also may include investments in new or upgraded information management systems. â¢ Search capabilities. This category includes actions to acquire, develop, deploy, configure, inte- grate, monitor, and tune performance of search tools. These actions may include developing faceted navigation capabilities, implementing search capabilities that allow users to search across different repositories, integrating semantic resources into search, training users on search meth- ods, and tuning of relevancy rankings. (See Appendix C, Topic 1 for further information.) â¢ Metadata and terminology. This category involves standardizing and applying methods for classifying and describing content to facilitate discovery. Actions include defining standard content types; developing metadata (descriptions of content such as author, date, standard, policy, etc.) standards for different content types; implementing manual, automated, or semi- automated approaches to content classification; and building terminology resources such as taxonomies or synonym lists to support classification and search. These activities may require investments in metadata repositories that support access to distributed content, and tools for master data management, metadata management, taxonomy management, and text analytics. A robust findability solution typically involves implementing several of these components simultaneously to make sure that (1) the desired set of content is available and searchable, and (2) it can be conveniently and reliably discovered. It is important to remember that simply pur- chasing new software rarely solves fundamental findability issues; rather, the purchase of new software gives an organization the opportunity to develop resources and approaches that can dramatically improve findability, such as implementing new metadata standards and imple- mentation features, including auto-categorization, or other information management policies.

I-46 Improving Findability and relevance of transportation Information One effective combination is to implement content management software, content workflow automation, text analytics software, and faceted navigation. For example: â¢ Content management software provides a platform for content creation and storage. â¢ A standard workflow is defined for publishing new content (e.g., with steps for conversion, formatting, review, approvals, etc.). â¢ Text analytics software is integrated into the workflow and configured to suggest metadata for each new document at the point of publication (while allowing for correction and feedback from the author). â¢ The assigned metadata is used to provide the information needed to power search and naviga- tion capabilities. Step 5: Implement Quick Wins Some actions may be possible to implement with minimal resources and effort. These actions can be pursued for âquick winsâ while a more comprehensive solution is being designed. For example, the assessment might find that people are having trouble finding information because documents have been scanned from paper as images but are not OCR enabled. Changing the document conver- sion process to include OCR conversionâand bulk-converting existing documentsâwould be an example of a quick win. As another example, the assessment might reveal a lack of user awareness about what search tools are available or what search techniques are most effective. Improving user awareness can be addressed through a user training program, leveraging materials that are already available, such as Transportation Research Circular E-C194 (TRB March 2015). Step 6: Implement a Pilot It is best to begin with a relatively small scale pilot implementation in order to test effectiveness and demonstrate results. The pilot effort can be planned and executed as a scaled-down version of a solution (e.g., involving a limited set of content types, a limited set of information repositories, a limited group of target users and/or business needs, etc.). It should begin with a well-defined set of expectations for what is to be achieved, and include evaluation of results. The evaluation can provide the basis for deciding whether to expand to a full implementation, or to pursue an alternative type of solution. Demonstration of pilot benefits â either quantitative or qualitative â can be helpful for building the level of support needed for continued or expanded investment in staff time, hardware and software. Step 7: Expand and Formalize the Pilot If the pilot is successful, a full implementation can be pursued. This can be planned in phases based on the level of effort required, the availability of resources and the need to coordinate with other agency activities. The evaluation approach used in the pilot should be repeated after each phase to ensure that the desired results are being achieved and to communicate the value of the effort within the agency. Successes in one focus area can be used to make the case for initiating new findability initiatives in other areas. 5.2 Putting Management Functions and Processes in Place Strategic Direction and Leadership Improving information findability requires (1) investments in tools and staff time, (2) coordi- nation across business units to provide an integrated and consistent approach, and (3) changes in employee information management behaviors. None of these is likely to happen without some involvement from the agency leadership team. Thus, the first step in putting in place management

Implementing Findability Improvements I-47 functions and processes for information governance that improve findability is to get it on the radar screen of agency leadership. It may be unrealistic to expect agency leaders to make information findability a constant or central area of focus, but it is important that they understand and support some key ideas. â¢ Findability directly impacts organizational efficiency. The faster employees can access timely and relevant information, the more productive they will be. If they cannot find infor- mation, employees will act without it (and potentially spend time reinventing the wheel or failing to follow correct, current policy and standards). â¢ Providing findable information requires more than providing a search tool. Findability improvements look at a range of strategies for making sure that employees can find what they need, and search is only one part of the solution. The presence of a search tool on an agencyâs intranet does not necessarily mean that the information on that intranet is findable. Improv- ing findability requires an understanding of what people are looking for and how best to make sure they can find it. â¢ Well-designed findability improvements pay for themselves. Modest investments can result in employee time savings and more effective decision making. â¢ Improving information findability is not a one-shot deal. Improving findability is an ongo- ing effort and involves not only technology but also work on policies, processes, and change management. â¢ Improving information findability can be done incrementally. Limited efforts can focus on areas where they will have the greatest payoff. â¢ Someone needs to âownâ information findability. Without clear accountability and allocation of staff time, making real progress will be difficult. â¢ Improving information findability requires consistency across the organization in how information is managed. Achieving this consistency requires involvement and steady sup- port from the senior management team. â¢ Management oversight is important to ensure that investments in findability improve- ments have a clear business case and have the buy-in necessary to be successful. Specific actions can be taken to strengthen strategic direction and leadership for information findability. â¢ Establish management buy-in for findability improvements. â Create and deliver a presentation to the agency leadership team to enhance their under- standing of the findability problem and the need for action. Include specific examples or stories about how the difficulty of finding information quickly is impacting the agency. â Conduct an employee survey to better understand how much time people are currently spending searching for information, and what types of information they are unable to find. Communicate the results to senior management. â Draft a set of agency data and information management principles that stress the impor- tance of ensuring information findability, and present them to senior management for endorsement. â¢ Integrate consideration of findability improvements into agency planning and budgeting. â Organize a meeting of representatives from the different business units that could play a role in improving information findability (IT, data management, library, records management, intranet manager, engineering document management system owner, collaboration system owner, etc.) to discuss common challenges and brainstorm how to improve collaboration and coordination. â Identify a lead person to be charged with working to integrate actions to improve findabil- ity into agency business planning, strategic planning, and budgeting activities. â Define an annual or semi-annual reporting process on the state of information findability at the agency, the progress and results of various improvements, and priority future needs.

I-48 Improving Findability and relevance of transportation Information â¢ Establish a âhome baseâ for decisions about findability improvements. â Create an agency information governance board with responsibility for improving infor- mation findability (as one element of its charter) or incorporate this responsibility into the charter of an existing management group. The agency-wide governance board (or equivalent) would: ï¿½ Review and approve proposed agency policies for information management based on an understanding of the level of effort likely required to implement them, and the nature of behavior changes involved. ï¿½ Provide management support and liaison to facilitate policy implementation. ï¿½ Define and carry out escalation processes for addressing situations where policies are not being followed. ï¿½ Evaluate and prioritize agency-wide technology investments that support findability. ï¿½ Ensure that appropriate coordination is carried out across organizational units (e.g., when a new content management system is proposed, make sure that it does not duplicate the functions of existing systems and builds on existing processes and standards for metadata and classification). â¢ Assign roles and responsibilities. â Define roles that are important for ensuring information findability and assign these roles to specific individuals with appropriate levels of resources. Specific roles are described below. â Update employee job descriptions and performance criteria to highlight the fact that these are not âextraâ non-essential roles, but integral to their positions. Agency Functions Supporting Findability Several operational functions are essential to improving findability. â¢ Policy and standards functions involve ongoing decision making about standardization of practice. Decisions will need to be continually reviewed and adjusted based on feedback from managers and employees, new technology implementation, and organizational changes. Documentation of approval processes can include process maps, decision criteria, and RACI matrices identifying roles that are responsible, accountable, consulted, and informed for each type of decision. â¢ Training and change management functions are critical to the implementation of many of the findability strategies discussed in this guide. â¢ Monitoring and maintenance functions involve day-to-day operational support and manage- ment of search and related tools; assignment of metadata through manual, semi-automated, or automated means; and management of data integration and synchronization processes. Availability of staff resources to manage, monitor, and improve search over time is an impor- tant success factor for a findability solution. Responsibilities for each function need to be defined to identify who is accountable and what approval processes will be followed. Each function will require designation of an owner, resourcing with sufficient staff having appropriate expertise, and use of standard operating procedures. Policy and Standards Functions â¢ Policy and guideline development. Developing and updating policies and guidelines that standardize information management practices such as: â What types of information should be searchable. â Where different types of information are to be stored. â Management of updates (e.g., version control and audit trails). â Limitations on creating local copies of documents. â File formats. â File naming conventions. â Information organization or classification conventions.

Implementing Findability Improvements I-49 â Adherence to established data and metadata standards. â Frequency of review and required retention periods. â¢ Data standards development. Developing and updating data standards to enable linkages across systems. These would include definition of specific data element meaning, format, domain (i.e., possible values), and identification of authoritative sources. â¢ Metadata standards development. Definition of standard metadata elements to be used for different types of content. â¢ Reference data management. Managing standard code lists for information used across multiple agency systems. â¢ Controlled vocabulary management. Developing and updating standard key words, syn- onyms, taxonomies, and other semantic resources. Governance processes and standard oper- ating procedures would cover: â Who can suggest new terms, â Who evaluates them and what are the criteria, â Who decides to add them, â Who actually adds them, and â How changes are documented and communicated. â¢ Software selection. Adding, upgrading, and sunsetting software that supports findability, including web content or document management systems, search tools, text analytics tools, metadata repository tools, and taxonomy management tools. Training and Change Management Functions â¢ Change management. Working with stakeholders throughout the agency to implement changes in processes or use of new tools such as content management systems. Change man- agement involves a well-orchestrated set of activities including extensive communication, training, and support. It includes obtaining feedback on how things are working, and using this feedback to make further improvements. â¢ Search skills building. Training and assisting end users in how and where to search. â¢ Information hygiene. Training and assisting employees to implement good information man- agement practices (e.g., using links rather than making duplicate copies, following adopted file naming conventions, deleting old drafts, populating required metadata, etc.). Monitoring and Maintenance Functions â¢ Search monitoring and refinement. Reviewing search logs and available analytics regularly and using this information to identify and implement findability improvements, including, for example, adding to a set of Best Bets that direct search users to a particular link when they enter a given search term. â¢ Search interface review. Providing design review of the various search interfaces within the agency to ensure an effective and consistent search experience. â¢ Content tagging. Assigning metadata to new content (via manual, automated, or semi- automated methods) or providing quality assurance of metadata assigned by content creators. â¢ Data synchronization. Ensuring that the various systems in use are updated to utilize the current, authoritative, version of controlled vocabulary and reference data. Further information on specific roles related to enterprise search is shown in the text box labeled âSearch Team Roles.â This information is adapted from a comprehensive book on enterprise search by White (2015). White notes that each role need not be a full time job; that is, a single individual may be able to perform multiple functions (e.g., search analytics manager and search information specialist), and many of the functions may be carried out by existing IT, web team, or library staff. In addition to these roles, White recommends the appointment of search liaison specialists within individual business units who can be well trained in the use of search applications and can serve as the âeyes and earsâ of the search team, providing feedback on search performance and improvement needs.

I-50 Improving Findability and relevance of transportation Information Expertise for Findability Specialized skills are required to perform many of the functions described in the preced- ing sections. Although some of these skills can be learned on the job, it is important to have a core team (at least two-to-three individuals) who have received formal training in library and information sciences and who participate in ongoing professional activities related to information organization and retrieval. Specific necessary areas of knowledge and expertise include: â¢ Fundamentals of information retrieval methods and architectures. â¢ Information architecture, including user-information interaction patterns, information- seeking behaviors, usability, design of information organization, and navigation systems. â¢ Understanding of the differences between web search and enterprise search. â¢ Search user interface design. â¢ Information classification and taxonomy development and uses. â¢ Types and uses of metadata. â¢ Search engine components and mechanics, including crawlers and connectors. â¢ Understanding of relevancy ranking, search tuning, and search engine optimization. â¢ Understanding of indexing methods and use of inverted indexes. â¢ Familiarity with text analytics and machine learning methods and tools. â¢ Awareness of commercial and open source search and taxonomy management products and features and the ability to evaluate their applicability for specific purposes. Search Team Roles â¢ Search Manager â Lead individual responsible for search improvements â Understands business uses of information and business language â Maintains close relationship with search vendor for expeditious issue resolution â Typically part of Internet/intranet team; may or may not be on IT staff â¢ Search Technology Manager â IT role with both technical and managerial skills â Responsible for server/network performance, crawling, schedules, load bal- ancing, backup and recovery, information security, user access controls, and Application Program Interface (API) management and documentation â Participates in and oversees search application development â¢ Search Analytics Manager â Monitors search results and diagnoses reasons for unsuccessful searches â Conducts user surveys to assess user satisfaction, analyzes help desk inquiries â¢ Search Information Specialist â Expertise in library and information science, metadata, taxonomy â Technical leadership and oversight for metadata standards and taxonomy development â Develops Best Bets â¢ Search Support Manager â Responsible for user training and usability testing â User communication (e.g., blog posts, wiki with âtips and tricksâ) Source: Adapted from White (2015)

Implementing Findability Improvements I-51 Tracking Progress and Measuring Performance Defining performance measures for findability is critical to an effective outcome: What exactly does success look like? How do managers track progress? It can be argued that success is achieved when people across the agency (or across a major department or division) can quickly find the infor- mation they need for their jobs and the cost of making content findable is low. Design of a progress and performance measurement approach for findability should consider multiple perspectives. â¢ Business unit or process perspective. Do the employees in a given business unit, (or the par- ticipants in a given business process) have easy access to the information they need to succeed? (A possible performance measure is the level of manager and staff satisfaction with current information availability.) â¢ Organizational efficiency perspective. Do employees spend a disproportionate amount of time looking for information rather than analyzing and using information? (A possible perfor- mance measure is the average amount of time spent searching for information per employee.) â¢ Risk management perspective. If someone is seeking information about official agency policy or procedures, what are the chances that the individual will find (and use) an out-of-date ver- sion? (A possible performance measure is the percentage of a sample group of employees who can, when asked, retrieve the most current policy within 10 minutes.) â¢ Content utilization perspective. When looking for content that is managed (and indexed) in one of the agencyâs repositories, do employees know where to look, and can they retrieve it in a timely fashion? (A possible performance measure is the percentage of a sample group of employees who are able to retrieve a named document within 10 minutes.) â¢ Specific content availability perspective. If someone is looking for information about a con- struction project completed within the last 10 years (e.g., an as-built plan, or a final budget), how long would it take them to find it? (A possible performance measure is the percentage of a sample group of employees who are able to retrieve project information within 1 hour.) â¢ Search quality perspective. Do current search tools provide satisfactory levels of recall and precision? (Possible performance measures are: [1] what percentage of the items returned on the first page of search results are relevant to the userâs need, and [2] what percentage of all relevant resources are included on the first page.) â¢ Organizational change perspective. To what extent have employees adopted best practices for information management, classification, and metadata assignment? (A possible performance measure is the percentage of employees who have participated in information management train- ing, or who have provided compliant metadata for inclusion in an information repository.) These examples relate to the benefit side of findability. It is also important to consider the cost side. For example, an initiative might be deemed successful if it reduced the amount of time to create metadata for documents by replacing a manual approach with an automated or semi- automated approach. Measuring the success of a findability initiative can be challenging given the range of search types and the number of variables that influence the end result. Variables include: â¢ Content availability â¢ Adequacy of metadata to support search â¢ User awareness of where to search â¢ User knowledge and skills related to search â¢ Availability and features of search tools To obtain a good understanding of whether an initiative is working, it is typically necessary to select multiple measures. Table I-4 provides a summary of sample measures for judging the success of a findability initiative. Methods for gathering the necessary information to support each measure are noted in the table. It is important to focus on a vital few measures to keep measurement costs to a reasonable level and provide clear indications of progress for management.

I-52 Improving Findability and relevance of transportation Information Attribute Sample Measures Content Availability â¢ Amount or percentage of content of a speciï¬c type that is in electronic form (based on content analysis) â¢ Amount or percentage of content of a speciï¬c type that is text-searchable (based on content analysis) â¢ Amount or percentage of indexed target content Search Time â¢ Amount of time users spend searching for information (based on surveys, interviews, or user logs) â¢ Average or median time to retrieve a known document (based on test) â¢ Average or median time to ï¬nd the answer to a speciï¬c question that may require retrieval of multiple information resources Search Precision and Relevance â¢ Percentage of relevant resources in ï¬rst 25 results returned (based on test) â¢ Percentage of time a target appears in the ï¬rst 25 results returned (based on test) â¢ User ratings of search quality (based on surveys) â¢ Percentage of the ï¬rst 50 results that are outdated or duplicative Search Success Rate â¢ Percentage of users able to retrieve documents within a threshold amount of time (based on test for both a standard set and a user-selected set of documents) Search Convenience and Ease of Use â¢ User ratings of search tools (based on surveys or interviews/user stories) Content Processing Eï¬ort â¢ Total or per document cost of creating indexed, searchable content (stratiï¬ed by document type) â¢ Total or per document cost of metadata creation Metadata Quality â¢ Percentage of documents with complete metadata (based on database query) â¢ Percentage of sampled documents with quality metadata (based on independent validation of existing metadata elements) Information Management Practice Adoption â¢ Percentage of employees aware of information management policies and standards adopted by the agency â¢ Percentage of employees adhering to the policies and standards adopted by the agency â¢ Percentage of redundant and outdated content (based on analysis of a sample set of content) Table I-4. Performance measures for findability initiatives.

I-53 AIIM (2008). Findability: The Art and Science of Making Information Easier to Find. AIIM: Silver Spring, MD. AIIM (2014). AIIM Industry Watch. Retrieved 07 20, 2016, from http://www.aiim.org/Resources/Research/ Industry-Watches/2014/2014_Sept_Search-and-Discovery. Alberta Government (2016, June). Document Naming Conventions. Retrieved July 9, 2016, from http://www. im.gov.ab.ca/documents/publications/DocumentNamingConventions.pdf. Cleverley, P. H. (2015). The Best of Both Worlds: Highlighting the Synergies of Combining Manual and Automatic Knowledge Organization Methods to Improve Information Search and Discovery. Knowledge Organization, 42(6), pp. 428â444. Coursera (2016). Mooc List. Retrieved from https://www.mooc-list.com/course/metadata-organizing-and- discovering-information-coursera?static=true. Feldman, S. and Reynolds, H. (2010). Worldwide Search and Discovery 2009 Vendor Shares: An Update on Market Trends. IDC. Garrett, J. (2007). Subject Headings in Full-Text Environments: the ECCO Experiment. College & Research Libraries, 68, pp. 69â81. Gross, T., Taylor, A. G., and Joudrey, D. N. (2014). Data Spreadsheet for Still a Lot to Lose: The Role of Controlled Vocabulary in Keyword Searching. Library Faculty Publications. (S. C. University, ed.) St. Cloud, MN. Retrieved from http://repository.stcloudstate.edu/lrs_facpubs/39. IDC (2001). The High Cost of Not Finding Information: An IDC White Paper. Framingham, MA: IDC. IDC (2002). Quantifying Enterprise Search. Minnesota Historical Society (n.d.). Electronic Records Management Guidelines. Retrieved August 18, 2016, from http://www.mnhs.org/preserve/records/electronicrecords/erfnaming.php. Morville, P. (2005). Ambient Findability: What We Find Changes Who We Become. OâReilly Media, Inc. Morville, P., and Callender, J. (2010). Search Patterns: Design for Discovery. OâReilly Media, Inc. NIST (2016, March). Electronic File Organization Tips. Retrieved July 8, 2016, from http://www.nist.gov/pml/ wmd/labmetrology/upload/ElectronicFileOrganizationTips-2016-03.pdf. Pomerantz, J. (2015). Metadata. Cambridge, Massachusetts: MIT Press. Reamy, T. (2016). Deep Text: Using Text Analytics to Conquer Information Overload, Get Real Value from Social Media, and Add Bigger Text to Big Data. Information Today, Inc. Stocker, A. M. (2014). Is Enterprise Search Useful at All?: Lessons Learned from Studying User Behavior. Retrieved from http://doi.acm.org/10.1145/2637748.2638425. TRB (n.d.). Transportation Research Thesaurus. Transportation Research Board of the National Academies of Sciences, Engineering, and Medicine. Retrieved July 25, 2016, from http://trt.trb.org/trt.asp. TRB (March 2015). Literature Searches and Literature Reviews for Transportation Research Projects: How to Search, Where to Search, and How to Put it All Together: Current Practices. Transportation Research Circular E-C194. Transportation Research Board of the National Academies. Retrieved from http://www.trb.org/Main/ Blurbs/172271.aspx. U.S.DOT (2013). Highway Functional Classification Concepts, Criteria and Procedures. Available at https:// www.fhwa.dot.gov/planning/processes/statewide/related/highway_functional_classifications/. White, M. (2015). Enterprise Search: Enhancing Business Performance. OâReilly Media, Inc. Winkler, A. (2014). âEnterprise Information Taxonomy ProjectâTransportation Asset Project Outbrief.â Presen- tation prepared by Alexandra Winkler (advisor Dr. Denise Bedford), Kent State University, December 8, 2014. References

I-54 AIIM Association for Image and Information Management ANSI American National Standards Institute BI Business Intelligence CADD Computer-Aided Design and Drafting DBMS Database Management System DCMI Dublin Core Metadata Initiative DOT Department of Transportation ECM Enterprise Content Management System FGDC Federal Geographic Data Committee FOIA Freedom of Information Act GEC General Engineering Consultant GIS Geographic Information System GPO Government Printing Office GSA Google Search Appliance HQ Headquarters HR Human Resources HTML Hypertext Markup Language HOV High Occupancy Vehicle ISO International Standards Organization IT Information Technology ITS Intelligent Transportation Systems KDOT Kansas Department of Transportation LCSH Library of Congress Subject Headings LiDAR Light Detection and Radar MeSH Medical Subject Headings MMIS/SES Maintenance Management Information System/Single Entry Screen NIEM National Information Exchange Model NLM National Library of Medicine NISO National Information Standards Organization OCLC Online Computing Library Consortium OCR Optical Character Recognition OR Operation Region PDF Portable Document Format PO Purchase Order PS&E Plans, Specifications, and Estimates RACI Responsible, Accountable, Consulted, Informed ROT Redundant, Outdated, and Trivial ROW Right-of-Way Abbreviations

Abbreviations I-55 SaaS Software as a Service SBA Search-Based Application SEO Search Engine Optimization SQL Structured Query Language TRT Transportation Research Thesaurus UTP Unified Transportation Plan

I-56 A p p e n d i x A Example Improvement Initiatives Improvement 1: Focus on Findability of Construction Project Information The Situation DOT X has implemented a content management solution for its design drawings, but other content related to construction projects is created and managed at the district level, and there is no single central repository for this information. Practices vary with respect to what content is captured electronically, what ï¬le formats are used, where content is stored, and how content is organized. Districts use a combination of local hard drives, shared ï¬le drives, collaboration team sites, and physical ï¬le cabinets to store their content. The Problem When there are damage claims or public information requests, it is time- consuming to dig up the necessary background information â and the required information may not be available. When there is staï¬ turnover, newer staï¬ have diï¬culty ï¬nding project ï¬les. The central Project Management Oï¬ce would like to be able to review documentation maintained across projects statewide as part of an eï¬ort to streamline and improve requirements. However, the time needed to gather existing documentation limits their ability to do this. The Goal Ensure timely agency-wide access to information related to both active and completed construction projects. The Improvement Initiative Implement information management policies that ensure consistency across the state in management of construction project-related content. Implement search capabilities that facilitate access to project content. Utilize a text analytics tool to identify standard search terms and automate the process of metadata assignment to content. Planning the Improvement Survey the Information Landscape: - Identify types of content to be included in the eï¬ort: Design and as-built plans

example improvement initiatives I-57 Contracts Change orders Invoices Inspection reports/notes/photographs Materials test results Emails - Map out existing repositories with construction project-related information: Collaboration software team sites Engineering ï¬le repository â design plans Capital program management application â status, ï¬nding, schedule File cabinets/desk drawers District websites Understand Findability Needs of Diï¬erent Users: - Conduct focus groups with construction project managers in each district to identify their information retrieval needs and level of satisfaction. - Conduct focus groups with maintenance personnel to identify their needs for construction project information. - Review logs of damage claims and public information requests and follow up with interviews to document the search process used to respond, and the amount of time spent. - Interview Project Management Oï¬ce staï¬ to understand their information retrieval needs. Document the Information Management Life Cycle: - Prepare a chart indicating life cycle steps for each content type (when is content created and by whom, when is metadata created and by whom, when and where is content stored and by whom, when is content updated and by whom, when is content archived or deleted and by whom). - Identify existing standards and tools governing and supporting the information management life cycle. Implementing the Improvement Design and Implement the Findability Solution: - Map out âuse casesâ for discovery of construction project content with: Information users Search engines and interfaces Text analytics tools Information repositories Classiï¬cation and metadata Information producers - Identify speciï¬c tasks for implementing the solution.

I-58 improving Findability and Relevance of Transportation information Solution Element Implementation Activities Information Governance Identify required and optional information to be stored and managed for construction projects. Develop standard ï¬le naming conventions for each content type. Create governance policies for management of construction project-related content. Terminology Resources Develop taxonomy of construction project content types (e.g., inspector logs, change orders, as-built plans) â review with users and ï¬nalize. Metadata and Tagging Identify minimum required metadata elements to be associated with each content type â based on search needs (e.g., project number, district number, content format, content type, content creator, date created, date updated). Develop rules for auto-categorization of content types (e.g., based on a combination of ï¬le format, document name, and key words within the document). Develop rules for auto-population of other metadata elements. Tool Evaluation and Acquisition Evaluate and procure text analytics software to support auto-categorization. Evaluate federated search options for an integrated project content search across team collaboration, engineering document repository, and the Program Management application database. Tool Conï¬guration/ Integration Design and implement a standard template and workï¬ow for management of selected content including metadata assignment. Integrate document management workï¬ow with text analytics software â implement and test metadata population rules. Conï¬gure the federated search â develop crosswalks across metadata elements and build connectors to each repository. Test and roll out federated search.

example improvement initiatives I-59 Solution Element Implementation Activities Content Conversion Determine initial target for conversion of legacy content and complete conversion eï¬ort. Process legacy content through auto-categorization function to assign metadata and manually review/adjust results. Documentation and Training Develop training materials and deliver training to content creators and users. Improvement 2: Focus on Findability of Critical Corporate Documents The Situation DOT Y maintains policies and procedures at the agency-wide level and at the department levels. Design manuals are maintained by separate oï¬ces (e.g., bridge, traï¬c engineering, pavement). Agency-wide and departmental policies, procedures, and design manuals are issued and updated using separate processes. Policies, procedures, and design manuals are posted on the agencyâs intranet site. Each department or oï¬ce is responsible for posting their own policies, procedures, and manuals on their own intranet page. The Problem Diï¬erent (older) versions of the policies, procedures, and manuals are in use â in both print and electronic forms. Some of these older versions have been downloaded and are stored on individual employee hard drives; some have been posted to regional oï¬ce web pages. While some departments issue email notiï¬cations when new versions of documents are posted, others do not. Even when emails are sent, they are not necessarily forwarded to all the employees who need to know that an update exists. The Goal Ensure that all employees are using the most up to date versions of policies, procedures, and design manuals. The Improvement Initiative Implement a single location for retrieval of current corporate documents. Ensure that a search on the agencyâs intranet site returns current versions of the appropriate document(s). Implement standard procedures for updates to corporate documents. Communicate changes to how corporate documents are managed.

I-60 improving Findability and Relevance of Transportation information Planning the Improvement Survey the Information Landscape: - Identify types of content to be included in the eï¬ort: Operations and Engineering Policies and Manuals â Road Design, Bridge Design, Bridge Inspection, Traï¬c Engineering, Maintenance, Access Management, Utilities Coordination, Geotechnical, Hydraulic, Landscape, Project Development, Right-of-Way Administrative Policies and Manuals â Business Manual, Risk Management Policy, Records-Retention Policy, Legal and Litigation Hold Policy Human Resources Policies and Manuals â Sick Leave Policy, Workersâ Compensation Policy, Ethics Code, Family Medical Leave, Aï¬rmative Action Financial Management Policies and Manuals â Invoicing, Revenue and Accounts Receivable, Advance Construction, Debt Management, Cash Balance, Local Assistance IT Policies and Manuals â System Access Management, Data Protection, System Procurement, System Development Life Cycle - Map out existing repositories that store current policies, procedures and manuals: Intranet web pages â departmental, regional External website pages Collaboration software team sites Shared network drives File cabinets/desk drawers Understand Findability Needs of Diï¬erent Users: - Conduct focus groups with a sample of employees in central and ï¬eld oï¬ces â assess their understanding of how to access updated policies and manuals. - Conduct interviews with a sample of managers to understand extent to which out-of-date materials are being used, and associated risks â gather illustrative examples. Document the Information Management Life Cycle: - For key oï¬ces responsible for policy and manual updates, map out the current process for issuing a new policy/manual, updating an existing policy/manual, and discontinuing an existing policy/manual â including where the update is stored, where links are posted, and what communication occurs. Implementing the Improvement Design and Implement the Findability Solution: - Map out âuse casesâ for discovery of current policies, procedures and manuals with: Information users

example improvement initiatives I-61 Search engines and interfaces Information repositories Classiï¬cation and metadata Information producers - Identify speciï¬c tasks for implementing the solution. Solution Element Implementation Activities Information Governance Identify and document which types of policy and procedural documents are to be governed centrally. Create governance policies including process to be followed for updates, storage location for current versions, metadata requirements, and review procedures. Create standard procedures for linking to policy and procedure documents that ensure links will always refer to current versions. Identify accountability for maintenance and execution of the governance policies. Terminology Resources Develop high level classiï¬cations for policy and procedure documents based on subject (e.g., engineering, HR, IT, etc.), type (e.g., manual, policy, guideline, etc.) and status (e.g., active, pending, superseded). Metadata and Tagging Deï¬ne minimum required metadata elements to be associated with policy and procedure documents â based on search needs (e.g., category, type, status, content creator, date created, date updated). Document process and responsibilities for tagging of documents going forward. Tool Evaluation and Acquisition No new tools to be used. Tool Conï¬guration/ Integration Design policy and procedure search capability on agency internal website (e.g., main menu item called âPolicies and Procedures,â specialized search page with facets for subject, status, and type, keyword search). Content Conversion Identify responsibilities and resources for tagging of existing documents, and set a timeline for completion. Conduct periodic quality assurance on metadata.

I-62 improving Findability and Relevance of Transportation information Solution Element Implementation Activities Documentation and Training Develop training materials for content creators on update procedures. Communicate new process to employees through targeted emails, website announcements, and presentations. Improvement 3: Focus on Findability of Information for Critical Job Functions The Situation The agency is experiencing substantial turnover in key positions. Each employee develops their own approach to storing and accessing information of interest. The Problem Employees transitioning into new positions require considerable time to learn what people, documents, and data sets are available as key resources. The Goal Standardize information organization and management for key employees so that new employees moving in to their positions do not need to spend time discovering and organizing information resources necessary for their jobs. The Improvement Initiative Implement a series of standard âhome pagesâ that provide access to people, documents, and data sets of interest for particular job functions. Planning the Improvement Understand Findability Needs for the selected job function(s): - Interview people currently holding the positions (and their managers) â identify what information is important for their job, where it is stored, and how they access and use it. Survey the Information Landscape: - Identify types of content to be included in the eï¬ort: Expertise directory (internal staï¬ and potentially external resources) Software tools and applications Data sets Policy and procedure documents Business process documentation Reference documents Training materials

example improvement initiatives I-63 Meeting notes Document templates (e.g., standard letters, forms, etc.) - Map these content types to existing information repositories/sources: Employee directory Internal web content management system External websites or databases, RSS feeds, bookmarks Shared network drives Document management system Learning management system IT servers Document the Information Management Life Cycle: - For internally-produced content types of relevance, document the existing processes for updates and metadata assignment. A key objective of this activity is to determine how the ï¬ndability solution can leverage available metadata and understand requirements for keeping the job-speciï¬c âhome pageâ links updated as new content is created or existing content is moved or updated. Implementing the Improvement Design and Implement the Findability Solution: - Design the organization of a personalized âMy Jobâ web page (e.g., major categories, subcategories, and options for search and navigation). For example: major categories might be people, documents, applications, and data. Subcategories for documents might include presentations, spreadsheets, policies, manuals, meeting notes, and reports. Search options might include date range, source, and keyword. - Develop an agency expertise directory (if one does not already exist). - Develop a strategy for linking appropriate content to the page. Some of this could be automated by mapping standard metadata for diï¬erent content types to the My Job organization scheme. A manual process could also be built in so that people holding the position could add links and assign keywords (from a controlled vocabulary or free-form). - Identify speciï¬c tasks for implementing the solution. Solution Element Implementation Activities Information Governance Identify ownership and responsibilities for developing and maintaining the page â including keeping links up to date as content changes or moves location. Identify responsibilities and process for initiating and executing changes to the overall structure or update methods.

I-64 improving Findability and Relevance of Transportation information Solution Element Implementation Activities Terminology Resources Develop the categories and standard keywords to be used for information organization and search. Leverage existing taxonomies (if available). Metadata and Tagging Develop criteria and process to add relevant content to the index for a particular job based on storage location, ï¬le type, and potentially subject category. For job functions occupied by multiple individuals (e.g., district maintenance engineer), consider integration of social metadata including rankings and comments. Consider need for changes to overall metadata standards and tagging processes for particular content types in order to facilitate automated updating of relevant content for diï¬erent My Job pages. As appropriate, develop rules and processes for metadata assignment. Tool Evaluation and Acquisition Deï¬ne requirements for content indexing, metadata harvesting, content tagging â manual and automated, search and navigation features. Identify which requirements can be met with oï¬-the-shelf tools and which requirements will need to be met via custom coding. Acquire new tools as needed. Tool Conï¬guration/ Integration Conï¬gure tools and perform customization and integration to produce the My Job web page. Implement the features required to support search/navigation, manual and automated tagging, and metadata harvesting. Content Conversion Determine needs for converting physical documents to digital form, and perform conversions as appropriate. Assign metadata to converted content â through a manual or semi-automated process. Documentation and Training Develop training materials for adding content, deleting content, updating metadata for content, and navigating/searching for content.

I-65 A p p e n d i x B Glossary This glossary draws on the following sources: AIIM â Association for Information and Image Management Glossary: http://www.aiim.org/community/wiki/view/glossary IRMT â International Records Management Trust (IRMT) Glossary of Terms: http://www.irmt.org/documents/educ_training/term%20modules/IRMT%20TERM% 20Glossary%20of%20Terms.pdf ANSI/NISO Z39.19 â Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies (2005) ISBN: 1-880124-65-3 is p. 157-167: http://www.niso.org/apps/group_public/download.php/12591/z39-19- 2005r2010.pdf SAA â Society of American Archivists Glossary: http://www2.archivists.org/glossary OMB Circular A-130: http://www.whitehouse.gov/omb/circulars_a130_a130trans4/ W3C â W3C Data Catalog Vocabulary: http://www.w3.org/TR/vocab-dcat/#class-- dataset Underlined text is used to facilitate cross referencing. Analytics. Techniques for transforming data into information to provide insights into current conditions and/or likely implications of potential future actions. Best Bets. Manually created lists of content objects to be returned in response to common search queries in order to improve search results. Boolean search. A type of search based on a speciï¬cation of a set of logical conditions connected with Boolean operators such as AND, NOT, and OR. Example: Find all documents tagged as âmanualâ or âpolicyâ that were published after 2014 and have a status of âapprovedâ. Catalog. An organized, searchable, annotated list of content objects in a collection. Example: the National Transportation Library Catalog. Content. Information that has been packaged in a format suitable for retrieval, re- use, and publication. Content includes documents, data sets, web pages, image ï¬les, email, social media posts, video ï¬les, audio ï¬les, and other rich media assets. (Source: Adapted from AIIM) Content management. The process of establishing policies, systems, and procedures in an organization in order to oversee the systematic creation, organization, access, and use of content. Content Management is a subset of Information Management. (Source: Adapted from IRMT)

I-66 improving Findability and Relevance of Transportation information Content object. An individual unit of content that may be described for inclusion in an information retrieval system, website, or other information source. A content object can itself be made up of content objects (e.g., both a website and an individual web page; a journal and an article in the journal). A content object may also include metadata. (Source: Adapted from ANSI/NISO Z39.19) Controlled vocabulary. A list of terms that have been enumerated explicitly. This list is controlled by and available from a controlled vocabulary registration authority. Example: Library of Congress Subject Headings. (Source: Adapted from ANSI/NISO Z39.19) Digital curation. Selection, preservation, maintenance, collection, and archiving of digital content objects. Data. Representation of observations, concepts or instructions in a formalized manner suitable for communication, interpretation, or processing by humans or computers. Examples: a crash record; pavement roughness measurements. (Source: Adapted from AIIM) Data management. A subset of information management that is concerned with management of structured data. Data set. A collection of data made available for access or download in one or more formats. Examples: a stateâs crash records for a single year; a database with roughness measures for pavement segments on the state highway system. (Source: Adapted from W3C) Digital repository. An electronic information system in which digital content objects are stored, managed, and made available for retrieval. Dimension. Used in the context of data warehouse design, a dimension is a way to categorize and describe facts and measures to support data exploration. For example, dimensions for a set of agency expenditures might include purpose, year, and organizational unit. Discovery. A process of information exploration that is not initiated with or driven by a speciï¬c search objective or target. Document. Recorded data or information, ï¬xed in any media, which can be treated as a self-contained unit. May consist of one or more content objects. Examples: A strategic highway safety plan; a DOT transportation asset management plan. (Source: Adapted from AIIM and SAA) Document management. Techniques that ensure that documents are properly distributed, used, stored, retrieved, protected, and preserved according to established policies and procedures. Document management systems typically include capabilities for storage, retrieval, check-in/check-out, version control, and maintenance of audit trails for changes made. Document management is a subset of content management, and is typically concerned with management of stand-alone documents (e.g., reports, presentations, and spreadsheets) rather than more atomic content objects such as images, social media posts, links, or web pages. (Source: Adapted from SAA)

Glossary I-67 Electronic discovery (e-discovery). A process in which electronic data is sought, located, secured, and searched with the intent of using it as evidence in a civil or criminal legal case. Enterprise collaboration platform. A software application that supports workgroup collaboration, providing a mechanism for interactive review, discussion, and modiï¬cation of shared content such as technical documents, links, event calendars, and meeting minutes. It may also provide automated workï¬ow capabilities. Enterprise search. The practice of identifying and enabling speciï¬c content across the enterprise to be indexed, searched, and displayed to authorized users. (Source: AIIM) Faceted classiï¬cation. A system for organizing content into categories based on a systematic combination of mutually exclusive and collectively exhaustive characteristics of the materials (facets) and displaying the characteristics in a manner that shows their relationships. (Source: Adapted from SAA) Faceted navigation. Technique for accessing content based on a faceted classiï¬cation system. Faceted navigation is commonly used for e-Commerce websites. Federated search. Simultaneous search of multiple online databases. (Source: AIIM) Findability. The degree to which relevant information is easy to ï¬nd when needed; ï¬ndability is improved through application of metadata, taxonomies and other organizing tools, and search technologies. (Source: Adapted from AIIM) Index. List of the contents of a ï¬le, document, or collection of content objects together with keys or references for locating the contents. (Source: Adapted from AIIM) Indexing. A method by which terms or subject headings are selected by a human or computer to represent the concepts in or attributes of a content object. (Source: Adapted from ANSI/NISO Z39.19) Information. Presentation of data to facilitate interpretation or understanding; may include textual, numerical, graphic, cartographic, narrative, or audiovisual forms. Examples: Map of high crash locations; trend line showing changes in pavement roughness over time. (Source: Adapted from AIIM and OMB Circular A-130. Note: The term information is frequently used to refer generally to both raw data and processed or packaged data.) Information governance. The accountability for the management of an organizationâs information assets in order to achieve its business purposes and compliance with any relevant legislation, regulation, and business practice. Includes data governance, which focuses on governance of structured data. (Source: Adapted from AIIM) Information life cycle. The stages through which information passes, typically characterized as creation or collection, processing, dissemination, use, storage, and disposition. (Source: OMB Circular A-130)

I-68 improving Findability and Relevance of Transportation information Information management. The means by which an organization (e.g., a DOT) eï¬ciently plans, collects, creates, organizes, uses, controls, stores, disseminates, and disposes of information and ensures that the value of that information is understood and fully exploited. (Note: Information management encompasses content management, data management, and digital curation but is broader in scope.) Information resource. See content object. Information resource management. Principles and techniques to oversee and administer the creation, use, access, and preservation of information in an organization, founded on the belief that information is an asset comparable to ï¬nancial, human, and physical resources. Similar in concept to information management; included here given use of this term in OMB Circular A-130. (Source: Adapted from SAA) Information repository. See digital repository. Keyword. One of a small set of words used to characterize the contents of a document for use in retrieval systems. May also be referred to as a âTag.â (Source: Adapted from SAA) Knowledge. The basis for a personâs ability to take eï¬ective action or make an eï¬ective decision, built over time through education, work experience, and interactions. Examples: a safety professionalâs understanding of what countermeasures would be appropriate in diï¬erent situations; a pavement engineerâs understanding of the underlying causes for pavement deterioration in a given location. Knowledge attrition proï¬le. A workforce analysis output used to understand potential future knowledge gaps in an organization. The proï¬le (1) identiï¬es key employees in each business area (i.e., those with specialized or unique expertise), and (2) estimates the likelihood of these individuals retiring within the next 2, 5, and 10 years. Knowledge management. An umbrella term for a variety of techniques for building, leveraging, and sustaining the know-how and experience of an organizationâs employees. Knowledge transfer. Techniques for disseminating tacit knowledge and explicit knowledge across individuals and/or work units. Master data. Shared data about the core entities of an enterprise. In a private company, examples of core entities are customers, products, and vendors; in a DOT, examples of core entities are routes, projects, funding sources, and district oï¬ces. Metadata. Data describing context, content, and structure of documents and records, and the management of such documents and records through time. Literally, data about data. (Source: Adapted from AIIM/ISO 15489) Ontology. A type of controlled vocabulary that describes objects and the relations between them in a formal way and has a grammar for using the vocabulary terms to express something meaningful within a speciï¬ed domain of interest. For example,

Glossary I-69 an ontology might deï¬ne a relationship called âis a structural member ofâ to describe the structural elements of a bridge (e.g., trusses) and distinguish these from non-structural elements (e.g., railings). (Source: Adapted from AIIM) Portal. An entry point, especially a web page, that provides access to information from a variety of sources and that oï¬ers a variety of services. (Source: SAA) Precision. In the context of information retrieval, precision is a measure of how relevant the returned results are to the userâs query. It is calculated as the fraction of items returned from a search that are relevant to the userâs search query. Recall. In the context of information retrieval, recall is a measure of a search engineâs ability to locate all of the relevant results that are available. It is calculated as the fraction of all relevant items that were returned from a search. Record. Data or information in a ï¬xed form that is created or received in the course of individual or institutional activity and set aside (preserved) as evidence of that activity for future reference. Records may include paper documents, digital documents, data sets, emails, and other content types. (Source: Adapted from SAA) Records management. The systematic and administrative control of records throughout their life cycle to ensure eï¬ciency and economy in their creation, use, handling, control, maintenance, and disposition. Similar to document management, but focused on documents that have been designated as oï¬cial records with an emphasis on legal, regulatory, and risk management concerns. (Source: Adapted from SAA) Reference data. Data used to organize and categorize information, consisting of code tables and other shared lists of values. Search-based application. A specialized application developed to support a speciï¬c business process or task that features search as a central component. These applications may bring together information from multiple information repositories. Search engine. A coordinated set of programs for spidering, indexing, and querying content available on the World Wide Web. The spidering program âcrawlsâ the web and creates a list of available pages using the hypertext links available on each page. The indexing program creates indices based on the words and phrases included in each content object. The query program accepts a search request and returns a set of matching results from an index, sorted using an algorithm that seeks to present the results that will be most relevant to the user based on factors including match with search term, currency, geographic location, source authority, etc. Search interface. A user interface that provides a mechanism for users to specify their search query, reï¬ne their results set, and navigate to results of interest. Semantic resources. Synonym rings, taxonomies, thesauri, ontologies, and other resources that can be used for classifying and tagging content. Semi-structured data. Non-tabular data that include tags or other structural elements to represent relationships among elements but do not conform to a predictable model. Examples: XML ï¬le, social media post.

I-70 improving Findability and Relevance of Transportation information Spider. A computer program that scans the World Wide Web, following links on each page to identify new sites. Structured data. Data that conform to a predeï¬ned data model, typically structured as a series of columns (ï¬elds) and rows (records), and stored in relational databases, spreadsheets, or ï¬at ï¬les. Synonym ring. A group of terms that are considered to be equivalent, typically in the form of a ï¬at (non-hierarchical) list; synonym rings are used to improve search results. Taxonomy. A type of controlled vocabulary, consisting of categories and subcategories, that is used for classifying information. (Source: Adapted from AIIM) Text analytics. Techniques that utilize software and semantic resources to add structure to text-based content objects (text ï¬les, Word documents, websites, etc.). The main capabilities of text analytics include text mining, sentiment analysis, entity or noun phrase extraction, auto-summarization, and auto-categorization. Thesaurus. A type of controlled vocabulary consisting of terms linked together by semantic, hierarchical (i.e., parent-child), associative (i.e., related) or equivalence (i.e., synonymous) relationships. Such a tool acts as a guide to allocating classiï¬cation terms to individual records. (Source: Adapted from ISO/TR 15489- 2:2001) Unstructured data. Data that do not conform to any predeï¬ned organization, sequence or type. Examples: text, video, sound, images. Web content management. Processes and tools for creating, updating, and maintaining website content including text, images, links, and forms.

I-71 A p p e n d i x C Special Topics This appendix provides more detailed coverage of several topic areas related to search, metadata and terminology, and text analytics, supplementing the main body of the guide. Topic 1: Search Search Engines: The Basics Search engines are an essential technology supporting digital information retrieval. In order to understand how to improve search, it is helpful to review the basics of how search engines work and the diï¬erences between Internet search and search within an organization (a.k.a. âenterprise searchâ). Search engines include the following functions: Gathering content â identifying available content through crawling web pages, scanning ï¬le servers, or obtaining updates from applications. Indexing content â creating large data structures that provide an index to the available content â identifying which information resources contain diï¬erent terms, along with additional information such as the number of instances of these terms, and their prominence and location. Formulating queries â providing search interfaces for accepting a userâs search question â these may include features such as auto-complete or spelling correction, identiï¬cation of variants (e.g., paving, pavement, paver), synonyms (e.g., crash, collision, traï¬c accident), user ï¬ltering options based on date range, location or content type; and Boolean search options (and, or). Executing queries â carrying out a search to identify a results set that meets the query criteria. Ranking results â ranking the results by relevance. The relevance rankings are critical to achieving quality results â since each search can return millions of pages. Presenting results â display the results in one or more lists, with diï¬erent options for ï¬ltering, sorting, or additional reï¬nement. Internet Search Early Internet search algorithms ranked results based on factors including the number of matching search terms in the web page, the number of times the keyword was used, and its location on the page (title, subtitle, meta-tag). In 2000, Google revolutionized Internet search with their breakthrough PageRank algorithm. This algorithm relies on links to, from, and within

I-72 improving Findability and Relevance of Transportation information Since that initial breakthrough, Internet search has continued to improve. New features include reï¬nement of relevance ranking based on user search history, and use of more advanced natural language processing techniques to better discern what the user is seeking. Federated Search Federated search is the capability to simultaneously search multiple repositories with a single search request and present results in a helpful format to the user. While the terms âenterprise searchâ and âfederated searchâ are sometimes used interchangeably, federated search generally involves a brokerage approach in which each repository included in the search has its own index and search engine. The federated search tool sends a query to the various repositories and then aggregates the results for the user. Enterprise search tools may support federated search, but they also include a âstandard searchâ capability to create and maintain their own index for content stored in multiple repositories. These concepts are illustrated in ï¬gures I-C-1 and I-C-2: File Server Email Server Intranet Site Master Index results Search Interface query Figure I-C-1. Standard search. Doc Man System Web CMS Enterprise Search Engine queryIndex results Search Interface query Index Index Search Broker results query results query results Figure I-C-2. Federated search. Federated search capabilities generally include features to combine and sort results (e.g., by source, date, type, or relevancy) and to remove duplicate results. Federated search can be tailored to target a selected set of authoritative sources, ensuring high quality results. It can also be conï¬gured to tap into information repositories that cannot be accessed using standard Internet search engines. particular documents to calculate relevance rankings. This allows the search engine to return the most popular set of results as indicated by the number of popular links.

Special Topics I-73 Federated search is becoming a standard for enterprise search across most industries and government agencies. Most enterprise search vendors oï¬er the capability of setting up a federated search. Faceted Search and Navigation One signiï¬cant advance applicable to both Internet and enterprise search is the development of faceted search or faceted navigation. Tagging content with facets such as âDateâ, âPeopleâ, âOrganizationâ, and âDocument Typeâ works because it gives users simple ways to ï¬lter search results and does not require any advances in calculating search relevance. The one drawback to Faceted Navigation is that it requires a great deal of metadata. In some cases, this metadata can be automatically generated. However, considerable manual eï¬ort is typically required. Faceted Navigation is discussed further in the section labeled âVarieties of Search Interfaces.â Varieties of Search Interfaces A search interface is the mechanism that enables users to develop and input their search queries into the search engine. It can range from a simple single search text box to an advanced search interface that supports a variety of metadata input ï¬elds usable to create more complex search queries. Common varieties of search interfaces are brieï¬y summarized below. Search Box The search box shown in Figure I-C-3 is the simplest user interface for search. It consists of a single form ï¬eld in which a user enters a search query of one or more words. Figure I-C-3. Standard text search box. Advantages of this interface are that it is: Easy to implement Quick and easy to use Familiar to most users Disadvantages of this interface include: Implementations are inconsistent in how multiple words are treated by default â for example, a search query âtransportation planningâ might be interpreted as âtransportationâ OR âplanningâ or as âtransportationâ AND âplanning.â Users have minimal control over the search process. Users may not know why they retrieved none, too few or too many results. Unsatisfactory results require the user to make a new search attempt.

I-74 improving Findability and Relevance of Transportation information Implementation Tips: Supplement the text box with an advanced search capability for advanced users. Support use of quotation marks around multi-word phrases for phrase searches. Support use of wildcard characters such as (?) for a single character match and (*) for a multiple character match. Support use of advanced search strings, including nested searches in parentheses and Boolean operators (AND) and (OR). Provide an explanation of whether multiple words default to âallâ (AND) or to âanyâ (OR) combinations. Advanced Search An advanced search interface, also called query form search or ï¬elded search, provides multiple search text boxes in which the user enters words or phrases to be searched on in combination (see Figure I-C-4). Standard ï¬elds such as Date, Title, or Genre may be included, as well as application-speciï¬c ï¬elds such as Project, Route or Funding Source. Figure I-C-4. Advanced search interface. Advantages of advanced search are that it: Gives expert users more control over their searches. Can leverage available diï¬erent kinds of metadata and classiï¬cation to improve search results. Disadvantages of this type of search interface are that it: Usually requires a separate web page in the user interface design. Can become overly complex for many searchers, and training may be required. These disadvantages are relatively minor and can be overcome by making the advanced search optional.

Special Topics I-75 Implementation Tips: Support advanced search (e.g., ï¬nd presentations and spreadsheets that have been created in the last 12 months within one of the district oï¬ces and tagged with the categories âMeetingsâ and âBudgetâ). Additionally, support wild cards, Boolean operators (AND, OR, NOT), and other advanced search expressions within each of the ï¬elds, for use by expert searchers. Faceted Navigation A faceted navigation interface combines a search textbox with options for reï¬ning the results set using multiple ï¬lters based on a faceted classiï¬cation system (see Figure I-C-5). A faceted classiï¬cation system is one that includes multiple ways of categorizing content based on diï¬erent attributes. Typically, users can view the count of available search results for each value of a facet. For example, the image in Figure I-C-5 shows that 167 books and 12 online resources are related to the search term âwordsworth.â This type of interface allows for interactive exploration of a collection of information. Figure I-C-5. Faceted navigation interface.

I-76 improving Findability and Relevance of Transportation information Independence. In general, facets should represent diï¬erent types of characteristics (e.g., ï¬le type, organizational unit, and status). Selection of a value from any of these facets does not impact the choices for the others. In practice, even where facets represent diï¬erent characteristics, there are situations where selecting a particular facet value will constrain choices from other facets (e.g., there may be a ï¬le type that always has a status of âactiveâ). Meaning and relevance. Facets should represent fundamental, meaningful, and distinguishing features of an information resource within a particular domain that are relevant to the identiï¬ed set of users who will be searching for information (e.g., the distinguishing features of fruit might be taste [sweet, tart], color [yellow, red, purple], and shape [round, oval, linear]). In a transportation context, some distinguishing features of bridges might include span (simple, continuous, cantilever), primary material (concrete, steel, timber), placement of travel surface (deck, pony, through), and form (beam, arch, truss, cable stay, etc.). Extensibility. This is the ease of adding more categories to a given facet. Balancing simplicity and completeness. There should not be too many facets or an unmanageable number of categories within each facet. The design should allow users to make multiple, quick, simple selections from the interface provided. On the other hand, having too few facets can also cause problems in ï¬ndability. Facets should cover the important characteristics of objects within the domain of interest. Finding the right balance usually requires signiï¬cant understanding of the information needs and behaviors of users. Implementation of faceted navigation depends on availability of consistent metadata for the collection being searched. Advantages of faceted navigation are that it: Combines the beneï¬ts of searching with the beneï¬ts of browsing. Gives users more control over their searches. Is relatively easy to use. Is familiar to users, given that faceted navigation is becoming more common, especially in e-commerce. Potential disadvantages are: Faceted navigation is not supported by all search platforms, and may require custom development. Good faceted navigation requires attention to design to ensure good usability. It can be too cumbersome to use if there are too many facets or if the facets do not match with user requirements. Many out-of-the-box faceted navigation solutions only allow the selection of a single term/value from a facet at a time, unless more complex user interface (UI) design is involved utilizing check boxes next to the terms providing a multiple-select option. (Note: This may be changing for specialized and advanced applications, but in For facets to work well, they must be designed properly. Key design considerations include:

Special Topics I-77 general, supporting multiple values within each facet has been shown in usability tests to be too advanced/confusing to most users.) Implementation Tips: Keep the number of facets between three and seven. A two-level hierarchy of facets can be implemented (e.g., a Mode facet with the categories Highway, Transit, Marine, and Air, and a Transit sub-facet with categories Subway, Light Rail, and Bus). As an option, display the most commonly used three to four facets and provide a âmoreâ hyperlink to show additional facets. Where individual facets have many values, show the ï¬rst few only and provide an option (e.g., a â+â sign) to expand the list. Avoid using facets with exceedingly long lists of values (e.g., more than 20 elements); these can be diï¬cult to scroll through, especially with limited space. Facets are typically displayed in the left margin with the results in the main screen area. It is possible to have a two-level hierarchy of topics displayed within a facet, if limited in size. Filters/Refiners Closely related to faceted navigation is the use of ï¬lters or reï¬ners that are not based on pre- assigned categories. For example, reï¬ners can be based on ï¬le dates, keywords that appear in the text of documents, or other auto-detected information, such as ï¬le format type. Auto-Suggest Auto-suggest is an enhancement used within search boxes for displaying matches of the user- entered search string against a predeï¬ned list or dictionary of terms (see Figure I-C-6). Typically, an excerpt of the term list is displayed just under the scroll box. The user can then select the desired term from the short list. Type-ahead, also called auto-complete, is a variant of auto-suggest in which the system dynamically executes a match against the list/dictionary of terms, based on the characters that the user has entered so far. Figure I-C-6. Type-ahead/auto-suggest feature.

I-78 improving Findability and Relevance of Transportation information The dictionary of terms used in auto-suggest may be based on the history of previous searches by the individual user or by all users on the system, based on a curated controlled vocabulary of terms or a combination of the two. Advantages of using auto-suggest/type-ahead are: Auto-suggest/type-ahead combines the beneï¬ts of searching (quickly and easily looking for something) with the beneï¬ts of browsing (selecting from terms and not having to guess what is there). Type-ahead in particular saves the user the trouble of having to decide when best and how to truncate. Disadvantages are: Type-ahead requires data communication resources. If applied over a network, this can slow down the search. Use of past searches for auto-suggest may in some cases be misleading for users who assume that they are selecting from a controlled vocabulary designed to improve search results. Conversely, users may ignore an auto-suggest list that is actually based on a controlled vocabulary because they think it is merely a history of past searches and will not provide any improvement to search results. Implementation Tips: Use of a controlled vocabulary, rather than a history of past searches, can achieve more accurate search results. Use of a controlled vocabulary with synonyms can further increase the likelihood of matches. Include an option for users to turn oï¬ the type-ahead feature if they ï¬nd it to be distracting. Highlight a default term on the pick list so that the user only needs to hit enter to select it. Arrange display lists of terms (typically 4 to 5 terms) alphabetically. Improving Search Results with Best Bets Best Bets is a technique to improve search results by selecting the document(s) or website(s) that best match a particular search query and putting them at the top of the search results list. A number of criteria can be used to select the âBest,â ranging from the most popular document/site to the oï¬cial document. For example, a searcher who types in âRoad Safetyâ could ï¬nd at the top of their results list an oï¬cial Road Safety website. Some search tools include the ability to specify Best Bets (e.g., by using the âkeymatchâ feature in the Google Search Appliance [GSA]). The selection of Best Bets is normally carried out by the search team and/or a library team. An initial set of Best Bets can be identiï¬ed to include known documents of importance (e.g., design and maintenance manuals). Additional documents can be identiï¬ed through research into how users search (via direct observation, review of search logs, or user surveys).

Special Topics I-79 Advantages of using the Best Bets approach are: It provides a way of delivering better search results than using relevance rankings. It saves users the time and eï¬ort required to scan long search results lists in order to ï¬nd a document of interest. It can be built up over time and so does not require a major project to deliver positive results. Disadvantages are: A comprehensive approach requires eï¬ort, both to identify which documents should be designated as Best Bets and to keep the set updated over time. Best Bets can only address a relatively small subset of all user searches. Implementation Tips: Utilize Best Bets as one element of an overall search strategy. This approach should be viewed as a supplementary strategy rather than a full solution. Topic 2: Metadata Purpose of Metadata Metadata serves multiple purposes: It helps to make relevant content easier to ï¬nd, especially when a user is looking for content relevant to a particular question or topic area and a full-text search would yield too many irrelevant results. It helps make a body of content easier to navigate and explore (e.g., in a faceted navigation interface). It helps users to understand whether an information resource is relevant and suï¬ciently authoritative to address their needs. It provides a basis for managing records-retention schedules and periodic culling of content within repositories. Metadata is particularly important to enable searches of rich media and other non-textual formats. It improves search results of textual content relative to free text searches. It also provides the basis for implementing interfaces that allow users to explore a body of content by setting diï¬erent ï¬lters (e.g., source, date, topic, etc.), each of which is based on available metadata. Improved Relevance of Search Results Full-text search capabilities can ï¬nd a search term in a text document without any metadata. However, inclusion of keywords within metadata ï¬elds indicates that a particular term is important, as opposed to mentioned incidentally. Use of subject terms in metadata can increase the chances that a given document will appear toward the top of the search results set. For example, instead of counting the number of times a search term (e.g., âtransportationâ) shows up in a document, if the search term matches a keyword metadata associated with the document, it is automatically counted as a correct hit

I-80 improving Findability and Relevance of Transportation information and moved to the top of the results list. The fact that this term was speciï¬cally included in the metadata provides a more reliable indicator that the document is about âtransportationâ than the number of times the word âtransportationâ appears in the document. Search engines can be conï¬gured to give extra weight in their ranking algorithms to the contents of metadata ï¬elds. Some search interfaces allow users to limit searches to metadata ï¬elds in order to ensure that only relevant results are retrieved. Metadata ï¬elds can be exposed to users as advanced search ï¬elds. This can improve accuracy by reducing the number of irrelevant documents that the search engine returns. Support for Faceted Navigation and Search Refiners Faceted navigation is a powerful method of guided end-user search, which allows the user to ï¬lter or narrow search results by various criteria. For example, a typical faceted navigation interface might include ï¬lter options for subject, document type (purpose), source (organization), and publication date. For these ï¬lters to work, consistent metadata must be available for each document in the target collection. Documentation for Users Metadata provides a convenient summary for users of the pertinent characteristics of an information resource, including the title, the author, the source, the description/abstract, the publication or eï¬ective date, and its version or status (e.g., draft versus ï¬nal; active versus superseded). This information helps users to quickly understand the relevance of the document without having to review it in depth. Documentation for Information Life Cycle Management Metadata is essential for information life cycle management, providing information managers need to know about where a given information resource should be stored (per established policies), when to archive or delete an information resource, and what types of access restrictions are needed. In order to perform these functions, information managers require metadata about who owns each content object, what its retention schedule is, when it was last reviewed, its level of sensitivity or conï¬dentiality, and its status (e.g., draft, approved, ï¬nal). Types of Metadata The National Information Standards Organization (NISO) deï¬nes the following categories of metadata (See http://www.niso.org/publications/press/UnderstandingMetadata.pdf): Descriptive metadata describes an information resource for purposes of identiï¬cation and ï¬ndability. - Includes title, abstract, author, subject terms. Administrative metadata provides information to help manage a resource, such as when and how it was created. - Includes document type, intellectual property rights, retention schedule date. Structural metadata indicates how physical (non-electronic) items are structured.

Special Topics I-81 One of the most well-known metadata standards is the Dublin Core Metadata Element Set. Work on this standard originated at a 1995 workshop held in Dublin, Ohio, sponsored by the Online Computing Library Consortium (OCLC) and the National Center for Supercomputing Applications. The Dublin Core speciï¬cations are currently managed by the Dublin Core Metadata Initiative (DCMI). Currently, 15 standard element types are deï¬ned that are related to the content, intellectual property, and instantiation of a content object: Content (7 element types): - Coverage, subject, description, type, relation, source, title Intellectual property (4 element types): - Contributor, creator, publisher, rights Instantiation (4 element types): - Date, format, identiï¬er, language Dublin Core is stronger on administrative metadata than ï¬ndability metadata. Developing a Metadata Strategy Because the creation and maintenance of complete and high quality metadata requires substantial eï¬ort, it is important to strike the right balance between level of eï¬ort required and value added. Although the 15 Dublin Core elements provide a useful guideline, it may not be appropriate or cost eï¬ective to maintain these elements for the types of information collections found within DOTs. Key questions to consider in developing a metadata strategy include: Metadata elements. How many kinds of metadata need to be managed and recorded? More comprehensive metadata provide more opportunities to track and ï¬nd content, and enable faceted navigation; however, each additional element adds to the cost and management burden. Classiï¬cation schemes. How many values/terms must be deï¬ned for each metadata element? More terms will support more speciï¬c classiï¬cation and enable more precise retrieval, but too many terms makes it diï¬cult and time-consuming to classify and index content and ensure quality metadata. Variations. Should metadata elements vary for diï¬erent content types or departments? Standardizing metadata facilitates searching and ï¬nding content in a consistent manner across a large, heterogeneous content set. Consistency makes it easier to implement, manage, and maintain. It also makes user training more straightforward. On the other hand, variations support better ï¬ndability with speciï¬c content sets, especially those of high use. System-enforced elements. Which metadata elements should be required and system-enforced? Metadata elements that are not required might not be used by content creators who want to save time. If many metadata elements are required, however, content creators might assign incorrect metadata values (such as the ï¬rst option available) in order to get past system requirements and move on quickly.

I-82 improving Findability and Relevance of Transportation information In conjunction with deciding what type of metadata to maintain, it is important to develop an approach to metadata assignment. Topic 3: Text Analytics Background Text analytics software was ï¬rst developed in the 1990s and commercialized in the 2000s. Using text analytics has the potential to change the economics of metadata, making it feasible to implement metadata-driven search solutions in order to substantially improve upon âout-of- the-boxâ full-text searches. Text analytics technologies can be used to automatically or semi- automatically generate the types of metadata that makes faceted navigation work. Text analyticsâ auto-categorization capability can also be used to improve relevance ranking of search results. It is used to analyze unstructured text-based content (Word docs, PDFs, text ï¬les, tweets, blogs, etc.) in a variety of ways in order to extract new information. Text analytics is used in a variety of applications from fraud detection to e-Discovery. Text analytics includes the following functionality: Auto-categorization. Determining what documents or document sections are about. Extraction. Identifying and extracting text elements such as people, places, organizations, and events. More sophisticated capabilities can determine relationships across entities (e.g., âCompany A and Company B are planning on a merger.â). Summarization. Automatically creating short document summaries. These can be created dynamically, based on a user search term, to include pointers to where the terms are used within the document. Sentiment analysis. Characterizing the sentiments expressed in various social media such as tweets and blogs (e.g., to determine what features people like or dislike, and provide early warning on issues with products). For search, the major use of text analytics is in categorizing documents and adding a variety of metadata. The two most important features for search are extraction and auto-categorization. In addition, summarization, particularly dynamic summarization based on the search query, can replace snippets that most search engines return. The summaries tend to be more useful than snippets, but this is still a relatively minor use of text analytics for search. For a more in-depth discussion of text analytics, see Reamy (2016). Extraction Methods Entities can be extracted from documents in two basic ways. The ï¬rst method is to use catalogs of entities to be extracted. This is also referred to as âknown entityâ extraction, and these catalogs might include a list of geographic locations or other entities such as state and federal agencies or pieces of legislation. An example of entity extraction for countries is shown in Figure I-C-7.

Special Topics I-83 Figure I-C-7. Text analytics application for known entity extraction. The second method for extracting metadata from documents is to use rules which look for unknown entities; that is, entities that are not in a catalog. Because these rules are not based on lookups from a catalog, they are less precise and very often require human interpretation. For example, one rule might be that if a word starts with a capital letter and is preceded by any of a number of titles (e.g., âMr.â, âMs.â, âDr.â, or formal academic titles) then it is likely to be a personâs name. Another rule might be that if the word starts with a capital letter and is followed by words like âsaid,â that is also likely to be a personâs name. Categorization Methods Categorization is the most complex feature of text analytics, but from a search perspective it is also the most valuable. Categorization can be implemented in a variety of ways. Training sets. This approach uses statistics describing distinctive usages of words and patterns of words (called a âstatistical signatureâ) for a selected set of documents and compares that statistical signature to new documents to determine if they belong to the same category. This is a relatively straightforward method for categorization, but it is severely restricted in its accuracy and lower levels of granularity. Set of termsâsimple. This approach assigns categories based on the presence of speciï¬c terms. For example, if the term âlight railâ appears in the document, assign the category âTransitâ for the facet âMode.â The software can typically add synonyms and apply stemming (reducing words to a common root form to automatically include all the variations of the word including plurals, diï¬erent

I-84 improving Findability and Relevance of Transportation information tenses and other variants). For example, a stemming algorithm would reduce the words âdiggingâ, âdugâ, and âdigsâ to the stem âdigâ. These term-based rules can be augmented by additional simple rules (e.g., to weight terms more if they are included in the title of a document). Set of termsâadvanced. This approach assigns categories based on a sophisticated set of rules that are speciï¬ed using the Boolean operators AND, OR, NOT, as well as more advanced operators. These more advanced operators can, for example, test for the distance between two words (DIST), and test whether two words are included in the same sentence or the same paragraph. Examples of text categorization rules are provided in ï¬gures I-C-8 and I-C-9. Figure I-C-8 illustrates a simple rule â in which a document is classiï¬ed as related to âaviationâ if it includes any of the terms shown on the right (e.g., aviation, air traï¬c, airplane, etc.). Figure I-C-8. Text analytics application for simple auto-categorization.

Special Topics I-85 Figure I-C-9 illustrates application of a more complex set of rules that look for agricultural terms that: Appear in the ï¬rst 100â200 words of the document AND Are not part of certain phrases (e.g., âDepartment of Agricultureââa formal phrase that shows up in multiple documents that are not really about agriculture). Figure I-C-9. Text analytics application for complex auto-categorization. More complex rules require more development and maintenance, but they can provide a greater degree of accuracy. It is possible to use templates to reduce the cost of custom development, such as the one shown in Figure I-C-10.

I-86 improving Findability and Relevance of Transportation information Figure I-C-10. Text analytics rules template for auto-categorization. This rule looks for any term associated with arthritis that appears: In the document title. In both a keyword ï¬eld and in the abstract. At least twice in the abstract. At least twice in the ï¬rst 500 words. Once these rules are developed, they can be applied to any number of documents and used to generate metadata of all kinds, including subject and various faceted metadata. Figure I-C-11 illustrates how a federated search capability can be augmented with a text analytics package that incorporates application of auto-categorization rules within the indexing process.

Special Topics I-87 Figure I-C-11. Federated search with text analytics. Predictive Coding Predictive coding (also called technology-assisted review or computer-assisted review) is a term that is primarily used in the legal and e-discovery realm. In a general information and text analytics realm, it is referred to as categorization by example or categorization by sample documents. It refers to the process of having subject matter experts (SMEs) select some number of documents that they consider representative of a particular concept or category. These documents are then processed by software to produce a statistically-based signature. Once the signature has been created, it can be used to categorize new documents as belonging to that concept or category by comparing the category signature with the signature for each new document. In the realm of e-discovery, this technique is often used to reduce an extremely large set of documents to a size that can more economically be reviewed by humans. This technique could also be used for a FOIA request. In the text analytics area, all major vendors oï¬er this capability. There are two major diï¬erences between using e-discovery software and text analytics software. First, e-discovery software contains speciï¬c features that make it easier to create an e-discovery application. Using text analytics software for e-discovery purposes would mean having to develop some comparable features. Second, e-discovery software is not well suited for developing other types of applications, such as enterprise search applications. An e-discovery software would have to be supplemented to create an enterprise search application or any type of search application other than an e-discovery application. For the kind of broad-based information applications that were explored for NCHRP Project 20- 97, categorization by example is widely considered the weakest approach. This technique had been explored in prior projects conducted by the text analytics experts for NCHRP Project 20-97, but it had been rejected as not accurate enough and therefore was not used for this project. (For more information about the background research for NCHRP Project 20-97, see Volume 2.)

I-88 improving Findability and Relevance of Transportation information Categorization by example is most useful when there are a limited number of categories and no need for a taxonomy or ontology. It is often considered to be superior because it does not require humans creating categorization rules, which is a skill that SMEs typically do not have. However, it does involve a great deal of human eï¬ort in selecting good sample documents that are representative of speciï¬c concepts or categories and that distinguish between those concepts or categories. Typically, categorization by example is limited to about 70% accuracy, which is much lower than what the researchers achieved using rules in this project. One reason for this is that categorization by example produces a âblack boxâ categorization that is very diï¬cult or impossible to understand why it works or not. On the other hand, categorization by rule produces rules that can be understood and explicitly reï¬ned. The only way to reï¬ne categorization by example is selecting diï¬erent documents, which is also an esoteric skill that few SMEs have. Integrating Text Analytics Rules into Content Management Workflow As described above, text analytics rules can be applied as batch processes to large sets of documents to add metadata to large collections of documents at a time. This is typically done for documents that were created and stored before the text analytics software was introduced. In addition text analytics software can be used as part of document creation or publishing workï¬ow. This workï¬ow can include provisions for authors to suggest new terms for the controlled vocabulary â when none of the automatically suggested terms are suitable. Topic 4: Terminology and Semantic Structures to Improve Search Controlled Vocabularies Controlled vocabularies provide a consistent set of terms to assist in information search and retrieval. They are vocabularies because they reï¬ect the agreed-upon words or phrases to represent a concept, and they are controlled because only pre-approved terms may be included in each controlled vocabulary, and any subsequently added terms must be approved through a governance process. Each term in a controlled vocabulary represents a unique, unambiguous concept. In contrast, user-assigned tags on social media posts (e.g., #connectedvehicles) are an example of uncontrolled vocabulary. Two commonly known controlled vocabularies are the Library of Congress Subject Headings (LCSH) and the National Library of Medicineâs (NLM) Medical Subject Headings (MeSH). In transportation, TRBâs Transportation Research Thesaurus (TRT) provides a controlled vocabulary used to index transportation research. Use of controlled vocabularies within an organization (e.g., a DOT) can help to standardize language that is used to tag documents so that searches for these documents yield more reliable results. Controlled vocabularies can include synonyms to handle situations where more than one term is in common usage for a given concept (e.g., âstructureâ and âbridgeâ or âcollisionâ and âcrashâ).

Special Topics I-89 In the TRT controlled vocabulary, for example, bridge decks are found in the following hierarchy: Bridges and Culverts Bridges Bridge members Bridge Superstructures Bridge Decks If one were searching for all research articles on "bridge decks", the controlled vocabulary in the TRT would eï¬ectively take the searcher to what they are looking for. In addition, because the controlled vocabulary also addresses the use of synonyms, if one searched on the terms "bridge slab" or "bridge surface", the result of the search would include all articles on bridge decks. With improvements to natural language processing, some debate has occurred in the literature as to whether the eï¬ort required to develop and maintain controlled vocabularies adds suï¬cient value in terms of improved ï¬ndability over full-text searches. However, several studies have documented the value of using controlled vocabulary subject headings within bibliographic records. For example, a study by Garrett (2007) found that addition of subject headings to a bibliographic database increased the rate of retrieval by 29%. A more recent study by Gross, Taylor, and Joudrey (2014) performed an analysis of 194 search terms and found that of all documents retrieved based on a keyword search with these terms, 28% would not have been found if the subject heading information (based on controlled vocabulary) did not exist. There are several diï¬erent varieties of controlled vocabularies with diï¬erent levels of complexity. These are described below. Pick Lists A pick list is a simple list of values that provide a standard set of options for a metadata element or a search facet. For example, a search interface for road inventory data might include a pick list for selection of functional classiï¬cations. Synonym Rings A synonym ring provides a set of equivalent terms for a speciï¬c concept. For example, a synonym ring for the term âbridgeâ might include âbridge, structure, overpass, viaductâ. A synonym ring can be used to extend search: If a user enters one of the terms as a keyword, information resources containing any of the terms in the synonym ring can be retrieved. Authority Files An authority ï¬le is like a synonym ring, but designates an authoritative term to be used for each concept. Taxonomies A taxonomy is a kind of controlled vocabulary consisting of preferred terms, all of which are connected in a hierarchy. There may be a single or a limited number of top terms. Taxonomies

I-90 improving Findability and Relevance of Transportation information support classiï¬cation, categorization, and concept organization when designing a search structure. They allow users to browse and navigate the taxonomy from the top down, from broader to narrower terms. For example, Figure I-C-12 illustrates a taxonomy for transportation modes. Five highest level modes are deï¬ned, and each mode can be further broken down into a more detailed set of categories. Figure I-C-12. Example taxonomy of transportation modes. Faceted navigation systems may also include taxonomies. Synonyms (non-preferred terms or variants) are often included as optional elements in taxonomies. Advantages of using taxonomies for representing controlled vocabularies are: Ease of use. Most users are familiar with tree structures. Ability to start with a broad topic and navigate to more speciï¬c topics. This is suitable for users who want to explore a topic and are not sure what they are looking for. Provision of a âmap of the territory,â indicating to users what topics are in the scope and have content available, so there is no guessing of search terms that may or may not have results. As with other forms of controlled vocabulary, this approach saves the user the risk of errors when typing into a search box. Source: Adapted from the Transportation Research Thesaurus (TRB)

Special Topics I-91 Potential disadvantages are: Relevant subjects might not all ï¬t into a neat hierarchical structure. Comprehensive taxonomies can become too large to browse. As they grow, the user can lose track of where they are in the structure. If taxonomies are large and more than three levels deep, they may become too unwieldy to use for faceted navigation. Implementation Tips: If the taxonomy is intended for use for faceted navigation, terms at the same level should be viewable on a single page or scroll box. This makes it easier for users to navigate and iteratively narrow down the results set. Taxonomies should generally not exceed three levels of hierarchy for novice users and four levels for users familiar with the subject. If there will be a mix of users, the design should aim for the most likely frequent users. If there are more than a dozen terms in a given topic, or if the terms are proper nouns, terms should be arranged alphabetically. Thesauri A thesaurus is a kind of controlled vocabulary that has been arranged in a known order and is structured so that the various relationships among terms are displayed clearly and identiï¬ed by standardized relationship indicators: broader term/narrower term, related term, and preferred term/non-preferred term. The purpose of a thesaurus is to promote consistency in the indexing of content items. An example thesaurus entry is shown in Figure I-C-13. Advantages of using a thesaurus for representing controlled vocabularies: Because this approach does not require all terms to be in a single or limited number of hierarchies, it is more accommodating to inclusion of all terms, including those that are slightly âout of scopeâ. Because a thesaurus is not displayed as a hierarchy, hierarchical levels can be deeper (over 4 levels) without compromising browsing. (Users select a term and may branch out from it, but generally do not navigate from the top down.) Relationships are intuitive, easy to follow and understand by non-expert users. A thesaurus provides helpful guidance to manual indexers to help them ï¬nd the appropriate terms when indexing. It is especially useful for trained indexers. Terms from the thesaurus can be displayed or partially displayed to end users, allowing them to pick terms and saving the user the eï¬ort of typing with possible errors into a search box. The standard feature of synonyms/non-preferred terms improves search results. Disadvantages are: Thesauri generally require considerable time, eï¬ort, and expertise to create and maintain.

I-92 improving Findability and Relevance of Transportation information Implementation Tips: Follow the ANSI/NISO Z39.19-2005 (R2010) standard: Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies Semantic Networks Figure I-C-13. Example of a thesaurus entry. Semantic Networks Semantic networks are complex controlled vocabularies whose relationships have richer meanings than merely broader/narrower and relatedâhence the use of the word semantic. These relationships can include those found within taxonomies and thesauri, but also can include a variety of other types of relationships. One well-known example of a semantic network is WordNet (see http://wordnet.princeton.edu). An extract from WordNetâs entry for âautomobileâ includes the following: Synonyms (is equivalent of) such as car, auto, automobile, motorcar. Hyponyms (is a kind of) such as ambulance, bus, taxicab. Meronyms (is a part of) such as air bag, gas pedal, horn, fender. Domain Term Categories (related terms) such as showroom, road map, rental, hopped-up. Term: Nonmotorized transportation (Aex) Definition: Transportation that includes walking and bicycling, and variants such as small-wheeled transport (skates, skateboards, push scooters and hand carts) and wheelchair travel. (Source: TDM Encyclopedia (Victoria Transport Policy Institute): www.vtpi.org/tdm/tdm25.htm) Broader Term Transportation modes (Ae) Narrower Terms Bicycling (Aexb) Walking (Aexw) Related Terms (Hierarchical) Air transportation (Aea) Ground transportation (Aeg) Water transportation (Aes) Public transit (Aet) Related Terms (Associative) Pedestrians (Mwx) Horse drawn vehicles (Qbdf) Human powered vehicles (Qbdh) Source: Transportation Research Thesaurus (TRB)

Special Topics I-93 Semantic networks are often used by text analytics software to enhance their ability to interpret and process language. They are also utilized within semantic search and ontology- based applications that are able to apply reasoning based on semantic relationships. For example, a search engine using the WordNet entry for âautomobileâ would be able to tell that a document including the phrase âhopped-up jalopyâ is a potential match for the search term âcar engineâ. Most eï¬orts to leverage semantic networks for search are still in the developmental stage. Topic 5: Integration Considerations for Enterprise Search Although both standard and federated search features can be used out of the box, deliberate design and development work is typically required to ensure good results. This is a four-step process involving: requirements analysis, search interface design, metadata integration design, and technical implementation. The ï¬rst step is identifying requirements for the enterprise search capability. Key questions to ask in this step are: Which repositories should be included in the search? Do we want to search both information repositories containing unstructured content as well as relational databases? Do we want to include just internal agency repositories or also include external sources? Once the target repositories are identiï¬ed, technical requirements can be assessed based on the type and location of these repositories, their security or access controls, their internal capabilities for indexing and search, and their available interfaces for accepting search requests. The second step is designing a user interface to utilize the addition of multiple new content sources (i.e., repositories) in a way that leads to better search results. Utilizing a faceted navigation interface is normally the best approach but still often calls for more design work beyond simply adding content source as one facet. The third step is designing a metadata framework that integrates the various local metadata schemas in use within the existing repositories. This is perhaps the most crucial and usually the most resource-intensive part of the implementation process. It involves developing an overall metadata design that supports all the local variations in metadata ï¬elds, including diï¬erences in terminology. For example, one information repository might have a âsubjectâ metadata ï¬eld with possible values that include âplanningâ, âmaintenanceâ, and âconstructionâ. A second repository could have a âtopicâ metadata ï¬eld with possible values including âtransportation planning and programmingâ, âmaintenance and operationsâ, and âproject developmentâ. Two options are available to enable searching across these repositories: (1) standardizing the metadata to use the same ï¬eld names and categories, or (2) mapping categories across the two repositories. In this case, the âsubjectâ category âplanningâ would be mapped to the âtopicâ category âtransportation planning and programmingâ; the âsubjectâ category âmaintenanceâ would be mapped to the âtopicâ category âmaintenance and operationsâ, and so on. As another example, one repository may have a âdistrictâ ï¬eld to indicate the geographic applicability of its

I-94 improving Findability and Relevance of Transportation information content objects. Another repository might have a âjurisdictionâ ï¬eld. Mapping would involve matching up each district with a list of the jurisdictions that it serves. Sometimes multiple metadata ï¬elds in one repository will correspond to a single ï¬eld in a second repository, requiring a âmany to oneâ mapping. For example, one repository might include âcreator nameâ, âcreator organizationâ, but another repository might just include âcreatorâ. In this case, it would be necessary to analyze the contents of the âcreatorâ ï¬eld in the ï¬rst repository to determine whether a crosswalk can be developed to/from one or both of the corresponding ï¬elds in the other repository. When information repositories are already well-established, mapping provides the simplest integration approach. However, when new information repositories are being created, integration can be facilitated through metadata standardization. There are three types of metadata standardization: 1. Metadata elements (standardization of the deï¬nition of metadata elements such as subject, creator, etc.). This also includes standardization in the structure of metadata elements (e.g., where an âaddressâ element is deï¬ned as a combination of âstreetâ, âcityâ, âstateâ and âzip codeâ elements). 2. Data content (standardization of how the values of metadata elements are recorded, such as date or address formats). 3. Data value (standardization of lists of values for metadata elements (e.g., use of controlled vocabularies for things like equipment type or district). Transforming existing metadata into consistent formats (e.g., getting all creation dates into a âDD/MM/YYYYâ format) is sometimes referred to as normalization. The ï¬nal step in developing an enterprise search capability is the actual technical implementation. This includes creation of the necessary data structures to store and manage metadata and metadata mappings; conï¬guration or development of the search interface; and conï¬guration of the crawler that creates the master index (in the case of standard search) or the connectors or interfaces that allow for simultaneous queries of multiple repositories (in the case of federated search). This last step requires speciï¬c attention to managing diï¬erent access control lists and credentialing requirements for diï¬erent repositories. Although it is important to follow these steps to ensure that the enterprise search capability is implemented with careful attention to the full range of issues, the actual design/development of the system is not necessarily a major project. Agencies should anticipate an implementation timeline in the range of 3 to 6 months for an eï¬ort of moderate complexity.

I-95 A p p e n d i x d DOT Information Organization Resources Several examples of DOT classiï¬cation schemes are provided in this appendix. These are intended as illustrative of current practice rather than as examples of best practice. Example 1 shows a set of high level âinformation entitiesâ identiï¬ed as part of an Enterprise Architecture project at the Kansas DOT. This list could be useful for other DOTs creating standard subject keywords. Example 2 shows high level categories used by the Texas DOT for their document management system. This is an example of using organizational functions as an organizing scheme. Each of the categories shown can be mapped to one of the boxes included in Figure I-D-2. Example 3 shows a hierarchical information organization scheme used in a project oï¬ce at the Washington State DOT. This very specialized scheme includes a mixture of the content types and content formats listed in Chapter 2, organizational functions such as those included in Figure I-D-2, speciï¬c instances of entities discussed (e.g., Northbound Express Lane 3498), and other dimensions, including level of sensitivity (secure versus other), events (e.g., oï¬ce moves), responsibilities (e.g., budget management, reporting & oversight, planning) and owners (e.g., secretary). Example 4 shows a simple organization scheme for the Mississippi DOTâs team collaboration site. It is a hierarchical system based on a mix of organizational function, organizational structure, and asset. Example 5 shows a set of keywords from a Virginia DOT controlled vocabulary used to tag documents on the agencyâs collaboration site. These terms mostly correspond to content types (e.g., presentation, policy, permit), but also include organization functions (e.g., audit, legal, facilities). Example 6 is a diagram showing typical DOT organizational functions developed based on a review of several state DOT organization charts. The mixture of diï¬erent facet types in a single classiï¬cation scheme is an implementation of a paper ï¬ling system in which every piece of paper could only be in a single category. This approach is commonly implemented on shared ï¬le drives through establishment of folders. It is simple to implement and maintain, but has the limitation that each information resource can only be in a single folder (unless resources are duplicated). Rather than mixing facet types in a single classiï¬cation scheme, DOTs can consider classifying their information resources based on multiple facets. Each facet would have its own classiï¬cation scheme and controlled vocabularies. This approach is more powerful and ï¬exible, but it does require the eï¬ort to create and manage metadata for each information resource.

I-96 improving Findability and Relevance of Transportation information Example 1: Kansas DOT Enterprise Architecture â DOT Entities An enterprise architecture initiative at the Kansas DOT identiï¬ed the following hierarchical list of high level entities. Each of these entities could be deï¬ned as a concept that a given piece of information might be âaboutâ: Human Resources Employees Positions Organization Structure Project Information Programs Projects Activities Contracts Location â Geospatial Area-based Point-based Linear-based Cities Counties Geographic areas Geographic Maps Digital Photos (ortho quads) State Highway Network Roadway Information Bridge Data Right-of-Way Tracts Roadway Non-Road Features Signs Recorded Events Roadway Conditions Accidents â Recorded Traï¬c Counts Weather ITS Business Partners Contractors Consultants Vendors

dOT information Organization Resources I-97 Public Entities Other Government Entities Financial Data Funds Budgets Financial Transactions Grants Accounts Bonds Loans Assets Buildings Equipment Land Materials Oï¬ce Equipment Public Transportation Rest Areas Shop Equipment Storage Areas Towers Consumable Inventory Information Systems and Technology Example 2: Texas DOT Document Management This example, used for document classiï¬cation, is based on DOT function. Document Class Description 1. Administrative Documents related to the management and administration of Division, District, or Oï¬ce operations, programs, and projects. Administrative documents relate both to higher level management and to routine program or project administrative functions. 2. Construction Construction documents not related to a contract. 3. Contracts Documents related to agreements, leases and contract development and administration across functional groups. 4. Environmental Operations Documents related to environmental operations, including cemeteries, maintenance program, project coordination and review and public transportation reviews.

I-98 improving Findability and Relevance of Transportation information Document Class Description 5. Equipment and Facilities Documents related to building facilities and equipment, including building construction; maintenance, operations and security; equipment operations and maintenance; hazardous materials. 6. Finance Documents related to division or district ï¬nancial operations: accounting, budgeting, billing and payment, funds management, ï¬nancial reporting, payroll, employee timekeeping and travel expense. 7. Human Resources Documents related to individual employee ï¬les and actions as described in Chapter 10 of the Human Resources Oï¬cersâ Guide, and other human resources management areas, including civil rights and grievance, substance abuse programs, former employees, job placement, training, manpower and classiï¬cation. 8. Information Systems Documents related to information resources operations, planning, project development, purchasing, quality assurance, security and technical documentation. 9. Maintenance Operations Documents related to district maintenance operations, including, building security, ferry and tunnel operations, hazardous materials, petroleum storage tanks, MMIS/SES input, permits, maintenance projects, reports, section diaries, traï¬c signals and waste activities. 10. Occupational Safety Documents related to safety program management, including accident reports, emergency planning, hazardous materials training, materials safety data sheets, substance abuse program, safety inspections, meetings and training and workplace chemical lists. 11. Project Development and Design Documents related to advance project planning for speciï¬c projects, including environmental and public involvement, estimates, meeting minutes and notes, programming assessments, project authorization and approval, monitoring and management, work authorization, public hearings/public involvement, ROW determination, schematics and studies, and to the development and approval of plans, speciï¬cations and estimates (PS&E) for speciï¬c projects. 12. Purchasing and Warehouse Documents related to purchasing and warehouse operations, including inventories, materials requests and issues, POs, requests for information, oï¬ers, proposals, quotes, requisitions, speciï¬cations and work orders. 13. Right-of-Way Documents related to district right-of-way operations, projects to acquire right-of-way and dispose of surplus right-of-way, easements and programs related to junkyards and sign regulation.

dOT information Organization Resources I-99 Document Class Description 14. Traï¬c Operations Documents related to district traï¬c operations, including sign requests and issues, traï¬c program and projects, traï¬c safety grants and traï¬c signal maintenance. 15. Transportation Planning Documents related to district planning and programming operations, metropolitan and rural transportation planning, the annual uniï¬ed planning work program, program scheduling, UTP, projects, proposed project studies, scheduling, projects, roadway functional classiï¬cation and abandoned rail corridors. Example 3: Washington State DOT Project Office This example shows an information organization scheme for a project oï¬ce responsible for High Occupancy Vehicle (HOV) programs. It combines DOT functions with content types. HOV Program Management 1.0 Secure Program Oversight 1.1 Correspondence 1.2 Meetings & Brieï¬ngs 1.3 GEC Selection 1.4 Secure <person name removed> 1.5 HOV Budget Management 1.6 Quarterly Project Reports 2.0 Secure GEC Management 3.0 Accounting & Purchasing 3.1 Memos â Directives 3.2 Meeting Authorization Requests 3.3 Purchasing Card 3.4 Secure Accounts Payable 3.5 Capital Inventory 3.6 Links 4.0 Secure Personnel 4.1 Tables of Organization 4.2 Position Descriptions 4.3 Staï¬ Development

I-100 improving Findability and Relevance of Transportation information 4.4 Recruitment 4.5 Workforce Estimates 5.0 Payroll 5.1 Memos â Directives 5.2 Forms 5.3 Schedules 5.4 Leave Report 5.5 Labor Reporting 5.6 Timesheets 5.7 Washington State DOT Employees Tracking Sheets 6.0 Secure Facilities 6.1 Memos â Directives 6.2 Policies 6.3 Contracts 6.4 Maintenance 6.5 Floor plan 6.6 HOV Oï¬ce Start Up 6.7 Custodial Inspection Sheet 6.8 Oï¬ce Moves 7.0 Safety 7.1 Memos â Directives 7.2 Secure Emergency Contacts 7.3 Building Evacuation Plan 7.4 Safety Forms 7.5 Safety Meetings 7.6 Safety Awards 7.7 Accident Reports 8.0 Washington State DOT Administration 8.1 Memos â Directives 8.2 Forms 8.3 Secretary

dOT information Organization Resources I-101 8.4 Phones & IT 8.5 Vehicles 8.6 Washington State DOT Manuals 8.7 Equipment Manuals 8.8 Correspondence 8.9 Parking and Keys 9.0 Secure Communications 9.1 Memos â Directives 9.2 Incoming Correspondence 9.3 Outgoing Correspondence 9.4 How To instructions 9.5 HQ & OR Communications 9.6 Reporting & Oversight 9.7 Planning 9.8 Internal Communications 9.9 External Communications 9.10 Maps & Graphics 9.11 Photos 9.12 Meeting Agendas 9.13 Special Projects 10.0 Secure HOV Management 10.1 Memos â Directives 10.2 Contracting 10.3 Transition 10.4 Invoice Review Project Filing 1.0 Northbound XL3498 1.1 CAD 1.1.1 BaseFiles 1.1.2 CADDoc 1.1.3 FromDesign 1.1.4 Rsc

I-102 improving Findability and Relevance of Transportation information 1.1.5 As-Builts 1.1.6 ChangeOrders 1.1.7 ContractPlans 1.1.8 PlansforApproval 1.1.9 RightofWayPlans 1.2 EngDataConst 1.2.1 ConstDoc 1.2.2 Deliverables 1.2.3 RWKs 1.2.4 Standards 1.2.5 Geometry 1.2.6 Libraries 1.2.7 Reports 1.2.8 Surfaces 1.2.9 Survey 1.3 EngDataDesign 1.3.1 Deliverables 1.3.2 DesignEngDoc 1.3.3 RWKs 1.3.4 Standards 1.3.5 Geometry 1.3.6 Libraries 1.3.7 Reports 1.3.8 Surfaces 1.4 Environmental 1.5 Estimates 1.6 Hydraulics_Report 1.7 Permits 1.8 Photogrammetry 1.9 Photos 1.10 Project_Documentation 1.11 Quantities

dOT information Organization Resources I-103 1.12 Scoping 1.13 Survey 1.13.1 Deliverables 1.13.2 Requests 1.13.3 SurveyDoc 1.13.4 RawData 1.13.5 WorkingData Example 4: Mississippi DOT Internal Services Audit Human Resources Information Systems Legal Public Aï¬airs State Aid Commission Administrative Services Asset Management Facilities and Records Management Financial Management General Services Procurement Special Projects Support Services Projects Bridge Construction Contract Administration Consulting Services Environmental Enforcement Maintenance Materials Planning Programming Rails Research

I-104 improving Findability and Relevance of Transportation information Right-of-way Roadway Design Traï¬c Engineering Transportation Information (GIS) Districts District 1 District 2 District 3 District 4 District 5 District 6 District 7 Internal Planning Aeronautics Freights Ports Waterways Public Transit Program-Speciï¬c Plans and Studies - Long-Range Transportation Plan - Modal Plans - Freight Plan - Corridor Plans - Asset Management Plans - Transportation Studies - Traï¬c Engineering Studies - Safety Studies - Cost Allocation Studies - Research Reports - Customer Surveys - Employee Surveys Program Development - Grant Applications and Awards - Needs/Candidate Project Lists - Program Plans - Program Performance Reports - State Transportation Improvement Plan Engineering - Design Standards and Speciï¬cations - Product Evaluations

dOT information Organization Resources I-105 - Land Surveys - Structure Inspection Reports - Design Plans/Engineering Drawings - Value Engineering Studies Project Development - Right-of-Way Maps - Utility Relocation Records - Property Acquisition Records - Property Deeds and Titles - Categorical Exclusions - Environmental Impact Statements - Environmental Assessments - Environmental Decisions - Engineering Drawings/Plans - Engineering Calculations - Project Cost Estimates Construction Projects - Project Advertisement Reports - Bid Notices - Bid Proposals - Construction Agreements - Construction Contracts - Subcontracts - Change Orders - Daily Inspection Reports - Materials Test Reports - Claims - As-Built Plans Maintenance and Operations - Maintenance and Operations Procedures - Customer Complaint Reports - Utility Permits - Access Permits - Outdoor Advertising Permits - Oversize/Overweight Permits - Maintenance Records - Signal Timing Records - Incident Logs/Reports - Crash Records

I-106 improving Findability and Relevance of Transportation information Administrative - Policy and Procedure Guidelines and Manuals - Calendars and Schedules - Administrative Contracts and Agreements - Correspondence - Business Plans - Press Releases - Public Notices - Meeting Notes - Public Records Requests - Contact Lists - Newsletters and Publications - Audit Reports - Federal Grantee Reports - Contractor Compliance Reports Financial - Invoices - Budgets - Financial Reports Plant and Facilities - Equipment Manuals - Equipment Inventory - Building Inventory - Property Disposition Records - Work Orders - Maintenance Reports - Inspection Reports Example 5: Virginia DOT Document Descriptors The following keywords (from a controlled vocabulary) are used to characterize document types within the Virginia DOT collaboration site: Audit Budget Contract Employee Beneï¬ts Evaluation Facilities Form Legal Legislative Lessons Learned Manual

dOT information Organization Resources I-107 Memorandum Org Chart Performance Permit Plan Sheet Policy Presentation Procedure Project Report Security Speciï¬cation Strategic Plan Template Training Material Example 6: Washington State DOT Asset Types This example shows an information classiï¬cation scheme. The Washington State DOT conducted a research project with Kent State University to develop a taxonomy of asset types (see Winkler [2014]). Project results included a two-level asset classiï¬cation structure (shown in Figure I-D-1) and a more detailed thesaurus that included terminology related to each of the second-level elements of the classiï¬cation structure. For example, for the element âBarrier Systemsâ under the highest level category âTransportation Systemsâ, thesaurus entries include âGuardrailâ, âImpact Attenuatorâ, and âBridge Railâ. These are all marked as ânarrower termsâ (i.e., types of âBarrier Systemsâ). For the term âGuardrailâ, the thesaurus includes a synonym (âGuardrail Barrierâ) and several narrower terms representing types of guardrail (e.g., âW-Beam Guardrail Barrierâ). The Washington State DOT intends to use the asset thesaurus to tag data and content to enable searches across both structured and unstructured information sources based on asset type.

I-108 improving Findability and Relevance of Transportation information Figure I-D-1. Washington State DOT high level asset classiï¬cation scheme (proposed). Transportation Infrastructure Roadway Infrastructure Roadside Infrastructure Railway Infrastructure Aviation and Airport Infrastructure Bicycle and Pedestrian Path and Trail Infrastructure Marine Infrastructure Multimodal Infrastructure Bridge Infrastructure Environmental Infrastructure and Features Materials, Pavement and Markings Parking Infrastructure Transportation Systems Barrier Systems Aviation Systems Communication and Monitoring Systems Emergency Management Systems Intelligent Transportation Systems (ITS) Marine Control Systems Navigation Systems Rail Signaling and Control Revenue Collection Systems Detection and Identification Systems Traffic Safety, Signs and Lighting Transportation Facilities Roadway Facilities Airport Facilities Marine Facilities Multimodal Facilities Parking Facilities Rail Facilities Transport Vehicles Roadway Vehicles Aviation Vehicles Marine Vessels Rail Vehicles

dOT information Organization Resources I-109 Example 7: âTypicalâ DOT Organizational Functions Figure I-D-2 shows a hierarchical set of DOT organization functions, developed based on a review of several DOT organization charts. This is a resource that could be used to develop a set of functional categories for DOT information classiï¬cation that was generic in nature (i.e., independent of speciï¬c actual business units, which can and do change over time). Figure I-D-2. Typical DOT organizational functions.

I-110 A p p e n d i x e Examples of Commercially Available Enterprise Search and Text Analytics Products This appendix presents additional information on companies that provide text analytics products and software. The enterprise collaboration platform software and other products mentioned reï¬ect examples of products currently available at the time of this research and are included as information without implying speciï¬c endorsement. Table I-E-1 lists selected open source and commercial enterprise search tools. Some of these have text analytics features and are also listed in Table I-E-2. Table I-E-2 presents a selected list of products from commercially available text analytics companies, with detail on features. CRP does not endorse the use of any speciï¬c software product. This appendix summarizes information on products used by DOTs as found in the research project. A number of free resources also are available, described in Table I-E-3. These resources can be a good way to explore text analytics and text mining software, and to use as a foundation for developing software within the agency. In addition, some companies are simply using the programming language, Python, to develop their own text analytics software. Most vendors do not include pricing on their websites; however, some do provide an indication of their pricing model. The pricing models vary, but in general the following variables aï¬ect pricing: the features included; the number of documents processed; the number of conï¬gurations; the number of users; the license term (e.g., annual vs. perpetual); maintenance requirements (e.g., annual maintenance costs); and, whether the services are on-premises or in the cloud.

examples of Commercially Available enterprise Search and Text Analytics products I-111 Table I-E-1. Examples of commercially available enterprise search products. Company/Product Description General Comments Sample Customers or Partners a BA Insight Provides both stand-alone search and add-on tools for SharePoint. Tools include content analytics, connectors, visual refiners, and personalized views. See also description in Table I-E-2. Louisiana Department of Transportation and Development, U.S. Army, U.S. Department of Homeland Security Coveo/Intelligent Search Platform Enterprise search engine providing support for complex security requirements, auto- suggest features, and social tagging City of Denver, AECOM, Harris Corporation Elasticsearch Open source enterprise search platform â like Solr, built on the Lucene foundation. Original release in 2010. Distributed and maintained by Elastic. Newer than Solr; choice between the two should involve detailed analysis of features versus requirements. Verizon, CERN, New York Times, Netflix Google/Google Search Appliance A rack-mounted device providing document indexing functionality. Includes multiple features including wildcard search, metadata sort, entity recognition, spellcheck and synonyms, and auto-complete. Limited ability to support federated search. Google has announced that this product will be sunset over the next few years. IBM/IBM Watson Explorer Search platform supporting content analytics and big data; integrated with Watson analytics, Q&A functionality Brazil Ministry of Justice, Toyota Financial Services, Lehigh University LucidWorks Enterprise search application development platform based on Solr. Includes a crawler/indexer, multiple connectors, and reporting features. Offers integration with big data platforms. Federated search not supported. Search implementation company specializing in customized Solr installations. General Services Administration, U.S. Army, Sacramento County, Statistics Canada Microsoft/SharePoint Search Allows developers to build machine learning applications for semantic discovery, knowledge collaboration, sentiment analysis, and classification. Note that FAST has been incorporated into this product. Primarily a toolkit to build solutions; includes some simple OOB features. Washington State DOT, Oregon DOT, Mississippi DOT, Virginia DOT (continued on next page) a Not all companies listed customers or partners on the company website. For those that did, this table includes a sample, emphasizing public sector customers where applicable.

I-112 improving Findability and Relevance of Transportation information Company/Product Description General Comments Sample Customers or Partners a Solr Solr is an open source enterprise search platform built on Lucene (an open source text search engine library written in Java). Original release in 2006. Solr is distributed by the Apache foundation. Can be downloaded for free but requires expertise to install and configure. AT&T, Disney, Netflix, Sears, Travelocity Voyager Enterprise search engine featuring a spatially enabled user interface (âNavigoâ) combined with crawling, indexing and search technologies from the open source Apache Lucene/Solr products. Indexes geospatial files, databases, and documents. Offers extensions for geotagging, federated search, ArcGIS integration, and document management system integration. Only commercial product identified combining spatial interface with full- featured search engine. California DOT, National Geospatial Intelligence Agency, York Region (Canada) Table I-E-2. Examples of commercially available text analytics products. Company Product Description General Comments Sample Customers or Partners a Ai-Oneâ¢ Topic Mapperâ¢ Allows developers to build machine learning applications for semantic discovery, knowledge collaboration, sentiment analysis, and classification Primarily a toolkit to build solutions; includes some simple OOB features UltraMatchâ¢ Analyzes patterns and information structures in images; matches images Analyst Toolbox BI tool works with free text and unstructured data; automates ontology building http://www.ai-one.com/tag/text-analytics/ AlchemyAPI AlchemyLanguage Text analysis service including functions for: entity extraction, sentiment analysis, emotion analysis, keyword extraction, concept tagging, relation extraction, taxonomy classification, author extraction, language detection, text extraction, microformats parsing, feed detection, and linked data support Bought by IBM, so is now a part of IBM Watson and is not a separate product Pocket; Shutterstock; Hearst; Pulsar; Adtheorent AlchemyVision Identifies and classifies subjects, objects, events, and settings; trains classifiers using examples http://www.alchemyapi.com/ a Not all companies listed customers or partners on the company website. For those that did, this table includes a sample and may not include all listed customers or partners. Table I-E-1. (Continued).

examples of Commercially Available enterprise Search and Text Analytics products I-113 Company Product Description General Comments Sample Customers or Partners a BA Insight Smart Analytics SharePoint based text analytics and search analytics; includes auto- classification capabilities They have multiple connectors to incorporate content from sources outside of SharePoint U.S. Department of Homeland Security; U.S. Army; Shell; Pfizer; CBS; Deloitte; pwc; Ropes & Gray; Ford; PayPal SharePoint Connectors Integrates with over 50 systems to incorporate content outside of SharePoint http://www.bainsight.com Basis Technology Rosette Text Analytics Self-defined âmatureâ text analytics functions include: entity extraction, language identification morphological analysis, tokenization, sentence tagging, name translation, and name matching; self-defined âbetaâ functions include: entity linking, relationship extraction, categorization, and sentiment analysis Strengths are multiple languages (55 supported); focus is on a toolkit and SDH â build own Airbnb; Amazon; EMC2; Google; Oracle http://www.basistech.com/ Cambridge Semantics Anzo Unstructured Add-on to Anzo Enterprise that includes conceptual mapping of documents, entity linking, automatic ontology mapping, semantic search, document classification and annotation, and dashboard output capabilities Primary focus is data analytics not text analytics U.S. Air Force; biogen idec; Lockheed Martin; Johnson & Johnson; Merck; Novartis; Sanofi; Utah Department of Health Anzo Enterprise Includes data integration, search, visualization and analysis capabilities; features semantic search, entity linking and management, and unstructured text mining and analytics http://www.cambridgesemantics.com/ Clarabridge Clarabridge CX Suite Focused on customer-centered insights, including through social media, surveys (using text analytics), and a BI platform Primarily a cloud solution with a focus on sentiment and social analysis http://www.clarabridge.com/product/ Concept Searching ConceptClassifier Includes semantic metadata data generation, auto-classification, and taxonomy management tools; available on-premise, in the cloud, or as a hybrid Strong SharePoint integration; focus is on usability National Transportation Safety Board; Transport for London; BP; NATO; Bain Capital; United Health Group https://www.conceptsearching.com/ (continued on next page)

I-114 improving Findability and Relevance of Transportation information Company Product Description General Comments Sample Customers or Partners a Data Harmony // Access Innovations Thesaurus Master Â® Controlled vocabulary development; provides thesaurus and taxonomy management Primarily a taxonomy and vocabulary management company; have a full-featured text analytics capability M.A.I.â¢ Document indexing; Statistics Collector submits suggestions for improvement MAIstro Â® Bundles Thesaurus Master and M.A.I.; includes additional features such as metadata extractor http://www.dataharmony.com/products/ Expert System Cogito Discover Data extraction, semantic tagging, structured information loading, standard or customized taxonomy development, and automatic categorization Full-featured text analytics development platform; recently bought Temis, another leading text analytics company Google Cloud Platform; Accenture; Capgemini; Esri Cogito Studio Combines rules generation, ontology creation, taxonomy customization, extraction customization, and semantic network enrichment Luxid Annotation Server Extracts information using morpho- syntactic reasoning, statistics, thesaurus-/taxonomy-/ontology- based extraction, machine learning, and rules-based extraction Luxid Webstudio Web-application for ontology management and semantic enrichment for maintaining a shared ontology http://www.expertsystem.com/products/ IBM SPSS Â® Text Analytics for Surveys Watson Categorizes survey responses to turn survey text into quantitative data Commercial version of the Jeopardy-winning software. Initial module was for healthcare. They are developing additional modules for other industries Full cognitive computing platform; most advanced software available; requires a large start-up cost http://www-03.ibm.com/software/products/en/spss-text-analytics-surveys ; http://www.ibm.com/watson kCura Content Analyst Analytical Technology (CAAT Â®) Features include geometry-based concept searches (requiring no word lists, taxonomies, or thesauri), concept-based categorization, clustering of related content, email analytics Focus is on automatic clustering and categorization; do not have a full- featured development platform http://contentanalyst.com/html/tech/caat.html Table I-E-2 (Continued).

examples of Commercially Available enterprise Search and Text Analytics products I-115 Company Product Description General Comments Sample Customers or Partners a Lexalytics Semantria Includes categorization and named entity extraction tools; integrates with Excel; provides output to use with BI tools Platform for text analytics applications; primary focus is on sentiment analysis but also has enterprise categorization and metadata capabilities https://www.lexalytics.com/semantria Luminoso Luminoso API Features include auto-tagging, classification, conceptual search, topic correlation, topic clustering, and predictive modeling Focus is on enhancing their automatic capabilities; very advanced technology; aim is to grow âcommon senseâ brain Intel; Autodesk; Sony; Target; NASA; CDC; Scotts Miracle-Gro; TNS http://www.luminoso.com/products/api/ MeaningCloud Topics Extraction Entity and concept extraction using complex natural language processing techniques. User can create a customized dictionary to use Primary focus is on social media and sentiment analysis. Basic model is cloud- based serviceText Classification Automated document classification through the use of taxonomies or user-defined categories, and a combination of statistical document classification and rule-based filtering Corporate Reputation Applies reputational dimensions to identify the polarity (positive, negative, neutral) of content Text Clustering Clusters similar documents and provides description and relevance values of the clusters https://www.meaningcloud.com/products Megaputer PolyAnalyst Data analysis capabilities for loading, preparation, analysis, and reporting; text analysis components include document categorization, taxonomy generation, entity extraction, text OLAP, semantic search, document clustering, keywords extraction, and pattern detection Strong point is integration of text and data; text is mostly automatic National Transportation Safety Board; FAA; Allstate; Chase; HP; 3M; Boeing; Siemens; Marriott http://www.megaputer.com/site/polyanalyst.php (continued on next page)

I-116 improving Findability and Relevance of Transportation information Table I-E-2 (Continued). Company Product Description General Comments Sample Customers or Partners a MultiTes MultiTes Pro Desktop based application to manage thesauri Primarily thesaurus and vocabulary management; not really text analytics but often used with text analytics for taxonomy management, etc. MultiTes Online Online thesaurus management to allow for multiple editors MultiTes Site/ MultiTes WDK/ MultiTes EDK Tools for publishing a thesaurus on the Internet, intranet, and organization servers http://www.multites.com/index.htm NetOwl NetOwl Extractor Entity and event extraction with over 100 types of entities out of the box Primarily text mining and extraction; does not include much for auto- categorization LexisNexis; IDT Payment Services; Gale; Blackhawk Engagement Solutions NetOwl TextMiner Leverages NetOwl Extractor and provides semantic search NetOwl NameMatcher Machine learning approach to name matching for multiple entity types NetOwl EntityMatcher Match entity records based on key attributes NetOwl DocMatcher Used to categorize and compare documents, and identify duplicates https://www.netowl.com/ OdinText Next Generation Text Analyticsâ¢ Text analytics platform focused on combining structured and unstructured data Primarily sentiment and social analysis; focus is on OOB automatic capabilities and integration with data presentation Campbellâs; Shell; Disney; Coca-Cola; NBC http://odintext.com/about-odintext/ Open Text OpenText Content Analytics Features include management of controlled vocabularies, concept extraction, entity extraction, categorization, and summarization They have multiple products to handle text and incorporate into applications; auto- classification is mostly statistical and is part of their content management offering http://www.opentext.com/what-we-do/products/discovery/information-access-platform/content-analytics

examples of Commercially Available enterprise Search and Text Analytics products I-117 Company Product Description General Comments Sample Customers or Partners a Pool Party Basic Server Taxonomy and thesaurus management with additional optional features Primarily a taxonomy management company; have added limited text analytics â mostly entity extraction and auto- classification Boehringer Ingelheim; Credit Suisse; Council of the European Union; Pearson; RedBull Media House; The World Bank; The PokÃ©mon Company International; Wolters Kluwer Advanced Server Taxonomy and thesaurus management, linked data management, and ontology management included; concept tagging and semantic search are optional Enterprise Server All features of Advanced Server, plus concept tagging, text mining and entity extraction, and content recommender included; semantic search, data integration, and data analytics and visualization are optional Semantic Integrator All included and optional features of enterprise server are included https://www.poolparty.biz/product-overview/ Provalis Research QDA Miner Software for qualitative data analysis to code, annotate or search text or images; can extract information from text or images Primarily text as data, text mining, and entity extraction WordStat Text analysis tool includes text mining and visualization tools for theme extraction; provides ability to create taxonomies with words, patterns, and proximity rules; includes document classification tools; integrates with QDA Miner and requires installed version of either QDA Miner or SimStat ProSuite Integrates QDA Miner, WordStat, and SimStat (a statistical analysis product); integrates structured and unstructured data http://provalisresearch.com/products/ (continued on next page)

I-118 improving Findability and Relevance of Transportation information Table I-E-2 (Continued). Company Product Description General Comments Sample Customers or Partners a SAS Enterprise Miner Features include text mining, content classification, auto- categorization, taxonomy development, and entity extraction SAS has the most full-featured set of products with dozens of offerings. These products can be integrated with Enterprise Miner or SAS 9.4 â their foundation software. Their primary focus has been high-end data analysis but have added full- text processing capabilities. Text Miner Full-text analytics platform, including both statistical and rule- based categorization; features noun phrase extraction â named entity and pattern based, sentiment analysis, automatic summarization, and regular expression Analysis Sentiment Analysis SAS platform; allows for taxonomy development and auto- categorization as well as statistical and rule-based categorization Full-featured sentiment analysis with linguistic and statistical functionality Integrated with Text Miner in the full Integrated with Enterprise Miner Contextual http://www.sas.com/en_us/home.html SAP HANA Allows for linguistic, semantic, and âfuzzyâ searches through entity extraction and normalization; contains user interface for building search apps and browser-based search; also includes separate spatial, predictive, and BI capabilities Also have the old Inxight text analytics platform; not clear how well supported it is https://hana.sap.com/capabilities/analytics.html#section_section_4

examples of Commercially Available enterprise Search and Text Analytics products I-119 Company Product Description General Comments Sample Customers or Partners a Smartlogic Classification Server Provides rule-based classification and natural language processing to tag metadata; Includes customizable entity and fact extraction Full development platform; focus is on ontology as structure Ontology Editor Users can model concepts, topics, structures and relationships in this ontology management platform; web-based interface allows for collaboration Search Application Framework Integrates Ontology Editor model into a customizable search engine, with functionality including taxonomy or entity-based facet navigators, topic maps, and filters Text Miner Highlights most used language to inform taxonomy and ontology development http://www.smartlogic.com/what-we-do/products-overview Verint Systems Verint Â® Text Analyticsâ¢ Separates employee and customer streams using conversational analytics; provides automated theme discovery; includes interface for visualizations Primarily social and customer analysis; adding more text analytics http://www.verint.com/solutions/customer-engagement-optimization/voice-of-the-customer- analytics/products/text-analytics/

I-120 improving Findability and Relevance of Transportation information Table I-E-3. Examples of free text analytics and text mining software. Company/Product Description Apache Mahout Tool for developers to create machine learning applications; converts text to vectors to allow for text clustering; text mining only â used for building prototypes Ref: http://mahout.apache.org/ GATE Historically the primary software for building text analytics environment in- house Ref: https://gate.ac.uk/ KNIME KNIME Analytics Platform is open platform tool with over 1000 modules, including modules for text analysis Ref: https://www.knime.org/knime-analytics-platform LingPipe Text processing tool capable of topic classification, named entity recognition, clustering, and sentiment analysis Ref: http://alias-i.com/lingpipe/ Natural Language Toolkit Platform for using Python with text that allows for entity identification, classification, semantic reasoning, tokenization, and tagging Ref: http://www.nltk.org/ OpenNLP Processes natural learning text, and supports common NLP tasks such as named entity extraction, parsing, and tokenization; based in machine learning Ref: https://opennlp.apache.org/ RapidMiner Provides predictive analytics and sentiment analysis Ref: https://rapidminer.com/

Next: Volume II - Background Research »

Improving Findability and Relevance of Transportation Information: Volume I—A Guide for State Transportation Agencies, and Volume II—Background Research (2017)

Chapter: Volume I - A Guide for State Transportation Agencies

Welcome to OpenBook!

Get Email Updates