Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
V o l u m e I I Background Research
C o n t e n t s V o l u m e I I Background Research II-3 Chapter 1 Introduction II-3 1.1 Project Overview II-4 1.2 Document Overview II-5 Chapter 2 Current Practices for Improving Findability II-5 2.1 Overview II-5 2.2 Literature Review II-11 2.3 Findability PracticesâDOTs II-18 2.4 Findability PracticesâOther Organization Types II-22 Chapter 3 Framework for Improving Findability II-22 3.1 Framework Development Process II-22 3.2 Final Framework II-27 Chapter 4 Pilot Demonstration II-27 4.1 Pilot Objectives II-27 4.2 Identification of Pilot Agencies II-29 4.3 Summary of Pilot Activities II-30 4.4 Level of Effort for the Pilot II-34 4.5 Transferability and Scalability of the Pilot II-36 Chapter 5 Conclusions and Future Research Needs II-36 5.1 Conclusions II-38 5.2 Future Research Needs II-41 References II-42 Appendix Pilot Findability Report II-43 A.1: Pilot Overview II-44 A.2: Assessment II-49 A.3: Content Collection II-54 A.4: Solution Development II-68 A.5: Test and Evaluation II-75 Annex 1: Pilot Classification Rule Descriptions II-84 Annex 2: Example Scenarios Using Faceted Search Design II-90 Annex 3: Evaluation Metrics
II-3 C h a p t e r 1 1.1 Project Overview Research Objectives The objective of this research was to improve state department of transportation (DOT) infor- mation findability by (1) defining a management frameworkâincluding responsibilities of a transportation agency and its partnersâfor classification, search, and retrieval of transportation information; (2) describing successful practices for organizing and classifying information (e.g., ontologies or metadata schemas) that can be adapted to classification, search, and retrieval of the diversity of information a transportation agency creates and uses; (3) developing federated or enterprise search procedures that a DOT can use to make transportation information available to users, subject to concerns for security and confidentiality; and (4) undertaking an example implementation of the management framework, the organization and classification practices, and search procedures to demonstrate enhanced findability for a DOTâs data. Research Scope and Tasks NCHRP Project 20-97 was structured in four phases: ⢠Phase 1 involved information gathering to document current practices for improving find- ability and relevance of information. It included a literature review, interviews with five state DOTs, and compilation of information about practices in non-DOT organizations based on the research teamâs prior experience. ⢠Phase 2 involved developing a framework for improving findability for use by DOTs, based on the information gathered in Phase 1. ⢠Phase 3 involved a pilot demonstration of techniques for improving findability at a state DOT. This pilot focused on findability of construction project information and the application of text analytics tools for automated classification of content and improving relevancy of search results. ⢠Phase 4 involved documenting the results of the research in this final report, and development of a stand-alone guide to improving findability. The resulting guide is presented as Volume 1 in this research report. Introduction
II-4 Improving Findability and relevance of transportation Information 1.2 Document Overview Volume II of NCHRP Research Report 846 provides a high level summary of the project method- ology and deliverables. The balance of Volume II is organized as follows: ⢠Chapter 2 documents the information gathering activities and summarizes practices for improving findability. ⢠Chapter 3 documents the framework that was developed for improving findability. ⢠Chapter 4 provides an overview of the pilot demonstration. ⢠Chapter 5 contains conclusions from the research including lessons learned and suggested ideas for future research. ⢠The Appendix and Annex materials contain a more detailed description of the pilot activities.
II-5 C h a p t e r 2 2.1 Overview The initial substantive research task involved a review of âsuccessful practices for ensuring find- ability of mission-critical information in public and private sector organizations,â leading to the development of an initial framework for ensuring findability. To accomplish this objective, the research team conducted a practice review involving a literature search, telephone surveys with five state DOTs, and documentation of prior research team project experience with findability projects outside of the state DOT community. Each of these activities is discussed in this chapter. 2.2 Literature Review A limited literature review was conducted, focusing on three areas: (1) general references on information architecture and search, (2) specific references on transportation information man- agement and findability, and (3) state DOT enterprise architecture studies. References reviewed are listed in Table II-1. Findings of relevance to this project are briefly summarized below. Information Architecture and Search As noted in the working plan, a rich base of research, guidance, and practice examples from multiple domain areas exist related to information governance, architecture, and search. General lessons from information architecture practice include: ⢠Ensure that improvement initiatives will have clear benefits to the organization that can be measured. Identify the organizationâs critical information assets, and target findability issues that are adversely impacting efficiency or effectiveness. ⢠Tailor solutions to different search questions and associated patterns. Distinguish simple lookup needs from more open-ended discovery needs. ⢠Recognize that success of any findability initiative depends on actual usage. Providing convenient and satisfying search experiences for users is essential. ⢠Make sure that the solutions fit the organizationâs capabilities; do not pursue a resource- intensive approach if it cannot be realistically sustained. For example, approaches that require ongoing staff efforts to manually classify documents generally are not sustainable. ⢠Leverage new technologies. For example, automation of content discovery and indexing has advanced; semi-automated approaches can provide effective results and are more cost-effective than manual processes. The Information Management Foundation published a set of 19 best practice articles on topics including taxonomy and content management integration, metadata creation and man- agement, knowledge transfer, enterprise information management investments, web analytics, Current Practices for Improving Findability
II-6 Improving Findability and relevance of transportation Information and media creation management (see Table II-1). One of these articles, referencing a content management initiative at Motorola, contained sufficient information to be included as one of the research teamâs standard findability practice examples. Other articles were informative, but they were either not directly applicable to state DOTs or not provided in a case study format. The bookâs introduction offers several principles that are reinforced by the best practice articles, and that are useful to keep in mind in designing findability improvements: ⢠Information is communication. In designing findability improvements, it is necessary to consider what kind of communication is being supported or assisted. ⢠Information has value. Information management takes effort, and therefore candidate efforts need to be evaluated based on business value. ⢠Information has audiences. It is important to obtain an in-depth understanding of the needs and search behaviors of information users, who are the targets of any findability improve- ment. Otherwise, there is a real danger that resources will be invested without providing value. Information Architecture and Search 1. Information Management Best Practices, Volume 1 (The Information Management Foundation) 2010 https://books.google.com/books?id=m RhgHhvhiUsC&lpg=PA4&ots=4Ihtwj- Kq&dq=boiko%20Hartman%20best%2 0prACTICES&PG=pa1=onepage&q=boik o%20Hartman%20best%20practices&f- false 2. Search Patterns (Morville and Callender) 2010 http://searchpatterns.org/ 3. Ambient Findability (Morville) 2005 http://www.amazon.com/Ambient- Findability-Peter- Morville/dp/0596007655/ï¬ndability- 20/ Transportation Information Management 4. NCHRP Report 754: Improving Management of Transportation Information (Cambridge Systematics) 2013 http://onlinepubs.trb.org/onlinepubs/ nchrp/nchrp_rpt_754.pdf 5. NCHRP Report 643: Implementing Transportation Knowledge Networks (Spy Pond Partners, LLC) 2009 http://onlinepubs.trb.org/onlinepubs/ nchrp/nchrp_rpt_643.pdf State DOT Enterprise Architecture 6. Development of a Strategic Enterprise Architecture Design for Ohio DOT (Cooney, Clement, and Shah) 2014 http://www.dot.state.oh.us/Divisions/ Planning/SPR/Research/reportsandpla ns/Reports/2014/Administration/1347 56_FR.pdf 7. Kansas DOT Enterprise Architecture 2005 http://www.mdt.mt.gov/other/webda ta/external/research/DOCS/RESEARCH _PROJ/IT_ARCH/TASK_2.PDF (pp. 11- 15) Table II-1. Literature review: Findability of transportation information.
Current practices for Improving Findability II-7 ⢠Information has a life cycle. A full life cycle view of information is needed to ensure findabilityâand successful content management efforts encompass the creation, tagging, conversion, publication and retirement/culling processes. Peter Morvilleâs two books provide valuable background on the nature of search and strate- gies for designing useful search environments (see Table II-1). Some fundamental concepts of relevance to identifying successful practices for findability include: ⢠Findability is a challenge because information is stored in multiple repositories, with multiple inconsistent ways of labeling and categorization, and inherent ambiguities in language. Moreover, in most organizations, âfindability falls through the cracks,â meaning that nobody is responsible for the end result. ⢠Improving findability means considering the nature of searches. Search objectives range from looking for a specific item (e.g., find the AASHTO Asset Management Guide) to exploring available resources within a topic area (e.g., find out what guidance or experience exists for roundabout design). Also, searchers vary in terms of search skill level and familiarity with the content space. ⢠Where exploratory searches are common, search interfaces based on faceted navigation can be very helpful. This approach is used on many shopping websites (e.g., on www.Amazon.com). They are powered by structured databases incorporating standardized metadata for each item. ⢠Findability can be approached not only through âpullâ methods (in which a user actively searches for content), but also through âpushâ methods (in which the user receives content based on subscriptions or role-based targeting). ⢠Search success can be evaluated based on precision and recall. Precision measures how well a system will retrieve only the relevant documents (e.g., the percentage of results that are relevant); recall measures how well a system will retrieve all of the relevant documents (e.g., the percent- age of available relevant documents that were included in the search results). Because precision and recall often are inversely related, it is helpful to understand which metric is more important when designing search capabilities. ⢠Full-text search performance depends on the size of the search pool. As collections increase in size, both precision and recall decline. ⢠Removal of redundant, outdated, and trivial content from the search pool is helpful for shrinking the search space and improving search results. This can be tackled via content policies that define what should (and should not) be stored, and by regular weeding of the collection. ⢠Use of descriptive metadata for subject and content type is increasingly valuable for improv- ing findability as collections grow larger. However, while metadata improves search perfor- mance, centralized, manual tagging is typically too expensive and time-consuming for most large-scale search applications. ⢠Different search objects require different findability strategies. An approach for a relatively small set of policy documents might rely on full text search, whereas an approach for a large collection of photographs would require use of keywords or structured metadata. ⢠Effective search design patterns include use of faceted navigation, use of âBest Betsâ for the most common queries, use of auto-complete and auto-suggest as the user is typing the search criterion, emphasis on presenting the âbest resultsâ first (through well-tuned relevance algorithms), options to sort by date, options to filter by format and content type, use of personalization information, and use of diversity algorithms to guard against redundant results. ⢠Federated search is helpful when searches across multiple sources are needed, but perfor- mance can be slow, and metadata-based queries are limited to the âlowest common denomi- natorâ across the different sources. Building a unified index of content across repositories is an alternative that can achieve the same objective.
II-8 Improving Findability and relevance of transportation Information ⢠A variety of approaches can be taken for the design of content classification and tagging methods. There is no single best way; the approach should be designed to fit the need: â Taxonomies can be helpful in situations for which findability can be enhanced via a hierarchical breakdown or tree structure of content (e.g., locating construction project information applicable to project phases, and tasks within phases). â Faceted classification can be helpful in situations for which users want to search based on different criteria, such as locating meeting records based on date of meeting, organizational unit running the meeting, or type of content produced (e.g., presentation slides, meeting minutes, meeting agenda, etc.). â Standard key words can be helpful to facilitate common searches. Use of thesauri can extend the value of key words by establishing preferred terms as well as equivalent and associated (broader and narrower) terms. For example, a user searching for âperformance measuresâ could be directed to resources that were tagged with the terms âstructurally deficientâ or âpavement condition indexâ. â The resource description framework can be used to document a formal, machine-readable representation of relationships across terms, providing a powerful semantic foundation for search-based applications, and the ability to link independently produced data resources. For an example, see the BBCâs Wildlife Ontology (http://www.bbc.co.uk/ontologies/wo), used to power the organizationâs Wildlife Finder website (http://www.bbc.co.uk/nature/wildlife). â Free-form tagging (or folksonomies) can be helpful for social media posts and other con- tent that is somewhat transient in nature. ⢠Googleâs search methods combine full-text, metadata, and popularity measures in which inbound links constructed by humans are, in effect, used as metadata. ⢠Typical intranet searches do not perform as well as Google searches of the Internet given the absence of structured metadata to power faceted navigation, and insufficient scale to support full text relevance-ranking algorithms. Transportation Information Management NCHRP Report 754: Improving Management of Transportation Information reviewed the state of the practice in transportation agency information management (see Table II-1). This report identified successful strategies in some areas, including organization and distribution of structured data (notably traffic data and geographic data), use of content and document management systems such as SharePoint and ProjectWise, sharing of internal research reports, and DOT library services including cataloging of printed documents. Challenges also were identified in several areas: ⢠The growing number of information sources, which make it time-consuming and difficult for staff to determine what is most relevant and valuable to read and share. ⢠A need to make website and other content findable through improved information organiza- tion and use of key words. ⢠The siloed nature of information creation and management within the DOT, impeding find- ability and use of centralized information management strategies. ⢠A lack of user training on how to discover and retrieve data and information. ⢠A lack of executive policy direction on information management. ⢠Highly constrained staffing and financial resources for improving findability and providing reference support. The report also presented a variety of information management strategies, organized around processes for capturing, administering, and retrieving information. Strategies related to improved findability included: ⢠Establish agency policies for information governance, archiving and records retention. ⢠Provide content in electronic format; use digital preservation and allocate funds to address electronic file management.
Current practices for Improving Findability II-9 ⢠Leverage available technology for information storage and retrieval. ⢠Establish categorization schemes for data and information management. ⢠Use taxonomies, semantic schemes, and authoritative glossaries and vocabularies, and use taxonomy management tools. NCHRP Report 643: Implementing Transportation Knowledge Networks was an earlier study that established a business plan for knowledge sharing within and across transportation agencies (see Table II-1). This study included focus groups and a web-based survey to better understand information needs. Reported information needs were wide ranging; the list below provides a flavor of the diversity of information being sought and nature of search requirements. Searches for Specific Documents ⢠Search for a particular engineering standard. ⢠Search for an older plan or engineering document (especially difficult if the project name changed or if the project was split or combined). ⢠Search for online equipment maintenance manuals. ⢠Search for unpublished or âgray literatureâ (e.g., presentations from internal meetings or national conferences). ⢠Find current active contracts and agreements. Searches for Information on a Specific Topic and of a Specific Content Type ⢠Search for research reports or information about best practices. For example, âHas a study been done about outdoor advertising practices?â ⢠Search for latest developments for a specific technique or technology application. For example, âWhat are the latest technologies for automated speed enforcement?â ⢠Search for activities at peer agencies. For example, âWhat are other agencies doing in the area of innovative finance?â âWho is using electronic signatures on plans?â ⢠Find current links to different websites with information collections. For example, âWhere can I look for information about pavement preservation methods?â ⢠Search for existing or pending/proposed local, state, and/or federal legislation related to a particular topic area. People Searches ⢠Find specific contacts at a peer agency for different functional areas. Data Set Searches ⢠Search for data relevant to a particular question (e.g., construction costs, vehicle registration trends, freight movements). Project Searches ⢠Search for historical information or construction details related to a specific project. Recorded Event Searches ⢠Search for historical information about events (e.g., details about a particular crash or incident). Based on the identified information needs, the report presented a vision for a central informa- tion portal including federated search capabilities with the following components: ⢠Information search, giving access to various organized information sources including agency survey results, library catalogs, data sets, and legislation. ⢠Topic search, giving access to curated sets of information resources, maintained by designated national topic leaders, organized by resource type (e.g., research report, synthesis, data set, etc.).
II-10 Improving Findability and relevance of transportation Information ⢠People search, by role (e.g., traffic engineer for City X) or by area of expertise. ⢠Calendar search for events by date range and topic. ⢠News search for articles by topic area, keyword, and source. ⢠Research search (e.g., giving access to TRB sources and other sources on active transportation research projects). The business plan recommended the following performance measures related to findability: ⢠Changes in access time and in cost for a standard âbasketâ of information goods. ⢠Percentage of unique transportation library holdings that can be discovered via available search tools. ⢠Percentage of active and completed research projects that can be discovered. State DOT Enterprise Architecture The state DOT enterprise architecture studies for Ohio DOT and Kansas DOT did not explicitly address findability but do provide useful models of state DOT business processes and information systems that serve as a context for understanding search needs and behav- iors (see Table II-1). Figures II-1 and II-2 show two products of the Kansas DOT enterprise architecture study. Figure II-1 is a value-chain view that distinguishes primary and support- ing activities of the agency and illustrates the life cycle of core business process activities. The value chain provides a way of understanding the business context for information orga- nization. The categories identified can provide a useful way to classify agency information resources and understand their creation and utilization patterns. Figure II-2 illustrates a high level data model that provides another way of categorizing different information resources in an agency. Source: Redrawn from figure in draft document from the Kansas State DOT. Figure II-1. Kansas DOT enterprise architecture value-chain diagram.
Current practices for Improving Findability II-11 Figure II-3 shows a simplified business process model developed as part of the Ohio DOT enterprise architecture study. Similar to the Kansas DOT value chain, it offers a way to associate the agencyâs information resources with key business activities. In effect, these architectural views provide ways of dividing up the state DOT information space into logical categories that can support findability of information. These categories may be used to develop one or more facets for DOT information search. (Note that Figure II-3 was adapted from the original and simplified to show its essential elements.) 2.3 Findability PracticesâDOTs Methodology Five state DOTs were selected for interviews based on the research teamâs familiarity with ongoing initiatives related to improving findability: ⢠The Washington State DOT ⢠The Virginia DOT Source: Redrawn from figure in draft document from the Kansas State DOT. Figure II-2. Kansas DOT enterprise architecture high level data model.
II-12 Improving Findability and relevance of transportation Information ⢠The Mississippi DOT ⢠The Illinois DOT ⢠The Colorado DOT The interview was divided into the following sections: ⢠Basic information about the agency (e.g., number of employees, system size). ⢠Description of successful practices for improving findability. ⢠Current practices for managing selected types of content. ⢠Approaches for managing special content types (images, data sets, social media). ⢠Findability improvement needs. If an agency had adopted information organization schemes or classification methods, the research team requested copies. The interviews proved to be a very useful method for gaining a good understanding of the âinformation management landscapeâ at state DOTs. Several commonalities were identi- fied across the five agencies with respect to content management tools, processes, and chal- lenges. One observation from the exercise was that, because information management is not typically a highly centralized activity in state DOTs, one would need to interview many dif- ferent individuals across multiple departments to obtain a complete picture of content stor- age, organization, and search practices. For the most part, the individuals interviewed had a reasonably broad and complete understanding of formalized information management Figure II-3. Ohio DOT business process view (simplified).
Current practices for Improving Findability II-13 systems and practices. In some instances, they consulted with others. For the Washington State DOT, the research team conducted follow-up calls to fill in some of the details for specific content types. Summary of Findings The DOTs interviewed were facing several common challenges: ⢠Lack of consistency across business units as to what content is stored, where it is stored, and in what format. ⢠Lack of a coordinated approach for management of structured and unstructured information resources. ⢠Lack of ability to search across different information repositories in the organization. ⢠Limited formalized metadata standards; the metadata in use was primarily administrative rather than descriptive in nature. ⢠Lack of formalized information governance processes. In general, the DOTs interviewed had implemented the following types of practices to provide information findability: ⢠Deployment of content management systems for construction project plans. ⢠Deployment of content management/collaboration software for sharing of corporate, busi- ness unit and team content. ⢠Implementation of data warehouses and geographic information system (GIS) portals to provide centralized access to structured data resources. ⢠Digitizing paper documents for archiving and retrieval. Successful Practices for Enhancing Findability Successful practices for findability (as selected by the interview subjects) were as follows: ⢠Colorado DOT. The Colorado DOT identified two initiatives: their Online Transportation Information System (OTIS) website and their Document Retention Program (DRP). â OTIS provides a single point of access for roadway data using a GIS platform. â The DRP applies âleanâ business process improvement methodologies for document reten- tion to meet legal, regulatory, or audit requirements. The department is developing an implementation plan for improving consistency and efficiency of the program. There are more than 40 document retention coordinators distributed across the different functional areas of the department. All of the content will be stored either on ProjectWise (engineer- ing documents), SAP ContentServer (financial documents), or SharePoint (everything else). Each document will be assigned to a retention schedule, which will serve as a classi- fier. Retention schedules are being refined and streamlined. Metadata is being defined but emphasizes document management rather than search. â The Colorado DOT plans to use the Fast Search and Transfer (FAST) search engine to provide federated search capabilities across the three repositories. The DOT is exploring use of the Perceptive product for supporting document intake workflow. The DOT also has a governance committee for document management systems. They have developed suggested standards and user-friendly guides to available document storage options. ⢠Illinois DOT. The Illinois DOT identified two initiatives: a SharePoint implementation and their data warehouse program. â The Illinois DOT was an early adopter of SharePoint, and its use has become âpart of the culture.â Basic governance is in place for defining new content types, defining metadata
II-14 Improving Findability and relevance of transportation Information elements, and ensuring searchability of content across SharePoint sites. There are more than 5 million documents in the system. IDOT implemented SharePoint to cut down on duplicate copies, eliminate multiple potentially conflicting versions of documents, and provide a business platform for content management, team collaboration, and workflow. Built-in workflow is a key factor for success. IDOT uses an add-on tool for workflow design and a companion tool from KnowledgeLake for document capture. An initial effort with this product was conducted to capture content related to American Recovery and Restoration Act projects. â The Illinois DOT has established a data warehouse program that includes extract- transform-load processes for capture of information from a diverse set of legacy systems, and a business intelligence portal providing access to several subject area data marts including construction, financial information, payroll, safety, and human resources. The DOT does not use a taxonomy for SharePoint, but is currently working to develop one as part of a records management system implementation. This is a collaborative effort involving the Bureau of Information Processing, the records coordinator, and the Illinois DOTâs library. ⢠Mississippi DOT. The Mississippi DOT did not identify a single successful effort, but rather offered information on their overall approach to content management utilizing three systems: Microsoft SharePoint, Bentley ProjectWise, and EMC ApplicationXtender. â SharePoint was implemented at the Mississippi DOT in 2003 and is used for human resources content, transportation commission documents, standard operating procedures, e-forms, and business unit collaboration. â ProjectWise was implemented more recently and is used for management of construction project files (including CAD plans). â ApplicationXtender is an older product that is used for scanning and archiving documents for records management. Documents include permits, financial records, law enforcement records, and project-related files. â The Mississippi DOT uses the native SharePoint search engine for federated searches includ- ing SharePoint, ApplicationXtender (AX), and files stored on shared network drives. Goals for implementation of these systems included improved searchability and access, decreased paper file storage, and support for internal and external collaboration. â The Mississippi DOT has established metadata standards and has defined clear roles for content management. They have an enterprise content management (ECM) team that meets regularly to update governance and ensure standardization across the agency. ⢠Virginia DOT. The Virginia DOT was focusing on improving management of 1,400+ policy and procedure documents to ensure that DOT staff can find the most recent, authoritative versions of these essential corporate documents. â The Virginia DOT was implementing a new tool that will allow for simultaneous publication of updated documents in both SharePoint (used for the agencyâs intranet) and the external website. The Virginia DOTâs Knowledge Management Office was taking the lead for this ini- tiative, handling metadata development and assignment, including controlled vocabularies for document type and subject key words. â The Virginia DOT has also implemented DeepWeb, a federated search tool that allows simultaneous searches of their library catalog, several subscription databases, the Vir- ginia DOTâs Twitter feed and YouTube channels, the 50-state DOT Google search (which searches content within state DOT public-facing websites), TRID, and other national and international transportation information sources. This capability is working well and the Virginia DOT is beginning to consider how it might be integrated with the DOTâs SharePoint site. ⢠Washington State DOT. The Washington State DOT identified several initiatives of note, including development of an agency data catalog and metadata repository (âDOTSâ), a data
Current practices for Improving Findability II-15 warehouse, a physical library staffed by professional librarians, use of SharePoint for docu- ment sharing and collaboration, use of ProjectWise for engineering document sharing, and deployment of a GIS tool providing access to widely used geospatial data. The DOT also noted that they are piloting a tool called Varonis for security management and file utilization tracking. Varonis has capabilities for automated content classification based on pattern and dictionary-based content matching. (The agency subsequently reported that it did not imple- ment this tool, in part due to its cost.) â âDOTSâ is a custom application that harvests metadata from the Washington State DOTâs Information Technology (IT)-managed databases. DOTS maintains definitions for thou- sands of business terms, and maps these terms to data elements. Currently, DOTS has a lim- ited text search function. Planned improvements will add the capability to search based on use of synonyms. A single full-time equivalent position is devoted to maintaining DOTS; other staff support is provided as needed. The data warehouse provides a single, authori- tative source of integrated data to support core business functions, providing answers to questions that would have previously been prohibitively time-consuming to answer. Data are stored in SQLServer. â The Washington State DOT is currently replacing their existing query/reporting tool with IBM Cognos. The DOT is served by six professional librarians staffing four physical libraries (main, materials lab, terminal engineering, and vessels engineering). The first two librar- ies are affiliated with the Washington State Library; the latter two libraries support the Washington State Ferries division. Librarians assign keywords and Transportation Research Thesaurus (TRT) index terms to research reports. Information Management Practices and Goals Additional findings from the state DOT interviews are summarized in the remainder of this section. Where Do these DOTs Store and Manage Content? Identified storage locations are: ⢠Database servers ⢠Dedicated video servers ⢠Shared network drives ⢠Local hard disks on employee desktops and laptops ⢠External websites ⢠Intranet sites ⢠Cloud storage locations ⢠Project websites (internal and external) ⢠Physical libraries Identified information repositories and content management systems are: ⢠SharePoint (agency-wide, departmental and team documents) ⢠ProjectWise (engineering documents) ⢠Falcon (design plans) ⢠Oracle content management system/Stellent ⢠OpenText/ECM LiveLink ⢠DocuWare (records management solution) ⢠EMC Application Xtender (AX) ⢠GIS Portals
II-16 Improving Findability and relevance of transportation Information ⢠SAP Content Server ⢠Data warehouse ⢠Web-based content management system (e.g., Plone) ⢠Custom applications (e.g., for right-of-way management, project management) ⢠Social media (cloud) (e.g., YouTube, Twitter, Facebook, Flickr) What Other Tools Do DOTs Use for Information Management? Search tools used are: ⢠Google Search ⢠SharePoint Search ⢠FAST search ⢠Other search tools (e.g., those built into documents and content management systems) Query and reporting tools consist of: ⢠IBM Cognos Catalogs/metadata repositories used are: ⢠Library catalog software (EOS) ⢠An enterprise metadata repository (at the Virginia DOT) ⢠A data catalog (at the Washington State DOT) Other tools used are: ⢠KnowledgeLake Capture for SharePoint ⢠NINTEX Workflow Designer (a SharePoint add-in) ⢠Informatica (for data transformation) What Approaches Are Used for Metadata and Categorization of Content? The approaches used are: ⢠Keywords for subject and document type, assigned by Knowledge Management or Library staff for policies and procedures (Virginia DOT). ⢠Use of standard link fields across systems (e.g., vendor ID, project ID) (Illinois DOT). ⢠Taxonomy for records management system (Illinois DOT, under development). ⢠Definition of standard content types and metadata elements for SharePoint. ⢠Federal Geographic Data Committee metadata for GIS data sets (descriptive metadata; limited value for search). ⢠Standard folder organization structures and metadata elements (project ID, location) for ProjectWise. ⢠Business concept definition management; association with data elements (Washington State DOT). What Findability-Related Business Goals Do They Have? ⢠Enable improved discovery in response to litigation, audits, and Freedom of Information Act (FOIA) requests. ⢠Support core business needs (e.g., asset management). ⢠Ensure recovery of valuable documents in the event of a disaster or hardware failure. ⢠Reduce duplication through making available centrally accessible repositories, use of links rather than copies, version control, and so forth.
Current practices for Improving Findability II-17 ⢠Avoid information loss due to lack of organized information management. ⢠Protect investment in costly plans/studies and make sure they can be found. ⢠Facilitate getting new employees up to speed (e.g., ability to find and review background documents relevant to a position). ⢠Promote data sharing by ensuring that people know what data are available and how to access it. What Types of Needs Are Recognized? Standards, policies, and governance-related needs are: ⢠Develop agency-wide standards for managing construction project-related content and reducing time/costs of finding information when claims are filed. ⢠Provide clear guidance on where different types of content can and should be stored. ⢠Ensure electronic content is text-readable/searchable. ⢠Implement common classification approach across content types and storage locations. ⢠Develop metadata solutions that recognize the wide variety of heterogeneous content types. ⢠Obtain stronger endorsement/management support for coordination of application and data architecture to promote data re-use and ensure integration across systems. ⢠Put in place stronger information governance to accomplish and sustain findability improvements. Education and training needs are: ⢠Make the case for a more disciplined approach to content management, without which it is difficult to implement and enforce strict governance policies. ⢠Train users on how and where to search. ⢠Improve awareness of when full text search is sufficient and when a more structured approach to metadata is needed. ⢠Provide education to ensure that staff understands the importance of information manage- ment (including unstructured content) and the role of information owners. Content management capabilities needs are: ⢠Provide reliable and persistent electronic storage for content. ⢠Automate workflow for life cycle management of content. ⢠Digitize archival paper records; reduce/eliminate paper generation. ⢠Implement content/document management systems to provide electronic access (the alterna- tive being paper files in boxes or scanned files on CDs). ⢠Manage access to content based on roles (e.g., tied to ActiveDirectory) across different repositories. Search capabilities needs are: ⢠Support searches for specific documents as well as searches by topic area to provide a satisfying user experience (e.g., âGoogle-likeâ or âAmazon-likeâ) and provide meaningful search refiners (e.g., date, document type, content classification). ⢠Build standard search/retrieval services into line-of-business applications. ⢠Reduce the number of places to search for information; provide a single search interface to look across multiple repositories (federated search), including across SharePoint, library catalog, data repositories, and file servers. ⢠Provide spatial search capabilities across content types. ⢠Provide an approach to finding documents stored on external website from internal search tools.
II-18 Improving Findability and relevance of transportation Information ⢠Identify where full text search capability is sufficient and where additional effort to invest in taxonomy development and tagging is worth the cost. ⢠Improve email and archived records findability. ⢠Improve web content organization and findability. 2.4 Findability PracticesâOther Organization Types Methodology To complement the state DOT findability practice examples, the research team drew on its experience working with a range of organization types over the past decade to develop six findability improvement examples. A seventh example was drawn from the literature review. Selected follow-up calls to the organizations were made to obtain updated information. The seven organizations were: ⢠The U.S. Government Accountability Office (U.S. GAO). ⢠Battelle (a 5,000-employee engineering and science consulting organization). ⢠The Wyndham Hotel Group. ⢠Boehringer Ingelheim (a major international pharmaceutical company with more than 40,000 employees). ⢠A major industrial conglomerate with 32,000 employees. ⢠First Wind (a 200-employee renewable energy company). ⢠Motorola (a multinational telecommunications company with more than 20,000 employees). These cases represent a variety of organization types, approaches, technologies, types of content, and applications for findability improvement. A standard template was developed to document successful practices, including the following types of information: ⢠Practice description. ⢠Business case. ⢠Scope of content included. ⢠Organizational units leading and supporting the effort. ⢠Technologies used for information storage, access, and search. ⢠Metadata and classification schemes. ⢠Responsibilities and resources for tagging/indexing, vocabulary management, and search monitoring. ⢠Information governance policies and processes. ⢠Reported benefits. Summary of Findings ⢠Organizations. The organizations ranged in size from 200 employees to more than 40,000 employ- ees. Organization types included a government agency, an energy start-up, an industrial manufac- turer, a biotech firm, and an engineering and science consulting company. ⢠Findability practices. Examples were split between a focus on enterprise search capabilities and a focus on content management functions. ⢠Scope of content included. Target content types including project information, scientific literature, and web content. Applications targeted both internal and external content. ⢠Organizational responsibilities. A variety of units had primary responsibility for both the content being searched and the overall project to improve search. One common theme was
Current practices for Improving Findability II-19 the need for collaboration between central groups and distributed groups (e.g., field offices) and the need for a dedicated team working together to plan and implement improvements. ⢠Technologies. Content and document management platforms included SharePoint, Docu- mentum, Adobe Experience Manager, HP Autonomy, and Fatwire (acquired by Oracle). Search engines included those built into these content management platforms, Lucene/SOLR, Verity, and AskMe. Taxonomy management software included DataHarmony, SchemaLogic, and ConceptSearching. The use of text analytics software (Teragram [SAS], ConceptSearching, Inxight, and others) was a major factor in improving the overall quality of tagging documents as well as reducing the cost and time required for creating metadata. ⢠Metadata. Most of the efforts involved creating metadata standards and well-structured vocabularies (including keywords). Several of the examples illustrated the use of taxonomies and particularly faceted taxonomies. ⢠Information governance and policies. These examples illustrated the importance of having a well-defined policy and process for adding metadata to content. Applications of broader information management policies were not explored. ⢠Benefits. Certain shared benefits were exhibited by all or most of the projects, with some specific variations. The main benefit areas were: â Improved search, improved quality of metadata. â Reduced cost and time of the search and creating metadata. â Reduced cost for business processes, including customer self-service, eProcurement, online training, and others. â Reduced need to re-create documents and remove duplicate documents from current repositories. â Increased value from existing information resources. â Value added from new applications and information retrieval capabilities built on top of search. ⢠Success factors and lessons. Success factors and lessons learned for each example are summarized in Table II-2. Implications for State DOT Findability The example search and content management practices assembled generally were more advanced than those in place within the DOTs interviewed for this project. Although many of these organizations are larger (and in some cases, less financially constrained) than the typical state DOT and they have different types of needs, many of the practices described could poten- tially be implemented within a DOT environment. For example, many DOTs could develop or adapt existing taxonomies and use taxonomy terms to tag information resources either manually or semi-automatically. DOTs could also implement faceted search capabilities to improve usersâ ability to navigate through available content. DOTs could also devote additional resources to refinement of search tools based on user feedback. The examples illustrate useful approaches and lessons that are applicable for design of find- ability improvements in any organization. By looking at search and information initiatives in a variety of environments, it is possible to get a deeper understanding of search and what factors lead to success. A number of general lessons can be seen in these examples: ⢠First, improving search clearly takes much more than buying a new search engine. Mean- ingful improvement almost always requires taking a deeper look at how search is being used and at the entire information life cycle. It is important to take a comprehensive, strategic perspective that considers integration across all the parts of the organization involved with information access.
II-20 Improving Findability and relevance of transportation Information Organization Lessons and Success Factors U.S. Government Accountability Oï¬ce (U.S. GAO) ⢠Good quality metadata is essential for good search. ⢠Eï¬ciency of metadata tagging can be improved with hybrid human and text analytics. ⢠Developing good rules for auto-categorization is essential for success. ⢠Good auto-categorization rules require a combination of subject matter expertise and library science expertise. ⢠It is important to understand how information management software works in diï¬erent environments. Major Industrial Conglomerate ⢠Need for a comprehensive approach to ï¬ndability aligned with business strategic objectives. ⢠Need to recognize that taxonomies require ongoing maintenance and reï¬nement. First Wind ⢠Essential to get a complete understanding of processes for content creation and retrieval. ⢠Need a good understanding of search technology functions and capabilities, especially within SharePoint. Wyndham Hotel Group ⢠Importance of integrating diï¬erent perspectives from multiple teams. ⢠Collaborative approach of consultant expertise and in-house business understanding. Battelle ⢠Important to match the level of detail in a taxonomy to the information needs (do not over-engineer). ⢠Development of an integrated search capability for documents, people, and external technical information provided business value. ⢠Making the link to impacts on critical business processes is essential to get support for ï¬ndability improvements. ⢠A hybrid tagging approach is a powerful way to assign metadata for improved search. Boehringer ⢠Traditional keyword search was inadequate. ⢠Use of faceted search to ï¬lter results improved ï¬ndability. ⢠Text analytics works on highly scientiï¬c and technical literature as well as general semi-structured oï¬ce documents. Motorola ⢠Important to do content management system development and website redesign in parallel. ⢠Important to recognize the many ways a taxonomy can be leveraged. ⢠Important to consider three diï¬erent aspects of taxonomy implementation (taxonomy for navigation and search, taxonomy management, content tagging). Table II-2. Lessons and success factors from findability practice examples.
Current practices for Improving Findability II-21 ⢠Second, for many search applications, having high quality metadata that is based on well- designed taxonomies is essential for success. In most cases, the more metadata added, the better the search experience will be. Text analytics software applications have the potential to improve metadata quality and partially automate the process of assigning metadata to content. ⢠Third, implementation of faceted search based on well-defined facets is a successful practice. ⢠Fourth, search can be improved incrementally without undertaking a large and expensive information initiative. For example, simply assigning subject matter experts and/or librarians to tag documents as Best Bets can result in gradual improvements to the overall search experience.
II-22 C h a p t e r 3 3.1 Framework Development Process Based on the information gathering activities, the research team developed a preliminary framework for improving findability at DOTs. This framework recognized the complexity of improving findability in a DOT and the need for a multi-pronged approach involving: ⢠Understanding information seeking behaviors and needs. ⢠Mapping the information landscape (i.e., identifying where different types of information are stored). ⢠Understanding the information management life cycle to determine where and how to improve practices for metadata assignment, designation of authoritative documents, and cleanup of redundant and outdated content. ⢠Developing appropriate solutions integrating information management, search, and classification/ metadata elements. The framework was refined during the pilot activities, and again during the creation of the final guidance document. New elements were added to identify agency motivations (business drivers) for pursuing improvements to findability, and for the overall approach to implementation. 3.2 Final Framework Major Elements of the Framework An important insight drawn from the information gathering and the pilot that was reflected in the final framework is that improving findability of transportation information within a DOT is not something that can be done in a single project or initiative. Rather, a set of techniques, tools and organizational functions can be implemented or strengthened over time and applied to meet a set of targeted business objectives. Each agency can approach findability improvement with different emphasis areas or implementation sequences. For example, some agencies may want to begin by focusing solely on information management improvements to clean up and improve organization of content on file drives, email systems, and content management systems. Some may want to focus on improving performance of their existing intranet or content manage- ment system search tools by automating metadata assignment using text analytics techniques. Others may want to implement new enterprise search tools that index content across multiple repositories. Especially when paired with text analytics software that can automate the process of metadata creation and improve metadata quality, search tools have the potential to substantially improve findability in an agency. All of these techniques for improving findability require resources and focused attention. Without a solid grounding in specific business needsâand clear demonstration that there is Framework for Improving Findability
Framework for Improving Findability II-23 a solution that can make a noticeable differenceâit is unlikely that resources for findability improvement will be allocated and sustained. A clear focus on meeting business needs and matching of solutions to needs is critical to making progress. The final framework for improving findability is illustrated in Figure II-4. The top-level ele- ment, business drivers, covers major reasons why agencies would be motivated to implement findability improvements. The middle elements, planning and implementation, cover (1) how to establish requirements for improvements so that they fit with agency business needs and information resources, and (2) how to support continuous improvements to findability through a phased implementation approach and appropriate management functions and staff capabili- ties. The planning and implementation elements rest on three pillars representing key techniques and practices from which agencies can draw as they develop their implementation strategy. Business Drivers The top portion of the framework identifies four key motivations for pursuing findability improvements. These motivations were identified as part of the information gathering activities for the project, and were reinforced during the pilot. ⢠Reduced time spent searching for information. Employees spend significant amounts of time trying to find information. A paper by Cleverley (2015) reported that a review of several surveys spanning different business sectors found that â24% of a business professionalâs time is spent looking for information.â Reducing the amount of time it takes to track down avail- able information makes more time available for productive work. Figure II-4. Framework for DOT information findability.
II-24 Improving Findability and relevance of transportation Information ⢠More re-use of information, less re-work. Employees who cannot easily ascertain whether something already has been done that they could build upon may end up âreinventing the wheel.â The resulting re-work diminishes the value of agency investments to develop reports, studies, data sets, etc. ⢠Ensure use of authoritative information. Lack of ability to find the most current, autho- rized versions of documents or data sets is a common issue at many organizations, including DOTs. Use of outdated information can create risks for the agency, including inconsistent or improper implementation of agency policies and procedures, which can impact timely project delivery and consistent use of proven effective design practices. ⢠Efficient response to FOIA requests and claims. DOTs face an increasing number of public information requests, which can consume substantial amounts of time to fulfill. Similarly, responding to construction claims may require compilation of detailed records, including emails. Reducing time to compile this information is an important motivation for improving findability. Planning for Findability Improvements The second portion of the framework covers the steps needed to target and design improve- ments to match the needs and the information landscape of the organization. ⢠User needs. The starting point for any findability improvement is an understanding of infor- mation needs, current search behaviors, and pain points. This understanding can be obtained through online surveys, focus groups, interviews, and to some extent, a review of existing search logs. ⢠Information landscape. Once information needs and search behaviors are understood, it is important to develop an information landscape, or âmap of the territoryâ with respect to information repositories and their contents. This information landscape provides the basis for identifying which repositories and which content types should be targeted. Once targets are established, it is useful to obtain a picture of how information is created, updated, culled, and archived, including who is involved, what the processes are, etc. Based on this understand- ing, opportunities and constraints can be identified for improving information management practices, search, and metadata. Implementation of Findability Improvements The third portion of the framework covers a general approach to implementing findability improvements and identifies key implementation activities. The implementation element of the framework recognizes that it is not possible to address all of an agencyâs findability needs with a single solution or project. The range of needs across the agency will require multiple solutions. An incremental approach is recommended, grounded in the initial development of a vision that guides future activities. The Road Map A seven-step road map is suggested, involving: 1. Establishing an architectural vision for findability involving shared information repositories, common metadata elements, common terminology, master data management and enterprise search capabilities. 2. Identifying a focus area for improvement. 3. Conducting an assessment. 4. Identifying candidate improvements.
Framework for Improving Findability II-25 5. Implementing âquick winsâ (improvements can be easily accomplished with existing resources). 6. Implementing a pilot improvement that is consistent with and supports the architectural vision. 7. Expanding and formalizing the pilot. The architectural vision provides a big-picture view of how the agency will pursue findability improvements. It defines a set of guiding principles and cross-cutting resources (e.g., technology tools, metadata standards) that will be applied and refined over time. The vision also identifies priority needs. A deliberate process of developing a vision that involves key players in the organization is important to building an understanding of how different activities must fit together. This process can be integrated within a DOTâs overall business planning or information manage- ment strategic planning efforts. With a vision and strategy in place, the agency can implement incremental improvements, each of which may focus on a particular business area or type of content. DOTs can build their organizational and technology capabilities with each initiative. To make significant progress in improving findability, it is necessary early on to obtain man- agement understanding of what is needed and why (i.e., how improved findability enhances management functions). Once this understanding is established, a collaborative approach to improvement can be pursued involving existing units that are concerned with improving access to information for decision making (e.g., IT, data management, library, records management, intranet manager, engineering document management system owner, collaboration system owner, etc.). Creating a structure for this collaborationâor identifying an existing team with the right membershipâalso is important to establishing a focal point for taking action in a coordinated way. Several operational functions need to be considered for supporting findability, including: ⢠Establishing policies and standards. A root cause of difficulties with finding information is the lack of disciplined and consistent practices across organizational units for naming conven- tions, storage locations, metadata assignment, and so forth. Agencies should anticipate the need to establish and facilitate implementation of clear policies and standards for expected information management behaviors. ⢠Putting in place training and change management functions. Training and change man- agement are important both for introduction of new standard practices and for adoption of content management systems and other tools. ⢠Ongoing operational support. This includes management of search and related tools, assign- ment of metadata through manual, semi-automated, or automated means, and management of data integration and synchronization processes. Availability of staff resources to manage, monitor, and improve search over time is an important success factor for a findability solution. Agencies need to plan for and resource these functions, keeping in mind that specialized skill sets will be required. Findability Techniques A range of techniques can be used to improve management of information, provide better search and navigation tools, and build standard terminology and metadata needed to support findability. Any given findability improvement may involve a combination of three types of techniques, which are summarized briefly in this section. Volume I of this research report pro- vides more detailed descriptions.
II-26 Improving Findability and relevance of transportation Information Information Management Techniques Findability techniques related to information management include: ⢠Document management systems and content management systems. Use of these systems offers a more structured and contained environment for content than the âWild Westâ of shared file drives and email attachments. ⢠Content storage and cleanup policies and practices. These policies and practices provide consistency with respect to where different types of content can be found, as well as assurance that outdated or obsolete documents are removed or archived. ⢠File naming conventions. Using naming conventions allows users and information manage- ment staff to understand file contents without needing to open them. ⢠Scanning practices. Well-defined scanning practices ensure that files are text searchable. ⢠Security and access controls. Effective security and access controls will provide adequate protection of sensitive information without imposing unnecessary barriers to search across repositories. Search and Navigation Techniques Findability techniques related to search and navigation include: ⢠Enterprise search. These tools support search both within individual information reposito- ries and across different repositories. ⢠Faceted navigation. These interfaces allow a user to explore a body of information resources by selecting from a set of filters that restrict what resources appear on the list. ⢠Auto-suggest. This search tool capability improves search performance by suggesting standard terms that match a user-entered search string. ⢠Search monitoring and tuning. This technique enhances search performance based on targeting of problem areas observed through review of search logs. ⢠Search-based applications, meaning software applications in which a search engine plat- form (rather than a database) is used as the core infrastructure for information access and reporting. Metadata and Terminology Development Techniques Findability techniques related to metadata and terminology development include: ⢠Standard agency metadata elements and content types. Adoption of standard metadata schemes provides consistency across search interfaces and facilitates implementation of fed- erated search and service-oriented models for discovery of information resources. ⢠Standard classifications. Lists of values for common elements (e.g., organizational units, project phases, work types, material types, or infrastructure asset types) can be standardized. ⢠Terminology resources. This technique involves adapting or building taxonomies, synonym lists, and so forth to integrate into search tools to improve their effectiveness. ⢠Automated metadata creation. This technique involves the application of text analytics to automate or assist assignment of metadata elements.
II-27 C h a p t e r 4 4.1 Pilot Objectives A pilot demonstration was undertaken to: ⢠Test and validate concepts and methods for improving findability and relevance of transpor- tation information. ⢠Identify areas for refinements. ⢠Demonstrate effectiveness of findability improvements. ⢠Provide a documented case study application that could be used to strengthen the value of the report. 4.2 Identification of Pilot Agencies The following criteria were identified for selection of pilot agencies: ⢠Availability of an agency point person who supports the effort and can marshal the necessary resources to support it. ⢠Agency level of interest in enhanced search. ⢠Extent to which existing information repositories and search tools reflect âtypicalâ DOT prac- tice (to maximize relevance of pilot results to other agencies). ⢠Ability to provide the necessary access to agency systems and search tools (direct or via agency staff) to enable the research team to implement a search improvement. ⢠Availability of target users to participate in interviews and testing process. ⢠Availability of information management staff (library, data management, website manage- ment, etc.) to support the effort. ⢠Existence of usable controlled vocabulary and/or subject matter taxonomy as a starting point (although a lower priority, the existence of such a vocabulary or subject matter taxonomy could provide the effort with a âleg upâ). Based on these criteria, and on indications of potential interest from panel members, the research team contacted three agencies to explore their potential participation in the pilot for NCHRP Project 20-97: the Washington State DOT, the Virginia DOT, and the Wisconsin DOT. A project briefing document was provided to each agency describing the objectives and scope of NCHRP Project 20-97 in general and the pilot in particular. The Wisconsin DOT declined to participate due to internal resource constraints. The Washington State DOT and the Virginia DOT expressed a strong interest in participating. Follow-up telephone conversations were held with staff in both agencies to discuss potential pilot scopes that would be both helpful for the agencies and meet the project objectives. The rest of this section summarizes content from these telephone conversations. Pilot Demonstration
II-28 Improving Findability and relevance of transportation Information The Virginia DOT Background: The Virginia DOTâs Knowledge Management Office is responsible for ensur- ing findability of the agencyâs mission-critical content through enhancement of information management and classification methods. The DOT uses SharePoint 2010 for their corporate intranet. SharePoint is also used as the agencyâs intranet platform for document sharing and team collaboration. The Virginia DOT uses the FAST search tool that is embedded within the SharePoint environment. The agency has deployed a corporate document repository on SharePoint and continues to improve classification and management of these documents. The Virginia DOT also is in the process of developing a high level taxonomy for describing its content, and plans to build out different elements over time. Needs: The Virginia DOT was interested in addressing several priority findability issues. Spe- cifically, the DOT sought to: ⢠Improve management of active construction project documents, including construction inspector logs and notes, material test results, contractor invoices, certified payroll submittals, and so forth. Practices for managing this content varied across districts, with some districts utilizing SharePoint and others relying on folder structures on shared drives. ⢠Improve the likelihood that a search for a particular document would return a single authori- tative source and distinguish authoritative documents as such. The Washington State DOT Background: In 2015, the Washington State DOT established an Enterprise Information Gov- ernance Group and adopted eight principles for data and information management. At the time of the interview, this group was discussing next steps toward improving enterprise content man- agement (ECM) at the agency. Several distinct content management and collaboration systems were in use, including Oracle ECM, LiveLink, Bentley ProjectWise, and Microsoft SharePoint. The Washington State DOT also had recently completed two projects with students from Kent State University: one focused on improving findability of information in support of agency responses to public disclosure requests (PDRs), and a second focused on developing an asset taxonomy. The PDR project recommended a core metadata structure and developed high level specifi- cations for each of the metadata elements. The core metadata structure included the following elements: ⢠Title ⢠Organizational Unit ⢠Region/Division ⢠Date Created ⢠File Type ⢠Content Type ⢠Abstract-Description ⢠Transportation Keywords ⢠Transportation Asset ⢠Project ID ⢠Project Phase ⢠Business Function/Records Class Work also was done to test the build-out for two elements: Content Type and Transportation Keywords. The Content Type vocabulary involved integration of a broad set of categories for records classification at the statewide and DOT levels. The Transportation Keywords provided a full build-out for the âTransportation Assetâ element.
pilot Demonstration II-29 Needs: The Washington State DOT had an interest in leveraging the work completed to date in order to move forward with implementation of a findability solution in support of PDR informa- tion searches and potentially a broader set of findability use cases. Specifically, the agency wanted to validate the content type and asset taxonomies that were developed with users, improve them as needed, investigate practical approaches to creation and assignment of metadata, and develop strategies to integrate use of metadata with current (and future) content management solutions and search tools. Agency staff noted that, currently, each content type had a different set of meta- data elements with multiple taxonomies in use, so a strategy for migrating or mapping to a con- sistent set of metadata elements was needed. Agency Selection An important objective for the pilot was to demonstrate application of the findability frame- work and demonstrate benefits from findability improvements. Achievement of this objective required a fairly intensive effort at a single agency. The research team recommended focusing on a single agency but involving the second agency in a review of results and discussion of transfer- ability. Both agencies met many of the defined selection criteria with respect to level of interest and willingness to provide access to systems and staff support. The Virginia DOT was selected to be the pilot agency for the following reasons: ⢠A focus on construction project document findability is more tractable in the context of a brief pilot effort than the broader and more complex issue of findability in support of PDRs. ⢠The technology environment for content management and search at the Washington State DOT was in flux, and less typical than that at the Virginia DOT (SharePoint and the FAST search engine). Products of a pilot at the Virginia DOT were therefore perceived as likely to have a greater potential for re-use in other DOTs. 4.3 Summary of Pilot Activities The pilot involved the following steps: ⢠An assessment of findability needs based on interviews with relevant stakeholders and identi- fication of specific scenarios on which to focus. ⢠Assembly of a body of content to be searched based on the selected scenarios, and a review of relevant content types and storage locations. ⢠Identification of a standard set of content categorization elements (facets) that would allow users to search or navigate to content of interest. ⢠Development of a semantic model (a set of classification categories and associated terminol- ogy) for describing the content collection based on text mining and review of available agency data sources (e.g., project lists, standard pay item lists). ⢠Development and automation of rules for auto-classification of the content using a commer- cial text analytics package. ⢠Design of a search/navigation solution that allowed a user to enter search terms and refine the search based on the various facets. ⢠Evaluation of recall and precision for different types of searches, comparing use of the solu- tion that was developed to a âplain vanillaâ full text search of the same body of content. ⢠Evaluation of transferability of the auto-categorization rules to similar content obtained from the Washington State DOT. The research team initially explored conducting the pilot in-house at the Virginia DOT, but was not able to do this because of hardware, software and IT staffing constraints. The text analyt- ics software vendor selected for this project (Smartlogic) agreed to host a cloud environment and
II-30 Improving Findability and relevance of transportation Information run the indexing and categorization processes using the specifications and rules developed by the research team. Because the solution created as part of this project is based on specific commercial platforms, it is intended as a demonstration of capabilities only; it is not a packaged software product intended for distribution. The rules developed for auto-classification are documented in detail in Annex 1 to this volume of NCHRP Research Report 846, and the ontology developed has been provided to NCHRP as one of the research products of NCHRP Project 20-97, to be made available on request. Table II-3 summarizes the specific activities of the pilot. The pilot project is described in greater detail in the Volume II Appendix of this research report. 4.4 Level of Effort for the Pilot The pilot was implemented by a team with the following roles: ⢠Business lead. This individual had expertise in DOT business processes, structure, and data. ⢠Text analytics lead. This individual had expertise in information architecture, metadata, and text analytics. ⢠Text analytics/taxonomy specialists. These individuals focused on ontology and rule development. ⢠Business analysts. These individuals focused on content harvesting, content analysis, content conversion, and solution testing. Table II-3. Pilot summary. Step Activities 1. Identify Needs Appoint a Lead ⢠The Virginia DOT Knowledge Management Oï¬ce Director was the lead for the initiative. She works closely with agency IT staï¬ to improve the use of SharePoint as a platform for managing both corporate documents and work group documents. Identify the Emphasis Area ⢠The Virginia DOT wanted to focus on ï¬ndability of construction project information. Practices for information storage and organization were not consistent across districts. Response to public information requests and assembling information related to construction claims were pain points. An engineering content management system was under development, but it was not going to incorporate pre-existing documents. Identify Stakeholders ⢠Knowledge Management/Library staï¬ who oï¬er services to help staï¬ ï¬nd information. ⢠Central Construction Oï¬ce staï¬ involved in developing tools and standards and conducting statewide analysis of construction costs and performance. ⢠District Construction staï¬ involved in day-to-day management of construction projects (e.g., construction manager, area construction engineer, contracts manager, project controls engineer, technology resource engineer). ⢠IT staï¬ responsible for development and support for content management and collaboration software (in this case, SharePoint) and associated search capabilities.
pilot Demonstration II-31 Step Activities 2. Deï¬ne the Target Scope Select Target Needs ⢠After analysis, four categories of needs were identiï¬ed to be addressed in the pilot: - Find a speciï¬c known document for a project (e.g., an estimate) using a variety of search criteria. - Find/review all available documents for a project (e.g., for a FOIA request). - Search across projects to ï¬nd projects with a speciï¬ed item, material, or construction technique. - Research reasons for delays and changes. Identify Target Content ⢠Stakeholders identiï¬ed a variety of construction project content types and provided examples. ⢠For purposes of the pilot, three content types were selected based on likely business value and relevance to the selected four types of needs: daily work reports, project work orders (change orders), source of materials forms, and project proï¬le forms. Assemble the Content ⢠The pilot involved gathering content of the selected types from SharePoint sites and shared drives. Target content for approximately 250 projects was assembled. ⢠The body of content included emails with attachments that matched one of the three selected content types. These attachments were extracted from the emails. ⢠Data for each active construction project was assembled, including universal project code (UPC), which is the agencyâs âcradle-to-graveâ project identiï¬er; contract ID; work type; cost; district; route; etc. 3. Prepare the Content Identify and Analyze Stakeholder Needs ⢠Several needs were identiï¬ed based on the interviews, including: - Provide all records or certain types of records for a particular project (e.g., to respond to a public information request or claim). - Find projects that have installed a particular make and manufacture of an item. - Find projects that have used a particular construction technique (e.g., cold-in- place recycling). - Identify systemic issues that contribute to construction projects not meeting goals for on-time, on-budget, environmental, and quality scores. ⢠Some needs expressed would best be addressed through improvements to structured databases and query tools. These needs were not addressed, because the pilot was focused on improving search and exploration of unstructured content. Some needs expressed were general or complex (e.g., âWhat projects have involved use of innovative materials?â). Further speciï¬city was required to understand whether improved search capabilities could meet the need. Interview Stakeholders ⢠Stakeholder interviews focused on (1) understanding current information sources and management practices and (2) identifying ï¬ndability needs and concerns. (continued on next page) Table II-3. (Continued).
II-32 Improving Findability and relevance of transportation Information 4. Develop the Solution Identify the Search Facets ⢠Based on the search needs and the content analysis, the following search criteria were identiï¬ed: content type, contract award amount (ranges), contractor name, district, type of equipment, jurisdiction name, manufacturer/supplier name, material type, pay item code, project ID/name, highway system category, route ID, type of work, work issue and work order category. ⢠For each of these items, source(s) for lists of possible values were identiï¬ed. In some cases, these were based on the categories included in the project data set (e.g., the list of Virginia DOT districts). In other cases, values were based on text mining of content (e.g., to produce a list of manufacturers and suppliers). Populate the Semantic Model ⢠The text analytics tool selected for the pilot (Smartlogic) powers faceted navigation and search based on a semantic model (ontology). This model was populated based on the search facets and lists of values. ⢠Building the ontology involved specifying relationships across diï¬erent terms (e.g., âBristol Districtâ is a âDistrictâ; âBland Countyâ is in âBristol Districtâ, etc.). ⢠The ontology also included synonym lists, including for example, multiple types of project identiï¬ers that referred to the same project, and pay item numbers and names. Develop Auto- Categorization Rules ⢠Rules were developed to assign metadata or tags to projects for the diï¬erent search facets. Some of these rules were fairly straightforward and driven by the ontology. Others required more in-depth analysis and iterative development (e.g., identifying work orders that were related to âdrainage issuesâ). ⢠Assignment of a project number was based on rules that looked for UPCs, project numbers, and contract numbers (using diï¬erent formatting conventions). Once the project number was found, it was used as the lookup value to tag the document with project attributes from the project data set (e.g., district, routes, contract award amount, type of work, and road system). Clean up the Content ⢠Non-searchable PDFs were converted using optical character recognition (OCR) software, with varying results based on original scan quality. ⢠Irrelevant and obviously duplicative content was eliminated from the body of content to be searched. ⢠Excessively long ï¬les were removed. ⢠The project data set was cleaned to reduce inconsistencies, ï¬ll in missing data, and normalize identiï¬ers and classiï¬ers. Step Activities Analyze the Content ⢠The body of content was analyzed to (1) inform development of rules for auto- classifying documents and (2) identify issues that would impact the eï¬ectiveness of the search capability to be developed. ⢠Issues included inconsistent and non-informative ï¬le naming conventions, non- searchable ï¬les (scanned PDFs), variations in the work order and daily construction report formats over time and across project oï¬ces, presence of duplicate documents, and existence of very long documents consisting of compilations of individual daily work reports. Table II-3. (Continued).
pilot Demonstration II-33 Table II-3. (Continued). Build the Interface ⢠A faceted search interface was built in a SharePoint environment using the Microsoft FAST search engine; however, the text analytics software used for the pilot can be conï¬gured for use with other platforms and search engines. ⢠The interface allowed users to explore the content and reï¬ne search results by ï¬ltering through selected criteria. ⢠The interface included an auto-suggest feature that would provide users with matching terms from the ontology as they started typing. Step Activities 5. Evaluate the Solution Set up a Test Environment ⢠The goal of the evaluation was to compare the solution that was developed to the search capability that could be provided by an âout-of-the-boxâ full text search capability without faceted navigation or auto-classiï¬cation. ⢠To accomplish this goal, a âplain vanillaâ search environment was set up, pointed to the same body of content. Deï¬ne Metrics ⢠The testing process focused on measurement of recall and precision. Recall is typically deï¬ned as the fraction of all relevant items that were returned from a search. Precision is the fraction of items returned from a search that are relevant to the userâs search query. ⢠In practice, it is time-consuming to test for relevance because (1) each document must be manually reviewed and (2) clear protocols for determining relevance must be established, given that each person may have a diï¬erent deï¬nition of what this means. The metrics and test cases used struck a balance between the need to obtain meaningful test results and the time requirements to conduct the tests. Deï¬ne Test Cases ⢠A set of speciï¬c test cases were deï¬ned to provide coverage of diï¬erent content types and search criteria. ⢠For each test case, steps were deï¬ned for conducting a search in both test environments. Perform the Tests ⢠For recall, tests measured the percentage of known relevant documents in the top 30 results. ⢠For precision, tests measured (1) the percentage of relevant documents in the top 20 results and (2) the number of documents needed to ï¬nd 10 relevant results. ⢠Tables of results were compiled for each test case. Summarize Findings ⢠Test results were analyzed to highlight test cases in which the developed solution provided a signiï¬cant advantage over the vanilla search. In some cases where results appeared to be similar, further enhancements to the rules were identiï¬ed for potential future implementation. ⢠Limitations of the test process were recognized (e.g., improvements to the convenience of the search experience provided by the faceted search and the auto- suggest were not tested and would require a more extensive process involving actual target users and subjective evaluation measures). Test and Reï¬ne the Solution ⢠The development process involved multiple cycles of testing and reï¬nement. Sample searches were conducted to identify situations in which search results varied from expectations. The samples were used to reï¬ne the auto-categorization rules.
II-34 Improving Findability and relevance of transportation Information The estimated total effort was about 700 hours across all team members. An approximate breakdown of this effort across the major pilot activities appears at the end of this section. The experience of the pilot team provides a benchmark for planning similar future efforts. Note that the estimated hours only include design and development of the solution. Additional time was required for documentation, team coordination, and evaluation (assessment of precision and recall). Technical activities to establish the test environment and run the indexing processes to apply the auto-categorization rules were performed by the text analytics vendor and are not included in the estimate. Specific design and development activities included: ⢠Needs analysis (24â40 hours). This task included stakeholder identification, interviews, syn- thesis, and identification of information seeking scenarios to address. ⢠Content assembly (160 hours). The pilot involved harvesting content from various locations to develop a stand-alone external test platform, and converting content to text-readable format. The content harvesting portion of this step would not be necessary for the more typical situa- tion, in which the solution involved searching and indexing content in its native repositories. ⢠Semantic model development (120 hours). This task involved analysis and text mining of the content and incorporation of other sources. ⢠Rule development for auto-categorization (160 hours). This activity involved the initial speci- fication of rules for auto-classifying content with terms in the semantic model. The majority of this time was spent on finding and refining the keywords to use in work issue categorization rules. ⢠Solution testing and refinement (200â240 hours). The research team also engaged in iterative testing and refinement of auto-categorization rules. Following the pilot, the Virginia DOT expressed an interest in better understanding what the cost of software licensing would be for the text analytics product. The agency completed a ques- tionnaire provided by the vendor in order to provide a pricing estimate. The following pricing factors were included on the questionnaire, with the Virginia DOTâs responses: ⢠Volume of content to be indexed (2,000 gigabytes [GB], 4 million documents). ⢠Anticipated growth in content over next few years (500 GB per year). ⢠Type of categorization to be performed (taxonomy/ontology-based vocabulary; options not selected included entity extraction and personally identifiable information [PII]). ⢠Current crawl frequency and duration (incremental crawl every 15 minutes, 10-minute duration; full crawl once a week, 13-hour duration). ⢠Number of queries per day (6,000). ⢠Number of users accessing (7,200). ⢠Number of ontology editor users (1â3). ⢠Integrations (FAST search engine, SharePoint). ⢠Plan to tag content outside of SharePoint (Yes). ⢠Ontology widgets on search results page (Best Bets, related topics, search facets/filters/refiners). The software cost estimate was between $200,000â$300,000, which included a 20% mainte- nance fee for 12 à 5 (12 hours à 5 days) technical support for the initial year. The estimate did not include hardware, installation, configuration, or training costs. 4.5 Transferability and Scalability of the Pilot Transferability Based on obtaining samples of content from the Washington State DOT, the research team con- cluded that most of the ontology, auto-categorization rules, and faceted search design developed for the Virginia DOT pilot could be easily adapted for other DOTs. The content type categoriza- tion rules would require minor adjustments to reflect different titles, and the agency-specific data
pilot Demonstration II-35 for projects, districts, highway system categories, and so forth would need to be replaced with equivalent data. Interestingly, two of three rule sets for identifying types of work issues (i.e., drain- age and weather-related issues) worked well with no modifications on Washington State DOT content. The third rule set (utility issues) would need further refinement, possibly because of the need to use specific utility company names in this rule. Another transferability consideration relates to software platforms for solution development. Based on a review of websites and conference presentations, the research team estimated that at least one-half of state DOTs use SharePoint; however, it should not be assumed that all DOTs would want to base their findability solutions within this platform. Further, a variety of enterprise search and text analytics tools are available. NCHRP Research Report 846, Volume I, Appendix E presents a partial list of commercially available toolsâand these tools are evolving rapidly. The pilot developed for this project was constructed for demonstration only. Agencies wish- ing to replicate the solution developed in the pilot would need to re-create the steps taken to fit within their own software environments. However, much of the effort for the pilot was spent on design and development of the semantic structures and the auto-categorization rules rather than on implementing them within the text analytics tool; and the effort involved to set up the faceted search in SharePoint was not extensive. Scalability The pilot implementation, which covered only a fraction of DOT content (specifically, con- struction daily work reports, change orders, source of materials forms, and project profile forms), required an intensive effort. The question of scalability is therefore important to consider; namely, what would it take to implement this type of faceted search solution for a larger portion of the DOTâs information resources? Although the research team did not estimate what percentage of the total content the pilot represented, the researchers would not expect a linear relationship to exist between the number of content types included and the number of hours necessary to implement a search capability. In fact, with marginal additional effort, the ontology and rule base developed for the pilot could be adapted for additional content types. It would be particularly straightforward to extend the pilot framework for other types of text- based construction project-related content. Given that many of the facets relate to construction projects already, the main work that would be required would be to develop rules for auto- categorizing the new content types. In most cases, these rules would be fairly straightforward to develop, particularly when standard titles or text blocks identify the content types. The effort required to extend the pilot beyond project-related content is highly dependent on the complexity of required rules. As previously noted, substantial time was spent in the pilot to develop and refine the rules for classifying work issues. This involved time-consuming manual review of multiple documents to develop lists of key words on which to base rules. Other rules were considerably simpler and quicker to develop, particularly those that leveraged available resources such as pay item code lists, and project data files allowing assignment of district, high- way system, cost range, and so forth, based on a project identifier.
II-36 C h a p t e r 5 5.1 Conclusions DOT Business Drivers for Findability Based on the literature review, information gathering, and pilot activities conducted for NCHRP Project 20-97, a recognized need exists to improve findability of information within transportation agencies. Needs of greatest concern are efficient retrieval of information in response to FOIA requests, PDRs, and legal claims; and ensuring that employees can find cur- rent, authoritative versions of agency policies, manuals, guidance, and standards. Agencies also understand the potential benefits of making it quicker and easier to find relevant information, including reduced time spent searching and improved ability to re-use information that has already been created rather than duplicating efforts. Because search capabilities in DOTs are typically basic and fragmented, employee expectations are low. Employees may not attempt to discover information that would be helpful for their tasks at hand, beyond use of specialized applications that serve their particular job functions. Those employees who do seek additional information rely on asking colleagues, visiting known intranet pages, and conducting external Internet searches. When an external information request comes in, staff may be asked to spend hours sifting through email files, file drives, and databases to develop a response. Considerable opportunity exists for agencies to realize employee time savings from findability improvements, though these time savings will be spread across the organization and, as a result, difficult to track. Importance of Understanding Needs The findability of information in a DOT is not a simple problem with a single solution. Multiple types of information needs exist, and a multi-faceted approach is needed to address most types of needs. The guide presented as Volume I of this research report emphasizes understanding user needs as a key step in any findability improvement. Doing this avoids wasted effort making improvements that do not have any benefit, and provides the foundation for effective design of a solution. Some types of needs will be readily apparent (e.g., improving relevance of intranet searches for guidance information). Other types of needs may be latent in nature because people may not think to ask for something that could be helpful but is not currently possible. It is essential to identify customers and involve them in the process of developing solutions. Practices for Improving Findability In most cases, a combination of information management discipline, effective deployment of enterprise search tools that index content within multiple agency repositories, and design and Conclusions and Future Research Needs
Conclusions and Future research Needs II-37 implementation of a workable metadata strategy will be required to improve findability. Develop- ing a workable metadata strategy means standardizing on an essential set of metadata elements that are (1) helpful for information search and discovery and (2) can be reliably populated using a combination of manual and automated methods. Development and ongoing improvement and management of terminology resources are integral to the metadata strategy, because terminology provides lists of values for metadata elements. Terminology resources also are needed for build- ing effective search solutions with features such as auto-suggest and query expansion to include synonyms and related terms. The guidance developed for this project can be used by DOTs and other transportation agencies to assess and strengthen each of these elements. While search within organizations (called âenterprise searchâ) does not perform as well as Inter- net search, the private sector examples reviewed and the pilot capabilities demonstrated show that search within an organization can, in fact, be improved substantially. Improvements to search within DOTs would likely lead to identification of new applications and benefits. Following the demonstration of pilot capabilities, one DOT employee remarked: If we could index our structured and unstructured data, it would solve most of our search and findability issues. It would help to structure our information landscape. It would get people thinking, âWhat else can we do?â . . . The faceted search allows you to group like items together and gives cohesiveness to content. Once all related items are findable, maybe people will take more care in making just the appro- priate information available (versioning, duplicate info). I think the ripple effect of having this would be enormous. . . . It would be a game changer (Personal Communication 2016). A wide variety of open source and commercial text analytics and search products are available, and they are growing in sophistication. These technologies support development of faceted search capabilities, tuning of results relevancy ranking, and automating the assignment of metadata, all of which are essential ingredients for improved search. Need for an Integrated and Coordinated Management Approach To develop, deploy and maintain findability capabilities, agencies must put in place the right set of roles and responsibilities, and acquire or build the necessary types of expertise. At DOTs, putting in place the necessary organizational functions, skills, and disciplines presents more of a challenge than acquiring and implementing supporting tools and technologies for findability. To provide much of the needed expertise, however, DOTs can look to their existing staff resources, including librarians, content managers, documentation engineers, project controls specialists, data managers, records managers, website managers, and other IT professionals. A coordinated approach to improving findability is needed, leveraging available skills. Several DOTs have established data or information governance groups that can provide a focal point for coordinated implementation of improvements. One aspect of the pilot demonstration implemented for this project involved leveraging avail- able data resources (construction project data) to facilitate search of unstructured documents. Doing this provided the ability to build a search capability that could, for example, auto-complete a project number entered into a search box and find documents that referenced only a contract number that was related to the project number entered. Other key search facets were populated from lists of districts, lists of jurisdictions, and other reference data sets. This example highlights the importance of an integrated approach to findability for both structured and unstructured data and information resources. The implication is that staff responsible for developing agency business applications, GIS, and business intelligence capabilities should work collaboratively with those involved in developing search capabilities focused on unstructured content. For example, master and reference data management practices are valuable not only to support structured reporting but also for search applications.
II-38 Improving Findability and relevance of transportation Information 5.2 Future Research Needs Future research would be beneficial in several areas: ⢠Additional DOT pilots to validate, extend, and facilitate adoption of the findability practices developed in the initial pilot. ⢠Investigation of machine learning techniques for auto-categorization of content. ⢠Investigation of techniques for auto-categorization of DOT image files. Additional DOT Findability Pilots Additional pilot implementations of the auto-categorization and faceted search techniques developed in the Virginia DOT pilot would provide several benefits. They would: ⢠Enable a new set of agencies to gain exposure to these techniques, understand their potential benefits, and have a basis for evaluating ongoing staffing and implementation options. ⢠Provide an opportunity for extension of the resource base (ontology and auto-categorization rules logic) that can be made available to the entire DOT community. ⢠Allow for development of expanded guidance for DOTs covering findability scenarios beyond those tested at the Virginia DOT (related to construction project information). ⢠Allow for additional validation of the transferability of the techniques and an improved under- standing of resources needed for extending capabilities beyond construction project information. A sample set of tasks for DOT pilots is suggested below: 1. Prepare a prospectus detailing pilot objectives and time requirements, solicit agency interest, and select agencies to participate. 2. Meet with agency staff; identify target content and search needs. 3. Refine the ontology developed in NCHRP Project 20-97, working with agency-specific branches. The product would be an updated, expanded version of the ontology developed in the Virginia DOT pilot, reflecting work done in any additional pilots. It would include both generic and agency-specific elements (e.g., each agencyâs list of regions would be different). 4. Develop and refine rules for auto-categorization based on each agencyâs content collection. The product would be documentation of an expanded set of rules for automatically tagging content with terms from the ontology (e.g., an expanded set of rules to identify projects). Each agency included in the pilot could use these rules to implement an auto-categorization function using the technology solution of their choice. The documented rule set could also be adapted for use by other agencies. 5. Develop an agency work plan for implementation. Each agency would be provided with a work plan detailing tasks for implementing software and processes for improving findability, applying the ontology and classification rules. This task would involve interviews with agency staff to develop an approach to identify technical implementation solutions compatible with their existing IT environments. The work plan would include special attention to opportuni- ties for implementation of spatial search interfaces and integration with business intelligence capabilities. 6. Produce a summary guidance document for general DOT use providing implementation guidance based on the pilot experience. Provide rules and ontology as separate reference files. Provide example work plans (from Task 5) as a resource for DOTs to use in developing their own work plans. Based on the work plans, general guidance could be included on how to integrate the products of NCHRP Project 20-97 into DOT geospatial and business intel- ligence applications. The tasks outlined above provide a practical approach to validating and refining the products of NCHRP Project 20-97 and providing additional exposure to the techniques within the DOT
Conclusions and Future research Needs II-39 community. They do not, however, involve actual implementation or development of the tech- niques. This approach is suggested given the costs and risks of developing a specific technology solution for each participating agency. A logical follow-on activity would be to support development of faceted search and auto- categorization capabilities within the agencyâs production environment. An in-house implemen- tation would provide a real-world example of what is involved in building a search index across repositories and working through access restrictions. This approach would require a greater level of involvement of agency IT staff than was possible in the first pilot. However, agencies might view this as an opportunity for staff to gain experience, which could be applied in the future if the agency decided to move forward with production implementation. An actual implementa- tion project might be considered following the initial pilots in order to produce a functioning example capability within a DOT that could be demonstrated to other interested agencies. Machine Learning Investigation Techniques involving machine learning for automated categorization of documents are cur- rently used for a wide variety of applications, with an emphasis on legal e-discovery. Machine learning techniques involve manual classification of a set of âseedâ documents, and then using the set of manually classified documents to derive algorithms for assigning classifications to other documents. The potential advantage of these techniques is that they would not require extensive manual rule development. The disadvantage is that they do require a subject matter expert to manually classify the set of seed documents, and the algorithms that are developed are not transparent. Given that there is interest in applying machine learning techniques, it would be valuable to test application of these techniques for some specific DOT use cases. Two areas that would be of potential interest would be: ⢠Speeding response to FOIA requests. ⢠Assignment of construction work issue categories, extending the work conducted in the NCHRP Project 20-97 pilot. For the FOIA application, potential tasks would be: 1. Assembling an advisory panel from several DOTs comprised of staff involved in preparing responses to FOIA (or state public disclosure) requests. 2. Collecting examples from the advisory panel members of common requests that are time- consuming to fulfill. 3. Selecting one or more example requests from advisory panel members who are willing to provide content associated with the request for the analysis. 4. For each request, assembling a body of content, including the items that were provided in response to the FOIA request along with other content that was not relevant to the request. 5. Utilizing a text analytics tool to analyze a selected set of seed content and develop algorithms for selecting new applicable content. 6. Testing the algorithms developed on a mixture of additional content (beyond the seed set), including both relevant and irrelevant items. 7. Comparing and analyzing results across the different FOIA requests. 8. Writing a summary report with conclusions about the level of effort required and the accuracy of results. A second test of machine learning would build on the work conducted in the Virginia DOT pilot to auto-classify construction change orders and daily work reports for work issues. It would utilize a seed set of documents classified with one or more work issues (e.g., drainage, weather,
II-40 Improving Findability and relevance of transportation Information utilities) to train the analytics software. Then, the software would be used to auto-classify another set of documents, potentially from a different agency. The result could be used to compare the level of effort and outcomes for application of the rule-based and machine learning methods of classification. Findability of Image Files NCHRP Project 20-97 emphasized findability of text-based content. Given that DOTs have large (and growing) collections of image files (both photographs and video images), techniques are needed for improving findability of these images. A large body of research on multimedia information retrieval is available covering a variety of techniques, from facial recognition to machine learning techniques utilizing social tagging. Potential tasks in this research topic could include: ⢠A critical review of existing techniques and their applicability to DOTs. ⢠Documentation of case study examples of organizations that have implemented advanced techniques for auto-classification and search of images. ⢠Identification of available open source and commercial products that provide auto-tagging capabilities for images. ⢠Development of guidance for extending the findability framework established in NCHRP Project 20-97 for inclusion of image content types.
II-41 Boiko, B., and E. M. Hartman (Eds.) (2010). TIMAF Information Management Best Practices, Vol. 1. Utrecht, The Netherlands: Erik Hartman Communicatie. Cleverley, P. H. (2015). The best of both worlds: Highlighting the synergies of combining manual and automatic knowledge organization methods to improve information search and discovery. Knowledge Organization, 42(6), pp. 428â444. NCHRP Project 20-109. (n.d.). âEnhancement of the Transportation Research Thesaurus.â Project description retrieved July 25, 2016, from: http://apps.trb.org/cmsfeed/TRBNetProjectDisplay.asp?ProjectID=4061. Personal communication. Washington State Department of Transportation focus group participant (March 11, 2016). Virginia Department of Transportation. (2016). Construction Dashboard Project. Project details retrieved January 4, 2016, from: http://dashboard.virginiadot.org/Pages/Projects/ConstructionOriginal.aspx. References
II-42 A p p e n d i x This appendix documents implementation of a findability pilot for the Virginia DOT and an analysis of the transferability of pilot results with the Washington State DOT. Three annexes providing additional details are included at the end of the appendix: ⢠Annex 1: Pilot Classification Rule Descriptions ⢠Annex 2: Example Scenarios Using Faceted Search Design ⢠Annex 3: Pilot Evaluation Metrics and Description Pilot Findability Report
pilot Findability Report II-43 A.1: Pilot Overview The research team conducted pilot activities at the Virginia DOT (VDOT) in order to demonstrate an application of the ï¬ndability framework and potential beneï¬ts from ï¬ndability improvements. The pilot demonstrated and validated the concepts and methods proposed in the guidance to improve ï¬ndability, assessed the eï¬ort required in making these improvements, and evaluated the transferability to other DOTs. The pilot also provided an opportunity to document a case study for inclusion in the guidance, creating resources that other DOTs can build upon such as a faceted search navigation design and a set of common terms to use in document classiï¬cation rules. As described in the pilot proposal, the research team ï¬rst identiï¬ed candidate agencies and potential pilot project scopes at each agency. After identifying these project scopes, the research team selected VDOT as the âprimaryâ agency for the pilot, with a focus on construction project document ï¬ndability. VDOT staï¬ identiï¬ed construction documents as a priority ï¬ndability issue for VDOT, and it was tractable in the context of a brief pilot eï¬ort. Meeting the research objectives required a fairly intensive eï¬ort with the âprimaryâ agency, VDOT, for development and testing of a solution to improve management of active construction project documents. A âsecondaryâ agency, the Washington State DOT (WSDOT), provided a set of sample content resources to allow the research team to consider variations across the two agencies and the transferability of the pilot solutions. This pilot project consisted of a number of activities, grouped into four areas: assessing ï¬ndability needs, collecting content, developing a solution, and testing and evaluating that solution. The process used to structure pilot project activities is displayed in Figure II-A-1. Each of these activities is further detailed in the remainder of the document. Pilot Objectives ⢠Demonstrate and validate concepts and methods to improve ï¬ndability ⢠Assess eï¬ort required and transferability to other DOTs ⢠Document a case study for inclusion in the guidance
II-44 improving Findability and Relevance of Transportation information Source: Adapted from ï¬gure in internal draft document from Kansas City DOT (2005). Figure II-A-1. Pilot activity process. A.2: Assessment Stakeholder Identiï¬cation The research team identiï¬ed three main types of stakeholders: Knowledge Management and Library staï¬, Central Construction Oï¬ce staï¬, and District Construction staï¬. Individuals in these groups have responsibility for improving construction information management and search capabilities, and routinely search for construction-related information. Knowledge Management and Library employees within transportation agencies are stakeholders because this research relates to the ï¬ndability and organizational structure of information within the agency. At VDOT, Knowledge Management and Library employees work heavily with the content management and collaboration portal platform used in the pilot analysis, and are knowledgeable both in its use and its content. The Knowledge Management Oï¬ce identiï¬ed a pilot goal of improving the ability to ï¬nd needed project information through a search of this portal. The research team identiï¬ed Central Construction Oï¬ce staï¬ as a second group of stakeholders interested in improving ï¬ndability of project information, particularly related to project costs and schedule. For example, the Central Construction Oï¬ce could use patterns in unstructured text (e.g., in construction daily work reports) to identify issues leading to project cost overruns or time extensions. Finally, the research team identiï¬ed District Construction staï¬ as the third group of stakeholders for this pilot. This group includes the District Construction Manager, Area Construction Engineer, Contracts Manager, Project Controls Engineer, and Technology Resource Manager. While the Central 1. Assessment A. StakeholderIdentiï¬cation B. Interviews C. Findability Needs 2. Content Collection A. Content Type Selection B. Content Harvesting, Analysis, and Conversion C. Project Data and Proï¬les 3. Solution Development A. Semantic Model Development B. Rule Development and Reï¬nement C. Faceted Search Design 4. Test and Evaluation A. Rule-based vs. "Vanilla" FAST Search B. Testing and Subjective Evaluation C. Transferability Analysis
pilot Findability Report II-45 Construction Oï¬ce staï¬ wanted to ï¬nd patterns across projects, District Construction staï¬ primarily focused on locating information about speciï¬c projects. Interviews A 3-day site visit to VDOT was conducted during May 4â6, 2015, during which members of the research team met with Knowledge Management staï¬, construction staï¬ from three districts, representatives of the central oï¬ce Construction Division, and Information Technology staï¬ responsible for the agencyâs content management/collaboration platform implementation, including search capability conï¬guration. Following these initial meetings, research team members followed up with VDOT Library staï¬ and with individuals holding statewide responsibilities for construction scheduling and materials. The research team also worked with Information Technology staï¬ to discuss the technical approach to the pilot. In discussions with VDOT Information Technology staï¬, it became clear that VDOT would have diï¬culty hosting the actual pilot demonstration on agency servers. There were several reasons for this, including availability of suitable hardware, initiation of activities to migrate to an updated software version, security challenges associated with demonstrating a federated search approach, and logistical diï¬culties providing direct access to agency servers and databases to external consultants within the timeframe of the pilot. As a result, the research team initiated discussions with a vendor of search and text analytics software, who agreed to host the pilot (in âthe cloudâ) and provide access to the necessary software. This allowed the research team to demonstrate new kinds of search capabilities in an environment that can be easily controlled by the research team. Also through the interviews, the research team identiï¬ed document content types used at VDOT, as listed in Figure II- A-2. The research teamâs eï¬orts focused on three content types: daily work reports/inspector diaries, work orders, and source of materials forms. The research team also included project proï¬les in the pilot, which are publicly available from the VDOT website. Descriptions of each of these content types are included in the âContent Type Selectionâ section. The interviewees also identiï¬ed a number of content management methods at VDOT that would aï¬ect the pilot. Correspondence Meeting Minutes Contracts Work Orders (C-10) Daily Work Reports/Inspector Diaries (C-84) Material Documentation (C-85) Source of Materials Forms (C-25) Subletting Request (C-31) EEO Reports (C-64) Estimates (C-79) Starting and Completion (C-5) Vouchers Price Adjustments Blast Reports Environmental Compliance Reports Contractor Inspection Reports Certiï¬ed Payroll Design Field Changes Job Mix Designs Materials Test Results Insurance Certiï¬cates Tracking Logs Notice of Intent Claims Source: VDOT; list adapted from ï¬gure in internal draft document from Kansas City DOT (2005). Figure II-A-2. Content types.
II-46 improving Findability and Relevance of Transportation information This included the following ï¬ndings. More detail on a number of these ï¬ndings is included in the âContent Harvesting, Analysis, and Conversionâ section. A number of documents include scanned images that are not text searchable. The documents demonstrate a lack of ï¬le naming conventions. Often, ï¬le names are context dependent (e.g., âestimate #1â) or not meaningful (e.g., âSCAN001â). âStandardâ forms include a number of variations, reï¬ecting changes over time, across districts, and across projects. Folder hierarchies that store the documents contain variations in folder structure. Document storage includes extensive use of email and email attachments. There is duplication of content across the SharePoint drive, including within project folders. Documents contain minimal metadata. Multiple VDOT interviewees expressed an interest in using some degree of automation to add metadata to documents. This idea is described in more detail in the âSemantic Model Developmentâ section. Interviews also provided the research team with a more complete understanding of current searches and pain points for each of the identiï¬ed stakeholders. Interviewees noted that a major pain point is in responding to Freedom of Information Act (FOIA), audits, Notices of Intent (NOIs), and claims. These searches require signiï¬cant eï¬ort to locate documents. Additionally, the common practices of making multiple copies of a document and distributing documents via email attachments creates problems with locating the most recent or âauthoritativeâ version of a document. They also make the application of retention schedules diï¬cult. Interviewees also noted the risk of document loss when an employee leaves a position and leaves content on personal drives not accessible to other staï¬. They noted that the new Construction Document Management System is intended to address many of these areas but will not address management of historical content. Following the initial set of interviews, the research team developed several potential user scenarios on which to focus the pilot, and began investigating these scenarios. The information garnered from the interviews served as the foundation for the pilot design, both in deï¬ning the existing information infrastructure and methods, and in understanding search needs that the pilot search tool should address. Findability Needs Through the interviews with VDOT staï¬, the research team identiï¬ed business questions and associated information search needs. Table II-A-1 summarizes these questions and search needs. The Comments column of this table includes the research teamâs assessment of whether the business question was an appropriate candidate for inclusion in the pilot. The business needs that the research team chose to use to guide the development of speciï¬c search scenarios for the pilot are marked with asterisks.
pilot Findability Report II-47 Table II-A-1. Summary of VDOT Business Questions and Search Needs ID Business Question/Search Need Comments 1 Where was work actually done (for projects that do not have a single route- from-to location)? Daily work report descriptions do not generally include location, and stationing information in the item block of the form is typically blank. 2* Where have we used cold-in-place pavement recycling? Can search for variants of âcold-in-place recyclingâ, âCIP recyclingâ, etc. within daily work reports and classify projects based on results.1 3 What were the success factors and lessons learned from projects of a particular type (e.g., design-build, accelerated bridge construction, etc.) These would be best addressed through interviews â search capability would not add substantial value. Design-build contracts can be identiï¬ed via existing structured data (contract type), and use of accelerated bridge construction is suï¬ciently specialized that the state bridge engineer would be aware of these projects. 4 What projects have involved innovative use of materials, what was done? Would need to have more speciï¬city (e.g., list of speciï¬c techniques) to investigate potential. 5* Provide all records (or certain types of records) for a particular project (in response to a FOIA request, audit request, NOI, claim investigation, or Construction Quality Inspection Program check). Pilot demonstrates ability to retrieve multiple document types based on one of several IDs, leveraging a crosswalk for project identiï¬ers. 6 Locate a Right-of-Way agreement or a set of correspondence for a particular project. Similar to above â rules could be used to partially compensate for inconsistent tagging of documents by project number - but biggest barrier to ï¬ndability relates to information management practice (i.e. storing documents in a searchable location). 7* Find recent projects that installed a particular make, manufacturer of item (e.g., Trinity guardrail GR-9) â based on construction item and source of material. Pilot allows a search to use a combination of Material facet and Supplier facet to search the Source of Materials forms. 1 Although the research team intended to incorporate this search need into the pilot, there was not suï¬cient content available to test or implement this scenario.
II-48 improving Findability and Relevance of Transportation information ID Business Question/Search Need Comments 8* Find construction documents based on one or a combination of: tax map parcel, project number/universal project code (UPC), project type (paving, bridge, etc.), ï¬xed completion date (for active projects), construction document type, district, county, route, owner, contractor, subcontractor, cost range, types of material, item code/category, responsible charge engineer. Pilot used this business need as input for design of a more comprehensive search tool for projects, and can demonstrate some of these search items. 9 Find projects that have used a particular type of asphalt binder within a speciï¬ed date range. This could be accomplished through a search of pay items in SiteManager. 10 Find all documents associated with a given project that reference âKARSTâ â a geological condition characterized by sinkholes â as part of an investigation of when this condition was discovered. A simple text search would meet this need â no complex rules required. A more complete set of project records (including emails) would be required to meet this need. Not representative of a common search need. 11 Find out asset install dates by mining the daily work reports. This could potentially be accomplished through a search of the structured data on pay items placed by date in SiteManager. 12* Identify projects that used a particular pay item or category of pay items (in order to guide selection of pay items to include on a project being designed). Pilot demonstrates use of a pay item category facet as a way to drill down to a collection of projects that used one of a related set of items. 13* Identify systemic issues that contribute to construction projects not meeting goals for on-time, on-budget, environmental and quality scores. Currently very time- consuming to do this research. Pilot partially addresses this by providing a way to ï¬lter project documents by whether there was an issue identiï¬ed on a Daily Work Report or a Work Order with a particular type of reason/cause â e.g., utilities, drainage issues. 14 Respond to inquiries from other DOTs on various topics: e.g., use of a particular technique and material for bridge deck overlays (alternative to asphalt); how to respond to issue related to longitudinal crack on bridge decks. Pilot partially addresses this through the materials facet â but this need is really a collection of unrelated topical investigations that would require further speciï¬city to assess.
pilot Findability Report II-49 The selected information access needs can be classiï¬ed into four categories. These four categories serve to organize the pilot evaluation test cases described in the âTesting and Subjective Evaluationâ section: 1. Find a Single Known Document for a Project (e.g., an estimate) Using a Variety of Search Criteria 2. Find/Review All Documents for a Project (e.g., for a FOIA Request) 3. Search Across Projects - Find Projects with Item, Material, Construction Technique 4. Research Reasons for Delays and Changes A.3: Content Collection Content Type Selection After analyzing the content types listed in Figure II-A-2, the research team selected three types of content: daily work reports, work orders, and Source of Materials forms. These were selected based on their likely business value and relevance to the business questions in Table II-A-1. An emphasis was placed on content with valuable unstructured (text) that could not easily be discovered via existing database applications or query tools. Daily work reports (VDOT Form C-84), also known as Inspectorâs Daily Reports, are forms completed by construction inspectors. They are used to record pay item quantities placed and equipment used, and provide a narrative of activities and conditions on the construction site. Daily work reports were selected because they contain substantial blocks of free-form text that could be mined to derive useful information about construction projects. Work orders, also known as change orders (VDOT Form C-10), are used to authorize a change in contract scope, schedule or budget. These contain text descriptions of the location and type of work included in the change, and the justiï¬cation or reasons for requesting the change. Similar to daily work reports, work orders were selected because they contain substantial blocks of free-form text that could potentially be mined to derive useful information about construction projects. Both daily work reports (DWRs) and work orders (WOs) can be created within VDOTâs SiteManager application. However, SiteManager includes only a rudimentary search capability, and is not generally accessible to the general user. In addition, content analysis revealed many examples of DWRs and WOs that appeared to be created outside of SiteManager, as well as many examples of PDFs and HTML reports created from SiteManager but stored independently on ï¬le drives and team collaboration sites. Source of Materials forms (VDOT Form C-25) are completed by contractors detailing the intended manufacturer or supplier for each type of material to be utilized for a construction project. Materials Division employees receive this form and complete the required method for testing of each material. This form is used by the construction inspector to verify that the sources have been approved, and that appropriate testing takes place. Source of Materials forms were selected to illustrate a search capability for a combination of material type and vendor â for information that is fundamentally of a structured nature (i.e. pay items and vendor/supplier names) but not currently collected via a structured database. This capability would have been useful for the investigations that occurred related to the recent issues with the Trinity guardrail.
II-50 improving Findability and Relevance of Transportation information In addition to the three above content types, the research team elected to include a fourth type that would provide a user with general information about a construction project. Project proï¬les for each construction project were created utilizing information that was publicly available for download from VDOTâs online dashboard, in the âProject Deliveryâ section. Each proï¬le contains project details, project summary, contact information, and budget and schedule details. Project proï¬les were selected to provide a document that would allow users searching for a project to quickly access information about that project. Content Harvesting, Analysis, and Conversion In general, content collection was time-consuming because document naming and storage locations are not standardized across districts. To collect daily work reports, work orders, and Source of Materials forms, the research team conducted a series of searches for key words or identiï¬ers that appeared within forms (e.g., âForm C-25â). Documents meeting the search criteria were downloaded, and found in a variety of formats (including PDF, MHT, Word, Excel, RTF, MSG). Some of these documents were produced from systems (e.g., AASHTOWare SiteManager), while others were stand- alone forms and related correspondence. The collection, analysis, and conversion process for each type of document is described in more detail below. Roughly 3,000 daily work reports, 1,000 Source of Materials forms, 1,000 work orders, and 6,000 project proï¬les were collected. From this, the research team limited the content to approximately 2,000 daily work reports, 1,000 Source of Materials forms, 1,000 work orders, and 2,000 project proï¬les. This accomplished two objectives: it increased the computing performance of the search function by limiting the total content volume, and it avoided skewing search results toward a particular document type (e.g., including all 6,000+ project proï¬les would have accounted for over half of the pilot content). Document Harvesting The research team obtained a collection of content through a combination of methods: direct provision of ï¬les by districts (from their shared drives and/or team sites), searches for particular document types on VDOTâs content management/collaboration platform, and direct downloads from this platform of entire project folders. As noted during the interviews, the folder structure varied from project to project. Although content is often contained in subfolders multiple levels down in the project folder hierarchy (and not in a consistent location from project to project), once found, this structure allows for bulk download of all of the work orders and/or daily work reports for a project. The research team downloaded and stored these documents in folders named with the project number, and did not alter the original ï¬le names (i.e., the research team gave each content type- project combination a folder and maintained the original ï¬lenames for all content within the folder). This improved the speed of document collection and increased the number of documents collected, as not all documents appeared in the initial searches. This approach was not used for Source of Materials forms because the initial searches resulted in a suitably high volume of content. VDOTâs Knowledge Management Oï¬ce facilitated access to a body of content that was stored on a contractor content management/collaboration site for a design-build megaproject. This document collection process followed a similar search pattern as the initial search collection process, although
pilot Findability Report II-51 with fewer issues related to naming conventions which were more standardized given that the content was for a single project. There were some variations between the megaproject and the internal VDOT content. For example, Form C-10 was called a âChange Orderâ rather than a âWork Orderâ. Content Analysis and Conversion To identify how many diï¬erent projects the collected body of content represented, the research team created a spreadsheet of individual projects through an examination of individual daily work report and work order ï¬les. VDOT construction projects are identiï¬ed by a state project number (e.g., â0023- 101-102,C501â), a sequentially assigned UPC - a cradle-to-grave project identiï¬er, e.g., â15786â, and a contract ID (for the main construction contract). Files collected typically contained at least one of these identiï¬ers. The team downloaded information from VDOTâs dashboard that includes each of these three identiï¬ers (as described in the âProject Data and Proï¬lesâ section). Using this information, the team matched the project identiï¬ers for the existing ï¬les to this master ï¬le. Based on this exercise, the collection of content assembled collectively represents approximately 250 diï¬erent projects. Analysis of the body of content led to the following observations. Each of these issues represents challenges likely to be faced in other organizations seeking to implement improved search capabilities across existing repositories: Naming Conventions. As noted in the âContent Harvesting, Analysis, and Conversionâ section, diï¬erent projects maintain diï¬erent naming conventions. At times, naming conventions also diï¬er within projects (e.g., a single project could have ï¬lenames of âWork Order 2â and âWO 3,â include a date at the end of some ï¬le names, and include the project number at varying levels of speciï¬city within the ï¬le name). File names are also often context dependent (e.g., the âWork Order 2â document noted above is context dependent on the project folder in which it appears). These diï¬erences and inconsistencies could make it diï¬cult for a user to ï¬nd a speciï¬c work order within any content repository. Text Recognition. Many of the downloaded documents are scanned PDF documents that do not contain searchable text. In this original state, the search tool would not be able to read or index these PDFs. To take advantage of these documents, the research team used optical character recognition (OCR) software to convert these to searchable documents. Due to the number of documents, this required extensive processing time. The resulting document quality was generally good, but varied based on the quality of the original scan. The process of converting PDF images to PDF text ï¬les made them readable. However, once converted, diï¬erent pieces of software may read the OCRed text diï¬erently . Speciï¬cally, the classiï¬cation software seems to have read some zeroes as Os, Bs as 8s and so on. This is common in OCRed text. The search tool, FAST, appears to have read some of those characters more accurately. These observations are based on testing classiï¬cation in one tool (where it is possible to see exactly how the software read the characters) and search in another (which returned some searches that the classiï¬cation software had read diï¬erently). Email Documents. Many of the relevant documents were in email format, with embedded attachments containing work orders, daily work reports, and Source of Materials forms. Since the search tool utilized for the pilot was not conï¬gured to search text within attachments, the research
II-52 improving Findability and Relevance of Transportation information team opened each email and downloaded the relevant attachments. Although Smartlogic could be conï¬gured to search both email and attachment text, it would only be able to do so for attachments with searchable text. By downloading the relevant attachments, the research team was able to use OCR software to recognize text in the attachments and increase the content base. Emails presented a challenge within the search tool in that content could be repeated multiple times within an email chain (e.g., once in an original email, and again multiple times as part of the replies). Historical Evolution of âStandardâ Forms. Each of the content types included a variety of formats â presumably reï¬ecting changes in practice over time, variations across districts, and variations in contractor-created forms. For example, most daily work reports were entered into VDOTâs construction management software (SiteManager), and then either printed and scanned to PDFs or output and saved as HTML or Microsoft Word (.doc) ï¬les. Some were completed using a standard form (C-84) and .doc ï¬les. Others were completed in Microsoft Word using a custom format (i.e. not using the C-84 form). The collected content contains 14 varieties of daily work report documents (including documents with titles such as âInspectorâs Daily Report,â âPM Diary,â and âDaily Report of Constructionâ). Similarly, the collected content contains nine varieties of work orders (including âchange ordersâ). Notably, the work orders often have similar sections, but in diï¬erent sequences. For example, the slight diï¬erences in two versions of the Form C-10 (from 2006 and 2007) include: The inclusion of a VDOT- deï¬ned âcategoryâ ï¬eld (e.g., âADDâ for additional work not originally planned) in the 2007 version; Diï¬erent language specifying the âContract IDâ ï¬eld (âJob Des. Or Contract ID. No.â in the 2006 version, âContract ID No.â in the 2007 version); Diï¬erent language specifying the explanation for the proposed work (âEngineerâs Explanation of Necessity for Proposed Workâ in the 2006 version, âResponsible Charge Engineerâs Explanation of Necessity for Proposed Workâ in the 2007 version); Diï¬erent speciï¬cation of the time eï¬ect of the work order (referencing âA Time Extensionâ of a speciï¬ed amount of additional calendar days with a speciï¬ed new ï¬xed completion date in the 2006 version, compared to the speciï¬ed contract time limit prior to approval and upon approval of the work order in the 2007 version). Other diï¬erences in the forms, such as the title (e.g., âWork Orderâ and âChange Order,â or âDaily Work Reportâ and âDaily Report of Constructionâ) could have more of an impact on ï¬ndability. As discussed in the âSolution Developmentâ section, understanding these varietal diï¬erences is essential to the development of the search capability by enabling search logic to take advantage of patterns in the documents. Related Documents. In addition to the oï¬cial work order documents, several other related documents were collected from the same folders as the work orders: the FHWA Conceptual Approval Request letter, and various cover letters and transmittal slips with related information on prices, etc. Oï¬cial work orders are entered and tracked in SiteManager. Duplicate Documents. Because of the diï¬erent approaches used to harvest documents, the content collection process resulted in some duplication.
pilot Findability Report II-53 Document Length. Some documents were quite lengthy. For example, PM Diaries often contain a full month (or more) of daily work reports. In these cases, information relevant to a user search may be limited to 1-2 pages within a 100 page document. Similarly, some work orders contain signiï¬cant additional information, in which the Form C-10 is only one to two pages out of hundreds of pages in the document (and may be located in the later portion of the document). This can impact the value of the search tool â which can ï¬nd documents containing relevant information, but not point the user to the portion of the document that is relevant. This capability could be developed via customization of search or text analytics capabilities but was beyond the scope of the pilot. For the purposes of the pilot, the research team attempted to limit the inclusion of these documents in the ï¬nal content by selecting documents based on ï¬le size (mainly to improve processing speed, but also to prevent these documents from overwhelming search results by matching on many search terms. Alternatively, fully built-out rules could similarly prevent these lengthy documents from consistently appearing as top results). Content Storage Locations. Diï¬erences across districts in where ï¬les were stored (content management system/collaboration portal or ï¬le servers) made the content collection process complex and would similarly complicate development of an enterprise search tool. Project Data and Proï¬les Project List As noted above, the research team compiled project information from publicly available sources. The research team downloaded a project list available from the VDOT Dashboard websiteâs âProject Deliveryâ section.2 This website provides information for all construction projects dating back to FY 1999. The downloaded project list includes the following ï¬elds for each project: District Route Road System (e.g., Primary, Secondary, Urban, Rural) UPC Description Contract ID Original Speciï¬ed Completion Date Estimated Completion Date Current Contract Amount Award Amount Cost of Work to Date Final UA Cost Acceptance Date Contract Type 2 http://dashboard.virginiadot.org/Pages/Projects/ConstructionOriginal.aspx
II-54 improving Findability and Relevance of Transportation information Type of Work (e.g., Bridge Widening, Grade / Drain / Pave) Type of Work (Code) Type of Work (Group) On Time On Budget The description ï¬eld contains a variety of information, often including the project number and location. The research team harvested this information to create separate ï¬elds for project number and location. For example, a description of â2009 PLANT MIX SCHEDULE (South Hill, Mecklenburg County) (PM4A-058-026,N501)â represents the PM4A-058,026,N501 project in South Hill, Mecklenburg County. In extracting the project number for each project, the research team created a project and contract numbering mapping. As described in the âSearch Capabilityâ section, the search tool includes a âsmartâ search capability based on this information by retrieving selected structured project information from the project list, and generating results for documents containing the contract ID, project number, or UPC. A number of the project list items required signiï¬cant data cleaning and normalization. For example, the project team: Separated the project number and location from the description, as described above Normalized route information and added ï¬ve synonyms for each (RT. 66 and I-66, for example) Separated route numbers for some projects that involved more than one route Mapped contract award amounts to ranges Separated multiple UPC codes into columns so that they could be imported individually and be mapped to each project Separated and added road system(s) to each project Project Proï¬les On the same public website as the project list, VDOT provides access to project proï¬les for each of the projects contained in the project list. Because of the volume of project proï¬le documents available, the research team wrote a script to automate a PDF download of each of the project proï¬les. These project proï¬les provide search tool users the ability to quickly access information about projects. A.4: Solution Development Semantic Model Development The research team explored the possible use of a variety of semantic resources, and used some of these as input into the development eï¬orts. One resource was a set of search logs from InsideVDOT, which the research team mined to include some terms in the semantic model. The research team examined the TRT but found it to be of limited value given the highly speciï¬c focus of the pilot. The TRT provides broad coverage of transportation concepts at a general level. The search terms for the pilot required more speciï¬c terms.
pilot Findability Report II-55 The research team turned to content from the districts. The research team began by using a number of text analytics tools to mine this content for entities and noun phrases, and manually analyzing content for concepts and complex relationships. Using the understanding gained from analyzing VDOTâs content, the research team designed the architecture of the semantic ontology to respond to VDOTâs speciï¬c search needs. This architecture governs the facets in the semantic model, how those facets are related to one another, and how they work together to classify content. It is designed to be intuitive to users and responsive to the information seeking needs that were identiï¬ed in the interviews. After designing the architecture of the model, the research team created facets, imported data, and created relationships among terms. The semantic model includes a number of project-speciï¬c facets available from the master project list, which could allow for additional analysis by VDOT staï¬. The top-level categories of the ontology are based on the metadata ï¬elds or facets that would be useful in a search application. Table II-A-2 documents the items included in each facet available for search. Users would be able to ï¬lter search results by selecting criteria from this second level of the semantic model (and in further detail by selecting criteria from the third level of the model and beyond if desired).
II-56 improving Findability and Relevance of Transportation information Table II-A-2. Semantic Model Facet Included Values and Value Sources Facet Source of Values Included Values Content Type Content Analysis Values limited to: Work Order Daily Work Report Source of Materials Project Proï¬le Related to Work Order Contract Award Amount Project List Ranges of: Less than $500,000 $500,000 - $1,000,000 $1,000,000 - $5,000,000 $5,000,000 and above Contractors Text Mining of Content Variety District Project List (VDOT Master Data) Values limited to: Bristol District Culpeper District Fredericksburg District Hampton Roads District Lynchburg District Northern Virginia District Richmond District Salem District Staunton District Equipment Text Mining of Content 20 top-level equipment categories, with additional subcategories Jurisdiction City and County List Text Mining of Content All possible values for Virginia cities and counties Manufacturers and Suppliers Text Mining of Content Variety Materials List of Materials Text Mining of Content 30 top-level materials categories, with additional subcategories Pay Items VDOT Standard Item Code Table3 Variety 3 Virginia DOT, âStandard Item Code Table,â available at http://www.virginiadot.org/business/resources/const/itemcodestandard.pdf
pilot Findability Report II-57 Facet Source of Values Included Values Projects Project List Variety Road System Project List Values limited to: Interstate Primary Primary (Arterial) Rural Secondary Urban Various Routes Project List Variety Type of Work Project List Values limited to: Box Culvert Bridge Bridge Ordinary Maintenance Bridge Painting Bridge Repair (& Rehab) Bridge Widening Demolition Fence Repair / Replace GR Replacement / Repair Grade / Drain / Pave Jacked Pipe / Pipe Rehab Maint Replacement New Roadway Pavement Marking / Markers Pavement Repair Paving / Asphalt Paving / Concrete Planting Sidewalk, Curb & Gutter Signals Signing / Sign Overlay Surface (Overlay & Treatment) Utility Widen Roadway
II-58 improving Findability and Relevance of Transportation information Facet Source of Values Included Values Wildï¬owers Work Issue Content Analysis; Adapted from Previous Work (Sun and Meng)4 Values limited to:5 Drainage Issue Utilities Issue Weather Issue Work Order Categories Content Analysis; Work Order Category Lists in VDOT Content Values limited to: ADD (Additional work not originally planned) CHAR (Changes per Section 104.2 (Character of Work)) CONT (Error or omission in contract document) LEG (Local, State or Federal government proposal) MISC (Does not ï¬t into other categories) NBID (Items speciï¬ed in contract with set unit price, not bid on by contractor) PLAN (Plan error or omission) POL (Changes in VDOT Policy) RENW (Renewing / Extending time limit on a renewable contract) UTIL (Delays caused by utility issues) VALU (Contractor Value Engineering Proposal) VDOT (Late NTP or VDOT caused delay) Figure II-A-3 displays the facets in the top-level of the semantic model. Users have the ability to search and/or ï¬lter by these facets. Users can also drill down within a facet to view subcategories (e.g., selecting âDistrictâ in Figure II-A-3 would then allow the user to select a district from among the list of districts). 4 Sun, Ming and Xianhai Meng. âTaxonomy for change causes and eï¬ects in construction projects,â International Journal of Project Management 27 (2009), p. 560-572. 5 The research team also identiï¬ed the following issues as candidates for inclusion in a full development: concrete issues, contractor issues, equipment issues, external delay issues, materials issues, paving issues, plan changes, safety work issues, traï¬c maintenance issues, and value propositions. These candidates are included in the ontology, although the research team did not focus on them for testing.
pilot Findability Report II-59 Figure II-A-3. Top-level semantic model categories. The research team used Smartlogicâs Ontology Manager to organize the data into a structure, which deï¬nes relationships between terms. Figure II-A-4 provides an example of how the model incorporates these relationships. In the example, the user has selected âBristol District,â which is an element of the âDistrictâ facet, displayed above the center circle. Bristol District âhasâ a number of associated âRoutes,â which are displayed in the circles to the right of the center of the center circle. Similarly, âBristol Districtâ âhasâ a number of âProjects,â displayed in the circles below the center circle. Finally, âBristol Districtâ âis District ofâ multiple âJurisdictions,â displayed in the circles to the left of the center circle. The search tool interface provides the user with this visual way to view relationships between facets. The content analysis was also used to help model the relationships between the diï¬erent facets. One of the advantages of an ontology over a simple taxonomy is this ability to model multiple relationships and types of relationships. Table II-A-3 provides a list of the relationships in the VDOT ontology. Relationships in the VDOT ontology fall into three groups: Hierarchical, which creates a parent- child relationship. Associative, which links two equal concepts in a non-linear way, and Equivalence, which include synonyms and other terms that we want to be included as an equivalent to our preferred term. Hierarchical relationships are limited to broader and narrower terms, as their name indicates. The research team deï¬ned the associative relationships to indicate the function of each term in the relationship. The relationships are reciprocal, just as many relationships in language are. For example, âis a parentâ has a reciprocal relationship of âhas a child.â Similarly, âis a parentâ could have a reciprocalFigure II-A-4. Example of semantic model.
II-60 improving Findability and Relevance of Transportation information relationship of âis a child ofâ so that now our relationships are: Tim is a parent of Sarah Tim has a child. Tim is a parent of Sarah is a child of Tim. The deï¬ned (or named) relationship is a reference for the user and allowed the research team to use those relationships in diï¬erent ways for classiï¬cation. For example, in Table II-A-3, districts comprise counties and cities, and counties and cities are in districts. But districts also have relationships with Projects and Routes, and deï¬ning those relationships lets the user discover information through search. For example, a user can look for counties that are located in the Lynchburg District without having to know the name of the county ï¬rst. Table II-A-3. List of relationships in the VDOT pilot ontology. Term class Term subclass Relationship Class of terms relationship is with Relationship type Content Type C-10 hierarchical has work order Work Order Category associative is work issue Work Issues associative C-25 hierarchical C-84 hierarchical has work order Work Order Category associative is work issue Work Issues associative Project proï¬le hierarchical Related to C-10 hierarchical related to C-10 C-10 associative Contract Award Amount is contract award amount of Project associative District has a roadway Routes associative has project Projects associative is District of County associative is District of City associative Jurisdiction City hierarchal is in County County associative is in District District associative County hierarchical is in County County associative is in District District associative has a roadway Routes associative Pay Items has item ID equivalence
pilot Findability Report II-61 Term class Term subclass Relationship Class of terms relationship is with Relationship type Projects has contract ID equivalence has project ID short equivalence has UPC equivalence Involves route Route associative is project of District associative has road system type Road System associative has contract award amount Contract Award Amount associative has type of work: Type of Work associative Road System is road system Projects associative Routes has road UF6 equivalence is a roadway of District associative is part of project Projects associative Type of Work is type of work Projects associative Work Issue has work issue C-84 associative has work issue C-10 associative Work Order Category hierarchal is work order category C-10 associative Rule Development and Reï¬nement The Smartlogic Semaphore â Classiï¬cation Server tool includes text analytics features that allow for development and application of rules that improve search results over a simple full text search capability. This proprietary, commercial software served as the basis for the pilot eï¬orts to demonstrate improved ï¬ndability. Text analytics is software that can be used to add structure to unstructured content, which can automate assignment of metadata to improve search within the enterprise. The basic elements of text analytics include auto-categorization and entity/fact extraction. Auto-categorization can characterize the subject of content, while entity extraction can pull out key concepts and information from a set of documents (e.g., materials and locations). Both processes start with content analysis, through manual sampling to understand patterns, and through text mining software to extract noun phrases. Entity extraction aims for collecting all signiï¬cant noun phrases, while categorization is built upon phrases 6 âroad UFâ is a named synonym that translates to âroad use for.â Giving synonym types a speciï¬c name allows the user to identify them more easily in the ontology. More importantly, this also allows the user to set up types of terms as a facet for search and to weight terms diï¬erently. For example, identifying contract IDs with a speciï¬c term type name lets the user count it as an equal to project IDs, and naming route synonyms as road UF lets the user weight it at 0.25 or at any score that helps classiï¬cation.
II-62 improving Findability and Relevance of Transportation information that are unique to each subject area. The research team used these processes to create rules for tagging content for each element in the semantic model. The software the research team used for this project goes through these basic steps when classifying a document: It looks for vocabulary (terms, term variants, phrases and patterns) in documents and metadata related to documents. It applies weightings to any vocabulary it ï¬nds to build an overall score for each term in the ontology. It adjusts weightings based on terminology frequency, location of the terms, proximity to other terms, the combination of terms found and the format and layout of the text containing terms. If the score exceeds a (conï¬gurable) threshold of 0.48, the document is tagged with classiï¬cation results. The process of developing and reï¬ning rules to classify documents and automate metadata tagging is discussed in more detail below. Unstructured and Semi-Structured Text-Based Rule Development The pilot focused on building rules for âunstructuredâ text to classify documents. For example, a simple rule to categorize a documentâs content type as âWork Orderâ would determine if the term âwork orderâ appears at a particular location in the document. Such a rule would avoid false positives (i.e. documents that contain the phrase âwork orderâ but are actually some other content type) that might be obtained via a simple search of the term âwork orderâ. Another part of the rule development included the use of synonyms, which cover everything from simple terms that refer to the same item or concept to common misspellings or abbreviations. For example, project numbers, UPCs, and contract IDs each identify a project. The research team added UPCs and contract IDs as synonyms for project numbers so that users could ï¬nd a project by looking for any of the three. Similarly, the rules included pay item numbers as synonyms for pay item names, so searching for a number will return the name of the item and vice versa. In a third example, a user searching for a particular route will ï¬nd documents containing a number of route identiï¬ers (e.g., âRTE 66â, âRoute 66â, âRT 66â, âInterstate 66â, and âI-66â). Because the four types of content used in the pilot have more structure than many documents due to having fairly consistent ï¬elds, the research team used that structure to improve the categorization rules. For example, the top section of many work order documents included speciï¬c ï¬elds containing values for project identiï¬ers and work order categories (the VDOT-speciï¬ed reason for the work order), among other facets. The research team developed rules to tag these work order categories. In a more complete application, more advanced structure rules could be developed. In addition to textual variation, rules were speciï¬ed to distinguish between when a word appears as a category indicator and when it appears elsewhere in the text. For example, the word âPlanâ appears throughout typical work orders so a rule only counts âPlanâ (and the other work categories) if it appears in conjunction with the word, âCategory:â (which often occurs in the top section as noted above). The rules were further reï¬ned to account for variations in spelling and the placement of designation words in the text. For example, seeing âCategory:â followed by âPlanâ worked in some
pilot Findability Report II-63 cases, but in others, a considerable (and unpredictable) number of words separated âCategory:â and âPlanâ. Some of the intervening words existed because of poor scan quality and text recognition. To account for this, the rule searches for the word, âPlanâ only within two paragraphs of the term âCategory:â. This increased the categorization accuracy from unusable to highly accurate. It is a general rule of thumb that 60%-70% accuracy is a minimum acceptable level and 90%+ is usually the goal, although that varies with the type of content and the type of application. Figure II-A-5 illustrates the programming language used for this example rule. In the actual development, the word âPLANâ is replaced by a variable that points to a list of terms, since in some work orders the word is replaced by the description. Work Issue Classiï¬cation The research team also focused signiï¬cant eï¬orts on building rules for auto-categorizing work orders and DWRs based on the types of work issues encountered â as described in free text blocks within the documents. This required reading through a sample of the work orders collected to develop an understanding of the types of issues that appeared as work order justiï¬cations across agency construction projects. Based on this understanding, the research team reviewed previous work to categorize work issues. The most relevant reference identiï¬ed was by Sun and Meng, who developed a taxonomy of change causes in construction projects.7 The research team ultimately chose to focus the rule development on identifying weather, drainage, and utilities issues. The challenge in building rules for these issues is in distinguishing when they are problems or when they prompt a work order or plan change. The rules built for the pilot accomplish that in some cases, but in others capture an occurrence more than a problem. Full rule development beyond a pilot could further reï¬ne these rules to the desired extent. Upon selecting an initial set of work issues, the research team conducted a search across compiled documents for key related terms. For example, to search for the weather-related work issues, the research team used terms such as âstormâ, ârainâ, âheavy rainâ, âwarmer weatherâ, âwinterâ, âshutdownâ, âmuddyâ, âsnowâ, âweather eventâ, âhurricaneâ, and âcold temperatureâ. Similarly, to search for utility issues, the research team searched for documents containing related terms such as âobstructionâ, âutilityâ, âconï¬ictâ, âgas lineâ, âsewerâ, âwaterlineâ, âelectricâ, âpowerâ, and âsignalâ. Based on these searches, the research team selected a subset of work orders that ï¬t into various categories. For each of these documents, the research team also recorded the text that triggered the categorization. These patterns, including sentence structure, word combinations, and the location of key words, served as the foundation for the initial rule development of work issues. The use of text 7 Sun, Ming and Xianhai Meng. âTaxonomy for change causes and eï¬ects in construction projects,â International Journal of Project Management 27 (2009), p. 560-572. <sequence sequencetype="paragraph"> <paragraph> <text data="Category:" /> </paragraph> <skip count="2"/> <paragraph> <text data="PLAN" /> </paragraph> </sequence> Figure II-A-5. Sample work order category rule.
II-64 improving Findability and Relevance of Transportation information analytics (including entity extraction) supplemented this analysis to ï¬nd additional terms and phrases to use in the rules. For the work issue rule development, two standard section headings (structural elements) signiï¬ed a location within work orders to search for work issues: âLocation and Description of Proposed Workâ and âResponsible Charge Engineers Explanation of Necessity for Proposed Work.â The research team developed rules to look for work issues only at these two sections, eliminating a lot of noise and false positives. The work issue rules use a two-step classiï¬cation process. First, they classify for content type. Almost all C-10s (whether they are âwork ordersâ or âchange ordersâ) have the two âsignifyingâ phrases in them. If those two phrases are present, the software classiï¬es the document as a work order and gives it a âscoreâ of 100, indicating full certainty. Second, the software classiï¬es for work issues. This is challenging because the explanations tend to be short and use non-distinct terms. For example, the word âutilityâ or âVerizonâ may be the only term present that indicates the nature of a utility issue described in the work order. Simply looking for those terms will return many false positives. Meanwhile, searching for longer terms such as âmove gas mainâ produces limited results (e.g., gas main line moves may be planned from the beginning of the project and not require a work order). Using these general terms is the only option available to classify the documents, so the software uses this strategy but gives these documents a low score, indicating a low conï¬dence level in accuracy. In the pilot, the research team used simple lists of these terms to develop the work issue rules (for example, the list in Figure II-A-6 provides phrases used in the rule that auto-categorizes utility issues). In a full development, the rules would be generalized further based on analysis of work combination patterns. This process was applied to daily work reports using a similar approach, expanding work order issue categorizations to issues or topics addressed in daily work reports. Rule Reï¬nement The development of these categorization rules requires a number of testing cycles to reï¬ne the rules for greater accuracy. Normally, the process started by ï¬nding terms that would correctly identify as many of the target documents as possible (recall). This was followed by testing and reï¬ning the rules to reduce the number of false positives (precision). For example, applying this approach to content type abandoned gas line adjustment due to utilities conï¬ict with existing utilities Dominion Power existing pipes replaced existing utilities existing Verizon ï¬re hydrant gas line gas line in conï¬ict gas line in the way gas main gas main in conï¬ict install new manholes new manholes old gas lines power company power lines relocate gas lines relocated utilities relocating utilities sewer line sewer main streetlight poles streetlight relocation telecommunication duct telephone cable telephone lines utilities adjustment utilities conï¬ict utility delay utility relocation utility situation utility work Verizon water service lines waterline alignment waterline placement waterline relocation waterline system Figure II-A-6. Utility issue terms.
pilot Findability Report II-65 classiï¬cation provided an indication that overly broad rules were capturing casual references to work orders in addition to work orders themselves. Another common technique starts with a small set of documents and achieves maximum recall and precision, then tests the rules against new and larger sets of documents. This process can continue almost indeï¬nitely, but since the payoï¬ decreases each time it is necessary to set an acceptable level of accuracy. For a pilot, this level is normally somewhat lower than for full development. The research team used this technique to develop rules for identifying work issues, using a subset of the total collection of documents to develop the rules. This led to the conclusion that the initial attempt to compensate for non-distinct terms was too restrictive, limiting the ability to identify issues in a larger document set. Full development beyond what was done for the pilot would entail generalizing the rules to other facets besides work issue, applying the rules to more content which might mean additional development, and aiming for âproduction levelâ accuracy. A full set of rules applied in the pilot is included in Annex 1. This includes rules for each of the facets speciï¬ed in Table II-A-2. Incorporating Structured Data The rules locating project identiï¬ers allowed the search tool to incorporate structured project information from the project list. As described in the discussion of âsynonyms,â the research team linked the UPC and contract ID to project numbers through this list, so that a document containing any of the three project identiï¬ers is tagged with a project number. The team also looked for a shortened version of each project number. Identiï¬ed as a âproject ID shortâ in the ontology, this identiï¬er ï¬nds project numbers without the leading (FO) or (NFO) of the project. It also drops the characters following the comma in project numbers, as these indicate diï¬erent phases of the same project. This approach did ï¬nd more projects, but the research team adds a caution to this approach: it would require additional testing before a full implementation to ensure that it is ï¬nding projects correctly. Once the project number is found using any of these identiï¬ers, it is then used as the lookup value to tag the document with project attributes, such as District, Routes, Contract Award Amount, Type of Work, and Road System. This capability results in a more powerful search tool that combines use of structured data resources (e.g., master project data) with text search capabilities.
II-66 improving Findability and Relevance of Transportation information Faceted Search Design The research team built a simple search interface, with facets on the left and search results in the middle of the screen. The faceted search design allows users to explore the content and reï¬ne the search results by ï¬ltering through selected criteria. Annex 2 includes example search scenarios with screenshots of the faceted search design. The challenge in building a faceted search capability is in generating enough metadata to support each of the facets; however, the rule development described above allowed the research team to do this by auto-categorizing documents with metadata. Figure II-A-7 demonstrates the search toolâs use of facets in an example based on a search for documents in the Hampton Roads District. Upon search submittal, the list of facets appears to allow the users to further ï¬lter the results. For example, a user searching for daily work reports for projects in the Hampton Roads District would be able to select âDaily work reportâ under the âContent Typeâ facet, and the original set of 1,224 Hampton Roads documents would be further limited to the 197 Hampton Roads District daily work reports. The user can sequentially select facet criteria to allow for combinations of criteria across facets and further limit the results. The faceted search design allows users to select a combination of ï¬lters to more quickly ï¬nd information in response to user business questions and search needs. It also allows users to ï¬nd target documents by starting with what they know (e.g., that the project involved an Interstate road system or a particular type of work). In this way, a rich set of facets can support a variety of users who start with diï¬erent knowledge. Users can begin a search in two ways: with a free text search or with a taxonomy-driven search. As users begin typing, the system automatically suggests any terms that match in the taxonomy. This type- ahead feature displays terms that match the letters a user types, with the closest matches displaying ï¬rst in the list. For example, if a user types âUtâ into the search box, the search tool oï¬ers suggestions for the âUTILâ work order category, the âUtilities Issueâ work issue, and the âMitchell Utilities LLCâ manufacturer, among others. This option allows users to search for documents about a term in the taxonomy instead of documents that only mention a term. For example, a ï¬le (such as an email) might mention a daily work report but not contain one. A simple text search for âdaily work reportâ would return this ï¬le as a result, while a taxonomy-driven search would return only ï¬les that actually contain daily work reports (because the categorization rules will only tag these documents as daily work reports). Figure II-A-7. Facets within example search.
pilot Findability Report II-67 The pilot solution also includes a capability to browse facets at the outset instead of needing to type in a search term to start a search. This provides users with a starting point for the search if they are unsure of what search term to use. For example, a user may have an interest in searching by a particular type of work, but may want to examine the types of work listed under this facet before choosing what term to use. This discovery works in a similar way to the visual ontology, as discussed in the Semantic Model Development section. The faceted tool incorporates both structured and unstructured data. It allows users to search or ï¬lter by the structured information contained in the project list or the unstructured information contained in the documents. For example, a user could search for utility issues on projects greater than $1,000,000 using: 1) The rules built around unstructured information to develop the work issue facet and automatically identify work issues in free-form text; 2) The incorporation of structured information for contract amount based on the project identiï¬er; and 3) The identiï¬cation of a project identiï¬er that bridges the unstructured and structured information. The pilot version of the faceted search tool does not currently incorporate logic related to the location of information within a multi-faceted search. For example, a user intending to ï¬nd documents containing a speciï¬ed material supplied by a speciï¬ed manufacturer could ï¬lter document results to those containing the speciï¬ed material, then ï¬lter a second time to those containing the speciï¬ed supplier. The ensuing results, however, will be limited to documents that include the speciï¬ed material and the speciï¬ed supplier. Current rules do not guarantee a relationship between these two facets â it is possible that the manufacturer appears in the document in reference to an entirely diï¬erent material, in a diï¬erent location within the document. This type of logic could be a future extension for the search tool. The research team also considered adding a Best Bets option to select speciï¬c documents to automatically return at the top of the results list when a user types a particular term. The pilot does not include this feature because the research team did not have enough information about speciï¬c documents that users might want to search for from the interviews; however, this could also serve as a future extension for the search tool. Another future extension could include a user personalization feature. This feature could only expose certain sets of facets for particular users. This is something that could greatly enhance the usefulness of search. For the pilot, all of the content was loaded into a single repository. However, current commercially available search tools support searches across repositories. These capabilities require custom conï¬guration to account for diï¬erent ï¬le formats, access methods, and security protocols. Additionally, although the pilot design used the Microsoft FAST search engine, other search engines could be used. For example, Google Search, Apache Solr, and others could incorporate similar concepts into a faceted search design. Basic features of the Pilot search tool were recorded in a video ï¬le, which can be accessed at: http://sites.spypondpartners.com/nchrp2097/Solution%20Demonstration.mp4
II-68 improving Findability and Relevance of Transportation information A.5: Test and Evaluation Rule-Based vs. Plain Vanilla FAST Search One component of the pilot evaluation diï¬erentiates between a rule-based search and a plain vanilla FAST search. The rule-based search includes the rules and automation speciï¬ed by the research team, while the vanilla search includes the same body of content but without any built-in rules, i.e., similar to the status-quo âout-of-the-boxâ search environment. The vanilla search was set up by porting the pilot content into a separate instance of the content management system without the ontology and developed facets. There are a limited number of facets available with the standard setup, including document type (Word, PDF, etc.), Author (who last touched the ï¬le), Date (ï¬le date), and Company (a generic set of company names). The evaluation compares the two environments. Testing and Subjective Evaluation The research team collected baseline metrics on success rate, using a selected set of search test cases that reï¬ect the business questions and search needs collected through the VDOT interviews. The baseline metrics for these test cases were collected using the vanilla search, with post-improvement values collected through the rule-based search. Evaluation metrics include: Precision of results Recall of results Time spent to compile results Qualitative user feedback Metrics specifying the precision of results were collected to identify if the rule-based search resulted in documents more or less relevant to the user than the vanilla search. These metrics include: The total number of results, which is used both to calculate the percentage of total results that were relevant and to identify test cases where a search appeared to have âtoo manyâ results â i.e., cases unlikely to have high precision. The position of the ï¬rst relevant document signiï¬es if the earliest results (sorted by relevancy) are relevant to the test case. A user searching for an individual document would prefer to have that document appear earlier in the results; similarly, a user searching for a set of documents would prefer to ï¬nd relevant documents immediately rather than after many results. Relevancy was deï¬ned individually for each test case. The number of relevant documents in the top 20 results assesses how many of the ï¬rst set of documents were relevant to the user. In cases where the search returned fewer than 20 results, this metric considers all results (e.g., the number of relevant documents in the top 15 results in a search that only returns 15 results). The percentage of documents in the top 20 results that are relevant provides a key precision comparison metric. The percentage is calculated by dividing the number of relevant documents identiï¬ed by 20 (or by the total number of results if fewer than 20).
pilot Findability Report II-69 The number of documents needed to ï¬nd 10 relevant results (or in some scenarios, a number less than 10) allows the research team to compare the vanilla and rule-based search capabilities to ï¬nd a set of documents. The research team also collected metrics related to the recall of results to assess which search method is more capable of returning a full set of documents desired by the user. These metrics include: The number of known relevant documents, which is used to calculate the recall percentage, and is evaluated based on the content collection structure. The majority of the content was downloaded by project, so individual project documents are readily identiï¬ed outside of the search environment. The remaining documents were not downloaded by project, but were each examined individually to determine the associated project. Because of this structure, recall metrics focus on project identiï¬ers. To evaluate recall of other facets (e.g., work issues), the research team would need to read and manually tag each of the documents in the content set, a time-consuming process. The number of relevant documents in the top 30 results assesses how many of the ï¬rst set of results are relevant to the user. In cases where the search returns fewer than 30 results, this metric considers all results (e.g., the number of relevant documents in the top 15 results for a search that only returns 15 results). The choice to examine the top 30 results was based on time constraints, and the concept that a well-performing search should ï¬nd the relevant results early in the result set. The recall in the top 30 results divides the number of relevant documents in the top 30 results by the number of known relevant documents. A higher number represents higher recall â i.e., the search ï¬nds a greater percentage of the total known results. The research team also applied other evaluation metrics in speciï¬c cases when appropriate. For example, when searching for a speciï¬c document, the research team evaluated the result position of the document to demonstrate the relative ease with which a user could ï¬nd the document (in a sense, as a proxy for the time that it would take a reader to ï¬nd the document). The full set of evaluation test cases and metrics are provided in Annex 3. General ï¬ndings from the pilot evaluation are as follows: The combination of structured (i.e., linked through the project identiï¬er) and unstructured information allows the user to conduct rule-based searches that would not be possible in a vanilla search by using related information from outside of the documents. The rule-based search often results in higher precision of speciï¬ed searches. In some cases, this comes at the cost of lower recall. This is particularly true for searches related to the project number (which included combinations of letters and numbers), as it appeared that the FAST software more easily distinguished â0âs from âOâs and â5âs from âeâs in documents with poor scan quality, possibly due to diï¬erences in the capabilities of reading OCRed text in each piece of software. Including rules to limit the scope of work issue identiï¬cation within the document based on the standard section headings resulted in higher precision but lower recall. A more complete application could explore tweaking the rule to capture some of the variants present in the
II-70 improving Findability and Relevance of Transportation information section heading language, which could increase recall. Similarly, a more complete application could examine if including rules for ï¬nding work issues in additional parts of the document would allow for increased recall while maintaining a high level of precision. The use of synonyms in the rule-based search provides greater precision and recall in a number of searches. For example, a rule-based search is able to ï¬nd documents that use one of three project identiï¬ers â project numbers, contract IDs or UPCs â even when users search for another. For example, if a user searches for project number 867-4305 and a document uses the contract ID 1986 but not project number 867-4305, the search results will include the document. Synonym matching behind the scenes means that users need to know only one important piece of information to ï¬nd the content they need. The vanilla search is more adequate for searches of terms that do not have many synonyms in common use, or diï¬erent contexts. For example, the vanilla search would have high precision and recall of results on a search for Source of Materials forms, which have fairly constant language. However, terms that apply to multiple contexts are better suited to a rule-based search. For example, a vanilla search for a speciï¬c route may return results using the number in a diï¬erent context (e.g., as part of an address). The rule-based search is able to provide some structure to these searches to improve result precision. The rule-based search interface can accommodate misspellings by suggesting search terms from the ontology, resulting in greater search recall and precision. Similarly, the rule-based search interface can suggest terms to further ï¬lter within facets, allowing a user to add speciï¬city to a search through name recognition. The test cases in Annex 3 are organized into the four information needs categories described in the âFindability Needsâ section. A brief summary of results applicable to each of these categories is also provided below: 1. Find a Single Known Document for a Project (e.g., an estimate) Using a Variety of Search Criteria. The rule-based search can accomplish this eï¬ectively, but for some types of searches a vanilla search can also do so if speciï¬ed at a similar level. This is particularly true for simple searches with a few search terms. The main advantages in using the rule-based search here derive from general ï¬ndings discussed above. 2. Find/Review All Documents for a Project (e.g., for a FOIA Request). The rule-based search is able to ï¬nd documents containing the project number, contract ID, or UPC, regardless of which is speciï¬ed in the search. This ï¬exibility provides an advantage over the vanilla search, which requires that the project identiï¬er speciï¬ed exist in the document. The degree to which this is helpful depends on the search â for example, searching across all project documents would favor the rule-based approach because of the variety of project identiï¬ers used in diï¬erent project documents; searching for daily work reports containing a speciï¬c contract ID could return similar results in both a rule-based and vanilla search because the contract ID is so prevalent in daily work reports. 3. Search Across Projects - Find Projects with Item, Material, Construction Technique. Again, a major advantage here of a rule-based search derives from the use of synonyms â for example, a rule-based search for a pay item by number would also ï¬nd instances where the pay item name appears without a number. Meanwhile, the structure of the Source of Materials forms results in high levels of precision for both vanilla and rule-based searches related to materials,
pilot Findability Report II-71 with the main advantages in using the rule-based search derived from general ï¬ndings discussed above. 4. Research Reasons for Delays and Changes. The ability for the rule-based search to deï¬ne work issues allows a user to specify a singular work issue instead of multiple terms (e.g., the âutility work issueâ is built on over 40 phrases). This categorization allows a user to more easily search for documents with a speciï¬ed issue. The rule-based search limits the number of results that a user must review to research reasons for delays and changes, with high precision for these results. Additionally, the rule-based search provides the user with the ability to search for documents with a VDOT-speciï¬ed work order category with higher precision than through the vanilla search (due to the frequent use of the work order category terms such as âVDOTâ and âADDâ in a diï¬erent context). Virginia DOT User Evaluation The research team demonstrated the pilot solution at a focus group at the Virginia DOT on February 9, 2016. Focus group participants included: Telecom Coordinator, Business Owner for InsideVDOT (VDOTâs intranet) District Technical Resource Manager Knowledge Management Oï¬ce Director Quality Specialist for Internet and Extranet Information and Knowledge Management Program Coordinator District Technology Resource Manager Project Manager for the Project Document Management System Project Construction Project Controls Lead District Construction Administrator Feedback from focus group participants focused mainly on potential implementation of a text analytics solution. Users noted that implementation would require rules, governance, and buy-in across the agency. They would need to make a business case (e.g., staï¬ time or money savings) in order to receive that buy-in and funding. One participant provided a potential business case that this type of ï¬ndability solution could decrease storage costs (due to reduction of duplicative information) and save millions of dollars. This type of argument would need to be further demonstrated and reinforced. Additionally, focus group participants discussed the agency roles that would need to accompany an implementation eï¬ort. For example, they suggested a âreviewerâ role for checking metadata. Additionally, this eï¬ort would require agency staï¬ to have the skills to build ontology and rules . VDOT does have a professional librarian who has this skill set. Focus group participants were also interested in the technical implementation. They mentioned that any text analytics solution would need to be integrated with VDOTâs current content management/collaboration solutions as part of the document intake process. Furthermore, implementation of a federated search tool would require work to identify and work through variations in permissions. The indexer could be granted the necessary access in order to do the auto- classiï¬cation, but user access restrictions on the content itself would need to be enforced.
II-72 improving Findability and Relevance of Transportation information Transferability Analysis The ï¬nal component of the pilot evaluation considers the transferability to other agencies of the ï¬ndability solution and development process. While each ï¬ndability solution should be driven by an information architecture that is tailored to the agencyâs situation, the general ideas behind the pilot setup contained in this report are transferable, including the logic behind the rules in Annex 1. Using a similar approach in any software will likely produce similar results. The WSDOT provided information and feedback for purposes of this transferability analysis. Transferability Subjective Testing To test the transferability of classiï¬cation rules, WSDOT provided content similar to the VDOT content: 100 inspectorâs daily work reports, 134 change orders, and 112 requests for approval of material. This content contains similar language and form ï¬elds, with similar content quality issues (e.g., poor scans, variation of usage and forms). However, these documents were generated from applications, so are more consistent from document to document. The research team evaluated the eï¬ort needed to convert rules developed for VDOT documents to apply to the WSDOT documents in order to provide an estimate of the transferability of the rules developed to other agencies. To do this, the team directly applied the ontology built for VDOT to the WSDOT content using Smartlogicâs Classiï¬cation Server. Table II-A-4 contains transferability results of the rules to classify content type. These metrics examine the recall of the classiï¬cation, comparing the documents that correctly classiï¬ed as a given content type to the known total number of documents of that content type. The results demonstrate that the current classiï¬cation rules for content type as work order or inspectorâs report could be directly applied to WSDOT without any changes. The naming diï¬erence of the materials form would require a simple addition to the rule of the WSDOT term to correctly classify those documents at WSDOT. Table II-A-4. Content type transferability results: recall. Content Type Total Documents Correctly Classiï¬ed Recall Work Orders/Change Orders 134 (Including 5 Unreadable) 125 93% Daily Work Reports/ Inspectorâs Daily Reports 100 (Including 2 Unreadable) 98 98% Source of Materials Forms/Request for Approval of Material 112 0 0% Table II-A-5 presents transferability results of the rules to classify work issues. These metrics examine the precision of the classiï¬cation, comparing the documents that were correctly classiï¬ed to the total number of documents that were classiï¬ed for each work issue. Based on these results, the current rules for a drainage issue and a weather issue could be applied directly to WSDOT; however, the rules
pilot Findability Report II-73 for a utility issue tend to capture some of the materials listed in the change order. Improving the precision of these rules would require narrowing the language and adjusting the scope of the rules to avoid that content. Table II-A-5. Work issue transferability results: precision. Work Issue Total Documents Classiï¬ed Correctly Classiï¬ed Precision Drainage Issue 19 19 100% Weather Issue 13 11 85% Utility Issue 16 7 44% To customize the pilot content and process, WSDOT could follow the steps shown in Figure II-A-8. A similar approach also would apply to other agencies choosing to customize this approach. The successful classiï¬cation of most of WSDOTâs content suggests, however, that the information architecture developed for VDOT is usable by other agencies even though the ontology will need customization. The expected eï¬ort to customize other facets for WSDOT would vary. For example: Classifying manufacturers and suppliers, contractors, materials, equipment, district, and jurisdiction could be done within hours for each facet by extracting data for each and replacing the existing rules with the extracted data lists. Some of these items (e.g., materials), may also be able to build on the existing lists. Generating project information (e.g., type of work, award amount, etc.) and using this information to link to project identiï¬ers within documents would take moderate eï¬ort (i.e., multiple days). Washington State DOT User Evaluation WSDOT staï¬ also provided input through a focus group discussion on March 8, 2016. Participants provided feedback on how the features of the rule- Pilot Content Use Data from Current Pilot System(s)⢠Capture Spelling, Synonyms, Model RelationshipsSupplement with Lists not in Database⢠Text Mining, Existing Resources Customize with DOT-Speci�ic Terms Create Project Pro�iles Develop Text Analytics Rules⢠Adjust Indicator TextModel Ontology Relationships⢠Project - District, Route, etc. Round of Testing and Re�inement Customized Content Figure II-A-8. Customization process.
II-74 improving Findability and Relevance of Transportation information based search applied to the WSDOT context and uses. Participants represented the following functions: Knowledge Management Records Management Risk Management Communications (website) Construction Materials Research Data Management Information Technology Security Library Services Asset Management The WSDOT focus group participants noted a number of existing information management âpain pointsâ related to both technical and process challenges. These pain points included: Determining which document is authoritative. Fragmentation of information repositories making documents diï¬cult to ï¬nd. Apparent redundancy of documents that actually provide a valuable historical record. Diï¬erent formatting in documents received from subcontractors, including handwritten notes, which leads to diï¬culty in determining what material was used or what was installed on a project. Diï¬culty in ï¬nding information on assets that have been replaced. The use of email records that follow employees instead of remaining connected to the employee position. The use of multiple project identiï¬ers. Internal and external dissatisfaction with search capabilities. Focus group participants raised a number of questions and discussion points about the transferability of the pilot. As in the focus groups at VDOT, these discussions focused mainly on the potential implementation of the pilot search tool or a similar tool in the WSDOT setting. Participants were interested in the level of eï¬ort involved in this process, including the initial eï¬ort required to set up the environment, build the taxonomy, and tag documents. Beyond the initial eï¬ort, participants were interested in the eï¬ort required to maintain everything, including administration, validation, and updating of the ontology and search capabilities. Participants noted that because taking the next steps with this process requires both time to lead the eï¬ort and additional ï¬nancial resources, it would be useful to consider where the most payoï¬ occurs for ï¬ndability eï¬orts. Much of the conversation also focused on the complexity of applying this tool to a complex DOT information landscape. Participants were interested in the search toolâs ability to search both within databases and across repositories, and had questions about how to build the search tool across diï¬erent kinds of servers, permissions, and access requirements. Finally, the focus group participants discussed the capabilities of the text analytics tool used for the pilot, the availability of other, similar tools, and the mechanics of developing and maintaining a taxonomy over time. Finally, one participant noted that an ideal future tool would be able to integrate text analytics capabilities with a geospatial front end (i.e., allow users to conduct a faceted search for documents beginning with a map selection).
pilot Findability Report II-75 Annex 1 Pilot Classiï¬cation Rule Descriptions The following subsections provide descriptions of the rules used to classify documents. The actual rules used in the pilot were built using a programming language (as demonstrated in the example rule in Figure II-A-5). Each rule includes weightings of diï¬erent factors that the research team identiï¬ed while testing content for successful classiï¬cation. Those factors are set to count as described below, but the algorithm in the software adjusts weights up and down. Content Type Daily Work Report Find one of the following phrases: 1) âDaily report of constructionâ 2) âInspector's daily reportâ 3) âproject diary â daily work reportâ 4) âPM Diaryâ Weight any of these phrases as 0.5. Related to Work Order Find the phrase âwork orderâ combined with one of the following: 1) âapprovalâ 2) âproposedâ 3) Other related ideas (e.g., asking for a signature, or language in an email chain negotiating a work order or signifying that a work order is coming) This series of phrases is meant to distinguish work orders from emails and documents that refer to work orders. Each of these phrases is weighted 0.5. If more than one of these phrases is present, the document is likely to have a higher relevancy score as the weight of phrases is cumulative in Content Types. Source of Materials Form Find one of the following phrases: 1) âVIRGINIA DEPARTMENT OF TRANSPORTATION SOURCE OF MATERIALSâ (must be all capital letters and an exact phrase) 2) âSOURCE OF MATERIALSâ (must be all capital letters) 3) âC-25â
II-76 improving Findability and Relevance of Transportation information The ï¬rst listed phrase is on all C-25 forms. This may not catch Source of Materials that are not on the oï¬cial form, but it screens out referrals to Source of Materials in emails and daily work reports. If the second phrase is caught instead, a lesser weighting (of 0.25) is used to contribute to the classiï¬cation. This helps account for references in other material where writers have used all capital letters for the entire document. Exclude the following combination from contributing to the classiï¬cation: 1) âsource of materialsâ in conjunction with any tense of the verb âto beâ has zero weight since this combination generally referred to a Source of Materials instead of denoting a C-25. The ï¬rst phrase is set to "score if found" meaning that it is weighted at 1.0. The second and third phrases are weighted at 0.25. Work Order Find one of the following: 1) The phrases âlocation and description of proposed workâ and âresponsible charge engineers explanation of necessity for proposed workâ 2) The phrase âChange Orderâ (must be capitalized and an exact phrase). Both of these phrases are weighted at 1.0. Contract Award Amount Find the following: 1) A project identiï¬er, as deï¬ned in the âProjectâ rules If the project identiï¬er is found, the contract award amount is added based on the project information provided in the project list. Contract award amounts have no weightings as their scores are inherited from the project rules. Contractor Find one of the following: 1) A single mention of the contractor 2) A single mention of the contractor with a variation of the name that drops a suï¬x. In the second case, examples of the dropped suï¬x include but are not limited to: âInc.â, âLLCâ, âCorp.â, âCorporationâ, âCompanyâ, and âCo.â For example, the rules would ï¬nd references to âBranscome Inc.â or âBranscomeâ. Each contractor is weighted at 1.0. The list of values for contractors is deï¬ned in a separate list.
pilot Findability Report II-77 District Find one of the following: 1) A project identiï¬er, as deï¬ned in the âProjectâ rules 2) Several mentions of a district If the project identiï¬er is found, the district is added based on the project information provided in the project list. In most instances, districts have no weightings as their scores are inherited from the project rules. Districts inherit their scores from project identiï¬ers because documents often do not refer to the district by name, but it can be important for users to know the district associated with a project. However, when found in documents, districts are weighted as follows: the ï¬rst mention receives a weight of 0.35. Subsequent weightings are adjusted algorithmically so that it takes four mentions of a speciï¬c district to score a base relevancy of 0.48. The list of values for districts is deï¬ned in a separate list. Equipment Find: 1) A single mention of the equipment Each piece of equipment is weighted at 1.0. The list of values for equipment is deï¬ned in a separate list. Jurisdiction Find one of the following: 1) A single mention of a city 2) A single mention of a county Each city and county phrase is weighted at 1.0. The list of values for cities and counties are deï¬ned in separate lists. Manufacturers and Suppliers Find one of the following: 1) A single mention of the manufacturer/supplier 2) A single mention of the manufacturer/supplier with a variation of the name that drops a suï¬x. In the second case, examples of the dropped suï¬x include but are not limited to: âInc.â, âLLCâ, âCorp.â, âCorporationâ, âCompanyâ, and âCo.â For example, the rules would ï¬nd references to âBMG Metals, Inc.â or âBMG Metalsâ. Each manufacturer and supplier is weighted at 1.0. The list of values for manufacturers and suppliers is deï¬ned in a separate list.
II-78 improving Findability and Relevance of Transportation information Materials Find: 1) A single mention of the material Each material is weighted at 1.0. The list of values for materials is deï¬ned in a separate list. Pay Items Find one of the following: 1) A single mention of the pay item name 2) A single mention of the pay item number Each pay item is weighted at 1.0. The list of values for pay items is deï¬ned in a separate list. Either of these two items will link to the document so that users will ï¬nd all documents for a pay item when searching with either pay item identiï¬er, even if the speciï¬ed identiï¬er is not in the document. Project Find one of the following: 1) A project number with or without the leading â(FO)â or â(NFO)â 2) A project number without any number-letter combination that follows the ï¬nal comma 3) A UPC number preceded by âUPCâ 4) A contract ID As an example of the project number criteria (1 and 2 above), the rules would search for â0015-030- 117â and â(FO)0015-030-117â to ï¬nd project number â(FO)0015-030-117,C501.â Any of these four items will link to the document so that users will ï¬nd all documents for a project when searching with any project identiï¬er, even if the speciï¬ed identiï¬er is not in the document. Each project number, contract ID, and UPC is weighted at 1.0. Road System Find the following: 1) A project identiï¬er, as deï¬ned in the âProjectâ rules If the project identiï¬er is found, the road system is added based on the project information provided in the project list. Road systems have no weightings as their scores are inherited from the project rules. Route Find one of the following: 1) A project identiï¬er, as deï¬ned in the âProjectâ rules
pilot Findability Report II-79 2) Several mentions of a route preceded by âRTEâ 3) Several mentions of a route preceded by âRTâ 4) Several mentions of a route preceded by âROUTEâ 5) Several mentions of a route preceded by âU.S.â 6) Several mentions of a route preceded by âSRâ 7) Several mentions of a route preceded by âState Routeâ If the project identiï¬er is found, the route is added based on the project information provided in the project list. In most instances, routes have no weightings as their scores are inherited from the project rules. Routes inherit their scores from project identiï¬ers because documents often do not refer to the route by name, but it can be important for users to know the route associated with a project. However, when found in documents, routes are weighted as follows: the ï¬rst mention receives a weight of 0.35. Subsequent weightings are adjusted algorithmically so that it takes four mentions of a speciï¬c route to score a base relevancy of 0.48. Any of route identiï¬ers will link to the document so that users will ï¬nd all documents for a route when searching with any route identiï¬er, even if the speciï¬ed identiï¬er is not in the document (e.g., a user searching for âRTE 66â will also ï¬nd documents containing âRT 66â, âROUTE 66â, etc.). The list of values for routes is deï¬ned in a separate list. Type of Work Find the following: 1) A project identiï¬er, as deï¬ned in the âProjectâ rules If the project identiï¬er is found, the type of work is added based on the project information provided in the project list. Type of work has no weightings as their scores are inherited from the project rules. Work Issues The work issue rules have a complex scoring system. First, a document is identiï¬ed as one of the following content types: 1) Work Order 2) Daily Work Report 3) Related to Work Order If the document classiï¬es as one of these content types, it is given an initial score of 0.35 or 0.40. Otherwise, it is given an initial score of 0. The document is then scanned for a list of phrases applicable to the speciï¬ed work issue. The rules search for these phrases in three ways: 1) An exact phrase (a âphrase scoreâ) 2) Within two words of one another (a ânear scoreâ) 3) Within a sentence (a âsentence scoreâ)
II-80 improving Findability and Relevance of Transportation information For example, the phrase âabandoned gas lineâ could have the following matches: 1) A phrase score: âThe design calls to remove the abandoned gas lineâ 2) A near score: âThe gas line was abandonedâ 3) A sentence score: âThe gas utility abandoned several lines, necessitating additional work.â Based on these matches, the document score will increase. For example, a document that is one of the three speciï¬ed content types and contains one work issue phrase would score a 0.48. If it contains two work issue phrases, it would score a 0.55. A document must score 0.48 or above to classify for a work issue. If the rules ï¬nd a work issue term outside of the three speciï¬ed content types, the document receives a score of 0.11. If the document ï¬nds multiple work issue phrases, the score will increase with each instance (with the possibility that a document that contains many references to a work issue will score 0.48 or above and classify as a work issue even if it does not meet the content type criteria). This scoring process applies to each of the work issues. The following subsections provide further detail on the phrases included to classify each work issue. Drainage Issue Use the work issue criteria deï¬ned in the âWork Issueâ section introduction, combined with the following phrases speciï¬c to drainage issues: 1) âadd underdrainâ 2) âadequate drainageâ 3) âcleaning storm drainâ 4) âdiï¬erent drainageâ 5) âdrainage abilitiesâ 6) âdrainage alterationâ 7) âdrainage analysisâ 8) âdrainage changeâ 9) âdrainage errorâ 10) âdrainage modiï¬cationsâ 11) âdrainage problemâ 12) âdrainage revisionâ 13) âdrainage structuresâ 14) âdraining delayâ 15) âerosion controlâ 16) âerosion problemâ 17) âexcessive erosionâ 18) âmodiï¬ed underdrainâ 19) ânecessary drainageâ 20) ânew drainageâ 21) âpermanent diversion ditchâ 22) âplanned drop inletâ 23) âpositive drainageâ 24) ârequired new wingwallsâ
pilot Findability Report II-81 25) ârevise drainageâ 26) âstorm drain installationâ 27) âstorm drainageâ 28) âstorm sewer placementâ 29) âunderdrain installationâ 30) âwater pondingâ Utility Issue Use the work issue criteria deï¬ned in the âWork Issueâ section introduction, combined with the following phrases speciï¬c to utility issues: 1) âabandoned gas lineâ 2) âadjustment due to utilitiesâ 3) âconï¬ict with existing utilitiesâ 4) âDominion Powerâ 5) âexisting pipes replacedâ 6) âexisting utilitiesâ 7) âexisting Verizonâ 8) âï¬re hydrantâ 9) âgas lineâ 10) âgas line in conï¬ictâ 11) âgas line in the wayâ 12) âgas mainâ 13) âgas main in conï¬ictâ 14) âinstall new manholesâ 15) ânew manholesâ 16) âold gas linesâ 17) âpower companyâ 18) âpower linesâ 19) ârelocate gas linesâ 20) ârelocated utilitiesâ 21) ârelocating utilitiesâ 22) âremove gas linesâ 23) âsewer lineâ 24) âsewer mainâ 25) âstreetlight polesâ 26) âstreetlight relocationâ 27) âtelecommunication ductâ 28) âtelephone cableâ 29) âtelephone linesâ 30) âutilities adjustmentâ 31) âutilities conï¬ictâ 32) âutility delayâ 33) âutility relocationâ
II-82 improving Findability and Relevance of Transportation information 34) âutility situationâ 35) âutility workâ 36) âVerizonâ 37) âVerizon in conï¬ictâ 38) âwater service linesâ 39) âwaterline alignmentâ 40) âwaterline placementâ 41) âwaterline relocationâ 42) âwaterline systemâ These phrases do not include the term âutility issueâ as this phrase is used to identify work order categories. Using it would result in misclassiï¬ed documents. Weather Issue Use the work issue criteria deï¬ned in the âWork Issueâ section introduction, combined with the following phrases speciï¬c to weather issues: 1) âanticipated hurricaneâ 2) âcold temperaturesâ 3) âdue to rainâ 4) âdue to showersâ 5) âdue to weatherâ 6) âextreme heatâ 7) âextreme rainsâ 8) âï¬ood plainâ 9) âheatâ 10) âheavy rainâ 11) âheavy rainfallâ 12) âhot weatherâ 13) âponding at road edgeâ 14) âprolonged curingâ 15) ârain delayâ 16) ârain eventsâ 17) ârainfall inspectionâ 18) âsevere weatherâ 19) âshutdown due to rainâ 20) âshutdown due to weatherâ 21) âshutdown during monthsâ 22) âwarmer weatherâ 23) âweather conditionsâ 24) âweather delayâ 25) âweather eventâ 26) âwet conditionsâ 27) âwet roadway conditionsâ
pilot Findability Report II-83 28) âwet weatherâ 29) âwinter shut downâ Work Order Categories Find all of the following: 1) Identiï¬cation that the document is a work order, as described in the âWork Orderâ section 2) A work order category abbreviation within two paragraphs of âCategory:â Limiting the documents to work orders eliminates the possibility of ï¬nding a work order category in other document types. The list of values for work order categories is deï¬ned in a separate list. Finding a work order category abbreviation includes a positional restriction (i.e., two paragraphs) to ï¬nd work order categories that OCRs do not read as being on the same line as âCategory:â but are found near âCategory:â This restricts the search from ï¬nding work order category information contained elsewhere in the document (e.g., in a list at the bottom), a variation that could be accounted for in a production environment (provided that the information is typed and not a handwritten âxâ in a list box).
II-84 improving Findability and Relevance of Transportation information Annex 2 Example Scenarios Using Faceted Search Design This Annex provides examples of search scenarios. These examples include screenshots of the faceted search design, in order to illustrate the pilot tool that the research team built. A video illustrating these and additional scenarios can be accessed at: http://sites.spypondpartners.com/nchrp2097/Solution%20Demonstration.mp4 Scenario 1: Finding Daily Work Reports for a Project 1. It is possible to search by entering the UPC: â18944â into the search box. The corresponding project number appears, as it is linked through the project list.
pilot Findability Report II-85 2. It is also possible to search by entering the contract number (âR00018944C02â) into the search box. The corresponding project number appears, as it is also linked through the project list. It is not necessary to type the entire number due to the auto-suggest feature. 3. Finally, it is possible to search using the project number.
II-86 improving Findability and Relevance of Transportation information 4. Once selected, clicking on the magnifying glass begins a search and leads to a set of 14 results. A user can further ï¬lter these results by any number of facets. In the image below, the user may choose to ï¬lter by the âdaily work reportâ content type by clicking on that facet. This further limits the set of results.
pilot Findability Report II-87 5. There are 13 results. This can be further ï¬ltered. For example, a user searching for daily work reports within a speciï¬c time range could in theory ï¬lter using a date range (although the daily work reports in the sample did not have consistent or readable dates, so pilot rules did not work with the date). But to simulate this, the pilot does include the document update date, as demonstrated below. Selecting a group here may further limit the results (e.g., to two results for the earliest document modiï¬ed date range).
II-88 improving Findability and Relevance of Transportation information Scenario 2: Finding Assets Supplied by a Speciï¬c Manufacturer 1. This example will search for content from a speciï¬c sign manufacturer, Korman Signs. Since suppliers and manufacturers are in the ontology, a user can search for the supplier directly by entering the name in the search box, and selecting âKorman Signs⦠in Manufacturers and Suppliers.â 2. This search results in 64 documents. The user can then further ï¬lter using the facet lists. For example, the user can ï¬lter the content by district. For example, selecting âHampton Roads Districtâ would result in 16 documents.
pilot Findability Report II-89 3. This can then be further ï¬ltered by any number of criteria to further reï¬ne the results.
II-90 improving Findability and Relevance of Transportation information Annex 3 Evaluation Metrics The following subsections present test case evaluations for the four diï¬erent categories of information access needs identiï¬ed in the VDOT interviews. Each test case description is accompanied by a deï¬nition of relevancy and a set of steps for both the vanilla and rule-based searches. Find a Single Known Document for a Project (e.g., an Estimate) Using a Variety of Search Criteria The ï¬rst set of test cases compares how well the vanilla and rule-based searches could ï¬nd a single document for a project using various search criteria. In the ï¬rst example of this (Tables II-A-6a and II-A-6b), the research team searched for a daily work report on a speciï¬c date, route, and district, with an unknown project. This search assumes that the user would recognize the description when reading the document (i.e., the document ï¬le name is used in the relevancy criteria under the assumption that the user would recognize the text when reading this document). In this test case, the rule-based search is able to take a complex amount of information and narrow it down to a manageable number of documents for the user to read through. Specifying the rule-based search takes more steps than with the vanilla search, but this time is inconsequential in comparison to reading through the increased content level as required in the vanilla search. Table II-A-6a. Relevancy Criteria and Search Steps: Daily Work Report for a Speciï¬c Date, Route, and District Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document is a daily work report. 2) Daily work report is for the date of 9/7/2012 3) Daily work report is for a project on Route 13. 4) Daily work report is for a project in Hampton Roads District 5) Document is â0013-001-623/EST 06 DIARY REPORT.pdfâ 1) Search: 9/7/2012 daily work report route 13 1) Type 9/7/2012 without using auto- suggest, then search. 2) Select Content Type = Daily work report 3) Select District = Hampton Roads District 4) Select Route = US Route 13 5) Read each document until desired document found. Table II-A-6b. Metrics: Daily Work Report for a Speciï¬c Date, Route, and District Metric Vanilla Rule-Based Total number of results (excluding duplicates) 59 14 Result position of targeted document 39 6 Number of steps to specify search 1 4
pilot Findability Report II-91 A second test case (Tables II-A-7a and II-A-7b) similarly anticipates that the user will recognize the document when reading it, and provides the criteria that the user is searching for a FHWA conceptual approval of work orders related to a drainage issue, in â.docâ format. In this example, the rule-based search is able to use the âRelated to Work Orderâ facet to narrow the drainage issue content to a more manageable volume. Limiting it to Microsoft Word documents narrows this even further, resulting in a maximum of only ï¬ve documents to read through, compared to 36 for the vanilla search. Although the total number of results is an improvement over the vanilla search, the result position of the targeted document is only a few documents earlier. In this case, the user is able to eï¬ectively specify criteria in the vanilla search adequately enough to ï¬nd the document as the ï¬fth result. Although the rule-based search encourages a user to increase the detail of search speciï¬cation by providing additional options, a vanilla search does not. The level of speciï¬cation is critical in understanding the eï¬ectiveness of the vanilla search. Table II-A-7a. Relevancy Criteria and Search Steps: Microsoft Word Document for FHWA Conceptual Approval of Work Order Related to Drainage Issue Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document is a Word Document 2) Document is a signed FHWA Conceptual Approval 3) Conceptual approval is related to a drainage issue 4) Document is âConcept_WO_10_Approval.docâ 1) Search FHWA approval drainage 2) Select Result Type = Microsoft Word 1) Type and select Drainage issue in auto-suggest, then search. 2) Select Content Type = Related to work order 3) Select Result Type = Microsoft Word 4) Read each document until desired document found. Table II-A-7b. Metrics: Microsoft Word Document for FHWA Conceptual Approval of Work Order Related to Drainage Issue Metric Vanilla Rule-Based Total number of results (excluding duplicates) 36 5 Result position of targeted document 5 2 Number of steps to specify search 2 3 The third single-document test case (Tables II-A-8a and II-A-8b) searches for a speciï¬c work order for a bridge project in the Virginia Beach jurisdiction. In this test case, the rule-based search limits the documents to a manageable size to review, with some knowledge of the project and issue. Meanwhile, the vanilla search provides a considerable diï¬erence in the number of results depending on how much detail is speciï¬ed in the search (with a greater level of detail speciï¬ed and better results in Version 1 than in Version 2). Although it is not the case here, the position of the relevant result could also vary signiï¬cantly based on the userâs knowledge about the document. With substantial knowledge and a highly speciï¬c search, the rule-based search does not provide an improvement over the vanilla search. With less direct knowledge about the document, it is more likely to provide an advantage by oï¬ering suggestions on how to reï¬ne the search.
II-92 improving Findability and Relevance of Transportation information Table II-A-8a. Relevancy Criteria and Search Steps: Work Order for a Virginia Beach Bridge Project Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document is a work order. 2) Document is for a project that has âBridgeâ type of work. 3) Document is for a project in Virginia Beach. 4) Work order about installing a horizontal directional drilled sanitary sewer force main Version 1 1) Search: Virginia Beach bridge work order horizontal drilled sewer main Version 2 1) Search: Virginia Beach work order sewer 1) Type and select Virginia Beach in auto-suggest, then search. 2) Select Type of Work = Bridge 3) Read each document until desired document found. Table II-A-8b. Metrics: Work Order for a Virginia Beach Bridge Project Metric Vanilla Version 1 Vanilla Version 2 Rule-Based Total number of results (excluding duplicates) 3 74 5 Result position of targeted document 1 2 3 Number of steps to specify search 1 1 2 Find / Review All Documents for a Project (e.g., for FOIA Request) The research team also evaluated a number of test cases related to ï¬nding a set of documents for a speciï¬c project, based on the use of project identiï¬ers. Because the rule-based search is able to link all project identiï¬ers (contract ID, UPC, and project number), it is able to ï¬nd project documents containing any of the three identiï¬ers regardless of which the user enters in the search. This provides an advantage over the vanilla search where there are diï¬erences in the identiï¬er type entered in the search and the identiï¬er type displayed in the document. The ï¬rst test case (Tables II-A-9a, II-A-9b, and II-A-9c) measures precision and recall of daily work reports using a speciï¬c contract ID to search. This search has the same perfect performance for both the vanilla and rule-based searches. This demonstrates that both searches are capable of ï¬nding project daily work reports based on the contract ID. As this identiï¬er often appears in daily work reports, this result is unsurprising. Table II-A-9a. Relevancy Criteria and Search Steps: Daily Work Reports for Contract ID V00014672C01 Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a daily work report 2) Daily work report is for Project 0337-122-F14, UPC 14672, or Contract ID V00014672C01 or C00014672C01. 1) Search: V00014672C01 daily work report 1) Type V00014672C01 in auto- suggest, select the corresponding project number, and search. 2) Select Content Type = Daily work report. 3) Look for relevant documents
pilot Findability Report II-93 Table II-A-9b. Precision Metrics: Daily Work Reports for Contract ID V00014672C01 Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 8 8 Position of ï¬rst relevant document 1 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 8 8 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 100% 100% Documents needed to ï¬nd 5 relevant results 5 5 Table II-A-9c. Recall Metrics: Daily Work Reports for Contract ID V00014672C01 Recall Metric Vanilla Rule-Based Number of known relevant documents 8 Number of relevant documents in top 30 results 8 8 Recall in top 30 results 100% 100% The evaluation tells a diï¬erent story when searching for documents using the UPC, another type of project identiï¬er. The following test case (Tables II-A-10a and II-A-10b) searches for daily work reports for the same project, using the UPC to search instead of the contract ID. Although the recall is high for this project when using the contract ID in the vanilla search, the search only ï¬nds 25% of the documents when using the UPC. Meanwhile, the results for the rule-based search remain unchanged. The rule-based search performs better than the vanilla search by linking UPC to project number and contract ID, enabling it to ï¬nd documents that contain any of the project identiï¬ers. The vanilla search, on the other hand, can only ï¬nd documents that contain the UPC, an identiï¬er less frequently used in VDOT daily work reports. Table II-A-10a. Relevancy Criteria and Search Steps: Daily Work Reports for UPC 14672 Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a daily work report 2) Daily work report is for Project 0337-122-F14, UPC 14672, or Contract ID V00014672C01 or C00014672C01. 1) Search: daily work report 14672 1) Type UPC 14672 in auto-suggest, select the corresponding project number, and search. 2) Select Content Type = Daily work report. 3) Look for relevant documents Table II-A-10b. Recall Metrics: Daily Work Reports for UPC 14672 Recall Metric Vanilla Rule-Based Number of known relevant documents 8 Number of relevant documents in top 30 results 2 8 Recall in top 30 results 25% 100% This pattern is similar when searching for work orders using the UPC, as illustrated in the following test case (Tables II-A-11a, II-A-11b, and II-A-11c). Again, the rule-based search outperforms the vanilla
II-94 improving Findability and Relevance of Transportation information search for document recall, ï¬nding 92% of all relevant documents in the top 30 results (compared to 15% in the vanilla search). Notably, the vanilla search only returns 5 documents. The ability to ï¬nd documents that include any project identiï¬er enables the rule-based search to generate this higher recall value. Both the vanilla and rule-based searches have high precision (100% for the rule-based search in the top 20 results), demonstrating that high recall does not compromise the precision. Table II-A-11a. Relevancy Criteria and Search Steps: Work Orders for UPC 50057 Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a work order 2) Work order is for Project 0615- 047-169, UPC 50057, or Contract ID U00050057C01 or C00050057C01. 1) Search: 50057 work order 1) Type 50057 in auto-suggest and select (NFO)0615-047-169, then search. 2) Select Content Type = Work order. 3) Look for relevant documents. Table II-A-11b. Precision Metrics: Work Orders for UPC 50057 Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 5 31 Position of ï¬rst relevant document 2 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 4 20 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 80% 100% Documents needed to ï¬nd 10 relevant results Only 4 Found 10 Table II-A-11c. Recall Metrics: Work Orders for UPC 50057 Recall Metric Vanilla Rule-Based Number of known relevant documents 26 Number of relevant documents in top 30 results 4 24 Recall in top 30 results 15% 92% The vanilla search is highly eï¬ective in ï¬nding documents using the project number, however. The next test case (Tables II-A-12a and II-A-12b) considers recall of a search for work orders using the project number as the identiï¬er. In this evaluation, the vanilla search performs better than the rule-based search. The search rules are unable to ï¬nd all documents with the speciï¬ed project number. Built-out rules in a post-pilot scenario could improve on these results, as they would be able to account for issues such as poor scan quality misreading 0âs as oâs (the FAST software appeared to be better able to read OCRed text in documents with poor scan quality, as discussed in the âContent Harvesting, Analysis, and Conversionâ section).
pilot Findability Report II-95 Table II-A-12a. Relevancy Criteria and Search Steps: Work Orders for Project Number 0337- 122-F14 Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a work order 2) Work order is for Project 0337- 122-F14, UPC 14672, or Contract ID V00014672C01 or C00014672C01. 1) Search: 0337-122- F14 work order 1) Type 0337-122-F14 in auto- suggest, select the corresponding project number, and search. 2) Select Content Type = Work Order. 3) Look for relevant documents Table II-A-12b. Recall Metrics: Work Orders for Project Number 0337-122-F14 Recall Metric Vanilla Rule-Based Number of known relevant documents 34 Number of relevant documents in top 30 results 27 18 Recall in top 30 results 79% 53% While the ï¬rst few test cases in this information need category considered daily work reports and work orders, the following test case (Tables II-A-13a and II-A-13b) searches for Source of Materials forms for a speciï¬c project. In this test case, the vanilla search is unable to prioritize Source of Materials forms. The ï¬rst set of documents found in the vanilla search consists of work orders, instead of Source of Materials forms. Following the work order results, most of the documents are Source of Materials forms; the precision would be higher if including a greater number of results (beyond the top 20 documents) in the testing. Meanwhile, the rule-based search has extremely high precision, likely due to the consistency in the form structure for C-25s. It is able to identify the content type using the built rules with high precision. Table II-A-13a. Relevancy Criteria and Search Steps: Source of Materials Forms for Project 0615-047-169 Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a Source of Materials form. 2) Work order is for Project 0615- 047-169, UPC 50057, or Contract ID U00050057C01 or C00050057C01. 1) Search: 0615-047- 169 Form C-25 1) Type 0615-047-169 in auto-suggest and select project, then search. 2) Select Content Type = Source of materials.
II-96 improving Findability and Relevance of Transportation information Table II-A-13b. Precision Metrics: Source of Materials Forms for Project 0615-047-169 Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 42 17 Position of ï¬rst relevant document 16 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 5 17 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 25% 100% Documents needed to ï¬nd 10 relevant results 28 10 A ï¬nal test case for this information need category demonstrates how the rule-based search can help identify documents from a project when using an incorrect project identiï¬er (Tables II-A-14a and II-A- 14b). In this case, there is much higher recall in the rule-based search. The vanilla search has high precision, but ï¬nds a limited number of results because the contract ID identiï¬ed in the work orders is âC00050057C01â instead of âU00050057C01.â If the user typed âC00050057C01â instead of âU00050057C01â in the vanilla search, recall would be signiï¬cantly higher in the top 30 results. If the user typed âC00050057C01â in the rule-based search, auto-suggest would not complete this, which would suggest to the user to look up a diï¬erent contract ID. The rule-based search is able to map the U00050057C01 to the project number, which appears in the documents. Table II-A-14a. Relevancy Criteria and Search Steps: Work Orders for Contract ID U00050057C01 Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a work order 2) Work order is for Project 0615- 047-169, UPC 50057, or Contract ID U00050057C01 or C00050057C01. 1) Search U00050057C01 work order 1) Type U00050057C01 in auto- suggest and select (NFO)0615-047- 169, then search. 2) Select Content Type = Work order. 3) Look for relevant documents. Table II-A-14b. Recall Metrics: Work Orders for Contract ID U00050057C01 Recall Metric Vanilla Rule-Based Number of known relevant documents 26 Number of relevant documents in top 30 results 4 24 Recall in top 30 results 15% 92% Search Across Projects â Find Projects with Item, Material, Construction Technique The next set of test cases allows users to ï¬nd documents or projects using a particular pay item, material, construction technique, or other identiï¬er. The ï¬rst test case in this set (Tables II-A-15a and II-A-15b) searches for daily work reports that include pay item 12600.
pilot Findability Report II-97 In this evaluation, the vanilla search is able to ï¬nd a set of daily work reports that the rule-based search fails to classify. Both have 100% precision, but the vanilla search has higher recall in this case. A further built-out rule-based search would include rules that would better identify these documents as daily work reports, and in turn would likely have a similar recall to the vanilla search. Table II-A-15a. Relevancy Criteria and Search Steps: Daily Work Reports including Pay Item 12600 Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document is a daily work report 2) Daily work report has a pay item for 12600, or âStd. Comb. Curb & Gutter CG-6.â 1) Search: â12600â + âdaily work reportâ 1) Type 12600 in auto-suggest, select âSTD. COMB. CURB & GUTTER CG-6,â then search. 2) Select content type = Daily work report. Table II-A-15b. Precision Metrics: Daily Work Reports including Pay Item 12600 Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 14 7 Position of ï¬rst relevant document 1 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 14 7 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 100% 100% Documents needed to ï¬nd 5 relevant results 5 5 While the previous test case searches for documents by pay item number, the following test case searches for documents by the pay item name (Tables II-A-16a and II-A-16b). In this case, the vanilla search ï¬nds a number of documents with references to pay item Underdrain UD-4 (which is a more common pay item). But, it is unable to distinguish UD-4 from UD-2. The vanilla search also ï¬nds some documents that do not reference underdrains at all (even UD-4), but instead likely match the âUDâ part of the search to other words in the documents such as âincludeâ. In this textual search for pay items, the rule-based search provides a more precise match to the desired terms. Table II-A-16a. Relevancy Criteria and Search Steps: Documents including Pay Item for Underdrain UD-2 Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document has a pay item for 00585, 00598, âUnderdrain UD-2,â or âUnderdrain Modiï¬ed UD-2.â 1) Search: âunderdrainâ + âUD-2â 1) Type UD-2 in auto-suggest, select âUnderdrain UD-2â then search.
II-98 improving Findability and Relevance of Transportation information Table II-A-16b. Precision Metrics: Documents including Pay Item for Underdrain UD-2 Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 96 19 Position of ï¬rst relevant document 10 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 3 19 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 15% 100% Documents needed to ï¬nd 10 relevant results More than 30 10 The following test case attempts to identify Source of Materials forms referencing a speciï¬c supplier (Tables II-A-17a and II-A-17b). The main diï¬erence between the vanilla and rule-based searches in this example is that the vanilla search returns some letters referencing Korman Signs and Source of Materials forms. This results in lower precision than in the rule-based search, where all 49 documents are relevant. The rule-based search is able to identify that these letters are simply referencing but do not contain Source of Materials forms, so does not include them in the results. Meanwhile, the vanilla search ï¬nds the âSource of Materialsâ reference text and includes the documents. Table II-A-17a. Relevancy Criteria and Search Steps: Source of Materials Forms for Korman Signs Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a Source of Materials form 2) Source of Materials form contains entry for Korman Signs 1) Search: âKorman Signsâ + âSource of Materialsâ 1) Type and select Korman Signs in auto-suggest, then search. 2) Select Content Type = Source of materials. Table II-A-17b. Precision Metrics: Source of Materials Forms for Korman Signs Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 65 49 Position of ï¬rst relevant document 1 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 16 20 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 80% 100% Documents needed to ï¬nd 10 relevant results 14 10 The rule-based search is able to expand on this type of search, and add additional criteria. The following test case (Tables II-A-18a and II-A-18b) similarly searches for Source of Materials forms containing a speciï¬c supplier, but adds that the project should be for a primary road. This structured information is available in the project list, so the rule-based search is able to search for a project identiï¬er and link it to the road type.
pilot Findability Report II-99 The vanilla search ï¬nds more documents, but is unable to classify them by road system. A high proportion of all projects are for primary roads, so the vanilla search performs fairly well; however, it would not perform as well for other road types. The recall on the rule-based search is not as high as on the vanilla search, as the rule-based search does not account for company misspellings. This could be adapted in a fully built-out rule-based search function. Table II-A-18a. Relevancy Criteria and Search Steps: Source of Materials Forms for Asphalt Emulsion Inc. on a Primary Road Project Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a Source of Materials form 2) Source of Materials form contains entry for Asphalt Emulsion Inc 3) Source of Materials form is for a project for a road system classiï¬ed as âprimary.â 1) Search: âAsphalt Emulsion Inc. Source of Materialsâ 1) Type and select Asphalt Emulsion Inc in auto-suggest, then search. 2) Select Road System = Primary. 3) Select Content Type = Source of materials. Table II-A-18b. Precision Metrics: Source of Materials Forms for Asphalt Emulsion Inc. on a Primary Road Project Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 58 11 Position of ï¬rst relevant document 4 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 13 11 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 65% 100% Documents needed to ï¬nd 10 relevant results 16 10 In a related test case (Tables II-A-19a and II-A-19b), the rule-based search allows the user to limit content to Source of Materials forms and a speciï¬c manufacturer, with the goal of identifying a particular contractor that the manufacturer supplied (using name recognition). In this test case, the metrics are the total number of results in the speciï¬ed search, and the number of documents that the user would need to read to identify the contractor (i.e., the ï¬rst document in which the contractor appears). The list of contractors in the rule-based search under the âContractorsâ facet can quickly provide name recognition, as is the case in ï¬nding âBranscome Inc.â and âSlurry Pavers, Inc.â in this test case. For these two instances, the user would not need to read any documents to identify these contractors in the rule-based search because they appear in the Contractors Facet List. However, if a contractor is not one of the most frequent contractors appearing in the results, it will not show up in the Contractors Facet List. In these cases, ï¬nding the contractor in the rule-based search requires a similar approach to the vanilla search â opening each of the documents and reading the name of the contractor until it is recognized (as was the case for ï¬nding âCurtis Contracting, Inc.â). If the number of contractors appearing in the facet list remains small for usability, there is a smaller advantage over the vanilla search.
II-100 improving Findability and Relevance of Transportation information Table II-A-19a. Relevancy Criteria and Search Steps: Name Recognition of Contractor that Korman Signs Supplied Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a Source of Materials form 2) Source of Materials form contains entry for Korman Signs 1) Search: âKorman Signsâ + âSource of Materialsâ 1) Type and select Korman Signs in auto-suggest, then search. 2) Select Content Type = Source of materials. Table II-A-19b. Metrics: Name Recognition of Contractor that Korman Signs Supplied Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 69 62 Documents Read to ï¬nd âBranscome Inc.â 17 0 (Contractors List) Documents Read to ï¬nd âSlurry Pavers, Inc.â 32 0 (Contractors List) Documents Read to ï¬nd âCurtis Contracting, Inc.â 11 9 The rule-based search can also help a user identify when a search is incorrectly speciï¬ed (e.g., misspelled). A vanilla search will take the misspelled search entry and return zero relevant results (unless a document contains the same misspelling). Meanwhile, the auto-suggest feature of the rule- based search suggests terms based on the built ontology. The following test case example (Tables II-A- 20a and II-A-20b) provides evaluation results for a search in which the user misspells the supplier name. In the vanilla search, the user searches with this misspelled name. In the rule-based search, the user begins to type the misspelled contractor name then selects the correctly spelled contractor name using the auto-suggest feature. Alternatively, the user could enter and search by the incorrect contractor name in the rule-based search, then select the correctly spelled name from the list of possible topics. Due to the misspelling, the vanilla search does not return any results. Because the rule-based search proposes a spelling correction as the supplier is typed, it is able to match documents to the correct supplier. Table II-A-20a. Relevancy Criteria and Search Steps: Source of Materials Form with âKormenâ Signs Supplier Misspelling Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a Source of Materials form 2) Source of Materials form contains entry for Kormen Signs or Korman Signs 1) Search: Kormen Signs Source of Materials 1) Type âKormâ in auto-suggest, select Korman Signs, then search. Alternatively, type âKormen signsâ and search, then select âKorman Signsâ from the list of possible topics. 2) Select Content Type = Source of materials.
pilot Findability Report II-101 Table II-A-20b. Precision Metrics: Source of Materials Form with âKormenâ Signs Supplier Misspelling Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 0 49 Position of ï¬rst relevant document N/A 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 0 20 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 0% 100% Documents needed to ï¬nd 10 relevant results 0 Results 10 The rule-based search is also able to ï¬nd documents based on project location. It does this in two ways: by linking the project identiï¬er to the deï¬ned project information (e.g., district, route, etc.) using the project list, and by ï¬nding the location information directly in the document. The following test case (Tables II-A-21a and II-A-21b) evaluates this ability in a search for documents on a speciï¬c route, and ï¬nds considerably higher precision in the rule-based search. Table II-A-21a. Relevancy Criteria and Search Steps: Documents for Projects on Route 679 Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document is on a project for Route 679 1) Search: âRoute 679â. 1) Type Route 679 in auto-suggest, select âState Route 679â, and search. Table II-A-21b. Precision Metrics: Documents for Projects on Route 679 Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 51 17 Position of ï¬rst relevant document 1 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 2 17 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 10% 100% Documents needed to ï¬nd 10 relevant results More than 30 10 The following test case (Tables II-A-22a and II-A-22b) further limits the content to work orders on a speciï¬c Interstate route within a speciï¬c district. In this example, the vanilla search is unable to identify both the district and the route of the project. Meanwhile, the rule-based search is able to identify both pieces with precision.
II-102 improving Findability and Relevance of Transportation information Table II-A-22a. Relevancy Criteria and Search Steps: Work Orders Related to Route I-95 in Richmond District Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a work order 2) Work order is about Route I-95 3) Work order is for project in Richmond District 1) Search: I-95 work order Richmond District 1) Type and select I-95 in auto- suggest, then search. 2) Select Content Type = Work order. 3) Select District = Richmond District. Table II-A-22b. Precision Metrics: Work Orders Related to Route I-95 in Richmond District Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 106 6 Position of ï¬rst relevant document 0 in top 30 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 0 6 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 0% 100% Documents needed to ï¬nd 5 relevant results 0 in top 30 5 Research Reasons for Delays and Changes The ï¬nal set of test cases examines the ability to research reasons for delays and changes in work. These reasons often appear initially in daily work reports, and then are included in work orders, which deï¬ne material changes to a project. As noted in the âRule Development and Reï¬nementâ section, two standard section headings (structural elements) signiï¬ed a location within work orders to search for work issues. The following metrics include an evaluation of precision with and without this rule as part of the rule-based search. The evaluation tables that follow diï¬erentiate between âRule-Basedâ (which does not include this rule) and âRule-Based (Headings)â (which does include this rule). The search steps are the same for each rule-based evaluation, as the only diï¬erentiation occurs in the rules on the back-end. The ï¬rst test case (Tables II-A-23a and II-A-23b) evaluates the ability of the diï¬erent search methods to ï¬nd work orders related to utility issues. In this test case, the vanilla search results often include a list of VDOT-speciï¬ed work order categories at the end, with an accompanying box marked for each issue that applied. This list includes a VDOT-speciï¬ed âUTILâ category with the description: âDelays caused by utility issues.â Because of the presence of this list, the vanilla search matches the search for âutilityâ to this description even though the utility list item does not apply for the speciï¬ed document. Meanwhile, the rule-based search is able to identify work orders related to utility issues with 75% precision in the top 20 results. The rule-based search includes a rule to exclude this list of VDOT- speciï¬ed work order categories, which along with the other built-in rules improves the precision of the results over the vanilla search. Including the section heading elements in the rule-based search further increases the precision to 100%; however, it may decrease recall from the other rule-based search (under the assumption that additional relevant results would have been found in the other rule-based search if additional documents were reviewed beyond the top 20). As noted in the âTesting and
pilot Findability Report II-103 Subjective Evaluationâ section, the recall of the rule-based search that uses section headers could be increased using additional variations for a more complete application. Table II-A-23a. Relevancy Criteria and Search Steps: Work Orders Related to Utility Issues Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a work order 2) Work order is about or due to a utility issue 1) Search: Utility work order 1) Type and select Utility issue in auto-suggest, then search. 2) Select Content Type = Work order. Table II-A-23b. Precision Metrics: Work Orders Related to Utility Issues Precision Metric Vanilla Rule-Based Rule-Based (Headings) Total number of results (excluding duplicates) 974 222 16 Position of ï¬rst relevant document 1 1 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 3 15 16 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 15% 75% 100% Documents needed to ï¬nd 10 relevant results More than 30 13 10 The research team also built out rules for a number of other issues, focusing in particular on drainage, utilities, and weather issues. To develop estimates of eï¬ectiveness compared to a vanilla search, the following test case examines a set of 30 rule-based search results for each of these three work issues, specifying only the work issue and the content type (work order). It calculates the percentage of documents containing the work issue term within the document (e.g., the percentage of documents containing the term âdrainageâ for a drainage issue). Tables II-A-24a and II-A-24b provide the results of these searches for the rule-based search that does not include rules for the standard section headings (in order to provide a greater number of results for the evaluation). For the utilities issue search, daily reports of construction incorrectly classify as work orders because of two sections with âutilityâ in the title. Since the majority of the ï¬rst 30 documents for utilities are of this type, these are not included in the count. Across the three work issues, utilities issue has the lowest percentage of documents classiï¬ed that contain the name of the issue within the document. This is similarly true for the phrases used to build the classiï¬cation rules: âutilityâ or âutilitiesâ appear in 26% of the phrases used to build the utilities issue rules, âweatherâ appears in 31% of the phrases used to build the weather issue rules, and âdrainageâ appears in 53% of the phrases used to build the drainage issues rules (as speciï¬ed in Annex 1). The high percentage of these documents that contain the search term in them decreases the advantage of using classiï¬cation rules, as a plain search on âdrainage,â for example, would be expected to capture a considerable number of the drainage issue documents. Meanwhile, a plain search for âutilityâ or âutilitiesâ is less likely to be successful in comparison to the rule-based search. Further building out the rules to include additional phrases not containing the âtitleâ word would increase this beneï¬t of using a rule-based search.
II-104 improving Findability and Relevance of Transportation information Table II-A-24a. Relevancy Criteria and Search Steps: Work Orders that Directly Reference the Work Issue Relevancy Rule-Based Search Steps 1) Document is a work order. 2) Document is tagged with a speciï¬ed work issue. 1) Type and select desired issue in auto-suggest, then search. 2) Select Content Type = Work Order 3) Search document to see if the issue label appears within the document text: âdrainageâ for a âdrainage issueâ, âutilityâ for a âutilities issueâ, and âweatherâ for a âweather issueâ. Table II-A-24b. Metrics: Work Orders that Directly Reference the Work Issue Metric Drainage Utilities Weather Number of documents examined 30 30 30 Number of documents containing text of speciï¬ed issue within document (e.g., âDrainageâ is found within document tagged with âdrainage issueâ) 28 22 27 Percent of documents examined containing text of speciï¬ed issue within document 93% 73% 90% The following test cases look more closely at weather and drainage issues. Tables II-A-25a and II-A-25b consider speciï¬cally work orders related to weather issues. Of these test cases, the rule-based search that includes the section heading rules is the only one of the searches with high precision. Again, the diï¬erence in the total number of results is noticeable, as the rule-based search returns less than 1% of the total number of results of the vanilla search. Reading through all the documents in the vanilla search would take a signiï¬cant eï¬ort, but may result in ï¬nding more work orders related to weather issues than either version of the rule-based search. If the search rules were fully built-out, this expected recall gap would decrease. Notably, without the section heading rule, the rule-based search catches phrases that would apply to weather (e.g., âweather conditionsâ) that appear in the document outside of the work order form. Variation in how the agency provides these forms (e.g., sometimes supplemental information provided in the document speciï¬cally relates to the work order purpose, and sometimes relates to other aspects of the document such as within materials instructions) requires complex rules to increase the precision in these cases. Fully built-out rules would attempt to identify and address these subtle complexities in order to capture these cases while maintaining a high level of precision. Table II-A-25a. Relevancy Criteria and Search Steps: Work Orders Related to Weather Issues Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a work order 2) Work order is about or due to a weather issue 1) Search: Weather work order 1) Type and select Weather issue in auto-suggest, then search. 2) Select Content Type = Work order.
pilot Findability Report II-105 Table II-A-25b. Precision Metrics: Work Orders Related to Weather Issues Precision Metric Vanilla Rule-Based Rule-Based (Headings) Total number of results (excluding duplicates) 1,810 48 12 Position of ï¬rst relevant document 1 1 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 8 11 12 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 40% 55% 100% Documents needed to ï¬nd 10 relevant results 36 19 10 The next test case (Tables II-A-26a and II-A-26b) examines daily work reports about drainage issues. This test case includes two diï¬erent vanilla search speciï¬cations: the second expands on the ï¬rst to include a number of additional terms beyond âdrainageâ in an attempt to construct a search that would account for synonyms in a similar way as the rule-based search. Because the ï¬rst vanilla search only searches for âdrainage,â a number of vanilla search result documents include a section on drainage with the phrase âthere was no activity on drainage items on this date.â The rule-based search has a more limited number of results, but is highly accurate in those results. The ï¬rst vanilla search has a high number of results, but would require considerable eï¬ort to read through all results to ï¬nd documents related to drainage. It is also likely that fully built-out search rules would increase the number of documents found in the rule-based search by including additional synonym phrases. Since the ï¬rst vanilla search has low precision and a high number of results, the second version adds a number of drainage terms that were used to build the rule-based search rules. Because the software used in the pilot has a limit on the number of characters in a search (200 in an advanced search), this search mostly uses keywords instead of the full phrases used to improve precision in the rule-based search (e.g., the FAST search includes âerosionâ while the rule-based search includes rules for phrases such as âerosion control,â âerosion problem,â and âexcessive erosion.â) This second test actually increases the total number of results to the full set of daily work reports because âdaily work reportâ is the only ârequiredâ phrase. The vanilla search does not have a way to specify that âdaily work reportâ appear in addition to one of the other phrases. Although the research team anticipated that this search speciï¬cation would increase the relevancy of the top searches, the precision actually decreases from the ï¬rst to the second vanilla search.
II-106 improving Findability and Relevance of Transportation information Table II-A-26a. Relevancy Criteria and Search Steps: Daily Work Reports Related to Drainage Issues Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a daily work report 2) Daily work report discusses drainage issues. Version 1 1) Search: daily work report pm diary drainage Version 2 1) Search: ALL("daily work report") ANY("drainage" "underdrain" "storm drain" "erosion" "diversion ditch" "drop inlet" "wingwalls" "storm sewer" "water ponding" "drainage problem" "drainage structures" ) 1) Type and select Drainage issue in auto-suggest, then search. 2) Select Content Type = Daily work report. Table II-A-26b. Precision Metrics: Daily Work Reports Related to Drainage Issues Precision Metric Vanilla Version 1 Vanilla Version 2 Rule-Based Total number of results (excluding duplicates) 721 2,436 13 Position of ï¬rst relevant document 1 1 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 9 6 11 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 45% 30% 85% Documents needed to ï¬nd 10 relevant results 21 25 12 Because of the standard daily work report section on drainage, the following test case examines work orders related to drainage issues, and further limits these to a speciï¬c district (Tables II-A-27a and II- A-27b). Again, the vanilla search produces a high number of results. While it has high precision in the top 20 results, it ï¬nds documents that specify âdrainageâ and would not ï¬nd the same range of drainage issues as a rule-based search. The vanilla search precision is higher for work orders than for daily work reports because the âdrainageâ placeholder section does not exist in work orders, and is discussed only when relevant. Notably, conï¬rming that the document is for a project in Hampton Roads District requires an extra step of looking up the project information in the vanilla search. A fully built-out rule-based search with high precision would eliminate this step and reduce the eï¬ort required in the rule-based search compared with the vanilla search.
pilot Findability Report II-107 As in the examples measured in Tables II-A-18b and II-A-20b, the rule-based search that uses the section headings again has 100% precision. However, the reduced recall is clear in this example, as the number of relevant documents found in this search is lower than the other rule-based search and vanilla search. A fully built-out search containing the section header rule would look to capture more of the relevant documents found in the other two searches. Table II-A-27a. Relevancy Criteria and Search Steps: Work Orders Related to Drainage Issues in Hampton Roads District Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a work order 2) Work order is due to or about a drainage issue. 3) Work order is for project in Hampton Roads District 1) Search: drainage work order Hampton Roads 1) Type and select Hampton Roads District in auto-suggest, then search. 2) Select Content Type = Work order. 3) Select Work issue = Drainage issue. Table II-A-27b. Precision Metrics: Work Orders Related to Drainage Issues in Hampton Roads District Precision Metric Vanilla Rule-Based Rule-Based (Headings) Total number of results (excluding duplicates) 121 46 11 Position of ï¬rst relevant document 2 1 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 15 14 11 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 75% 70% 100% Documents needed to ï¬nd 10 relevant results 14 15 10 The rule-based search is also able to identify VDOT-speciï¬ed work order categories, such as the âUTILâ category discussed earlier. The next test case (Tables II-A-28a and II-A-28b) examines results from searching for the âCHARâ work order category, deï¬ned by VDOT as âChanges per Section 104.2 (Character of Work).â In this example, the rule-based search precisely deï¬nes where âCHARâ refers to the work order category, while the vanilla search cannot do so. The rule-based search excels in this test case, where the rules can focus on a speciï¬c term (âCategory:â) to identify the presence of the work order category of interest. Table II-A-28a. Relevancy Criteria and Search Steps: Work Orders with âCHARâ Work Order Category Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a work order 2) The work order category is listed as CHAR. 1) Search: âCategory: CHARâ 1) Type and select CHAR in auto- suggest, then search. 2) Select Content Type = Work Order.
II-108 improving Findability and Relevance of Transportation information Table II-A-28b. Precision Metrics: Work Orders with âCHARâ Work Order Category Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 112 24 Position of ï¬rst relevant document 1 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 2 20 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 10% 100% Documents needed to ï¬nd 10 relevant results More than 30 10 The ï¬nal test case (Tables II-A-29a and II-A-29b) extends the prior example to identify work orders on a speciï¬c project with the speciï¬ed work order category. In this case, the âVDOTâ work order category is used, with the idea that a search including the term âVDOTâ may lower precision when searching across VDOT documents. In this test case, the vanilla search is unable to distinguish where in the document âVDOTâ appears. So, if it ï¬nds work order, category, and VDOT, it matches to the document, resulting in low precision. The rule-based search has higher precision for this. However, it also has lower recall than the vanilla search (possibly due to fewer matches with the project number). Table II-A-29a. Relevancy Criteria and Search Steps: Work Orders on Project Number 0337- 122-F14 with âVDOTâ Work Order Category Relevancy Vanilla Search Steps Rule-Based Search Steps 1) Document contains a work order 2) Work order is for project 0337- 122-F14 3) The work order category is listed as VDOT. 1) Search: â0337-122- F14â + âCategory: VDOTâ 1) Type and select (FO)0337-122-F14 in auto-suggest, then search. 2) Select Content Type = Work Order. 3) Work order category = VDOT. Table II-A-29b. Precision Metrics: Work Orders on Project Number 0337-122-F14 with âVDOTâ Work Order Category Precision Metric Vanilla Rule-Based Total number of results (excluding duplicates) 32 6 Position of ï¬rst relevant document 1 1 Number of relevant documents in top 20 results (or in all results if fewer than 20) 9 5 Percentage of documents in top 20 results (or in all results if fewer than 20) that are relevant 45% 83% Documents needed to ï¬nd 5 relevant results 10 6
Abbreviations and acronyms used without definitions in TRB publications: A4A Airlines for America AAAE American Association of Airport Executives AASHO American Association of State Highway Officials AASHTO American Association of State Highway and Transportation Officials ACIâNA Airports Council InternationalâNorth America ACRP Airport Cooperative Research Program ADA Americans with Disabilities Act APTA American Public Transportation Association ASCE American Society of Civil Engineers ASME American Society of Mechanical Engineers ASTM American Society for Testing and Materials ATA American Trucking Associations CTAA Community Transportation Association of America CTBSSP Commercial Truck and Bus Safety Synthesis Program DHS Department of Homeland Security DOE Department of Energy EPA Environmental Protection Agency FAA Federal Aviation Administration FAST Fixing Americaâs Surface Transportation Act (2015) FHWA Federal Highway Administration FMCSA Federal Motor Carrier Safety Administration FRA Federal Railroad Administration FTA Federal Transit Administration HMCRP Hazardous Materials Cooperative Research Program IEEE Institute of Electrical and Electronics Engineers ISTEA Intermodal Surface Transportation Efficiency Act of 1991 ITE Institute of Transportation Engineers MAP-21 Moving Ahead for Progress in the 21st Century Act (2012) NASA National Aeronautics and Space Administration NASAO National Association of State Aviation Officials NCFRP National Cooperative Freight Research Program NCHRP National Cooperative Highway Research Program NHTSA National Highway Traffic Safety Administration NTSB National Transportation Safety Board PHMSA Pipeline and Hazardous Materials Safety Administration RITA Research and Innovative Technology Administration SAE Society of Automotive Engineers SAFETEA-LU Safe, Accountable, Flexible, Efficient Transportation Equity Act: A Legacy for Users (2005) TCRP Transit Cooperative Research Program TDC Transit Development Corporation TEA-21 Transportation Equity Act for the 21st Century (1998) TRB Transportation Research Board TSA Transportation Security Administration U.S.DOT United States Department of Transportation
TRA N SPO RTATIO N RESEA RCH BO A RD 500 Fifth Street, N W W ashington, D C 20001 A D D RESS SERV ICE REQ U ESTED N O N -PR O FIT O R G . U .S. PO STA G E PA ID C O LU M B IA , M D PER M IT N O . 88 Im proving Findability and Relevance of Transportation Inform ation N CH RP Research Report 846 TRB ISBN 978-0-309-44635-8 9 7 8 0 3 0 9 4 4 6 3 5 8 9 0 0 0 0