Skip to main content

Currently Skimming:


Pages 66-107

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 66...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 59 State Transportation Agencies 3.0 Utah DOT Findability Tests 3.1 Planning and Scoping Round 1 Utah DOT was interested in the NCHRP 20-97 project because they were in the process of implementing a new information indexing and discovery tool called "Knowvation," provided by PTFS. This tool includes full text search, faceted search, and spatial (map-based)
From page 67...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 60 State Transportation Agencies • Environmental Analysis: Find resource reports to inform the analysis – for example, any previously conducted cultural resource or wetlands report as part of a federal or state environmental document. • Public Records Requests: Respond to requests for information from the public – information may include accident reports, as-builts, environmental reports, traffic control plans, signal timing, old concept reports, and project-related emails.
From page 68...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 61 State Transportation Agencies • Location – the Knowvation product includes basic geocoding capabilities (e.g., to look up city and county names) ; we can consider supplementing these capabilities by adding the ability to pull out routes and street names.
From page 69...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 62 State Transportation Agencies 3.2 Content Collection and Analysis Round 1 Content Collection UDOT staff downloaded content from the Region 2 file drive and ProjectWise. We requested that only text files (TXT, DOC, DOCX, PDF, RTF)
From page 70...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 63 State Transportation Agencies Figure 11. Frequently occurring content types.
From page 71...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 64 State Transportation Agencies Figure 13. Utility agreement types.
From page 72...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 65 State Transportation Agencies registered IFilter for Microsoft Windows – and then reindex existing files (a time consuming process)
From page 73...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 66 State Transportation Agencies process of revision, so some of the equivalent WSDOT manuals were not available or incomplete. We "chunked" the UDOT manuals into chapters, and ran a cluster analysis using the same technique (k-means analysis)
From page 74...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 67 State Transportation Agencies In general, we found that UDOT's manuals had less overlaps in topics across manuals than WSDOT's did. We also found that nine of the 14 clusters matched across the two states.
From page 75...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 68 State Transportation Agencies SCIKIT-Learn modules within Python. NLTK is a Python module that provides a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
From page 76...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 69 State Transportation Agencies CountVectorizer module which creates a simple "bag of words" term frequency matrix, and a "term frequency-inverse frequency distribution (tf-idf) matrix".
From page 77...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 70 State Transportation Agencies • Determine the character block (number of characters in a representative document) that the script will need to examine.
From page 78...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 71 State Transportation Agencies Figure 16. Maintenance Agreement Example 1.
From page 79...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 72 State Transportation Agencies Figure 17. Maintenance Agreement Example 2.
From page 80...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 73 State Transportation Agencies Character String Score Character Block subject -3 700 sincerely -3 500 request for proposals -10 700 Easement -5 600 Checklist -5 600 Status Report -5 600 Inspector's Daily Report -5 600 Mitigation Bank -5 600 Standards and References -5 600 Change Order -5 600 Environmental Study -5 600 Manual of Instruction -5 600 Quitclaim Deed -5 600 Prospectus -5 600 Research Problem Statement -5 600 Note that there are positive and negative scores. The positive scores indicate that there is positive evidence that the document examined is a maintenance agreement.
From page 81...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 74 State Transportation Agencies Table 14. Utility Agreement Rules Character String Score Character Block Utility Agreement 10 500 Utility Reimbursement Agreement 10 500 Utility Relocation Agreement 10 500 Agreement 5 500 utility work 10 1000 utility relocations 10 1000 betterment -10 1000 made and entered into 9 700 whereas 5 700 agenda -9 700 meeting -8 700 minutes -7 700 attendees -5 700 from: -5 700 to: -5 700 subject -3 700 sincerely -3 700 request for proposals -10 700 Table 15.
From page 82...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 75 State Transportation Agencies Character String Score Character Block minutes -7 700 attendees -5 700 from: -5 700 to: -5 700 subject -3 700 sincerely -3 700 request for proposals -10 700 Table 16. Cooperative Agreement Rules Character String Score Character Block Cooperative Agreement 10 700 Cooperative 7 500 Agreement 5 500 Drainage -7 1000 Drainage Agreement -10 700 made and entered into 9 500 whereas 4 500 agenda -9 700 meeting -8 700 minutes -7 700 attendees -5 700 from: -7 700 to: -7 700 subject -3 700 sincerely -3 700 request for proposals -10 700
From page 83...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 76 State Transportation Agencies Table 17. Drainage Agreement Rules Character String Score Character Block Cooperative Agreement 10 700 Cooperative 7 500 Agreement 5 500 Drainage -7 1000 Drainage Agreement -10 700 made and entered into 9 500 whereas 4 500 agenda -9 700 meeting -8 700 minutes -7 700 attendees -5 700 from: -7 700 to: -7 700 subject -3 700 sincerely -3 700 request for proposals -10 700 Rule-Based Classification: Description of the Python Code Logic Rule-based classification was accomplished by applying weights, both positive and negative, to phrases found in prescribed character blocks in a document's text.
From page 84...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 77 State Transportation Agencies In this rule, "Betterment Agreement" is the phrase, 10 is the assigned weight, and 700 represents the character block (i.e., the first 700 characters of the document text)
From page 85...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 78 State Transportation Agencies extracts the subset of the text according to the character block setting. If the character block is assigned the value of 0, then the entire text is used.
From page 86...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 79 State Transportation Agencies Table 18. Machine Learning vs.
From page 87...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 80 State Transportation Agencies documents in the training set and increases the probability of covering all the patterns that drive a match. For the rule-based classification, improving the accuracy of results requires analyzing false positives.
From page 88...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 81 State Transportation Agencies • Project identified as Example PIN patterns observed were: • PIN: • PIN : • PIN.: • PIN.
From page 89...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 82 State Transportation Agencies example illustrates that date is frequently not included in the agreement. Also, Project ID is listed here as "Project No.", and PIN is listed simply as "PIN" without including a colon.
From page 90...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 83 State Transportation Agencies Figure 19. Different patterns of Project ID, PIN, Date, and Party.
From page 91...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 84 State Transportation Agencies Table 19. Text Extraction Logic Data Item Extraction Logic Maintenance Betterment Cooperative Drainage Utilities Project ID Values are extracted by pattern matching with the known formats of these values using standard regular expressions 19% 18% 23% 5% 74% PIN Values are extracted by pattern matching with the known formats of these values using standard regular expressions 15% 40% 45% 11% 72% Party Party with whom UDOT signed the agreement 30% 30% 20% 74% 26% Date The first date to appear in the document that is not part of a sentence is extracted as the date of the document 4% 21% 5% 26% 5%
From page 92...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 85 State Transportation Agencies Data Item Extraction Logic Maintenance Betterment Cooperative Drainage Utilities Tax ID Values from Drainage Agreements extracted by pattern matching with the known formats of these values using standard regular expressions NA NA NA 30% NA Round 2 The round 2 analysis used the same tools as round 1. The NLTK (Natural Language Toolkit)
From page 93...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 86 State Transportation Agencies Concept Report Concept Reports were identified by examining each provided text file for positive matches to the Concept Report Rules. The sample Concept Report shown in Figure 20 shows that the phrase Concept Report occurs at the beginning of the document within the first 700 characters.
From page 94...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 87 State Transportation Agencies Figure 21. Design Exception example.
From page 95...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 88 State Transportation Agencies Figure 23. Quit Claim Deed example.
From page 96...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 89 State Transportation Agencies Table 21. Design Exception Rules Phrase Weight Character Block Design Exception 7 700 Design Exceptions 7 700 Design Waiver 5 700 Design Waivers 5 700 Table 22.
From page 97...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 90 State Transportation Agencies Table 25. Exception Rule for Emails Phrase Weight Character Block From 0 100 To 0 250 Subject 0 400 utah.gov 0 150 Results After testing the rules and iteratively improving them based on the test results, the rules set were run against the UDOT document corpus of text files.
From page 98...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 91 State Transportation Agencies packages that contained approximately 50,000 documents each. Processing time and cost figures are shown in Figure 25.
From page 99...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 92 State Transportation Agencies date, grantor, and grantee. Because every entity is not present in each type of document, each document type was run through its own script designed to extract only the entities relevant to that document type.
From page 100...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 93 State Transportation Agencies Design Exceptions are highly standardized documents, and as such are suited to this type of entity extraction. They also tend to be clean, "born digital" documents which are not subject to OCR errors.
From page 101...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 94 State Transportation Agencies Figure 26. Extracted entities and ProjectWise metadata comparison.
From page 102...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 95 State Transportation Agencies Table 28. UDOT Implementation Plan Overview Task Explanation 1.
From page 103...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 96 State Transportation Agencies Task Explanation 7. Refine and test entity extraction results Refine the rules to improve the accuracy of entity extraction.
From page 104...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 97 State Transportation Agencies 2. Identify roles for continuing improvement of data and information findability at UDOT At UDOT there are owners designated for individual content repositories, but there is no clear responsibility for an agency-wide, cross-silo approach to information findability.
From page 105...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 98 State Transportation Agencies "data", "studies", "legal documents" and "plans". Sub-types under each category should be established based on reviewing existing document classification schemes or folder structures and analyzing samples of documents from the various repositories.
From page 106...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 99 State Transportation Agencies 7. Refine and test entity extraction results The entity extraction rules need more work to increase the program's ability to recognize and extract entities of interest (e.g.
From page 107...
... NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 100 State Transportation Agencies Roles and Responsibilities The following list of roles can be used as a starting point for assigning implementation tasks to groups and individuals. • Business Champion – UDOT manager responsible for leading initiatives to continue improving enterprise search capabilities, identifying resources for improvements, and ensuring value added for UDOT.

Key Terms



This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.