National Academies Press: OpenBook
« Previous: 3.0 Utah DOT Findability Tests
Page 108
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 108
Page 109
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 109
Page 110
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 110
Page 111
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 111
Page 112
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 112
Page 113
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 113
Page 114
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 114
Page 115
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 115
Page 116
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 116
Page 117
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 117
Page 118
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 118
Page 119
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 119
Page 120
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 120
Page 121
Suggested Citation:"4.0 Iowa DOT Findability Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 121

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 101 State Transportation Agencies 4.0 Iowa DOT Findability Tests 4.1 Planning and Scoping Iowa has an Electronic Records Management System (ERMS) that has long served as the agency’s official system of record for construction project records. The agency is currently in the process of implementing a new replacement system. While some metadata is available for documents in the ERMS, the metadata is basic and there is a large backlog of documents to be processed. Metadata assignment is time consuming. IADOT has other content management systems in addition to the ERMS. ProjectWise (a Bentley product) was introduced about 10 years ago. It is used during the project design process to store design files and related correspondence – its advantage is the integration provided with design software (Microstation). DocExpress (an Info Tech product) was introduced in 2007-08. It is designed to support eConstruction and facilitate secure document exchange with external partners. DocExpress includes workflow and electronic signature capabilities. DocExpress is used by IADOT’s Materials & Construction and Local Assistance units. All project plans go into the ERMS at the point of letting. However, subsequent changes in plans (including as-builts) and other project documents are not necessarily included. There is an effort made to ingest authoritative copies of files from ProjectWise and DocExpress into ERMS, but this requires considerable manual effort because of the lack of metadata, and the need to determine whether documents are already in ERMS. ProjectWise documents generally follow strict naming conventions, which facilitates this process, but DocExpress documents do not. Since data validation on the file is not required as a part of the file naming process for DocExpress, few metadata values are available for repurposing by the ERMS. This in turn requires manual intervention for reviewing, classifying and storing the documents in ERMS. Enhancing automation methods to automatically classify the files, recognize specific document types and assign metadata to the files, within the framework used by Doc Express, would reduce or eliminate the ERMS input and index process time while improving document findability. The DocExpress content is unstructured and primarily consists of multi-page PDFs. The hosting system is a file- based system with limited metadata. Most documents would require a keyword or data value to be used to identify, classify and store the document with its corresponding metadata. There is an API available to extract the content. However, there is no primary key field that allows for matching data between systems. Discussions with IADOT staff suggested that the IADOT pilot test methods for auto-classification and metadata extraction for plan and proposal documents from DocExpress. This would address IADOT’s current pain points related to registering documents from DocExpress in the ERMS, and ensuring that ERMS has the most authoritative project information. Auto-classifying documents and extracting key metadata elements should help to streamline the process of checking the DocExpress contents to see if they have already been registered within ERMS. IADOT was interested in extracting PINs, project numbers, and work types. While project locations were also important, IADOT maintains good data on the locations associated with

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 102 State Transportation Agencies each project. If a project number is available, then a data service can be used to identify the location. IADOT explained some of the nuances of project data: • A proposal may include several projects, each with a different work type; in these cases their practice is to make multiple copies of the proposal so there is one for each of the projects included. • A plan generally will include only one project and one work type. • There are multiple attributes related to what work is being done in a project. In addition to the work type shown in the plan or proposal, IADOT’s Project Scheduling System (PSS) also includes data elements WORK_GROUP (e.g. Pavement Rehabilitation) and WORK_DESCRIPTION (e.g. HMA Resurfacing/Cold-in Place Recycling). It may be of interest to compare these items with the work type extracted from the plans and proposals. 4.2 Content Collection and Analysis Content Collection Iowa DOT’s content was obtained during a site visit. A total of 149,915 files were obtained, including 120,900 PDFs and roughly 5,000 TIFFs. The remaining files were HTML and other miscellaneous file types. A substantial number of files (over 70,000) were identified as not text readable – they were image files that hadn’t had OCR applied. We discussed this finding with our IADOT contacts and agreed to just pull out a sample of files to OCR in order to demonstrate the process rather than applying OCR conversion to all 70,000 files (a very time-intensive process). In addition, IADOT provided a JSON export file from PSS. This file was to be used as a master source of PINs and project numbers for application of a lookup method of entity extraction – in which documents are searched for known PINs and project numbers. Content Preparation TIFF to text conversion We first attempted a direct TIFF to text conversion in Python using the ‘pytesseract’ package. While occasionally successful, this package threw encoder errors on the vast majority of the files we attempted to convert. The workaround was to first convert the files to PDF using the ‘img2pdf’ package, then use a PDF to text converter. “Fast” PDF to text conversion PyPDF2 extracts text data from PDFs one page at a time. Text extraction works well, but formatting information tends to get lost. It is well-suited to classifying documents, but not necessarily to extracting information from those documents. We found that this method performed well at extracting project numbers from the IADOT documents, but was unable to reliably extract PINs from Plans due to the way white space is handled by the converter.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 103 State Transportation Agencies “Slow” PDF to text conversion Pdfminer.six also extracts text data from PDFs, obtaining the exact location of characters on the page. This makes it take around 30 times longer than the “fast” method on average, but with greater fidelity allowing for more accurate text extraction. The faster method extracts around 70% as many project numbers as the slower method. This slower method can also extract PINs. Since our objective was to both perform classification and to consistently and accurately extract project information from these files, including PINs, the slower method was used. It took four days of processing for the text conversion (2 days for two computers). OCR Processing As noted above, we applied OCR processing to a sample of 422 files from the original 70,000. In addition, during the course of the auto-classification analysis we identified a substantial number of Plans (~390 files) that initially appeared to be text readable (based on their size following text conversion) but were in fact only partially text readable. Typically, the first page of the document (containing the title block) was not text readable, but subsequent pages were. This impacted the performance of our auto-classification rules, which depended on text typically found on the title page of the plan. We applied OCR to these files as well. The larger corpus contained 45,634 files. Because of spacing issues common to OCR-processed files, we applied some preprocessing to address OCR errors impacting the application of classification and entity extraction rules. The most common error was the introduction of an additional space, either between words or within words. The next most common error was substituting a capital O for a zero, which can make it difficult to extract the correct project ID, as “0” is a frequent character. Some other common swaps are shown in Table 29. Table 29. Common OCR Character Errors Real Character Character(s) as Read by OCR 0 O, D, Q, o ) L, >, } [sometimes missed by OCR] ( C, c, <, { [sometimes missed by OCR] 5 s 9 q, CJ These swaps are based solely on examining extracted project ID’s from plans. There certainly exist many more swaps that would cause problems for information extraction efforts.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 104 State Transportation Agencies 4.3 Solution Development and Testing Auto-classification A rule-based approach was applied to classify the documents. Following the rule-based analysis, a machine learning model was created to provide a comparison of effort as well as accuracy. Figures 27 and 28 show examples of typical IADOT proposals and plans. Red boxes highlight distinguishing text used for auto-classification. Figure 27. IADOT Proposal example.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 105 State Transportation Agencies Figure 28. IADOT Plan example. Rule Development Two rounds of analysis were conducted. In the first round, a 2-category classification was performed to identify plans and proposals. Preliminary results from this first round identified several types of documents being classified as plans that were not plans, but related to plans – i.e., they might commonly be included in a package with the plan. These included cross-section diagrams, plan revisions, and addenda. A second analysis was conducted to perform a 3- category classification to distinguish plans, plan-related documents and proposals. For the 2-category classification, Proposals were identified via the presence of the following two terms: “ESTIMATING PROPOSAL” and “Proposal ID No.”. Plans were identified via the presence of a single term: “SHEET NUMBER”, which is repeated on almost every page of a plan document. For the 3-category classification, Plans were identified as documents having all three of the following terms: “SHEET NUMBER”, “My license renewal date”, and “PLANS OF PROPOSED IMPROVEMENT”. Plan-related documents were identified by the presence of one or two (but not all three) of these terms. Criteria for identifying Proposals were the same as for the 2- category classification. The 2-category classification identified 849 plans and 734 proposals. The 3-category classification found 540 plans, 326 plan-related documents, and 734 proposals.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 106 State Transportation Agencies Recall and Precision Testing Results of the recall and precision testing are shown in Table 30. The methodology for these tests and analysis of results are provided below. Table 30. Iowa DOT Recall and Precision Results for Rule-Based Approach Plans Proposals Plan-related 2 Category Classification Recall 89/90 (99%) 97/98 (99%) -- Precision 89/100 (89%) 100/100 (100%) -- 3 Category Classification Recall 70/90 (78%) 97/98 (99%) 15/30 (50%) Precision 100/100 (100%) 100/100 (100%) 36/100 (36%) Recall We selected a random subsample of 100 documents likely to be plans and 100 documents likely to be proposals from our sample. These documents were chosen by the presence of a string in the filenames: “Plans” likely indicated plans, and “Proposal” likely indicated proposals. Each of these source documents was inspected by eye to confirm that these were in fact the correct document types. After removing incorrect selections, we were left with 90 plans and 98 proposals. For the 3-category classification, we assembled a sample for the “plan-related” category by searching for “Cross_Section”, “Revision”, and “Addendum”. We then inspected these files to confirm that they were plan-related documents. After removing incorrect selections, we were left with 30 plan-related documents. To test recall, we checked for the presence of the filename of each document in our random subsample in the ‘filename’ field of the output of the characterization algorithm. For the 2-category classification, 89 of the 90 plan documents in the independent sample were correctly classified as plans. The one that was missed was due to OCR quality. Ninety-seven of the 98 proposal documents in the independent sample were correctly classified as proposals. The document that was not correctly classified was due to the fact that the PDF had a sideways orientation and the OCR rendered the text backwards. For the 3-category classification, 70 of the 90 plan documents in the independent sample were correctly classified as plans. The remaining 20 plans were all classified as plan-related. At least one of the three necessary terms were misread due to OCR issues. Fifteen of the 30 plan- related documents were correctly classified. All of the misclassified documents were due to OCR issues - some relevant pages were not OCR-processed or the quality of the OCR wasn’t good enough, and as a result none of our identifying phrases were extracted. Recall results for proposals were the same as for the 2-category classification.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 107 State Transportation Agencies Precision We chose a random subsample of 100 documents categorized as each document type (for both the 2-category and 3-category classification runs). We inspected each document to determine if it was categorized correctly. For the 2-category classification, all 100 of the proposals were correctly categorized. Eighty- nine of the 100 plans were correctly categorized. Of the 11 that were mis-categorized, three were plan cross-sections; four were plan revisions; four were updated plan sheets; and one was a plan addendum. All 11 were plan-related. For the 3-category classification, all 100 proposals were correctly classified. All 100 plans were correctly classified. Only 36 of the 100 plan-related documents were correctly classified. Sixty- one of the misclassified documents were in fact plans. These plans were not properly classified because OCR issues rendered one or more of the three identifying phrases unreadable. The other three were maintenance work plans (e.g. bridge cleaning plans) which contain “My license renewal date”. Machine Learning Classification Test Two machine learning approaches were used to see if they could meet or exceed the performance of the rule-based approach: Naïve Bayes, and Cosine Similarity. Naïve Bayes Using the same methodology described above for classification of maintenance agreements for Utah DOT (round 1 test), a Naïve Bayes approach was applied. Two distinct models were created – one that classifies plans, and a second that classifies proposals. The final plan classification model used a training set consisting of 100 known plans and 600 other documents (not plans). Similarly, the final proposal classification model used a training set of 100 proposals and 600 other documents. There were three rounds of model development and testing. Misclassified documents from the second round were added to the training sets for the final round. Cosine Similarity A second machine learning algorithm called “Cosine Similarity” was tested. Mathematically, the Cosine Similarity algorithm measures the cosine of the angle between two vectors consisting of arrays containing word counts of the two documents. One document represents a single known example of the document to be classified (i.e. a plan or proposal), or a concatenated set of such documents. This document is then compared to other documents in the corpus. The Cosine Similarity algorithm produces a score (between 0 and 1) for each document in the corpus representing the degree of similarity to the known example (or compilation). This score can be used to establish a threshold value for classification. For the IADOT classification, we used a cutoff value of .777 for proposals and .5 for plans. These threshold values were selected based on spot checks of documents. Machine Learning Results The results of the 2-category rule-based classification were used to evaluate the results of the models. Recall was computed as the ratio of the number of correctly classified documents to

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 108 State Transportation Agencies the number of documents identified by the rule-based approach. Precision was computed as the ratio of the number of correctly classified documents to the total number of classified documents (including false positives). Results of the machine learning models are shown in Table 31. The Naïve Bayes algorithm had higher recall but lower precision than Cosine Similarity. Compared to the rule-based approach (see Table 30), the machine learning approach had lower recall for both plans and proposals. The machine learning approach had better precision for plans and comparable precision for proposals. Table 31. Iowa DOT Recall and Precision Results for Machine Learning Algorithms Plans Proposals Naïve Bayes Recall 783/849 (92%) 717/734 (98%) Precision 783/845 (93%) 717/744 (96%) Cosine Similarity Recall 659/849 (78%) 664/734 (90%) Precision 659/678 (97%) 734/744 (99%) Entity Extraction We created scripts to extract PINs and project numbers from plans, and project numbers and Type of Work from Proposals. For project numbers and PINs, we used two different methods – “regular expressions” and lookup. The regular expressions method involves searching the document for a key word (e.g. “PIN:”) and then extracting the characters following this key word. The lookup method uses an independent list of project numbers and PINs (obtained from IADOT’s PSS) and finds any instances of these known project numbers and PINs in the document. We used regular expressions only for Type of Work. The regular expressions method is faster and is more forgiving of variations in formatting. However, it tends to yield duplicates that have slight variations (e.g. a project number with and without a leading 0). The lookup method is slower but obtains cleaner results (due to matching with a master list). After discussing the initial results with the IADOT contacts, we decided to use a combination of these techniques, with post-processing logic to choose the most accurate results from the two methods. Results of the entity extraction exercise are reported below for two different sets of documents – the original corpus prior to adding OCR-processed documents, and the final corpus including the OCR-processed documents. Original corpus: We were able to directly extract single project numbers from 531 plan and plan-related documents via the lookup method. We supplemented the lookup with a regular expression search to extract project numbers. In some cases, the extracted project numbers required post- processing to adhere to the format of a proper project number. Most commonly, a “0” had been rendered as a capital “O”, or a leading “0” had been excluded from a portion of the

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 109 State Transportation Agencies number. Some 11 project numbers had to be adjusted manually. After this significant post- processing effort, we were able to match a project number to 685 plans or plan-related documents. We were also able to extract 316 PINs from these plans and plan-related files. We don’t expect to find a PIN in each plan; we found, for instance, that 136 plans associated with plan numbers that began with “MP” or “MPIN” (maintenance projects) had no PIN. We extracted project numbers (via the lookup method) and work type (via regular expression matching) from all proposals we identified. Expanded corpus: We were able to directly extract single project numbers from 590 plan and plan-related documents via the lookup method. Attempting to reliably extract project numbers from the rest of the files would be even more intensive than in the previous corpus due to the variety of mistakes we can incur by using OCR. We were also able to extract 448 PINs from these plans and plan-related files. We extracted project numbers from 693 of the 734 proposal files via the lookup method. These files are cleaner and easier for Acrobat to scan. We were able to extract the work type for 731 of these files via regular expression matching. Results Results of the entity extraction are summarized in Table 32. Because proposals were uniform, extraction results were excellent, though addition of some OCR-processed proposals reduced the success rate. Results were not as good for plans, which are less uniform and include a higher percentage of OCR-processed documents than proposals (for both the initial and expanded corpus). Table 32. IADOT Entity Extraction Results Entity Extracted Initial Corpus Expanded Corpus Plan and Plan- Related Proposal Plan and Plan- Related Proposal PIN 316/714 (44%) NA 448/866 (52%) NA Project Number 685 /714 (96%) 588/588 (100%) 590/866 (68%) 693/734 (94%) Work Type NA 588/588 (100%) NA 731/734 (100%) 4.4 Iowa DOT Implementation Plan This section was designed to stand alone and includes an introduction to the NCHRP 20-97 test at IADOT. Introduction IADOT participated as a test agency for NCHRP 20-97: Improving Findability and Relevance of Transportation Information. The IADOT test was designed to develop a process for auto-

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 110 State Transportation Agencies classifying project proposals and plans, and extracting key metadata elements from the documents including project number, PIN and work type. The purpose of the implementation plan is to provide IADOT with a roadmap for future development and application of the techniques demonstrated in the NCHRP 20-97 test. It is structured into 10 activities, summarized in Table 33 and described below. Table 33. IADOT Implementation Plan Overview Task Explanation 1. Strengthen information management governance Establish a governing body with authority for designating authoritative sources and approving policies and standards related to metadata and master data sources. 2. Harmonize existing metadata across content repositories Work toward a common, consistent set of metadata elements across different content repositories. 3. Establish controlled sources for the minimum metadata elements. Work towards consistency of metadata element values by using controlled, master sources to populate these elements. 4. Automate creation of standard metadata elements describing project location Extend current location assignment service to produce location-related metadata items (route, county, from MP, to MP) given a project number. 5. Establish document intake procedures for content management systems Establish procedures that ensure that documents are (1) text readable and (2) have minimum metadata populated prior to being added to content management systems. 6. Identify the value of establishing text analytics capabilities within IADOT Review the results of the NCHRP 20-97 IADOT pilot and assess how future application of these techniques could add value. 7. Test application of the pilot techniques within IADOT Test the pilot scripts from NCHRP 20-97 “as is” to establish a baseline familiarity with the techniques – and then modify them for a new small-scale pilot application. 8. Scope future auto-classification and entity extraction applications Create a scope for future application of auto- classification and entity extraction. 9. Evaluate commercial tools for auto-classification and entity extraction Consider use of commercial text analytics packages to facilitate the process of rule development as an alternative to the open source tools used in NCHRP 20-97.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 111 State Transportation Agencies Task Explanation 10. Implement processes and tools for continued application of auto- classification and entity extraction techniques Once a solution is selected (commercial or open source tools), deploy the solution, train users, and assign responsibilities for ongoing maintenance and refinement. Implementation Activities 1. Strengthen information management governance. Many of the pain points identified for the pilot could be addressed through stronger information governance. Presence of multiple content management systems storing overlapping content types and with inconsistent metadata has created the need for greater levels of staff time to work “behind the scenes” to identify the most current, authoritative documents and ensure that they are appropriately archived for future retrieval. A governing body with decision making authority over agency information management could work towards greater alignment and coordination across systems in the short term, and possibly consolidation of content management systems in the longer term. Their goal would be to meet business needs for content sharing, access and archiving in the most efficient possible manner from an agency-wide perspective. Representation of the different agency business functions (and in particular those representing major users and managers of content management systems), a governance body chair committed to collaboration and progress, and strong executive support (and willingness to back up the decisions of the group) are key to making a governing body effective. 2. Harmonize existing metadata across content repositories. Analyze metadata elements currently maintained within different content repositories. Identify a minimum set of common metadata elements that each repository should have. Develop a strategy for harmonizing lists of values or taxonomies across repositories in order to enable efficient cross-repository searches and document transfers. In the short term, explore ways to establish mappings or crosswalks across different metadata elements to minimize the need for major modifications to existing systems. Piggyback on system replacement or enhancement projects to take advantage of opportunities to improve metadata standardization and harmonization. 3. Establish controlled sources for the minimum metadata elements. Establish controlled vocabularies/taxonomies with master sources for metadata elements such as content types, subjects, project numbers, PINs, work types, and locations. Create a governance process for modifying these vocabularies/taxonomies and a workflow process for ensuring that these updates are reflected within each content management system. 4. Automate creation of standard metadata elements describing project location. IADOT has a data service that can assign a location (route, county, from MP, to MP) based on a project number. This service should be leveraged to populate location-related metadata items within content management systems where they don’t already exist, and validate location-related metadata items where they do exist.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 112 State Transportation Agencies 5. Establish document intake procedures for content management systems. Nearly half of the 170,000 files provided from DocExpress for the NCHRP 20-97 pilot were not text readable. DocExpress requires very little metadata, which makes ingestion of documents from this source into IADOT’s official electronic records management system (ERMS) problematic. Establishing document intake procedures that involve OCR conversion as well as metadata assignment would improve findability and reduce the need for time consuming processing activities on the part of ERMS staff. 6. Identify the value of establishing text analytics capabilities within IADOT. Review the processes used and results produced from the classification and metadata extraction processes used for the NCHRP 20-97 IADOT pilot. Identify specific future applications at IADOT where these techniques would be helpful. For example, they might be applied for selected “high value” file types that are relatively common and important to identify and make accessible and findable. The techniques could be used just for processing a large backlog of files, or used as an interim method for performing document intake pending establishment of metadata standards and governance. Make a decision about whether to pursue implementation of these techniques at IADOT. If yes, continue on to step 7. 7. Test application of the pilot techniques within IADOT. A hands-on test of the pilot techniques used within NCHRP 20-97 would be the best way for IADOT to assess both their value as well as the level of effort that would be required to scale them up beyond a pilot effort. The following approach is recommended – starting with installing and testing the scripts “as is”, and then modifying them for a new pilot application. a. Test the NCHRP 20-97 scripts “as is” Download and install the open source Python libraries utilized for the pilot. Assemble a test collection of documents and run the scripts from the pilot on these documents. Modify the scripts as needed to make them run properly in IADOT’s environment. Review the results. b. Identify a small-scale pilot to implement at IADOT Select 1-2 additional priority content types to categorize, and 1-2 metadata elements to extract for each of the content types. c. Create training sets for each new content type to be categorized Create a representative training set of at least 30 documents for each new content type to be categorized. d. Analyze the documents in the training sets to develop classification rules Examine the files included within each category and identify key words or phrases that might be used to uniquely distinguish the content type. Create one or more logical conditions to use for identifying the content type.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 113 State Transportation Agencies e. Modify and apply the auto-classification rules Modify the auto-classification rules from the pilot to reflect the condition(s) formulated in the prior step. Apply the classification script and review results. Identify documents that were misclassified, as well as documents that should have been classified but were not. Refine the rules to improve their performance and rerun. Anticipate 3-4 cycles of testing and refinement. f. Compile master data sources for each metadata element to be extracted (where applicable) Make a list of all metadata elements to be extracted across all of the selected content types. Identify any existing master data sources for these metadata elements (e.g. lists of project numbers, PINs, work types, etc.) g. Analyze the documents in the training sets to develop metadata extraction rules Examine the files included within each category and identify where the different metadata elements to be extracted appear, and what common words or phrases appear directly before or after them. Look for variations across the different documents in the training set. Create one or more logical conditions to use for identifying each metadata element for each content type. h. Modify and apply the metadata extraction rules Modify the metadata extraction rules from the pilot to reflect the condition(s) formulated in the prior step. Where applicable, make use of the master data sources to create a lookup method for metadata extraction that searches the document (or a portion thereof) for each possible value of the metadata element in the master list. Apply the metadata extraction script and review results. Review the results to identify where no results were obtained, or where results are erroneous. Refine the rules to improve their performance and rerun. Anticipate 3-4 cycles of testing and refinement. 8. Scope future auto-classification and entity extraction applications. Create a scope for future applications including types of documents to be classified, metadata elements to be extracted and the source of documents to be included. Identify whether these would be applied to create consistent metadata for a backlog of existing documents, or used to partially automate the intake process for new documents. 9. Evaluate commercial tools for auto-classification and entity extraction. Once IADOT has determined that auto-classification is of value, it may be appropriate to explore transition to a commercial text analytics tool. Tool capabilities include text mining, thesaurus or ontology development and management, auto-classification, entity extraction, and sentiment analysis. Appendix E of NCHRP Repot 846 can be used as a starting point to identify such tools. 10. Implement processes and tools for continued application of auto-classification and entity extraction techniques. The final step is to institutionalize the process identified in step 8 within IADOT. This would involve:

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 114 State Transportation Agencies • identifying a manager who will be responsible for this function, • identifying staff and/or external contract resources who will perform the work, • deploying the selected solution (if applicable), • conducting training necessary to get staff up to speed with the selected solution, and • developing an initial work plan that defines activities and responsibilities for application of the techniques, and ongoing maintenance and refinement.

Next: References »
Information Findability Implementation Pilots at State Transportation Agencies Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

With a quick online search, you can discover the answers to all kinds of questions. Findability within large volumes of information and data has become almost as important as the answers themselves. Without being able to search various types of media ranging from print reports to video, efforts are duplicated and productivity and effectiveness suffer.

The TRB National Cooperative Highway Research Program's NCHRP Web-Only Document 279: Information Findability Implementation Pilots at State Transportation Agencies presents the results of pilot applications of findability techniques at the Washington State Department of Transportation, the Utah Department of Transportation, and the Iowa Department of Transportation.

The document is supplemental to NCHRP Research Report 947: Implementing Information Findability Improvements in State Transportation Agencies and three videos on the Washington State DOT Manual Modernization Pilot. .

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!