Read "Implementing Information Findability Improvements in State Transportation Agencies" at NAP.edu

« Previous: Chapter 4 - Guidance for Improving Findability

Page 17

Suggested Citation:"Chapter 5 - Findability Techniques." National Academies of Sciences, Engineering, and Medicine. 2020. Implementing Information Findability Improvements in State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25884.

Page 18

Page 19

Page 20

Page 21

Page 22

Page 23

Page 24

Page 25

Page 26

Page 27

Page 28

Page 29

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

17 This chapter presents information about how the pilot projects were implemented and key lessons learned during the course of the pilots. This information can be used to replicate the techniques tested in the State DOT pilots. A Note on Tools The research team used a commercial text analytics product in the initial NCHRP Proj- ect 20-97 research. In the continuation project, open source software was used. The decision to use open source software was made to provide a way for DOTs to replicate the techniques used in the pilots without having to purchase specific commercial tools. However, it is impor- tant to understand that the open source route is a âdo it yourselfâ endeavor, requiring custom scripting for content preparation tasks, rule or model application, and output formatting. It also requires research to identify and evaluate which specific available functions and libraries to use for each task. Open source tools are an appropriate choice for agencies with access to technical staff capable of configuring open source software and scripting in Python or R. However, for agencies wishing to implement a regular text analytics function, use of commercial text analytics software may be a better choice. Commercial packages have built in features that would need to be manually implemented in open source environments and provide user inter- faces to simplify configuration and model/rule specification. While many of the established commercial products require a fairly significant investment, this is a rapidly developing area with new tools emerging at different price points. 5.1 Creating a Searchable Website for Agency Engineering Manuals Procedure The WSDOT pilot to create a searchable website for the agencyâs engineering manuals involved the following steps: 1. Assemble the most recent versions of the manuals. Use the native formats (e.g., MS Word) if available. 2. Create individual hypertext markup language (HTML) pages for each manual subsection. A custom script was written to perform this âchunkingâ process. 3. Load the HTML documents into a MySQL database â which serves as the back end data- base for the Drupal Books module of the (open source) Drupal web content manage- ment system. C H A P T E R 5 Findability Techniques

18 Implementing Information Findability Improvements in State Transportation Agencies 4. Identify and design a series of search filters or âfacetsâ for the manuals website. The WSDOT pilot focused on stormwater-related content, so these facets included drainage system elements, stormwater best management practices (BMPs), and project deliverables related to stormwater. 5. Create an ontology defining terms and relationships for each of the selected facets. The WSDOT pilot used ProtÃ©gÃ©, an open source tool for creating this ontology. The ontology was developed based on analysis of the content within the manuals with input and valida- tion from WSDOT subject matter experts. It included synonyms (equivalent terms) for some of the terms, identified during discussions with the subject matter experts. 6. Build an automated process for tagging the individual manual sections with terms from the ontology. The WSDOT pilot used Apache Jena to store the ontology as subject-predicate- object expressions called âtriplesâ. The ontology was exported as an OWL file from ProtÃ©gÃ© and loaded into Jena. The pilot used the Google natural language processing (NLP) service to extract meaningful terms from the manual text. A custom built component was used to tag manual sections based on matches between results from Google NLP and the triples stored in Jena. 7. Run and test the tagging process. Identify cases of over-tagging (sections tagged that arenât relevant to the term) and under-tagging (sections relevant to a term but not tagged). Adjust the ontology and the tagging logic as needed and re-run. Repeat this process until satisfactory results are achieved. 8. Design and create the search and navigation user interface in Drupal. The WSDOT pilot user interface includes a standard search box for entering search terms. After a user executes a search, the facets and terms relevant to the search results appear on the left side of the screen. Users can select terms to further filter their results. 9. Index the content in the MySQL database. The WSDOT pilot used the Solr search engine for this. 10. Configure the Solr search engine to include synonyms (to be applied at either index time or search query time). 11. Conduct testing (including structured testing as well as free form testing by target users), and modify the ontology, tagging process, and search configuration to address issues. The architecture for the WSDOT pilot manuals solution is illustrated in Figure 6. Lessons 1. Select an Appropriate Tagging Methodology Three different auto-classification approaches can be considered for tagging content: 1. An ontology-driven auto-classification process in which content is tagged with ontology terms based on direct matches to terms in the text or matches with synonyms and child terms. 2. A rule-based approach in which content is tagged based on a set of rules that can be designed to look for several different terms (in particular portions of the text) that indicate the con- tent related to the designated tag. 3. A machine learning approach in which an algorithm learns to classify content in a âlearn by example approachâ. The ontology-driven approach works well when the ontology is sufficiently built-out to ref- erence specific concepts in the content. When ontology terms are abstract or ambiguous, this approach is less accurate. For example, classifying sections with content related to âCulvertâ or âBioinfiltration pondâ was successful using the ontology approach in the WSDOT manuals pilot. Attempts to classify content based on broad topic areas such as âEnvironmental Reviewâ were not as successful. A rule-based or machine learning approach should be considered for tagging based on these more general terms.

Findability Techniques 19 The rule-based approach offers more flexibility than the ontology-driven approach. It can look for multiple terms related to the tag being assigned. It can be configured to add more weight to terms found in the titles, or to account for proximity of different words to one another. How- ever, rules can take considerable effort to develop and maintain. A machine learning approach can be considered as an alternative to a rule-based approach. However, this approach depends on creating a manually tagged training set. The number of documents to be classified must be sufficiently large to justify the effort needed to create this training set. For the WSDOT manuals pilot, there wasnât a large enough pool of documents for the machine learning approach. All three auto-classification methods require multiple cycles of testing and refinement to yield successful results. However, of the three, the ontology-driven approach is likely to be the most efficient, transferable, and maintainable. While rules involving regular expressions may be difficult to share and repurpose, ontologies can be saved and shared in a reusable file format such as OWL and used to drive automated tagging as well as to power other search and discovery features. 2. Plan for Iterative Testing and Refinement Plan for an iterative process of testing and refinement to achieve good results from ontology- driven tagging. For example, in the WSDOT pilot, tradeoffs were required to avoid over-tagging (based on matching any individual word within a multi-word term) and under-tagging (based on requiring a match with the entire phrase). In the end, the WSDOT tagging process was configured to require that the entire multi-word phrase be matched to the content for the tag to be applied. For some types of tags, the automated approach will likely need to be supplemented with manual tagging. For example, WSDOTâs Master Deliverables List (MDL) contains specific Figure 6. WSDOTâs pilot manuals solution architecture.

20 Implementing Information Findability Improvements in State Transportation Agencies references to engineering manual sections to be consulted for preparation of certain deliverables. These references created by the authors of the MDL provide a much better basis for tagging than matching of deliverable names to manual section text. As another example, a tag such as âstorm drainâ may be relevant to an entire manual chapter (e.g., Hydraulics Manual Chapter 6, Storm Drains), but there may not be specific vocabulary within each individual subsection of that chapter to match with the tag. In these situations, the automated vocabulary-driven tagging would not work well, but bulk-tagging processes could be developed to apply the tag to every subsection within the chapter. The general lesson is that vocabulary-driven tagging can be extremely helpful, but it needs to be carefully validated and supplemented with other tagging methods. It is important to assign someone the responsibility for the overall quality of results who can adjust the ontology as needed and select appropriate tagging strategies. In addition, it is essential to build in time for user testing and feedback and time to respond to the feedback received. 3. Integrate Synonyms with Care Integrating synonyms into both tagging and search configuration extends the search power beyond what would be achieved with full-text search. Synonyms can include equivalent or near equivalent terms (e.g., âluminaireâ and âlightâ or âslopeâ and âgradeâ) as well as common abbreviations (e.g., âCAVFSâ is a commonly used acronym for âCompost Amended Vegetated Filter Stripâ in the WSDOT manuals). Synonyms can be integrated into the tagging process by including them in the ontology. Synonyms for terms not included in the ontology can be specified for use by the search engine either at index time or query time. For the WSDOT pilot, the Solr search engine was configured to apply a set of synonyms at query time. This meant that when a user typed âgradeâ into the search box, their search results would include any sections that included the word âgradeâ or the word âslope.â Synonyms should be used with care, especially when the synonym selected has multiple mean- ings. For example, âPEâ is a common acronym for âPreliminary Engineering.â If âPEâ were specified as a synonym for âPreliminary Engineeringâ and a user searched for âPreliminary Engineering,â they might wonder why their search results included sections addressing when a Professional Engineer (PE) sign off is required. Also, it is important to validate synonyms with multiple subject matter experts. Opinions may differ about whether certain terms are synonymous or closely related with slightly different meaning. 4. Design Facets for Future Extensibility When designing the facets to be used in a particular search-based application, it is impor- tant to consider the nature of the content being searched and the needs of the users for the tool being developed. However, it is also important to step back and think about how the facets might fit with the needs of other existing (or future) applications. Where possible, facets should fit within an existing globally defined high level structure. For example, one of the WSDOT pilotâs facets was called âProject Development Topic.â This facet was designed to fit within a broader agency topic facet that had not yet been fully developed. 5.2 Discovering Subject Areas and Terminology Within a Body of Content The WSDOT and UDOT pilots used an unsupervised machine learning technique that allows a computer to find natural clusters of content with the same meaning within collections of engineering manuals. Each cluster is defined by a set of commonly appearing words. When the

Findability Techniques 21 computer discovers sets of words that appear together in the same document, the proximity of these words suggests potentially meaningful relationships among the words. Clusters of terms, therefore, can indicate the meaning of a document and can be used for classification. In addition to cluster analysis, the WSDOT pilot used text mining to identify common words for inclusion in the ontology. Text mining produces statistics on the frequency of different words or phrases in a document or collection of documents. These statistics can be used to identify terms that are meaningful for describing the content. For example, if the phrase âstormwater managementâ appears more frequently than any other phrase in a document, it is an indica- tion that the document is about stormwater management. These frequently used terms can be harvested to use in the ontology. Text mining and cluster analysis techniques are outlined in the following sections. Procedure 1. Prepare the content for analysis. a. First, eliminate boilerplate sections, such as tables of contents, glossaries, and other sections without substantive technical content. b. Then, convert the remaining content to plain text format. The pilots used Python tools PDFMiner (for PDFs) and docx2txt (for Microsoft Word). c. Use NLP tools to remove stop words (such as âand,â âor,â and âforâ) and perform stemming to convert words to their root form. The pilots used tools in the Python Natural Language Toolkit (NLTK) for this purpose. 2. Identify common words in the text. The pilot used the CountVectorizer function in the Python SCIKIT-Learn module to extract features from the text files. Prepare a frequency distribution to identify the most common terms. Review the list to distinguish the terms that are candidates to be added to a controlled vocabulary. 3. Create a term frequency â inverse document frequency (tf-idf) matrix for the terms in the cleaned text files. The pilots used the Python SCIKIT-Learn module to do this. 4. Apply the K-Means algorithm to the tf-idf matrix to obtain a set of clusters. Set up model analysis parameters including the number of desired clusters, the maximum number of features, and cutoffs that determine which terms will be ignored because they appear in either too few manuals/chapters or too many manuals/chapters. Tune the model to determine the optimal number of clusters (using the Elbow method) and the cohesiveness of clusters (based on the Silhouette coefficient). 5. Output the top terms for each cluster. Assign a name for each cluster based on the defin- ing terms. Results of the WSDOT cluster analysis are shown in Table 1. Note that the terms defining the clusters are shown in their root form, without endings (e.g., ârestorâ is used to match with related words such as ârestoreâ or ârestorationâ). Lessons 1. Using Results of Cluster Analysis Cluster analysis involves a computational method that seeks to identify what content is about. It can provide the starting point for a classification scheme, by isolating and aggregating similar words and concepts to identify groups of like content. Because this sorting is done computation- ally, sometimes the clusters are nonsensical and will need to be dismissed. In other cases, how- ever, the words and relationships accurately describe like content and are helpful for structuring vocabularies and retrieval systems.

22 Implementing Information Findability Improvements in State Transportation Agencies 2. Building the Model Many algorithms are capable of clustering content. Make sure that you understand the clus- tering technique and the math that drives the clustering algorithm. Sometimes a seemingly simple algorithm can yield results that are counterintuitive. Rely on subject matter experts to validate results. Plan on an iterative approach to select appropriate model parameters. Key parameters for the K-Means model used for WSDOT and UDOT are: â¢ Number of clusters; â¢ Maximum features to include in the model; â¢ Inclusion of single words only or multi-word phrases as features for the model; Cluster Top 15 Terms 1. Roadside roadside, plant, function, veget, visual, soil, restor, community, environment, polici, mainten, landscap,tree, enhance, nativ 2. Structures/Geotechnical bridg, structure, wall, barrier, railroad, geotechn, hq, nois, contour, slope, sheet, tabul, berm, onlin, clearanc 3. Environmental/Project Review environment, commit, permit, nepa, impact, june, feder, agenc, approv, mainten, fhwa, analysi, review, resource, polici 4. Estimating cost, contract, overtime, consult, premium, fee, negoti, indirect, rate, weight, labor, profit, hour, audit, payment 5. Traffic Engineering citi, pedestrian, roundabout, curb, illumin, park, traffic, signal, light, street, path, width, cross, mainten, access 6. Consultant Services cso, consult, contract, agreement, firm, negoti, acl, profession, task, solicit, competit, septemb, select, procur, request 7. Construction Contracts estim, cost, item, price, bid, april, scope, review, history, specialti, contractor, phase, costbas, quantity, analysi 8. Traffic and Safety traffic, sign, safeti, control, zone, lane, vehicl, speed, roadway, oper, intersect, element, mainten, instal, barrier 9. Stormwater water, stormwat, eros, sediment, discharg, bmps, runoff, soil, control, flow, infiltr, tesc, april, surfac, prevent 10. Utilities util, franchis, accommod, instal, agreement, reloc, shall, cost, right, approv, facil, applic, zone, permit, control 11. Plans na, cid, lt, yes, septemb, appendix, vacat, sheet, rt, checklist, chart, remark, lb, flow, addendum 12. Hydraulics hydraul, culvert, pipe, flow, channel, stream, fish, inlet, woodi, river, march, passag, wash, veloc, structur 13. Access Management access, connect, permit, shall, control, rcw, septemb, driveway, author, approach, appendix, applic, hear, wac, properti 14. Survey survey, monument, control, point, data, accuraci, januari, adjust, datum, map, instrument, gps, station, observ, rod 15. Development Review sepa, gma, counti, local, land, appeal, los, agenc, impact, propos, environment, mitig, review, cipp, rtpo 16. Right of Way right, properti, apprais, acquisit, parcel, res, februari, real, titl, certif, easement, acquir, owner, reloc, estat 17. Traffic Impact Mitigation mitig, impact, local, agreement, traffic, payment, improv, review, los, interloc, wsdotwagov, propos, permit, agenc, shall 18. Geometrics lane, curv, ft, speed, width, hov, ramp, exhibit, vehicl, cross, geometr, traffic, shoulder, intersect, facil Table 1. Sample cluster analysis results.

Findability Techniques 23 â¢ Words to exclude that donât contribute to understanding of meaning (e.g., months of the year); and â¢ Thresholds for removing rarely occurring and frequently occurring words from the model (e.g., ignore words that are in more than 70% or fewer than 5% of the documents). An âelbow curveâ can be created to help determine the optimal number of clusters for the data. Once the clusters are created, a âsilhouetteâ metric can be calculated, providing a measure of how similar an object (a document) is to its assigned cluster versus other clusters. 5.3 Auto-Classifying Documents by Content Type The UDOT and IADOT pilots both involved rule-based and machine learning-based auto- classification of selected content types. Steps for conducting rule-based and machine learning- based auto-classification are described here. Procedure for Classifying Content Type Based on Rules a. Select content types to classify. For best results, choose content types that are relatively stan- dardized. A content type that has many âsubtypesâ and many variations in format will be more difficult to auto-classify than one with a uniform format. b. Create training sets for each new content type to be categorized. Training sets should include at least 10 different examples for each content variation included in the content type. c. Analyze the documents in the training sets to develop classification rules. Examine the files included in each content type category and identify key words or phrases that might be used to uniquely distinguish the content type. Identify the part of the document in which these signal phrases occur. If all of the examples of the content type in the training set share one or more distinguishing signal phrases, create a simple classification rule based on the presence of these phrases. If there are several different variations to be accounted for, create a script that assigns points to the document for each relevant signal phrase that is found. d. Include negative evidence in your rules to avoid false positives (e.g., avoid classifying emails that reference maintenance agreements as maintenance agreements). e. Implement the rules in your text analytics software, or use a scripting language (e.g., Python). f. Apply the classification rules to a large test set of documents and review the results. g. Evaluate the results by examining each document in the test set. Identify documents that were misclassified (false positives) as well as documents that should have been classified but were not (false negatives). Refine the rules to improve their performance and re-run. h. Repeat the process of testing and refinement until you are satisfied with the results. Anticipate 3-4 cycles. i. Apply the rules to the full set of documents to be classified. Table 2 shows an example of rules created for classifying UDOT maintenance agreements. In this example, the negative numbers indicate negative evidence. For example, a âQuitclaim deedâ is not a âMaintenance Agreement,â so the rule will exclude this term as it identifies a positive match with âMaintenance Agreement.â Procedure for Classifying Content Using Supervised Machine Learning (NaÃ¯ve Bayes) a. Select content types to classify. For best results, choose content types that are relatively stan- dardized. A content type that has many âsubtypesâ and many variations in format will be more difficult to auto-classify than one with a uniform format.

24 Implementing Information Findability Improvements in State Transportation Agencies b. Create training sets for each new content type to be categorized. Aim to include at least 30 dif- ferent examples (more if possible) for each content variation included in the content type. c. Create another training set of at least 100-200 documents that are not one of the content types to be classified. d. Download tools for content preparation and machine learning. The NLTK and SCIKIT- Learn modules of Python, available within Anaconda 3, an open source distribution of Python and its various packages, were used for the pilot. e. Convert Word and PDF documents to text files. f. Pre-process the text files to perform tokenization, stemming, and removal of punctuation and stop words. g. Extract features from the text files. (Features are the terms to be used for processing by the algorithms.) The pilot applications used the Python CountVectorizer module, which creates a simple âbag of wordsâ term frequency matrix. (Use of a tf-idf matrix was also tested but did not improve the models.) Character String Score Character Block Maintenance Agreement 10 1000 Agreement 5 500 maintenance 9 1000 M A I N T E N A N C E A G R E E M E N T 10 1000 made and entered into 8 1000 whereas 4 700 agenda -9 700 meeting -7 600 minutes -7 600 attendees -5 600 from: -5 700 to: -5 700 subject -3 700 sincerely -3 500 request for proposals -10 700 Easement -5 600 Checklist -5 600 Status Report -5 600 Inspector's Daily Report -5 600 Mitigation Bank -5 600 Standards and References -5 600 Change Order -5 600 Environmental Study -5 600 Manual of Instruction -5 600 Quitclaim Deed -5 600 Prospectus -5 600 Research Problem Statement -5 600 Table 2. Example of rules for classifying UDOT maintenance agreements.

Findability Techniques 25 h. Specify the classes to be predicted by the model. For example, if you have selected a single content type to classify, your model would have two classes: âContent Type Xâ and âNot Content Type X.â i. Create a test-training set containing the training sets for each content type and the training set for the âotherâ documents. j. Use 80% of the documents in the test-training set to build the model, and set aside 20% of this test-training set for testing the initial model. k. Build the initial model using the training set, and test it on the test set. l. Evaluate the results by comparing the predicted classifications versus the actual classifica- tions in the test set. m. Adjust the model parameters to try to achieve a better result. Key parameters include thresholds specifying when to exclude certain terms from the analysis because they are either too common or too rare. For example, drop terms that are in 70% of the documents or only in 1% of the documents. n. Once the model prediction rate is good (over 80% correct predictions), test the model on a larger set of files from the full set of documents to be classified. If the prediction rate drops significantly, augment the training set and re-fit the model. o. Iterate until satisfactory results are achieved. p. Apply the final model to the full set of documents to be classified. Lessons 1. Appropriateness of Auto-Classification Auto-classification has the potential to produce valuable metadata in a fraction of the time that would be needed to do this classification manually. In fact, when there is a very large volume of content, auto-classification may be the only solution to classify content at scale. However, the effort required to perform an auto-classification is substantial. Candidate auto-classification efforts should first be evaluated based on the value that they are likely to add in terms of improved findability and then be evaluated based on whether they will save time or effort over fully manual tagging processes. Before embarking on an auto-classification effort, the up-front investment to develop rules or models should be compared to the effort that would be required to manually categorize the corpus of documents over time. The UDOT test provided a good understanding of the effort required to implement both rule-based and machine learning text analytics applications for auto-classification and entity extraction. See Appendix A for details on the level of effort required. Both rule-based and machine learning approaches require assembling a representative body of content that can be used to derive a set of rules (through manual examination) or to build a model (for machine learning). Rule and model development are iterative processes, requiring multiple cycles of development, application, testing, and refinement. Once the rule-based or machine learning techniques are implemented, additional effort is needed to validate and correct the results, unless agencies are willing to tolerate some percentage of miscategorized documents. In general, text analytics applications for auto-classification will not be cost effective when: â¢ The number of documents to be categorized is relatively small (e.g., under 500); â¢ There are inherent challenges to compiling training sets representing the types of docu- ments to be classified; â¢ There is a static collection of documents (as opposed to a constant stream of new docu- ments to be categorized);

26 Implementing Information Findability Improvements in State Transportation Agencies â¢ The body of documents to be categorized is highly varied in use of vocabulary and format, such that the size of the training set would need to be fairly large in order to represent variations in the entire corpus; â¢ The body of documents to be categorized is not in machine readable format or is not ame- nable to successful OCR due to poor image quality; and â¢ There are already good quality metadata being manually assigned as part of established content management system or library intake processes. Conversely, the most promising applications of text analytics for auto-categorization are those characterized by: â¢ A very large number of documents (e.g., over 5,000 and increasing); â¢ Documents that are relatively homogeneous or standardized; â¢ An available âpre-taggedâ training set; â¢ Originally digital documents or documents that have been successfully converted to machine readable form; and â¢ No ongoing manual content tagging processes. 2. Importance of Content Analysis Manual examination of the content to be classified or searched requires effort but is essential. Pilots at UDOT and IADOT found that some content types assumed to be relatively uniform had several variations. Identifying these variations and (if applicable) subclassifications of content types is necessary for establishing representative training sets to guide both machine learning and rule-based classification processes. Content analysis can also uncover special cases to be handled during processing, such as password-protected files or files that are not text readable. 3. Converting Documents to Text Format There are different tools available to convert documents to text format, and each tool may produce different results. For example, Pythonâs PDFMiner was found to be more accurate but also much slower than the alternative PyPDF2. Text files produced by UDOTâs vendor did not match the output from either of these Python tools. None of the tools did a good job with tabular formatted text. It is important to test different tools to make sure that their performance meets your needs; sometimes it is worth it to trade slower processing speed for higher accuracy. 4. Impact of OCR Quality on Accuracy Text created through OCR processing will likely have errors that will impact the success of the classification. The number and types of errors will vary based on the quality of the original document and the specific OCR tool used for conversion. You should anticipate the need for pre-processing to eliminate extraneous spaces that did not appear in the original document. 5. Recall and Precision Testing A structured process of testing the results of the auto-classification should be conducted, including both recall and precision. This process provides the basis for determining whether the auto-classification process is sufficient or if further work is needed to improve the results. Recall measures the completeness of the classification. Precision measures the reliability of the classification. Recall is the percentage of the total number of documents of a given type that were correctly classified. Improving recall means reducing the number of false negatives, namely documents that were not classified but should have been. Precision is the percentage of the docu- ments that were correctly classified. Improving precision means reducing the number of false positives, namely documents that were classified but should not have been.

Findability Techniques 27 Measuring recall and precision is time-consuming. With a large corpus of documents, it is not practical to examine each document to determine whether or not it was correctly classified. A sampling approach can be used to obtain an indication of recall and precision. To measure precision, take a random sample of the documents that were classified and check each one to see if it was correctly classified. The precision measure is the percent of documents in your sample that were correctly classified. To measure recall, create an independent sample of documents of the type you are auto- classifying. Check to see how many of the documents in this sample were included in the set of documents that were classified. The recall measure is the percent of the documents in the sample that were classified. 6. Managing and Processing Large Data Sets Auto-classification efforts can involve very large data sets. For example, the UDOT pilot involved managing hundreds of thousands of files, requiring several terabytes of disk space. It is important to plan efficient ways to download, transfer, store, process, and analyze data sets of this scale and to allocate sufficient time for processing. In the pilot projects, the most efficient way to transfer large data sets was with physical disks; uploading files to the cloud or File Transfer Protocol sites was attempted, but DOT upload speeds were too slow to make this method practical. However, use of cloud computing resources for processing-intensive tasks was a low-cost strategy that saved considerable time. When using cloud computing, take advantage of price breaks for using excess computing capacity during times when demand for computing resources is low. 5.4 Extracting Metadata from Documents The UDOT and IADOT pilots both involved extracting metadata from documents. Steps for performing metadata extraction are described in the following sections. Procedure a. Prepare the content for analysis by converting it to text format. b. Make a list of all metadata elements to be extracted across all of the selected content types. c. Identify any existing authoritative, master data sources for these metadata elements (e.g., lists of project numbers, PINs, and work types). d. Compile master data sources for each metadata element to be extracted (where applicable). e. Analyze the documents in the training sets to develop metadata extraction rules. Examine the files included within each category to identify where the different metadata elements to be extracted appear, and what common words or phrases appear directly before or after them. Look for variations across the different documents in the training set. f. Based on the analysis, create one or more logical conditions to use for identifying each meta- data element for each content type. g. Create scripts to apply the metadata extraction rules. Where applicable, use the master data sources to create a lookup method for metadata extraction that searches the document (or a portion thereof) for each possible value of the metadata element in the master list. h. Apply the metadata extraction script. Review the results to identify where no results were obtained or where results are erroneous. i. Refine the rules to improve their performance and re-run. Anticipate 3-4 cycles of testing and refinement. j. Modify and apply the metadata extraction rules. Figure 7 shows a sample document with the location of metadata entities to be extracted.

28 Implementing Information Findability Improvements in State Transportation Agencies Lessons The lessons included in Section 5.3 related to converting documents to text and OCR quality are equally applicable to entity extraction. Additional lessons for entity extraction are provided in the following sections. 1. Understand the Anticipated Results of the Metadata Extraction As part of the planning for an entity extraction effort, it is important to work with subject matter experts to understand what results are expected. Key questions to ask are: â¢ Should the documents have only one of these entities or several? â¢ If a document has several of the entities, should all of them be extracted, or only the one that occurs in a particular location in the document? â¢ Are all of the documents expected to have the entity? If not, are there any patterns or clues about which documents will not have it? 2. Use Available Agency Master Data Sources to Improve Entity Extraction In the UDOT and IADOT pilots, entities such as Project Numbers and PINs were extracted from documents using two different methods: âregular expressionsâ and âlookupâ. The regular expressions method involves searching the document for a key word (e.g., âPIN:â) and then extracting the characters following this key word. The lookup method uses an independent list of Project Numbers and PINs (from an agency master data source) and finds any instances of these known Project Numbers and PINs in the document. The regular expressions method proved to be faster and more forgiving of variations in entity formatting since an exact match with the master source data is not required. However, it relies Figure 7. Metadata extraction from a maintenance agreement.

Findability Techniques 29 on the presence of standard phrasing and document structure and works best when the document type is relatively uniform. The lookup method is slower than the regular expressions method and less forgiving of variations but obtains cleaner results (due to matching with a master list). Using a combination of the lookup and regular expression methods yielded the best results. Both methods were applied, and a post-processing script was created to use the lookup results where this method was successful; otherwise, the regular expression results were used. Master data sources can also be used to associate other metadata elements to documents. For example, consider a master data source that includes project number, route, county, from milepost, and to milepost. Once the project number associated with a document is known, the related location items can also be attached as metadata to that document. 3. Craft âFuzzy Matchesâ to Account for Format Variations and OCR Issues As noted earlier, the lookup method is less forgiving of format variations. For project numbers, several different format variations were observed in the pilots, such as in the use of leading zeros, dashes, and parentheses. These variations meant that a project number in the document would not match with the one in the master data source. In addition, OCR process- ing can create errors such as insertion of extraneous spaces and substitution of capital âOâs for â0âs. These variations can be identified and addressed through development of scripts that provide a âfuzzy matchâ capability. In addition, if you are creating an entity extraction script based on seeking particular terms in the text, it is best to use a single text conversion tool for the entire set of documents to be processed. This will reduce the number of variations that your script will need to consider.

Next: Glossary »

Implementing Information Findability Improvements in State Transportation Agencies (2020)

Chapter: Chapter 5 - Findability Techniques

Welcome to OpenBook!

Get Email Updates