National Academies Press: OpenBook

Implementing Information Findability Improvements in State Transportation Agencies (2020)

Chapter: Appendix A - Auto-Classification and Entity Extraction Level of Effort

« Previous: List of Acronyms and Abbreviations
Page 34
Suggested Citation:"Appendix A - Auto-Classification and Entity Extraction Level of Effort." National Academies of Sciences, Engineering, and Medicine. 2020. Implementing Information Findability Improvements in State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25884.
×
Page 34
Page 35
Suggested Citation:"Appendix A - Auto-Classification and Entity Extraction Level of Effort." National Academies of Sciences, Engineering, and Medicine. 2020. Implementing Information Findability Improvements in State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25884.
×
Page 35
Page 36
Suggested Citation:"Appendix A - Auto-Classification and Entity Extraction Level of Effort." National Academies of Sciences, Engineering, and Medicine. 2020. Implementing Information Findability Improvements in State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25884.
×
Page 36
Page 37
Suggested Citation:"Appendix A - Auto-Classification and Entity Extraction Level of Effort." National Academies of Sciences, Engineering, and Medicine. 2020. Implementing Information Findability Improvements in State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25884.
×
Page 37
Page 38
Suggested Citation:"Appendix A - Auto-Classification and Entity Extraction Level of Effort." National Academies of Sciences, Engineering, and Medicine. 2020. Implementing Information Findability Improvements in State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25884.
×
Page 38

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

34 Table A-1 lists the activities required for performing auto-classification and entity extraction. Rough estimates of hours for each activity are provided based on the experience of the Utah DOT in round 1 (with a corpus of roughly 6,700 files). It should be noted that these can vary widely based on the nature of the effort, the size of the corpus, and the experience level of indi- viduals performing the activities. For UDOT, the total effort involved to do either machine learning or rule-based auto-classification for a single family of content types (Agreements) was roughly 250 hours. This could have been reduced to under 200 hours through a more streamlined process for content collection and provision of machine readable files. Availability of a pre-tagged learning set could have further reduced the effort. Table A-2 illustrates how the time savings achieved through auto-categorization increases with the size of the corpus to be categorized. This analysis assumes: • Two minutes per document would be required for manual categorization. • Thirty seconds per document would be required to validate the automated categorization (in addition to an up-front investment of 250 hours). In practice, an agency might choose to accept a 70% to 80% accuracy rate and forego the process of validating the auto-categorized documents, which would further improve the value proposition. It is important to note that while this hypothetical analysis illustrates potential time savings from text analytics applications, in many cases, agencies would not be willing to make the invest- ment needed to manually categorize a corpus of existing documents. They may, however, be willing to make a relatively modest investment (e.g., 200–250 hours) to implement an automated (albeit not 100% accurate) process. A P P E N D I X A Auto-Classification and Entity Extraction Level of Effort

Auto-Classification and Entity Extraction Level of Effort 35 Activity Description and Range of Effort UDOT Level of Effort Content Collection Assembling a collection of content that represents the target corpus for the effort. At UDOT, this was a time-consuming process. The level of effort for this will vary based on the nature of the target corpus and current content management practices. 80+ hours Content Analysis and Processing Reviewing a sample of documents from the collection to identify variations in content types. Applying OCR to documents (if required). Converting documents to pure text files. Applying processes to remove stop words, punctuation, and boilerplate text (if applicable). Applying processes to lemmatize words to obtain word stems (if applicable). Resolving file conversion issues. Filtering out unreadable documents (poor OCR, non- searchable due to password protection, etc.). File conversion processes are generally automated and not time-consuming to run. Much of this effort is for manual review of files and resolution of issues. 60 hours Training Set Creation Assembling a training set that can be used for machine learning or for manual rule development. If an existing sample of documents has already been categorized, then very little effort beyond coordination will be required. At UDOT, documents for different training sets had to be identified based on manual review of each document. 40 hours Auto- Categorization Machine Learning Model Development Developing and testing a supervised learning—naïve Bayes model for auto-categorization—with several cycles of iteration to test different model formulations and parameters. 50 hours Auto- Categorization Rule Development and Testing Creating rules, developing Python scripts to apply the rules, and testing the rules. At UDOT, open source software was used; scripting would not be necessary if a commercial tool were employed. 60 hours Evaluating Auto- Categorization Results Checking precision and recall and scoring results. At UDOT, precision and recall metrics were created for the last two iterations of machine learning and auto-categorization results. 20 hours Table A-1. Level of effort for auto-categorization and entity extraction. (continued on next page)

36 Implementing Information Findability Improvements in State Transportation Agencies Table A-1. (Continued). Activity Description and Range of Effort UDOT Level of Effort Entity Extraction Analyzing content to identify patterns in the presentation/formatting of entities to be extracted. Creating a range of regular expression patterns to identify the entities to be extracted. Writing Python scripts to load each text file, identify data items, and extract the data. Running the Python scripts to extract data. Calculating the precision with which the data can be found and extracted. Iterating and improving the Python script performance. At UDOT, five entities were extracted: PIN, project no., date, party, and tax ID. Accuracy rates varied and could be improved with additional effort. Again, use of commercial software could reduce the level of effort. 100 hours 20,000 417 667 250 25,000 458 833 375 30,000 500 1000 500 35,000 542 1167 625 40,000 583 1333 750 45,000 625 1500 875 50,000 667 1667 1000 Number of Documents in Corpus Auto-Categorization Effort (250 hours + 30 sec per document) Manual Categorization Effort (2 min per document) Potential Savings, person- hours 5,000 292 167 -125 10,000 333 333 0 15,000 375 500 125 Table A-2. Time savings potential for auto-classification.

Abbreviations and acronyms used without definitions in TRB publications: A4A Airlines for America AAAE American Association of Airport Executives AASHO American Association of State Highway Officials AASHTO American Association of State Highway and Transportation Officials ACI–NA Airports Council International–North America ACRP Airport Cooperative Research Program ADA Americans with Disabilities Act APTA American Public Transportation Association ASCE American Society of Civil Engineers ASME American Society of Mechanical Engineers ASTM American Society for Testing and Materials ATA American Trucking Associations CTAA Community Transportation Association of America CTBSSP Commercial Truck and Bus Safety Synthesis Program DHS Department of Homeland Security DOE Department of Energy EPA Environmental Protection Agency FAA Federal Aviation Administration FAST Fixing America’s Surface Transportation Act (2015) FHWA Federal Highway Administration FMCSA Federal Motor Carrier Safety Administration FRA Federal Railroad Administration FTA Federal Transit Administration HMCRP Hazardous Materials Cooperative Research Program IEEE Institute of Electrical and Electronics Engineers ISTEA Intermodal Surface Transportation Efficiency Act of 1991 ITE Institute of Transportation Engineers MAP-21 Moving Ahead for Progress in the 21st Century Act (2012) NASA National Aeronautics and Space Administration NASAO National Association of State Aviation Officials NCFRP National Cooperative Freight Research Program NCHRP National Cooperative Highway Research Program NHTSA National Highway Traffic Safety Administration NTSB National Transportation Safety Board PHMSA Pipeline and Hazardous Materials Safety Administration RITA Research and Innovative Technology Administration SAE Society of Automotive Engineers SAFETEA-LU Safe, Accountable, Flexible, Efficient Transportation Equity Act: A Legacy for Users (2005) TCRP Transit Cooperative Research Program TDC Transit Development Corporation TEA-21 Transportation Equity Act for the 21st Century (1998) TRB Transportation Research Board TSA Transportation Security Administration U.S. DOT United States Department of Transportation

TRA N SPO RTATIO N RESEA RCH BO A RD 500 Fifth Street, N W W ashington, D C 20001 A D D RESS SERV ICE REQ U ESTED N O N -PR O FIT O R G . U .S. PO STA G E PA ID C O LU M B IA , M D PER M IT N O . 88 ISBN 978-0-309-48166-3 9 7 8 0 3 0 9 4 8 1 6 6 3 9 0 0 0 0

Implementing Information Findability Improvements in State Transportation Agencies Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

With a quick search online, you can discover the answers to all kinds of questions. Findability within large volumes of information and data has become almost as important as the answers themselves. Without being able to search various types of media ranging from print reports to video, efforts are duplicated and productivity and effectiveness suffer.

The TRB National Cooperative Highway Research Program's NCHRP Research Report 947: Implementing Information Findability Improvements in State Transportation Agencies identifies key opportunities and challenges that departments of transportation (DOTs) face with respect to information findability and provides practical guidance for agencies wishing to tackle this problem. It describes four specific techniques piloted within three State DOTs.

Additional resources with the document include NCHRP Web-Only Document 279: Information Findability Implementation Pilots at State Transportation Agencies and three videos on the Washington State DOT Manual Modernization Pilot.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!