The National Academies Press

Currently Skimming:

Pages 17-29

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.

From page 17... ... 17 This chapter presents information about how the pilot projects were implemented and key lessons learned during the course of the pilots. This information can be used to replicate the techniques tested in the State DOT pilots. Read the entire page →
From page 18... ... 18 Implementing Information Findability Improvements in State Transportation Agencies 4. Identify and design a series of search filters or "facets" for the manuals website. Read the entire page →
From page 19... ... Findability Techniques 19 The rule-based approach offers more flexibility than the ontology-driven approach. It can look for multiple terms related to the tag being assigned. Read the entire page →
From page 20... ... 20 Implementing Information Findability Improvements in State Transportation Agencies references to engineering manual sections to be consulted for preparation of certain deliverables. These references created by the authors of the MDL provide a much better basis for tagging than matching of deliverable names to manual section text. Read the entire page →
From page 21... ... Findability Techniques 21 computer discovers sets of words that appear together in the same document, the proximity of these words suggests potentially meaningful relationships among the words. Clusters of terms, therefore, can indicate the meaning of a document and can be used for classification. Read the entire page →
From page 22... ... 22 Implementing Information Findability Improvements in State Transportation Agencies 2. Building the Model Many algorithms are capable of clustering content. Read the entire page →
From page 23... ... Findability Techniques 23 • Words to exclude that don't contribute to understanding of meaning (e.g., months of the year) ; and • Thresholds for removing rarely occurring and frequently occurring words from the model (e.g., ignore words that are in more than 70% or fewer than 5% of the documents) Read the entire page →
From page 24... ... 24 Implementing Information Findability Improvements in State Transportation Agencies b. Create training sets for each new content type to be categorized. Read the entire page →
From page 25... ... Findability Techniques 25 h. Specify the classes to be predicted by the model. Read the entire page →
From page 26... ... 26 Implementing Information Findability Improvements in State Transportation Agencies • The body of documents to be categorized is highly varied in use of vocabulary and format, such that the size of the training set would need to be fairly large in order to represent variations in the entire corpus; • The body of documents to be categorized is not in machine readable format or is not amenable to successful OCR due to poor image quality; and • There are already good quality metadata being manually assigned as part of established content management system or library intake processes. Conversely, the most promising applications of text analytics for auto-categorization are those characterized by: • A very large number of documents (e.g., over 5,000 and increasing) Read the entire page →
From page 27... ... Findability Techniques 27 Measuring recall and precision is time-consuming. With a large corpus of documents, it is not practical to examine each document to determine whether or not it was correctly classified. Read the entire page →
From page 28... ... 28 Implementing Information Findability Improvements in State Transportation Agencies Lessons The lessons included in Section 5.3 related to converting documents to text and OCR quality are equally applicable to entity extraction. Additional lessons for entity extraction are provided in the following sections. Read the entire page →
From page 29... ... Findability Techniques 29 on the presence of standard phrasing and document structure and works best when the document type is relatively uniform. The lookup method is slower than the regular expressions method and less forgiving of variations but obtains cleaner results (due to matching with a master list) Read the entire page →

From page 17...

... 17 This chapter presents information about how the pilot projects were implemented and key lessons learned during the course of the pilots. This information can be used to replicate the techniques tested in the State DOT pilots.

Read the entire page →

From page 18...

... 18 Implementing Information Findability Improvements in State Transportation Agencies 4. Identify and design a series of search filters or "facets" for the manuals website.

Read the entire page →

From page 19...

... Findability Techniques 19 The rule-based approach offers more flexibility than the ontology-driven approach. It can look for multiple terms related to the tag being assigned.

Read the entire page →

From page 20...

... 20 Implementing Information Findability Improvements in State Transportation Agencies references to engineering manual sections to be consulted for preparation of certain deliverables. These references created by the authors of the MDL provide a much better basis for tagging than matching of deliverable names to manual section text.

Read the entire page →

From page 21...

... Findability Techniques 21 computer discovers sets of words that appear together in the same document, the proximity of these words suggests potentially meaningful relationships among the words. Clusters of terms, therefore, can indicate the meaning of a document and can be used for classification.

Read the entire page →

From page 22...

... 22 Implementing Information Findability Improvements in State Transportation Agencies 2. Building the Model Many algorithms are capable of clustering content.

Read the entire page →

From page 23...

... Findability Techniques 23 • Words to exclude that don't contribute to understanding of meaning (e.g., months of the year) ; and • Thresholds for removing rarely occurring and frequently occurring words from the model (e.g., ignore words that are in more than 70% or fewer than 5% of the documents)

Read the entire page →

From page 24...

... 24 Implementing Information Findability Improvements in State Transportation Agencies b. Create training sets for each new content type to be categorized.

Read the entire page →

From page 25...

... Findability Techniques 25 h. Specify the classes to be predicted by the model.

Read the entire page →

From page 26...

... 26 Implementing Information Findability Improvements in State Transportation Agencies • The body of documents to be categorized is highly varied in use of vocabulary and format, such that the size of the training set would need to be fairly large in order to represent variations in the entire corpus; • The body of documents to be categorized is not in machine readable format or is not amenable to successful OCR due to poor image quality; and • There are already good quality metadata being manually assigned as part of established content management system or library intake processes. Conversely, the most promising applications of text analytics for auto-categorization are those characterized by: • A very large number of documents (e.g., over 5,000 and increasing)

Read the entire page →

From page 27...

... Findability Techniques 27 Measuring recall and precision is time-consuming. With a large corpus of documents, it is not practical to examine each document to determine whether or not it was correctly classified.

Read the entire page →

From page 28...

... 28 Implementing Information Findability Improvements in State Transportation Agencies Lessons The lessons included in Section 5.3 related to converting documents to text and OCR quality are equally applicable to entity extraction. Additional lessons for entity extraction are provided in the following sections.

Read the entire page →

From page 29...

... Findability Techniques 29 on the presence of standard phrasing and document structure and works best when the document type is relatively uniform. The lookup method is slower than the regular expressions method and less forgiving of variations but obtains cleaner results (due to matching with a master list)

Read the entire page →

Key Terms

← Previous Chapter Skim

Next Chapter Skim →

This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.