Read "Implementing the Freight Transportation Data Architecture: Data Element Dictionary" at NAP.edu

« Previous: Chapter 4 - Inventory of Freight Data Sources, Dictionaries, and Glossary Terms

Page 34

Suggested Citation:"Chapter 5 - Classifying Data Elements Across Databases." National Academies of Sciences, Engineering, and Medicine. 2015. Implementing the Freight Transportation Data Architecture: Data Element Dictionary. Washington, DC: The National Academies Press. doi: 10.17226/21910.

Page 35

Page 36

Page 37

Page 38

Page 39

Page 40

Page 41

Page 42

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

34 C H A P T E R 5 5.1 Introduction Freight data sources tend to be heterogeneous in terms of structure, syntax, and semantics (Buccella et al. 2003). Structural, or schematic, heterogeneity deals with differences in how the data is stored in the various databases (e.g., table schemas, primary and foreign keys, etc.). Syn- tactic heterogeneity deals with differences in the representation of the data; in other words, data types and formats (e.g., numeric, text, alpha-numeric values, categorical, etc.). Semantic heterogeneity, which is the most challenging to resolve, deals with differences in interpretation of the meaning of the data (Merriam-Webster 2014). Cui and OâBrien (2014) classify semantic heterogeneity as follows: â¢ Semantically Equivalent Concepts: Different models use the same or synonymous terms to refer to the same concept; however, there may be differences in property types (e.g., the concept weight may be presented in tons in one model but in kilograms in another model). â¢ Semantically Unrelated Concepts: Different models use the same terms, but the terms have different meanings (e.g., the concept channel may mean ship channel in the U.S. Waterway database but mean traffic channelization device in the Federal Railroad Administration [FRA] Safety database). â¢ Semantically Related Concepts: Concepts may become generalized as they are classified across models; for example, the city âAustin, Texas,â in the Air Carrier Statistics database is referenced in the commodity flow survey (CFS) as âAustin-Round Rock, Texas.â Resolving freight data heterogeneity across multiple data sources is required to facilitate the integration of data elements, enable interoperability between multiple systems, and smooth the exchange of data and information. Heterogeneity resolution first involves identifying which ele- ments are related and vice versa. When dealing with more than 6,300 data elements, however, this process can be a tedious and time-consuming task. To address this problem, a general freight data classification system was developed to cat- egorize similar elements within each database, thus facilitating the identification of related data elements across multiple data sources. By first identifying related data elements, the process of determining the differences in data element definitions and resolving those differences through harmonization or statistical bridges becomes much clearer and more defined. 5.2 Background A literature review identified practitionersâ attempts to classify freight data, although no formal classification system currently exists for freight data elements. Specific applications and exam- ples of how the classification schemes could be utilized for data integration and heterogeneity Classifying Data Elements Across Databases

Classifying Data Elements Across Databases 35 resolution across multiple freight data sources were not found. The classification schemes found in the review of the literature are described in this section. To define key attributes of freight-related shipments, the TRB Committee on Freight Trans- portation Data (2003) coined the mnemonic CODMRT: â¢ Commodity, which describes the type of freight being moved and contains information such as value, weight, and handling characteristics. â¢ Origin, which describes the geographic starting point of a freight trip. â¢ Destination, which describes the geographic ending point of a freight trip. â¢ Mode, which describes the vehicles and infrastructure used to transport goods. â¢ Route, which describes the sequence of specific individual facilities (e.g., sections of roads, railroad tracks, etc.) that are used to transport freight between the origin and destination on a specific mode. â¢ Time, which is defined as the time period for which the freight data was collected (i.e., the freight forecast time period). Ambite et al. (2004) classified data elements for multiple sources by representing each data item as a measurement that has values along a set of dimensions (e.g., geographic area, type of flow, mode of transport, type of product, time interval, value, and unit of measurement). The classification schemes for both CODMRT (2003) and Ambite et al. (2004) were found, however, to be limited to the commodity flow domain, and they do not capture elements from other freight data sources such as accident data and industry information. Tok et al. (2011) developed a conceptual data structure for California that identified the rel- evant data set for a standardized national freight transportation data architecture. The high-level data elements defined in the data structure schema were time periods, time resolutions, zones, facility networks, commodities, modes, socioeconomic data, and logistics. Time-resolution data elements include items such as annual, quarterly, monthly, and daily time periods. Zones include items such as states, gateways, foreign, and trade regions. Facility networks contain information such as highway geography, rail geography, and waterways. The socioeconomic category consid- ers elements such as employment and population, and the logistics category considers elements such as time, emissions, energy consumption, and safety (Tok et al. 2011). NCFRP Report 14: Guidebook for Understanding Urban Goods Movement (Rhodes et al. 2012) provides classified freight data sources in the following categories: â¢ Freight node data, which represents consolidated or individual endpoints that generate or receive freight flows and are the key points of production, consumption, or intermediate handling for goods. â¢ Freight network data, which defines major route patterns and critical infrastructure being used to convey freight shipments through the various modal systems. â¢ Freight flow data, which provides information on commodity flows and provides insight on the economic and trade environment of regions. Typical commodity flow records will contain information on the O-D of shipments, type of commodity, weight, and/or value of the com- modity shipment, and mode of shipment. â¢ Neighborhood freight data, which provides information on safety, congestion, land use, and emissions. Although both Tok et al. (2011) and Rhodes et al. (2012) addressed a broader range of freight data types as compared to CODMRT and Ambite et al. (2004), specific applications and exam- ples of how the data structures can be mapped across data sources and utilized to resolve data heterogeneity across multiple data sources were not stated. An XML schema such as TransXML (which was developed for the exchange of transporta- tion data interoperability and dissemination) also is limited in scope. TransXML currently

36 Implementing the Freight Transportation Data Architecture: Data Element Dictionary addresses four key business areas in transportation: (1) survey/roadway design, (2) transporta- tion construction/materials, (3) highway bridge structures, and (4) transportation safety; how- ever, it excludes other areas specific to freight movement. Multimodal freight movements (air, marine, rail, and pipeline), economic and census data, industry information, and commodity flow data schemas cannot be addressed with the current version of TransXML, and other stan- dards, such as LandXML (LandXML.org 2000), Geographic Markup Language (GML), Geo- graphic Information Framework Data Standard (Federal Geographic Data Committee 2008), and International Organization for Standardization (ISO) 14825:2011 Geographic Data Files (ISO 2014), were not developed specifically to address freight data. As found in the literature search, freight data classification is mostly restricted to the commod- ity flow domain (e.g., CODMRT), and currently no agreed-upon classification system applies to all data elements from the various freight databases. Current data standards such as TransXML are limited in scope in terms of their representation of freight data, as they were developed mainly to address the exchange of transportation data and facilitate communication across multiple transportation industry stakeholders and agencies. The existing standards are inadequate to serve as a formal representation of the various data elements contained in multiple freight databases. For example, data elements that describe the freight industry, events that may occur during the transport of goods, and the role of human activity are currently not captured in these standards. A generalized framework for classifying freight data elements across multiple data sources is therefore proposed. The proposed schema, called the Role-Based Classification Schema (RBCS), organizes and classifies data elements within their respective parent databases and facilitates the comparison, unification, translation, and fusion of data from multiple databases. The RBCS does not replace the existing standards; rather, it facilitates the process of identifying related data ele- ments across a multitude of existing freight data sources. Data elements captured in the proposed schema can be used in advising the future development of existing standards (e.g., TransXML) to adequately capture all the existing freight data sources in their respective schemas. 5.3 Methodology In developing the generalized classification system for representing freight data elements across multiple databases, an attempt was made to identify and group elements with similar âroles.â For purposes of NCFRP Project 47, a role was defined as âthe kind of information con- veyed by an element or attributeâ in its database. The researchers found that the roles of data elements in their respective databases could be used as a means for developing the RBCS clas- sification system. Two levels of classification groupings were identified: a top-level, primary grouping and a second-level grouping. The top-level grouping was based on an enumeration of multiple freight databases and the literature on freight data classification. Examples of freight data classifica- tion schema from the literature that were utilized in developing the top-level primary groups included those of CODMRT (2003), Ambite et al. (2004), and Tok et al. (2011). Data elements from freight data dictionaries were examined and a final list of nine top-level, primary groups was identified. The second level of classification seeks to differentiate data elements that identify objects from data elements that describe the features of an object. This distinction became necessary as some elements were found to define entities that tend to be unique, whereas other elements were found to provide additional information about those identified elements. For example, a data element such as âorigin IDâ refers to or identifies a particular place, and the data element âpopu- lationâ describes the number of inhabitants living in that place. The distinction between these

Classifying Data Elements Across Databases 37 two types of elements is that only one âorigin IDâ can refer to a particular place in a database, but multiple places can each have âpopulationâ numbers, which are not necessarily unique. RBCS first determines and assigns a role to each data element within its database based on the primary and secondary level classification of that data element. Grouping data elements within their respective databases simplifies the process of identifying similar data elements across mul- tiple databases, as similar elements tend to fall within the same group. To validate the generality of RBCS, the classification schema was applied to all the freight data sources included in the master data dictionary to determine the generality of RBCS in successfully classifying data ele- ments across those databases. 5.4 The RBCS Primary and Secondary Level Classifications Nine top-level, primary groups were identified from examining the databases and reviewing the literature: commodity, event, humans, industry, link, mode, place, time, and unclassified. Figure 5-1 illustrates the inherent relationships that persist between the various data elements despite their classification into different roles. Commodities (C) generated by the industry (I) are moved by various transport modes (M) from one place (P) to another (P) along the trans- portation network (L) within a time period (T). During the transport process, a chain of possible events (E) may occur that involve various stakeholders or individuals (H). The last category, âunclassified,â forms part of a larger âvirtual boundaryâ that contains elements that do not fit under any of the aforementioned roles but need to be accounted for to preserve data integrity. The nine primary groupings capture many kinds of information that could potentially be retrieved from a freight database (the validation of which is explained by demonstrating clas- sification efficiency in the next section of this chapter). Considering the possibility of other researchers identifying new roles in the future, however, the outline of the virtual boundary is drawn in dashed lines. The second level of classification applies to all the above roles except the time and unclassified roles. This secondary classification seeks to separate elements that identify a known object from elements that describe the features of the object. For this purpose, data elements that identify objects are defined as identifiers, and examples include âorigin,â âdestination,â âroad name,â and âtransport mode.â Data elements that describe the features of an object are defined as features, and examples include âpopulation,â âarea,â âlength,â âunit train,â and ânumber of carloads.â Figure 5-1. Schematic representation of the RBCS.

38 Implementing the Freight Transportation Data Architecture: Data Element Dictionary From the nine primary and two secondary classification groups, the following classifications groups (or roles) were developed: â¢ Time elements, which provide information regarding either the exact time period (e.g., year, month, time, day) or duration (e.g., seasons, quarter, biannual) for which the data is being reported or the freight movement occurred. â¢ Place elements, which identify or describe the O-D of freight movement or the location of an event (e.g., an accident), or which may provide information relating to the characteristics of the place. â Place identifier (e.g., city name, state, origin county name, destination country name, accident location). For geospatial databases, this can either be points or polygons. â Place feature (e.g., area, population). â¢ Commodity elements, which identify or describe a commodity being moved. â Commodity identifier (e.g., Standard Transportation Commodity Codes [STCC], Standard Classification of Transported Goods [SCTG] commodity codes, Harmonized System codes, hazardous material). â Commodity feature (e.g., liquid, bulk, value, weight, trade type). â¢ Link elements, which identify or describe information about the roadways, waterways, routes, etc., on which freight is moving. â Link identifier (e.g., a roadway name, a waterway name). â Link feature (e.g., width, length, from, or to). â¢ Mode elements, which identify or describe the vehicles involved in the movement of freight. â Mode identifier (e.g., truck, rail, air, vessel). â Mode feature (e.g., unit train, vehicle class, number of trucks). â¢ Industry elements, which identify or describe fields that report on economic activities. â Industry identifier (e.g., North American Industry Classification System [NAICS] codes, Standard Industrial Classification [SIC] codes, company name). â Industry feature (e.g., number of employees, sales, annual payroll). â¢ Event elements, which identify or describe occurrences or actions that occur when freight is being moved. â Event identifier (e.g., an accident report number, a dredging operation, or a port call). â Event feature (e.g., number of fatalities as a result of an accident; depth of dredge; or number of port calls). â¢ Human elements, which identify or describe a person involved in a data record. â Human identifier (e.g., investigating officer, reporting agent, or contact person). â Human feature (e.g., drunk driver, driver age, or operator condition). â¢ Unclassified elements, which present a unique proposition in that some databases report additional information about the dataset themselves (e.g., expansion factors applied to data- set, empty fields, etc.). By themselves, these fields do not necessarily describe freight movement, but they can provide information that is useful when performing data analysis. Examples of unclassified elements include record IDs, primary keys, comment fields, record modification dates, metadata, and administrative ID fields. 5.5 Validation To validate the generality of RBCS, the schema was applied to all 6,322 data elements from the 28 public and commercial freight data dictionary sources. Table 5-1 illustrates the applica- tion of RBCS to data tables from five sources using all the possible roles. To quantify the ability of the proposed roles to classify data elements, this process was repeated for all the data sources. For each source, the number of elements that were successfully classified using the defined

Table 5-1. How RBCS groups data elements across databases. Element Role FAF3 Public Use Carload Waybill Sample Air Carrier Statistics (all carriers) HPMS U.S. Waterway Foreign Cargo Inbound and Outbound Data Time Year Waybill Date Accounting Period Waybill Reporting Period Length Year Quarter Month Year of Last Improvement Year of Last Construction Year Place Identifier Foreign Region Origin Domestic Region Origin Domestic State Origin Domestic Region Destination Domestic State Destination Foreign Region Destination Inter/intra State Code Origin BEA Area Origin Freight Rate Territory Interchange State #1 Interchange State #2 Interchange State #3 OriginAirportID OriginCityName OriginStateFips OriginStateName Urban Code County Code Climate Zone U.S. Port Code U.S. Port Name U.S. State Foreign Port Schedule K Code Foreign Port Code Foreign Port Name U.S. Coastal District Place Feature - - - - Longitude of Foreign Port Latitude of Foreign Port Link Identifier - - - Functional System Route Number Alternate Route Nameâ¦ Waterway Code Link Feature - Estimated Short Line Miles Number of Interchanges Distance Between Airports Facility Type Structure Type Access Control Ownership Speed Limitâ¦ - Mode Identifier Foreign Inbound Mode Domestic Mode Foreign Outbound Mode - - - - (continued on next page)

Mode Feature - Number of Carloads Car Ownership Category Code AAR Equipment Type Code AAR Mechanical Designation STB Car Type TOFC/COFC Service Code â¦ CarrierGroup CarrierGroupNew DistanceGroup Class - - Commodity Identifier Commodity (STCG) Commodity Code (STCC) - - Lock Performance Monitoring System Commodity Code Commodity Feature Type of Trade Value Weight Ton-Miles Billed Weight Actual Weight Freight Revenue ($) â¦ - - Tonnage Type Processing Event Feature - - - - - Industry Identifier - - UniqueCarrier AirlineID UniqueCarrierName UniqCarrierEntity â¦ - - Unclassified - Subsample Code Exact Expansion Factor Theoretical Expansion Factor DataSource - - AAR = Association of American Railroads; BEA = U.S. Bureau of Economic Analysis; FAF3 = Freight Analysis Framework, version 3. Note: A dash (-) indicates ânot applicable.â Element Role FAF3 Public Use Carload Waybill Sample Air Carrier Statistics (all carriers) HPMS U.S. Waterway Foreign Cargo Inbound and Outbound Data Table 5-1. (Continued).

Classifying Data Elements Across Databases 41 classification roles was counted, and the ratio (classification efficiency) of classified elements to the total number of elements in a data source was calculated, as follows: Classification efficiency Classified Elements Classified Elements Unclassified elements = + Table 5-2 shows the classification efficiency of all 28 data sources. In general, RBCS is found to yield high classification efficiencies. Of the 28 data sources, 12 had a classification efficiency of 100% and six had values ranging between 95% and 100%. Seven data sources had values ranging between 80% and 95%. It is important to note that the lower classification efficiencies can be attributed to the low number of total elements in the respective databases. As an example, the Database Name RBCS Classified Unclassified Classification Efficiency Air Carrier Statistics 500 4 99% Air Carrier Financial Reports 478 0 100% Annual Survey of Manufacturers 62 0 100% Border Crossing/Entry 5 0 100% CTA Intermodal Terminals Database 11 1 92% Carload Waybill Sample 252 0 100% Commodity Flow Survey 18 0 100% County Business Patterns 190 132 59% Fatality Analysis Reporting System 310 0 100% Federal Railroad Administration Safety Database 414 89 82% Foreign Trade 362 27 93% Freight Analysis Framework 68 2 97% Highway Performance Monitoring System 117 0 100% IHS Transearch 30 0 100% Motor Carrier Management Information System 314 44 88% Motor Carrier Safety Measurement System 28 4 88% National Agricultural Statistics Service 38 0 100% National Corridors Analysis and Speed Tool Database 17 4 81% North American Transborder Freight Database 60 6 91% Pipeline and Hazardous Material Safety Administration 32 1 97% Service Annual Survey 6 22 21% Survey of Business Owners 198 0 100% Topologically Integrated Geographic Encoding and Referencing 475 8 98% U.S. Waterway Data 263 3 99% Vehicle Inventory and Use Survey 241 1 100% Vehicle Travel Information System 154 53 74% Woods and Poole Economics, Inc. 1240 0 100% Table 5-2. Classification efficiency of freight data sources.

42 Implementing the Freight Transportation Data Architecture: Data Element Dictionary National Corridors Analysis and Speed Tool (N-CAST) database had 17 classified elements out of a total of 21 elements, which resulted in a classification efficiency of 81%. Data sources such as the County Business Patterns, Service Annual Survey, and Vehicle Travel Information System were found to have a significant amount of ânoise flag,â or metadata-related, data elements. 5.6 Model Limitation A limitation of RBCS is the need for consistency during the classification process. When ambi- guity exists, this is usually resolved by critically examining data element definitions to determine an elementâs role and ensure it has been classified consistently throughout the process. For example, if the data element âtrade typeâ is assigned to the role âCommodity Featureâ in one database, the same role should be applied to similar âtrade typeâ data elements in subsequent databases. This is important because if data elements like âtrade typeâ are assigned to one role âCommodity Featureâ in one database but to a different role (e.g., âEvent Identifierâ) in another database, the data elements will not be uniformly groupedâwhich makes it difficult to find and analyze similar elements. After carefully grouping the data elements across multiple databases, any decisions concern- ing changing an elementâs role can be easily made. Being consistent in the initial classification process facilitates future changes. Another limitation of RBCS may be the limited number of roles defined. The validation process in this study revealed that the nine primary groupings essentially capture the majority of the data elements from the databases tested. However, future enumerations of other databases may result in the identification of additional roles. For example, the âPlace Identifierâ role could be further expanded to âPoint of Originâ and âPoint of Destinationâ roles, and the âCommodity Featureâ role could be further expanded to differentiate between a âCommodity Unit of Measureâ (e.g., tons, value) and a commodity feature such as âTrade Typeâ (e.g., import or export). Considering the possibility of additional primary and subordinate roles, The âVirtual Boundaryâ described in Figure 5-1 provides an opportunity for future iterations of this classification schema. 5.7 Application In this study, RBCS was used to identify similar data elements and bridge differences in their definitions. For example, the âPlace Identifierâ role in Table 5-1, data elements that identify places are defined in the FAF3 data dictionary as âForeign region origin, Domestic region origin, etc.â and in the Carload Waybill Sample dictionary as âInter/Intra State Code, Origin BEA Area, Origin Freight Rate Territory, etc.â This example demonstrates a case of semantic heterogeneity in âPlaceâ between the FAF and Carload Waybill Sample databases. Examination of these ele- ments in isolation helps researchers formalize the process of identifying and addressing semantic heterogeneity. By ascertaining similar data elements, the subsequent process of mediating ele- ments from those databases becomes much clearer, especially when dealing with hundreds of data elements across diverse databases.

Next: Chapter 6 - Differences in Data Element Definitions »

Implementing the Freight Transportation Data Architecture: Data Element Dictionary (2015)

Chapter: Chapter 5 - Classifying Data Elements Across Databases

Welcome to OpenBook!

Get Email Updates