Appendix D
A Framework for the Development of National Freight Data
Dissenting Statement of Kenneth D. Boyer
This committee was charged with recommending “a framework for the development of national freight data … The framework was to be conceptual in nature and not a detailed data collection plan. Instead, it would articulate the types of freight data needed by the variety of users in transportation and the roles of different data providers.” The intellectually defensible way to come up with such a recommendation is to include a balanced discussion of the benefits of different ways of collecting freight data, along with a discussion of the limitations, constraints, problems, costs, and characteristics of each way of collecting these data. Once the analysis of the problem is laid out in this way, the recommendation should follow from the analysis. The recommended framework should offer the greatest benefits within the constraints identified by the analysis.
The majority of the committee in their draft of Chapter 3 did not do this, but rather focused on the benefits of improved data. The commit-
tee neglected to give equal weight to the discussion of the practical limitations of different data collection methods. By contrast, this appendix, representing a minority dissent from the discussion in Chapter 3, lays out an analysis of the practical realities of freight data collection, focusing on three issues: (a) the size of the database called for in Chapters 1 and 2, (b) confidentiality issues, and (c) the need for judgmental data fusion to create the database. This appendix then uses this analysis of the practical constraints implicit in freight data collection to propose a framework for freight data development. This framework is offered not as the only way to achieve the goals of Chapters 1 and 2; it may be discovered that there are other, better ways. It does, however, offer an intellectually coherent recommendation for a framework for freight data development that is missing in Chapter 3. Using the framework, this appendix shows that Chapter 3 errs in several key areas, among them the following:
-
Failure to recommend a procedure for dealing with confidentiality issues,
-
Confusion on how data series like Waterborne Commerce of the United States and the 1 percent Railroad Waybill Sample should be owned and managed in relation to the new proposed data collection efforts,
-
Failure to clearly define the role of third-party data organizers, and
-
An apparent recommendation to shift resources away from the current Commodity Flow Survey (CFS) in favor of surveying other participants in the supply chain.
THREE PROBLEMS INHERENT IN CODMRT DATA
Chapters 1 and 2 of this report give a rationale for pushing freight data collection in the direction of extremely fine descriptions of freight traffic flows. The fundamental reason for moving in this direction is to support infrastructure investments that either mitigate congestion or promote regional economic development. Currently data are not available at such a fine disaggregation, but accurate measurement of the benefits of infrastructure investment requires data describing CODMRT—that is, commodity, origin, destination, mode, route, and time of day. This report
recommends that a national freight data collection program have as its goal the provision of these ideal data.
The difficulty of this program should not be underestimated. For at least three reasons, there is no example in the world where such a national database has been created. First, the sheer magnitude of the data is almost beyond comprehension. For example, if we assume a modest 1,000 commodities, 1,000 origins, 1,000 destinations, 1,000 routes, 5 modes, and 4 times of day, a database that described each of the elements would have 20 trillion entries, and even this level of detail is too coarse to support a decision on whether to replace a bridge on a particular highway over a particular river. A national program of data collection to support all possible infrastructure investments would be populated by quadrillions of data points—clearly far more than can be published in any form. Realistically, a CODMRT data collection program will consist of a combination of investigations of data for specific projects along with a publication of summaries at a much coarser level of aggregation.
Almost all of the entries in a CODMRT database will be zero, but such is the nature of transportation data—one should not expect to find coal shipped to Newcastle or wheat shipped from Manhattan by any mode or route, much less by rail to Fargo, North Dakota, via I-20 through Shreveport at 3 in the afternoon. But the thinness of the data creates a second problem—that of confidentiality. Even before routes and time of day are added, and even at origins and destinations defined as states, for commodities that are defined more finely than broad aggregates, most of the C-O-D (commodity-origin-destination) entries in the CFS are suppressed, since fewer than four shippers are represented by the data. Once we add mode, route, and time of day to the data description, we can confidently predict that confidentiality requirements will prevent publication of the result in the large majority of cases where CODMRT data are nonzero.
The third and most difficult problem with CODMRT data is that, as a general rule, there is no agent who could be surveyed who knows what is moving between an origin and a destination by a mode and route at any time of day. A shipper fills a container and passes it to a carrier with a contract to deliver it to a particular destination. The receiver can verify that what was received is what was contracted for but may not know the origin of the shipment. The carrier likely will not know what has
been hauled beyond a very general description. Neither the shipper nor the receiver will know the route, and in the case of motor carriage, the route and time of day may be known only to the driver, who may be ignorant not only of the contents of the truck, but also of the origin and ultimate destination.
In the future, national security concerns may require that all freight be shipped on a freight bill specifying everything in the CODMRT sextuple, and it is worthwhile monitoring developments in this area to see whether database development motivated by national security concerns can be tapped as a source of transportation data. Until such time as all freight movement requires freight bills, a framework different from simple sampling will have to be used to get the desired data.
These three issues—the vastness of the implied database, the thinness of the individual flows (leading to confidentiality concerns), and the fact that in the general case there is no individual who could fill out a survey to populate single entries in the database—dictate the proposed framework for freight data collection. In fact, it is perhaps misleading to characterize the gathering of CODMRT data as data collection, since such data will require joint inference from records contained in more than one data set.
DATA FUSION
The name “data fusion” will be given to the inference of flow characteristics from data contained in more than one data set, no one of which contains all of the information of interest. For example, one data set may contain records of 100 shipments of coiled sheet steel from Gary, Indiana, to St. Louis. After applying expansion factors derived from the sampling rate of steel firms in Gary, we might infer from these 20 records that 60,000 tons of coiled sheet steel is shipped from Gary to St. Louis annually. We might also have 10 records of machinery from Gary to St. Louis and, by applying expansion factors, we might infer that these represent 60,000 tons annually. From a separate database, we might get the information that among the shortest routes between the two cities, I-55 through Springfield has twice as much traffic as the I-57–I-70 route through Effingham, Illinois. This does not, however, mean that 40,000 tons of coiled sheet were shipped from
Gary to St. Louis via Springfield and 20,000 tons of coiled sheet were shipped via Effingham, that the proportion of machinery was the same on the two routes, or that the diurnal flow of steel or machinery between Gary and St. Louis will match that of traffic generally.
More accurate inferences will be made if one uses the original records or microdata. For example, if one has records of the individual enclosed vans and steel-hauling trucks between the two cities, it may be possible to assume that machinery will match the time and route pattern of steel coil shipped between the two cities, while machinery will have route and time patterns of enclosed vans. It is important to recognize that data fusion is not the same as record matching—it is extremely unlikely that a shipment of steel or machinery from Gary to St. Louis will be sampled as a shipment and again sampled as a movement on the truck.
Data fusion is not simply a matter of getting a consistent definition of commodities, origins, destinations, modes, routes, and times of day so that the “data silos” can be merged. Data fusion instead involves assumptions and judgment about matching records of steel shipments against records of truck movements, perhaps none of which will represent shipments of steel.
As noted by Southworth (1999), data fusion is not a mechanical process. All data fusion makes assumptions about how, for example, flows should be assigned to routes or how total shipments from an origin should be assigned to destinations. The accuracy of data fusion is then dependent on the assumptions made by the data fuser. Two different data fusers could make two different estimates of the CODMRT and both would be credible, depending on the different traffic assignment assumptions made by the two practitioners. The result of CODMRT estimations will then not be data of the kind found in the CFS, for example, with known accuracy dependent on the sampling rate in the sample frame. Instead, fused data should be considered conditional estimates, dependent on the appropriateness of the model and accuracy of the assumptions made by the data fuser. The results of a fusion of a movement database with a shipment database are not data in the sense that individual survey responses are data. They are estimates, interpolations, or forecasts and are only as good as the assumptions and judgment of the data fuser.
SEPARATING DATA COLLECTION AND DATA FUSION
Since CODMRT data require fusion of different data sets and since the method of fusion depends on assumptions made by the fuser, innovation and experimentation should be encouraged. The marketplace should then be offered different products made by different organizations employed to query the data archive. After some time, it is reasonable to assume that the market will find some fusion assumptions better than others, though improvements in methods may take years to develop. The encouragement of multiple ways of fusing data argues in favor of data fusion being primarily a private initiative rather than a government enterprise. However, the result of data fusion will be most accurate if the fuser works with individual records, in violation of confidentiality restrictions. Key to the framework allowing the assembly of CODMRT data will be the ability to solve the twin problems of maintaining confidentiality and encouraging imagination and innovation in data fusion.
The U.S. Department of Transportation’s Bureau of Transportation Statistics (BTS) has experience in maintaining a confidential database of transportation records. The flagship product of BTS has been the CFS. The CFS will be at the heart of any movement toward CODMRT data assembly. However, most freight data in the United States are collected outside of BTS. There are several modal data sources, for example the 1 percent waybill sample collected by the Association of American Railroads and the U.S. Army Corps of Engineers’ Statistics of Waterborne Commerce. The Reebie Transearch data are known to use proprietary truck data in conjunction with the CFS. Individual cities collect cordon count data, in which truck types and volumes are counted on individual highway segments. Import data are in the process of coming online, and intelligent transportation system (ITS) data are collected as trucks pass weigh stations in multiple states. A data fuser would wish to use the individual record microdata from all of these sources to estimate CODMRT data, something that jurisdictional boundaries and confidentiality problems now preclude.
To facilitate the creation of CODMRT data, BTS should become a depository or real-time archive of all forms of freight microdata. BTS should then allow data fusers to access these data if they follow confi-
dentiality rules on publication. BTS should also be in a position to provide incentives to data gatherers to place their microdata in the BTS archive. One possible way to do this is to forbid data fusers who have access to confidential data from the BTS archive to fuse data from proprietary sources outside of the BTS archive. In this way, for example, an organization that continued to use proprietary data outside of the BTS archive would create products that were less accurate than other data sources since it would not have access to microdata in the BTS archive; organizations would then have an incentive to place their proprietary data in the BTS archive so that data fusers using the data could have access to other confidential data as well. Similarly, localities interested in understanding metropolitan freight flows would have an incentive to contribute local data in order to have access to the microdata in the BTS archives. If proprietary data are more valuable to the collector when they are fused with confidential data in the BTS archives, owners of proprietary data will have an incentive to contribute their data set, thus augmenting the whole.
FREIGHT DATA ADVISORY BOARD
The larger is the BTS data archive, the more valuable it will be. If it reaches a critical size, it is reasonable to expect that data collectors, for example the Association of American Railroads and the U.S. Army Corps of Engineers, will voluntarily add their data so the data fusers can get access to the confidential records in it. Success will require an impartial hand overseeing relationships among the archive holder (presumably BTS); data fusers, who would have access to the confidential data in the archive; and the data fusers’ customers, who would not have access to confidential data. BTS is not appropriate as an overseer since it will also participate in the process as data archivist. This oversight should instead be given to the Freight Data Advisory Board, composed of representatives of data generators (modes, shippers, EZ-Pass and similar organizations, etc.), data users (including third-party data fusers and their customers), and government data organizers (BTS and state and metropolitan statistical agencies).
The Freight Data Advisory Board should define the division of tasks between the data archive maintainer (BTS) and the data fusers. BTS
should not be precluded from publishing summaries of data in the data archives that meet confidentiality conditions, as it now does. BTS would also be expected to impute values of missing data from single data sets using generally accepted methods. However, imputing information about freight flows from combinations of data sets in the archives would be the primary task of independent third-party data fusers, certified as qualified to view confidential data but forbidden to disclose it to any parties outside of BTS. The data summaries provided by data fusers should also meet the criteria for confidentiality that BTS must abide by. In order to maintain confidentiality, the Freight Data Advisory Board should publish guidelines for systematic aggregation criteria to mask activities of individual shippers. In fact, deciding how to aggregate the very thin CODMRT flows to ensure maximum usefulness of route-level data while maintaining confidentiality will be one of the first and most important tasks for the advisory board.
The output of data fusion will in most cases be estimates of the commodities shipped on a particular transportation link at specific times of day, tagged by origin and destination. There are far too many CODMRT combinations for these data to be published on a national basis, but estimates can be expected to be made for specific local projects. As noted previously, the probability is very high that there will be fewer than four shippers of any specific commodity on a specific route at a specific time of day between two specific origins and destinations, and thus confidentiality rules will prevent publication of the data. One way around this problem is for the BTS advisory committee to develop rules to facilitate discussions between data fusers and shippers to waive confidentiality requirements where appropriate.
BRINGING NEW DATA COLLECTION PROGRAMS INTO THE DATA ARCHIVE
BTS, while not invited to be a data fuser, would be expected to be both a creator of data, as in the CFS, and an archivist of data collected by other organizations. The most promising external data sources to be included in the BTS archive are existing electronically collected passive data streams. Chief among these are ITS data that track trucks with appropriate transponders as they cross the country. Not all motor
carriers or private truckers have ITS devices, so the data collected cannot be considered a random sample and cannot be used directly as CODMRT data even if commodity identity were collected by the transponders. However, these data should be a rich source of routing information, which could be fused with traditional CFS data covering commodities, origins, and destinations. The International Trade Data System is also a promising source of import data to fill the one significant coverage gap in the CFS. Passively collected data also have the promise of more timely data availability than has been possible in the past.
The Freight Data Advisory Board should also advise BTS on the desirability of starting new data collection efforts to augment the CFS and the data programs for which it acts as an archivist. One promising source of CODMRT data is roadside surveys like those conducted in Canada in which trucks are stopped randomly and the driver is asked to give information on routing, commodity, origin, and destination. If legal authority were found for collecting data in this way, a pilot study might be justified to determine the feasibility, acceptability, and cost-effectiveness of collecting CODMRT data through roadside surveys.
Less promising, but also worthy of consideration by the Freight Data Advisory Board for new data collection efforts to be undertaken by BTS, would be surveys of other participants in the logistics system— receivers, carriers, distributors, and so forth. Actively collected surveys like these are expensive and, like the CFS, do not provide timely information. Surveys of other supply chain participants should only be undertaken if they have information that is not available from other sources and if a sampling frame can be established that permits a random sample.
The CFS survey of shippers provides a long time line of COD data, thus permitting trend analysis. A shift of resources away from shipper surveys to paper surveys of other parties in the supply chain risks shrinking the CFS sample size, thus jeopardizing the reliability and usefulness of the CFS. If the Freight Data Advisory Board is attracted by the prospect of traditional survey instruments for parties other than shippers, pilot projects should be undertaken to ensure the feasibility, acceptability, and cost-effectiveness of these other survey types.
SUMMARY
Chapters 1 and 2 of this report recommend that freight data collection move in the direction of making credible estimates of freight flows specified by commodity, origin, destination, mode, route, and time of day. A framework that supports this goal must provide a mechanism for maintaining confidentiality of information while providing access to individual records of shipments and movements, the combination of which will be necessary if these finely defined transportation data are to be assembled. This appendix recommends the establishment of a Freight Data Advisory Board advising BTS. BTS would be charged with overseeing a freight data archive composed initially of existing databases augmented with passively collected electronic transportation data. The data archive will then be queried by a separate group of third-party data fusers, whose job will be to combine data sets in the archive by using their own assumptions about the data generation process and to create reports under contract to data users. While the data fusers would have access to confidential data in creating the reports, their output would be subject to the same confidentiality conditions as we now have. The relative roles of the data archive holder and the data fusers would be arbitrated by the Freight Data Advisory Board.
REFERENCE
Southworth, F. 1999. The National Intermodal Transportation Data Base: Personal and Goods Movement Components (draft). Oak Ridge National Laboratory, Oak Ridge, Tenn.