Read "Framework for Managing Data from Emerging Transportation Technologies to Support Decision-Making" at NAP.edu

« Previous: Appendix C Telephone Interview Findings

Page 108

Suggested Citation:"Appendix D Project Documentation Review Findings." National Academies of Sciences, Engineering, and Medicine. 2020. Framework for Managing Data from Emerging Transportation Technologies to Support Decision-Making. Washington, DC: The National Academies Press. doi: 10.17226/25965.

Page 109

Page 110

Page 111

Page 112

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

101 Appendix D â Project Documentation Review Findings In several instances, information was not drawn from agency surveys or interviews; rather, detailed information on the emerging transportation technology project was available in materials produced as part of the project documentation. The team reviewed documentation from four such projects: â¡ City of Columbus, Smart Columbus â¡ New York City Department of Transportation Connected Vehicle Pilot Project â¡ Tampa Hillsborough Expressway Authority (THEA) CV Pilot Project â¡ Wyoming Department of Transportation Connected Vehicle Pilot Project Summaries of the general findings from the documentation reviews for each project are provided herein. D.1 City of Columbus Since winning the USDOTâs Smart City Challenge in 2016, the city of Columbus has developed a range of Smart City initiatives under the âSmart Columbusâ project. These initiatives include smart mobility hubs, connected vehicles, connected autonomous vehicles, event parking management, and mobility assistance for people with cognitive disabilities. Smart Columbus has developed advanced capabilities in its cloud storage-based data management practice, both in their mature internal data lake and in their externally facing data delivery platform âSmart Columbus OS.â At the time of the documentation review, Smart Columbus OS hosted 1,055 public datasets. Some of these datasets originated within the Smart Columbus projects themselves, while others were from external sources that applied to have their data added. All datasets went through a detailed vetting process where a Smart Columbus data curator analyzed the data for quality issues and privacy concerns before allowing it to be hosted on the platform. Some hosted datasets can be marked as private and limited to a list of approved users, but most datasets are accessible freely and anonymously. In addition to providing open data, Smart Columbus OS is also in the process of providing open analysis, where users may analyze the data directly from the website. This feature is powered by Jupyterhub, a collaborative online version of the popular data science notebook coding environment Jupyter. While this tool is still in limited beta testing, the concept of providing such a public analysis offering is far ahead of the curve compared to other projects assessed. Third-party contributors to Smart Columbus OS are ultimately responsible for the quality of the data they provide, and such data will not be accepted unless it meets certain quality standards. These standards include identifying all metadata and implementing an automated means to identify obvious PII. Even after meeting these initial standards, each subsequent update of the data will be machine- assessed for completeness and accuracy. Plans are in place to also allow users to assess and rate the quality of datasets, with low-scoring datasets flagged for review. Should a dataset fail this review process, it may be frozen or deleted. The city of Columbus hosts its own data in a cloud-based data lake. The cloud provider backs up these data frequently and assists with the data recovery process as needed. The data are stored in commonly used, open source formats. Columbus keeps full ownership of its data and is under no restrictive third- party contracts that would limit data access or use.

102 The Smart Columbus project takes several measures to protect sensitive personal information. First, anytime a new data source is identified, it is audited by a data curator to determine if there is any PII in the data and whether it is necessary to have. If it is not needed, the data provider is asked to remove the PII prior to the data reaching the Smart Columbus data stores. When this is not feasible, the data curator implements an anonymization process to the data such as applying random offsets to endpoints so that exact destinations cannot be derived. Finally, if the PII is needed and cannot be anonymized without compromising research value, then the dataset is marked âprivateâ and access is restricted on a user- by-user basis. Smart Columbus utilizes its own in-house developers and has a well-documented development methodology. This methodology includes agile data development following a V-model as defined in the FHWA Systems Engineering for Intelligent Transportation Systems Guide. Developers meet often in a "scrum of scrums," which allows new changes to be rapidly developed and implemented while maintaining coordination between the teams. All in all, Smart Columbus is an impressive project whose documented approaches closely align with recommendations drawn from the literature (James, et al., 2018). While the project continues to grow, the successes achieved in developing appropriate technology, procedures, and policies have already advanced Smart Columbus beyond the challenges that most other organizations currently face. D.2 New York City Department of Transportation (DOT) The New York City DOT is developing a connected vehicle pilot as part of the USDOTâs Connected Vehicle Deployment Program. DOT receives assistance from researchers at New York University (NYU), who collect survey and interview data from participants to supplement the sensor data collected by DOT. To protect themselves against being subpoenaed, DOT takes extreme steps to ensure that it not only avoids handling PII, but also discards all raw data that could conceivably be connected to PII. All data gathered is sanitized, obfuscated, and binned before they are ever entered into the database, and any paper files at NYU are locked in a safe and will be burned when no longer needed. While these approaches help DOT achieve the objective of removing all identifiable data from their servers, it also creates significant hurdles when it comes to data quality verification and data product development (Galgano, et al., 2016). Like all participants in the USDOT Connected Vehicle Deployment Program, DOT has very detailed documentation. There is a well-maintained metadata catalog that follows industry standard formats. They also enjoy some benefits to only having scrubbed data on their servers in that they do not need to maintain separate servers to house PII, and they are able to freely disseminate all their processed data to all interested parties (Van Duren, Rausch, & Benevelli, 2017). Whether these data are shared on their own data portal or through the USDOTâs Research Data Exchange, DOT does not need to worry about identifying the shared data or restricting access to certain users, because there is nothing resembling PII to protect. DOT has detailed and up to date documentation on its security architecture that ensures all data are encrypted at rest and in transit throughout the system of systems. There is not as much detail regarding the protection of PII and other sensitive data. The available documentation simply states that PII will be defined by signed agreements with each data provider, the PII will be obscured at time of collection, and that a manual audit will take place to ensure that the PII was properly obscured and cannot be

103 aggregated from the data available. No mention was found of storing datasets containing PII on a separate server or database, implying that all PII will be obscured or aggregated away. DOT enforces data quality checks at both the device level and at the database level. The devices first check for data outside of normal ranges before sending them to the database, and the database will repeat this check when it receives data, placing data outside the expected ranged into a âsuspectâ bin. DOT employees can then review these data points to better identify the root cause of the aberration and update their filters as necessary. User access to the preprocessed data is severely limited, and any such access is monitored and controlled. Only after the data have been fully processed and anonymized are they made freely available. D.3 Tampa Hillsborough Expressway Authority (THEA) The Tampa Hillsborough Expressway Authority (THEA), is another participant in the USDOTâs Connected Vehicle Deployment Program. While THEA collects data in open source formats and maintains direct control over the data in its own local data infrastructure, THEA does contract with third-party services such as âConcert ATMSâ and âNextConnectâ for data analysis and use. These partnerships provide several advantages such as detailed metadata catalogs, advanced data quality management, and easier data storage handling. There are also some challenges involved, particularly with advanced data analysis and data dissemination. THEA collects and manages a variety of open source data types including JSON, XML, and CSV. Detailed metadata definitions are documented for each data type so that researchers can easily find and merge data in their analyses. These data are stored locally on server hardware supporting a virtual machine environment via the âNextConnectâ software. Using virtual machine concepts makes the infrastructure easily scalable; however, there may be additional licensing fees or other restrictions involved compared to using open source alternatives like Hadoop. THEA has detailed security documentation that provides specific details on cryptographic implementation and security at the hardware level. This documentation includes detailed threat assessments, component security requirements, and diagrams of secure operation. Datasets that contain PII are either stored separately in a highly secure federal database, or the sensitive data are obfuscated and anonymized. THEA employs both automated and manual processes for verifying data quality. The first pass is a corrupt message check, where the system detects if a message from a device in a well known format is corrupt. If found to be corrupt, it is discarded as it cannot be verifiably cleaned of PII. After non-corrupt messages are collated into datasets, those datasets are filtered for extreme outliers and obvious errors. Finally, there are scheduled and unscheduled data audits where the data quality is checked manually. THEA does not directly share any data on an open data platform, nor does it currently have a documented open data plan. THEA does plan on sharing data that have been scrubbed of PII with the USDOT's Research Data Exchange, however at the time of review these datasets do not appear to be hosted yet on that site. THEA backs up its data frequently to a second, high availability (HA) site designed for fast access to the data if the primary database goes offline. From a data quality management perspective, all data are verified using both automatic and manual methods at all steps of the data lifecycle. Upon ingest, all data

104 are checked against expected format structure for corruption. After loading into datasets, the data are checked again for known errors and out of bound values. Finally, while the data are being used, it will be subject to manual data audits both scheduled and unscheduled (Johnson, et al., 2017). For analysis of these data using âConcert ATMS,â THEA does need to copy the data to separate cloud servers. Since this analytical software is controlled by a third party, any modifications or additions to the analysis need to involve that third party. Furthermore, THEA currently has no open data policy, and there are no plans to publicly share any information outside of uploading some anonymized datasets to the USDOTâs Research Data Exchange system. Based on available documentation, it is unclear if this decision is due to third-party contract restriction or for other reasons; however, it is common among other departments that have similar contracts for those contracts to end up complicating data dissemination in some way (Kolleda, Garcia, & Poling, 2016). D.4 Wyoming Department of Transportation (WYDOT) As part of USDOTâs Connected Vehicle Deployment Program, WYDOT has developed a well-designed data management plan with all the detailed documentation that is characteristic of this national program. Good practices are followed from the collection of data in the easy-to-use JSON format, to storing the encrypted data on access-restricted servers, to collaborating with other agencies in securely disseminating data on a Research Data Exchange (Kitchener, et al., 2017). Even with such a strong overall project, there are still some areas where additional development work could help avoid scalability, metadata management, and other challenges in the years to come. WYDOT manages a unified data warehouse, maintaining full control over access to their data. Only employees with background checks and the need to know are permitted access to any servers that store PII. Access for other users is provided through a managed operational data environment connected to the data warehouse. All data are encrypted both at rest and in transit, and source data stored on collecting devices is deleted after 10 minutes. Database backups and critical systems are stored in geographically distinct locations. Backup media are retained for at least 3 months and secured in a fireproof safe. The data warehouse architecture is relatively complex but should perform well given sufficient resources to manage and maintain it. The core of the database is stored on Oracle servers, which are well suited for traditional relational database management systems. One weakness of using such servers is that, as the size of the data grows, it can be very expensive to purchase additional Oracle server hardware to support that expansion. There are several modern alternatives, such as the open source HDFS, that can scale upwards more easily with inexpensive commodity hardware. Any access of servers that house PII must be requested in advance and is automatically logged and audited. Automatic data threshold/quality checks are performed on machine-ingested data prior to storing; however, no documented information is available on data quality practices for manually added data. Periodic data transfers of shareable data are made to a separate Research Data Exchange system in order to further contribute to open data efforts. Within the reviewed data management plan there is a great deal of information on the data architecture, security, and open data policy. There was less detailed information available on how the data quality is monitored and managed, how new data products will be developed, or what the

105 metadata catalog will contain (Gopalakrishna, et al., 2016). Based on the experiences of other departments, developing thoughtful policy and procedures in these areas early in the project lifecycle will help avoid time consuming issues and roadblocks in the future.

Framework for Managing Data from Emerging Transportation Technologies to Support Decision-Making (2020)

Chapter: Appendix D Project Documentation Review Findings

Welcome to OpenBook!

Get Email Updates