Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
C H A P T E R 5 Data ManagementHuman Subjects Protection Federal regulations and good research practice call for pro- tection of persons who participate in research studies (âhuman subjectsâ). The Office for Human Research Protections (OHRP) in the U.S. Department of Health and Human Ser- vices (HHS) provides leadership in the protection of the rights, welfare, and well-being of subjects involved in research. OHRP does this by providing clarification and guidance, developing educational programs and materials, maintaining regulatory oversight, and providing advice on ethical and regulatory issues in biomedical and behavioral research. These protective policies are enforced at the local level by an organizationâs institutional review board (IRB), an entity required by federal regulations. Of paramount concern in the design of the SHRP 2 NDS was the need to maintain close coordination with nearly all of the other project tasks, as virtually all tasks have some impact on the safety or privacy of the participants or their data. Key issues include protection of participant confidentiality, pro- tection of unconsented passengers (e.g., no continuous audio recording can be employed since it may capture the conver- sations of unconsented passengers), informed consent (and assent/parental consent for minor participants), protection of potentially identifying information (e.g., face video and geo- spatial identifying data), and the continued protection of par- ticipant confidentiality once the data are stored in a database for post hoc analyses. Institutional Review Boards and Certificate of Confidentiality Human subjects protection in the SHRP 2 NDS will be ensured by the review and approval of eight separate IRBs: those of the S06 contractor, the six S07 contractors, and the National Academy of Sciences (NAS). To prepare as well as possible for the human subjects protection review expected in the full-25scale field study, the protocols for the S05 pilot studies under- went the full board review process at Virginia Tech (VT). This allowed a wider range of reviewers to see the complete proto- col and raise human-participant concerns and issues prior to running the NDS. The combination of full board review at VT and full board review at NAS resulted in a very robust protocol that serves as a good starting point for the NDS. Additionally, a Certificate of Confidentiality (CC) was secured from the National Institutes of Health (NIH) for the S05 pilot study. A CC helps researchers protect the privacy of subjects in biomedical, behavioral, clinical, or other research projects against compulsory legal demands (e.g., court orders and subpoenas) that seek the names or other identifying char- acteristics of a research subject. The CC covers the collection of sensitive research information for a defined time period (the term of the project); however, personally identifiable informa- tion obtained about subjects enrolled while the CC is in effect is protected in perpetuity. A CC will also be requested for the full-scale NDS. On the basis of the approval of the S05 CC, a timely approval for the SHRP 2 NDS CC is anticipated. Upon NDS inception, one of the first sets of tasks relates to securing the IRB approvals from the S06 IRB, the NAS IRB, and each S07 IRB before proceeding with any and all aspects of the research involving human participants. Similarly, all S06 and S07 project personnel who will interact with participants or their data must certify that they have passed an approved IRB course or a course on protecting human participants. Each individual site contractor (except any that have chosen to formally rely on the VT IRB) will have to receive approval from its own IRB on the basis of the research protocol and participant-consent documents approved by the VT IRB. It is likely that modifications to the standard set of documents to meet local needs will be reviewed, but these are not expected to fundamentally change. IRB-related submission materials were shared with the various stakeholder IRB personnel early in the process, including during a meeting of these stakehold- ers in Washington, D.C., in the summer of 2009. IRB approval
26will be sought for the call center recruitment separately from that for the main study activities. Collection Process from Vehicle to Server The data collected during the NDS will include participant- identifying data and other sensitive personal information that must be protected. Consequently, every effort will be taken to protect all data from unauthorized access. The video data will be encrypted on the DAS and will remain encrypted until the data transfer process to the S06 server has been successfully completed. Once data quality processes have been applied, the video data will be reencrypted for storage. The data col- lection process is illustrated at a high level in Figure 5.1. The hard drive on the DAS has a single copy of the data. As those data are transferred to the S07 server, they are replicatedDAS Problem? Periodic Wireless Health Check No Yes S07 Uploads Data to Staging Server HD Nearing Capacity or Sched Removal?No Fixable by S07? S07 Schedules & Fixes DAS Yes No S07 Schedules HD or DAS Removal DAS Shipped to S07 Bad DAS Workflow Process Uploads Data to S06 Host Server S07 Refurbishes DAS Predominantly Oversight & Integration Contractor Functions Predominately Site Contractors Functions S07 Installs Good DAS DAS Collects Data S07 Gets Confirmation of Successful Data Upload Yes Good DAS & All Data Wireless Software Update Criterion Crash? Crash (or Crash Site) Investigation Considered w/ S06 Manager Yes No Yes S06 Processes Data for Access Workflow Process Confirms Successful Data Upload DAS (Re)Manufactured - Instantiated into Inventory Mgt No Figure 5.1. Data collection process.and stored on an array of HDs configured in a RAID (Redun- dant Array of Independent Disks). The RAID configurations on the S07 and S06 servers allow system administrators to completely restore a full copy of the data in the event of an HD failure on the server. Furthermore, once the data have been successfully transferred to the S06 site, an additional copy of the data will be stored for archival purposes. Data will be encrypted onboard the DAS by way of Advanced Encryption Standard (128- or 256-bit AES) symmetric encryp- tion. The key used for the encryption will be randomly gen- erated for each trip, and that key is encrypted using the Rivest, Shamir, Adleman (RSA) public key of a public/private key pair. The encrypted key file is stored with the same naming convention alongside the encrypted data and video files. This scenario provides the security of having a private key that will not be onboard the system, while allowing the data and video to be encrypted with the much faster symmetric encryption.
27Data Processing Data Upload to Data Storage Server When uploading the data from the S07 staging server to the S06 server, checksum analyses will be performed to ensure the integrity of the uploaded file. After uploading the data to the S06 database and decrypting, multiple quality checks will be done for each trip. These will be similar to but more sophisticated than those done during the routine health checks. Specifically, due to the amount of data and its contigu- ous nature (i.e., each trip file should begin at or near the same GPS coordinates where the previous trip file ended), more sophisticated comparisons between variables can be made to isolate potential problems within a trip. Analyses will also be conducted to compare trips to ensure that data are not being lost. For example, is the GPS location at the beginning of a trip the same location as the end of the previous trip? When a problem has been identified by the data-quality algorithms, any questionable data will be marked as such. At a minimum, the annotation will include a start sync, end sync, and metadata describing the test the variable failed. As resources are available, fixes may be applied to the data where such is possible (e.g., where it can be determined that a particular sensor was generating data that were off by a known constant value). S06 quality personnel will review the problems to try to determine the root cause (i.e., on the DAS or otherwise). The S06 contractor may need to work with the individual S07 contractor to isolate the problem and determine the best course of corrective action. Quality personnel will also conduct random spot checks by remotely requesting data snippets. Note that any additional processing required to get the data into a format to answer specific research questions is outside the scope of the current S06 project. However, it is believed that providing access to these data to researchers early on is paramount to the success of this project because it lets stake- holders at all levels begin to see results and the value of the project early on, without waiting for all data to be collected some 28 months later. Data Acquisition System Data Processing The purpose of the data processing is to get the highest- quality data as is feasible in the database and in a form usable by researchers. Several processes will be performed once the data arrive at the S06 server. Backups The data will be housed at the VT Data Center. RAID 6 pro- tection is also employed at this facility to guard against loss of data due to server failure. Archival backups of the data will be stored at a different physical location.Trip Summary Data Summary data will be extracted for each trip. These data include mileage driven, duration, start time, average speed, maximum speed, number of stationary epochs, maximum deceleration, driver identification (where possible), etc. This summary will help with quality processes, and it will provide a useful first look at the data for researchers. Data Standardization Data will be standardized into common formats. Because data are being collected on different vehicle makes, models, and countries of origin, it is possible that the DAS may collect data from a single sensor with different units, scales, axes, sample rates, or coding. It will be important to transform the data into standard units to assist researchers when they attempt to analyze the data across vehicles. This is also impor- tant if any algorithms are to be applied across the entire fleet consistently. The raw data will also be stored in the event that any researcher ever wants to review or analyze them. Also, some of the vehicle models (e.g., those equipped with LDW systems) may generate higher-resolution data (i.e., in time or the measured dimension) than others. Using steering information as an example, this higher-frequency data would be of great interest to researchers looking at steering reversals to investigate workload, drowsiness, steering entropy, or the performance of the onboard LDW. Expected Data Magnitude Data that are staged on S07 servers and then transferred to the central S06 data server could often exceed 100Mbps. This requires the use of high-performance research-caliber networks, such as Internet2 or National Lambda Rail. With almost 2,000 DAS units simultaneously collecting video and other sensor data for 2 years each, as well as a projected data life span of up to 30 years, the magnitude of data storage and criticality of adequate infrastructure cannot be overstated. Specifically, the NDS database will house information from several sources, including video and sensor trip data, crash data, health check (i.e., system) data, management informa- tion (i.e., inventory data and participant enrollment data), participant demographic and assessment data, vehicle inven- tory data, and analysis data (i.e., aggregated or reduced), as well as other external sources such as PARs, GIS, weather data, maps, and roadway information. In total, it is anticipated that 2 years of data collection will create a volume of approximately 1 petabyte of data comprising approximately 60â80 million miles and approximately 1.5 to 1.7 million hours of driving data (i.e., a data volume that would require approximately the storage capacity of one million 1-gigabyte USB flash drives). To characterize this volume of data in a different context, it would take approximately 70 million copies of the King James Version of the Bible to fill 1 petabyte of storage capacity.