A. Raw Data. All clinical trial data originate from patients and healthy volunteers who participate in studies that are carried out according to a detailed research protocol. This protocol is approved by an institutional review board and explained to participants through the informed consent process. Depending on the study under consideration, demographics, clinical outcomes data, and other appropriate raw source information are entered into case report forms. Some data (e.g., imaging studies) are interpreted by study investigators, and these interpretations are entered into the database—a process referred to as abstraction. The data are then coded to meet the study guidelines (e.g., men may be coded as “1” and women as “0”); the coded data are then entered into the case report forms. In addition, narrative data from the case reports are also transcribed into the database. The data are then reviewed (i.e., cleaned) to be sure that entries make sense and are internally consistent. The data are then abstracted, coded, transcribed, and cleaned as appropriate
B. Cleaned Analyzable Dataset. Once the database is cleaned and all queries are resolved, the database, which consists of both individual participant data and computed/summary-level data, is then analyzable. It is called analyzable because a very large percentage is never used. The next step is to lock the database so that no further changes may be made and the data may be unblinded. However, the cleaned analyzable dataset in its unlocked condition has the potential for subsequent use, because it could be re-analyzed at later time points with the addition of data (e.g., when 1-year, 3-year, and 10-year outcomes measures are added).
C. Cleaned and Locked Analyzable Dataset. The final cleaned and locked analyzable dataset consists of different components (participant characteristics and primary outcome, prespecified secondary and tertiary outcomes, adverse events data and exploratory data). A statistical analysis may involve a composite outcome using any of the various components. In addition, when data are missing, values may be imputed using this dataset. Results are derived from data in the cleaned and locked analyzable dataset, which have undergone statistical analysis. Analyses that were prespecified in the Statistical Analysis Plan form the basis for the Clinical Study Report (CSR) (a detailed analysis of the study efficacy data and the complete adverse event data). The CSR and the supporting cleaned dataset are available to regulators (e.g., the Food and Drug Administration, the European Medicines Agency) and to other data users as appropriate (e.g., ministries of health). Journal articles generally represent slices of the data that make a coherent intellectual whole. For example, the “lead article” usually describes the data on the primary efficacy outcomes, key secondary outcomes, and the relevant adverse event data. Subsequent articles often focus on different aspects of the secondary, tertiary, or exploratory outcomes. Investigators can also use parts of the analyzable dataset to prepare analyses for presentations, for data exploration, and for hypothesis generation. A biostatistics best practice is to freeze a copy of whatever data were used in an analysis so the results can later be repeated if necessary. It would also be desirable to store the code used in the analysis (i.e., the computer program), especially for any derived data.