The type of data needed in chemistry is changing. The traditional data requirements were for limited sets of data that were used to create correlations, to provide estimates, to test theories. This was a “retail” version of data usage. In industry, government, and academia, this work was typically done by individuals who had a strong background in the underlying physical principles embodied in the data. Errors in transcription were clear, and bad data usually stood out because data were used typically in sets and plotted against other related data. The data correlations were often extended to domains where measurement was either difficult or expensive. Predictions were made from the correlations but again, the background of the practitioners was such that the fundamental physical principles and "reasonableness" of the data were uppermost in their minds. The errors were mostly well appreciated, because the underlying science was closely coupled to the data analysis. The use of the resulting data was related to the confidence that the data were correct or at least that a firm understanding of the bounds of the uncertainty existed.
The use of modeling and simulation has placed new demands on data resources. These result in part from the different and often more complex systems that are being modeled, but also in part from the new requirements for complete data sets. The need for completeness comes about from the very nature of modern modeling programs, which take all aspects of the physics and chemistry into account—at least in principle. Since all physical and chemical processes are included, it is necessary to have data for the parameters that are used in describing the individual subprocesses of the model: diffusion coefficients, heat capacities, heats of formation, rates of reaction, and so on. Because it is essential that some value be placed in the model, there is a need to supply values for parameters for which there are little or no experimental data. This has given rise to a host of estimations and a greater need to determine the role of uncertainty in the modeling process.
For many of the unknown parameters, it is possible to show that any physically reasonable value will be acceptable since the underlying process is not a determinative of the outcome of the model. Thus, if one is in need of a diffusion coefficient for a radical, one can take the limits of the H atom (for which there are experimental data) and some molecule with a molecular weight twice that of the radical. Barring very unusual effects of polarity, the actual value of the diffusion constant will be in that range. By looking at the effect of the high and low values, it becomes possible to set limits on how much of an effect the high level of uncertainty will have on the final result. However, if the same calculation is to be applied to an ion, a completely different set of approximations must be used. The number and scope of the processes modeled in a modern simulation are so large that it is unlikely that anyone has the scientific background to ensure that all of the estimates are "reasonable." This is especially true since the definition of reasonable is a strong function of the problem: what is a small effect in one system may be large in another because of the difference in the process controlling the outcome of the model. For the most part we do not have modeling code that determines the "reasonableness" of the values used as input, nor do we have data resources that can provide physical limits for otherwise unknown data.
To satisfy the needs discussed above will require changes in the way that data resources are managed. Three broad categories of data resources are discussed to illustrate the problems in meeting these needs.
1. Archive. The archive is a set of numeric data of specific properties for specific chemical compounds with full literature references. The data should be clearly identified as to property (heat of