will range from highly unstructured to highly structured data. These factors will require even more multidisciplinary collaboration among agency scientists.
Warehousing and Mining
As increasingly large amounts of data continue to be generated through designated systems—such as environmental monitoring, biomarker and other exposure surveillance data, disease surveillance, and designed epidemiologic and experimental studies—or streamed from community crowdsourcing, EPA is faced with both an opportunity and a challenge of channeling and integrating data into a massive “data warehouse”. Data warehousing is a well-developed concept and a common practice in business (Miller et al. 2009). In EPA, the adaptation of and transition to data warehousing will continue to evolve with good protocols, such as EPA’s Envirofacts Warehouse (Pang 2009; Egeghy et al. 2012) and the Aggregated Computational Toxicology Resource (Egeghy et al. 2012; Judson et al. 2012). In the future, data in EPA’s warehouse will come from diverse sources, from multiple media, and across geographic, physical, and institutional boundaries. Recent efforts to integrate the US Geological Survey’s National Water Information System with EPA’s Storage and Retrieval System are an example (Beran and Piasecki 2009). To harvest relevant information from massive datasets to support EPA’s science and regulatory activities, integration of heterogeneous databases and mining of these massive datasets present some new opportunities. A recent application involving the European Union’s Water Resource Management Information System is a case in point (Dzemydienė et al. 2008).
Data-mining has become a standard for analyzing massive, multisource, heterogeneous data on consumer behavior used in business (Ngai et al. 2009). EPA should and can adopt this data analytic paradigm to support its knowledge-discovery process. The paradigm is increasingly important at a time when the discovery of new evidence or a new data model can be bolstered by dynamic mining of large amounts of data, including environmental indicators of air and water, satellite imagery of climate change from representative population databases, health indicators from disease surveillance systems and medical databases, social behavioral patterns, individual lifestyle data, and -omics data and disease pathways. That will require EPA to invest its resources to continue the development of new analytic and computational methods to deal with static datasets (for example, modeling of complex biologic systems and air and water models) and to adapt and develop new data-mining techniques to process, visualize, link, and model the massive amounts of data that are streaming from multiple sources. EPA is making progress in that direction in its Aggregated Computational Toxicology Resource System (Judson et al. 2012). Successful cases have also been reported for ecologic modeling (Stockwell 2006), air-pollution management (Li and Shue 2004), and toxicity screening (Helma et al. 2000; Martin et al. 2009), to name a few.