INFORMATICS AND INFORMATION TECHNOLOGY
IT and informatics are in rapid transition. Technologic change in the global capacity of computing and telecommunication has been growing exponentially (Hilbert and Lopez 2011). The end of Moore’s law1 of exponential growth in computer hardware power (Robert 2000) will require, for example, mastery of parallel programming to sustain the growth of computing performance and to meet the need for analyzing massive amounts of data until a postsilicon era is realized. The next 10 years will see a massive rebuilding of IT infrastructure everywhere.
The economics of IT are also changing profoundly, largely under the favorable pressure of consumer applications. Enormous increases in data bandwidth (especially wireless) have made possible a wide array of mobile endpoints for applications, and this trend will continue. The inability of traditional relational databases to scale to handle the rapid growth in unstructured, semiconsistent, real-time data on which decisions often need to be made based in the commercial world has led to the emergence of such tools as Map Reduce, Hadoop, and other next-generation data environments (NoSQL 2012), which are discussed above. Virtualization is steadily eliminating the concept of a dedicated server in a fixed location, and cloud computing is transforming the economics of IT. Social networking, already a major consumer phenomenon, has now entered the scientific workplace and can be used for heightened collaboration, as discussed above.
All of the emerging changes will require a more responsive and flexible approach to the opportunities afforded by global informatics and lead to a sys-
1 Moore’s law is a rule of thumb in the history of computing hardware whereby the number of transistors that can be placed inexpensively on an integrated circuit doubles about every 2 years (Moore 1965).
tems perspective of data instead of a focus on one locale, one experiment, or one medium at a time. Those are the directions that IT and informatics are taking. The challenge will lie in understanding how to harness information for EPA’s science needs for the future and understanding the role of advanced computer science and informatics in EPA.
EPA’s National Computer Center in Research Triangle Park, North Carolina, houses many of the agency’s computing resources, including the super-computing resources used by the Environmental Modeling and Visualization Laboratory and resources for such major applications as computational toxicology, exposure research, and risk assessment. Those resources are traditional high-performance computing machines, the products of a shrinking and struggling industry segment. The future of high-performance computing machines will look entirely different, and it is important that EPA adjust to the change to remain at the leading edge of the field.
Central processing units (CPUs) can no longer be made to run faster, so progress requires putting multiple CPUs, or “cores”, on each chip to operate concurrently. That, in turn, requires a decomposition of applications into independent components that can run in parallel. An important opportunity afforded by the effort to create highly parallel programs is that they can also be exported to external networks of underused processing for the few jobs that require massive resources. The existing tools for that style of programming are poor, and the skill is seldom taught. Fortunately, EPA has had experience in this regard in its supercomputing projects, but it will need to expand its overall skills inventory greatly to continue to take advantage of parallel and emerging techniques in computing as Moore’s law is repealed.
Cloud computing will redefine the economics of computation for the next 20 years. A cloud-computing server typically provides services to its clients in three ways: complete applications (software as a service, or SaaS); a platform for clients to build on (PaaS); or a raw infrastructure of processors, storage, and networks (IaaS). Clouds generally are classified as public (provided commercially), private (to one or more organizations), or hybrid (public with a secure connection to private). Services can be scaled up or down in capacity and performance instantly; the client is charged for the amount of time, storage, CPUs, and bandwidth, moment by moment. Even organizations with extreme needs for computation, storage, and bandwidth and high volatility of demand over the
short term have been able to transition from their own data centers to the cloud with excellent results (Cockcroft 2011). EPA has recognized the opportunity presented by cloud computing and has begun to embark on a process of transition for many services to a private EPA cloud (Lee and Eason 2010).
Throughout EPA, and especially in the regions and the technical offices, applications and databases are the responsibility of regions and offices, but the Office of Technology Operations and Planning (in the Office of Environmental Information) provides the infrastructure, platform, and support from datacenters in Research Triangle Park, North Carolina; Arlington, Virginia; Chicago, Illinois; and Denver, Colorado. Thus, it is natural for EPA scientific computing to move to PaaS and IaaS cloud operation, and it has begun to do so. Done carefully, this will also permit some applications to be moved to the public cloud as economics requires. Given the trajectory of costs and budgets, that is inevitable, and it is important that EPA continue on this path, ensuring that new science applications are designed for private cloud implementation and for later portability to the public cloud.
Dramatic improvement in the performance of data transmission in both wide-area and local wireless networks is driving enormous growth in mobile devices and applications. With many government agencies upgrading infrastructure under pressure to use more effectively the underused radiofrequency spectrum over which they have control, that growth will continue for the foreseeable future. Combined with new-generation real-time sensors, the wireless network has a profound effect on collection of and access to environmental information but it also changes expectations about the user experience. Furthermore, designing for mobile devices has different constraints and freedoms from building Web applications for a desktop environment. The techniques will be important as EPA works to engage and gain support from the public. It will be important for EPA to master the skills of spectrum-sharing and efficient use of bandwidth.
With centralized data centers, strong data-quality standards, and highly organized exchanges, EPA is executing well in IT and has adapted to changing technology while continuing to support its original charter to protect the environment and human health. However, a persistent challenge in such fields such as computational toxicology is the integration of available data from many sources. In particular, many investigators who generate large datasets may not have the knowledge and experience in informatics to integrate and interpret the data successfully. In the future, adopting a systems-thinking approach will result in a mixture of data from a variety of sources, including the atmosphere, soil, water, and foods; data will be related to genetics and health outcomes; and they
will range from highly unstructured to highly structured data. These factors will require even more multidisciplinary collaboration among agency scientists.
Warehousing and Mining
As increasingly large amounts of data continue to be generated through designated systems—such as environmental monitoring, biomarker and other exposure surveillance data, disease surveillance, and designed epidemiologic and experimental studies—or streamed from community crowdsourcing, EPA is faced with both an opportunity and a challenge of channeling and integrating data into a massive “data warehouse”. Data warehousing is a well-developed concept and a common practice in business (Miller et al. 2009). In EPA, the adaptation of and transition to data warehousing will continue to evolve with good protocols, such as EPA’s Envirofacts Warehouse (Pang 2009; Egeghy et al. 2012) and the Aggregated Computational Toxicology Resource (Egeghy et al. 2012; Judson et al. 2012). In the future, data in EPA’s warehouse will come from diverse sources, from multiple media, and across geographic, physical, and institutional boundaries. Recent efforts to integrate the US Geological Survey’s National Water Information System with EPA’s Storage and Retrieval System are an example (Beran and Piasecki 2009). To harvest relevant information from massive datasets to support EPA’s science and regulatory activities, integration of heterogeneous databases and mining of these massive datasets present some new opportunities. A recent application involving the European Union’s Water Resource Management Information System is a case in point (Dzemydienė et al. 2008).
Data-mining has become a standard for analyzing massive, multisource, heterogeneous data on consumer behavior used in business (Ngai et al. 2009). EPA should and can adopt this data analytic paradigm to support its knowledge-discovery process. The paradigm is increasingly important at a time when the discovery of new evidence or a new data model can be bolstered by dynamic mining of large amounts of data, including environmental indicators of air and water, satellite imagery of climate change from representative population databases, health indicators from disease surveillance systems and medical databases, social behavioral patterns, individual lifestyle data, and -omics data and disease pathways. That will require EPA to invest its resources to continue the development of new analytic and computational methods to deal with static datasets (for example, modeling of complex biologic systems and air and water models) and to adapt and develop new data-mining techniques to process, visualize, link, and model the massive amounts of data that are streaming from multiple sources. EPA is making progress in that direction in its Aggregated Computational Toxicology Resource System (Judson et al. 2012). Successful cases have also been reported for ecologic modeling (Stockwell 2006), air-pollution management (Li and Shue 2004), and toxicity screening (Helma et al. 2000; Martin et al. 2009), to name a few.
Informatics, data warehousing, and data-mining afford EPA powerful tools for maximal use of wealth of information that will continue to be gathered by it, other agencies, and the public on an unprecedented scale. Data analysis and modeling in many cases will be accomplished through informatics techniques, as is already the case in the analysis of -omics data (Ng et al. 2006; Baumgartner et al. 2011; Roy et al. 2011). As EPA moves forward with analyzing and modeling large sets of data, it should keep three points in mind:
• Information generation and information gathering are accelerating exponentially, and EPA will not be able to generate all the data needed to address complex environmental and health problems. It would benefit the agency to continue to develop its capacity to access, harvest, manage, and integrate data from diverse sources and different media and across geographic and disciplinary boundaries rapidly and systematically.
• Links between environmental change, exposure, human behavior, and human health are complex, and seamless integration and dynamic mining of diverse datasets will boost the chance of discovering such links. For example, to derive personal exposure estimates for particulate matter smaller than 2.5 μm in diameter (PM25), it is necessary to integrate environmental data, human behavioral data, and insight about how PM25 penetrates various indoor microenvironments. The exposure estimates are then linked to disease-mechanism data and health data. Such an approach is not difficult to appreciate in principle, but its practice hinges on how successfully an informatics approach can be adapted to mine the massive data from diverse systems. EPA has been a leader in air-quality research and associated health effects of exposure to air pollutants, as showcased through its contributions to the Six Cities Study (Dockery et al. 1993) and the National Morbidity, Mortality, and Air Pollution Study (Samet et al. 2000; Dominici et al. 2006), and it is in a strong position to retain its cutting-edge position by adapting informatics approaches to the analysis and modeling of diverse and massive datasets.
• As environmental challenges continue to emerge and evolve, EPA’s approach to problem-solving will need to be dynamic and adaptive. Having a cutting-edge capacity of data warehousing, data-mining, bioinformatics, environmental informatics, and health informatics will boost EPA’s ability to integrate massive external data in a timely fashion, to adopt new techniques, to borrow scientific and technical expertise from outside the agency, and to be more responsive and anticipatory.
As EPA continues to strengthen its informatics infrastructure, it will be important to pay attention to new analytic and statistical methods to address emerging modeling issues and to bridge methodologic gaps. Several outstanding issues warrant high priority. One challenge is to analyze large amounts of data
from diverse sources without having a shared standard for the data collection (Hall et al. 2005). For example, screening and identifying complex chemical mixtures in the natural environment are difficult because there so many possible mixtures and the mixtures change temporally and spatially (Casey et al. 2004). A second example involves conducting gene-screening analysis to differentiate among tens of thousands of genes or single-nucleotide polymorphisms along a hypothesized disease pathway with only a small number of subjects. Overzealous findings of a positive association are a consequence of this high-dimension problem (Rajaraman and Ullman 2011). Mining that type of data could pose serious challenges in validity and utility when the data are from across geographic and disciplinary boundaries and have heterogeneous quality standards. A special danger with huge datasets is a problem of multiple comparisons, which can lead to massive false positive results. Also with such data, there is sometimes a dominance of bias over randomness—increasing the amount of data generally reduces variances, sometimes close to zero, it but does not reduce bias. In fact, it may even increase bias by diverting attention from the basic quality of the data. Another challenge involves the modeling of complex biologic systems (such as pathway models, physiologically based pharmacokinetic and pharmacodynamic models, and hospital admission data). Information from a small number of static datasets is insufficient to support a large number of unknown model parameters. Two approaches are widely used: fixing some parameters at values that have only weak support from external systems (Wang et al. 1997) and tightening the range of variation of the values of the parameters by imposing probabilistic distributions in a Bayesian approach, such as a Monte Carlo Markov chain (Bois 2000). Those methods may give the user an unwarranted sense of truth when there are substantial uncertainties in the true model. As informatics and data-mining become standard, techniques for data analysis will be increasingly hybrid, combining mathematical, computational, graphical, and statistical tools and qualitative methods to conduct data exploration, machine learning, modeling, and decision-making. Developing its inhouse capability will help EPA to adopt and apply the new techniques.
Data Sharing and Distribution
EPA devotes substantial resources to the public sharing of data resources. It also provides support and encouragement to software and application (app) developers for the creation of both institutional and consumer applications for accessing, presenting, and analyzing available environmental data. One example is the Toxics Release Inventory. Others being developed are the EPA Saves Your Skin mobile telephone app, which provides ZIP code—based ultraviolet index information to help the public take action to protect their skin and an air-quality index mobile app, which feeds air-quality information based on ZIP code. The agency has made strides in analytic and simulation activities, as shown in the leadership role that it has played in computational toxicology (see
As information trends move from long-term data to data that are gathered in nearly real time from dispersed geographic sites, there will not be time for a traditional cycle in which the desired information needs to be extracted from the original compilation, reformatted to a specific standard, and finally loaded into an analytic application. It will instead be necessary to literally “send the algorithm to the data” and receive and collect the results centrally. In other words, the complex formulas developed to analyze the data may be used at the site and time of data collection rather than being sent to a central data-processing site for analyses. That approach, first developed by Google in 2004, is named Map Reduce and uses a functional programming model (Dean and Ghemawat 2004). Hadoop, a widely available implementation of Map Reduce, is available in open-source form and from several major vendors. Not only can Hadoop programming parallelize the problem of accessing widely distributed data; it is especially useful for processing unstructured data or combining them with traditional structured data.
Baumgartner, C., M. Osl, M. Netzer, and D. Baumgartner. 2011. Bioinformatic-driven search for metabolic biomarkers in disease. J. Clin. Bioinform. 1:2, doi:10.1186/2043-9113-1-2.
Beran, B., and M. Piasecki. 2009. Engineering new paths to water data. Comput. Geosci. 35(4):753-760.
Bois, F.Y. 2000. Statistical analysis of Fisher et al. PBPK model of trichloroethylene kinetics. Environ. Health Perspect. 108(suppl. 2):275-282.
Casey, M., C. Gennings, W.H. Carter, V.C. Moser, and J.E. Simmons. 2004. Detecting interaction(s) and assessing the impact of component subsets in a chemical mixture using fixed-ratio mixture ray designs. J. Agr. Biol. Environ. Stat. 9(3):339-361.
Cockcroft, A. 2011. Net Cloud Architecture. Velocity Conference, June 14, 2011 [online]. Available: http://www.slideshare.net/adrianco/netflix-velocity-conference-2011 [accessed Apr. 10, 2012].
Dean, J., and S. Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. Pp. 137-149 in Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04), December 5, 2004, San Francisco, CA [online]. Available: http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf [accessed Mar. 30, 2012].
Dockery, D.W., C.A. Pope, III, X. Xu, J.D. Spengler, J.H. Ware, M.E. Fay, B.G. Ferris, and F.A. Speizer. 1993. An association between air pollution and mortality in six US cities. N. Engl. J. Med. 329(24):1753-1759.
Dominici, F., R.D. Peng, M.L. Bell, L. Pham, A. McDermott, S.L. Zeger, and J.M. Samet. 2006. Fine particulate air pollution and hospital admission for cardiovascular and respiratory diseases. JAMA 295(10):1127-1134.
Dzemydienė, D., S. Maskeliūnas, and K. Jacobsen. 2008. Sustainable management of water resources based on web services and distributed data warehouses. Technol. Econ. Dev. Econ. 14(1):38-50.
Egeghy, P.P., R. Judson, S. Gangwal, S. Mosher, D. Smith, J. Vail, and E.A. Cohen Hubal. 2012. The exposure data landscape for manufactured chemicals. Sci. Total Environ. 414(1):159-166.
Hall, P., J.S. Marron, and A. Neeman. 2005. Geometric representation of high dimension, low sample size data. J. R. Statist. Soc. B 67(3):427-444.
Helma, C., E. Gottmann, and S. Kramer. 2000. Knowledge discovery and data mining in toxicology. Stat. Method. Med. Res. 9(4):329-358.
Hilbert, M., and P. López. The world’s technological capacity to store, communicate, and compute information. Science 332(6025):60-65.
Judson, R.S., M.T. Martin, P. Egeghy, S. Gangwal, D.M. Reif, P. Kothiya, M. Wolf, T. Cathey, T. Transue, D. Smith, J. Vail, A. Frame, S. Mosher, E.A. Cohen-Hubal, and A.M. Richard. 2012. Aggregating data for computational toxicology applications: The US Environmental Protection Agency (EPA) Aggregated Computational Toxicology Resource (ACToR) System. Int. J. Mol. Sci. 13(2):1805-1831.
Lee, M., and W. Eason. 2010. The Silver Lining of Cloud Computing. Presentation at Environmental Information Symposium 2010-Enabling Environmental Protection through Transparency and Open Government, May 13, 2010, Philadelphia, PA [online]. Available: http://www.epa.gov/oei/symposium/2010/lee.pdf [accessed Apr. 2, 2012].
Li, S.-T., and L.-Y. Shue. 2004. Data mining to aid policy making in air pollution management. Expert Sys. Appl. 27(3):331-340.
Martin, M.T., R.S. Judson, D.M. Reif, R.J. Kavlock, and D.J. Dix. 2009. Profiling chemicals based on chronic toxicity results from the US EPA ToxRef Database. Environ. Health Perspect. 117(3):392-399.
Miller, F.P., A.F. Vandome, and J. McBrewster, Jr., eds. 2009. Data Warehouse: Extract, Transform, Load, Metadata, Data Integration, Data Mining, Data Warehouse Appliance, Database Management System, Decision Support System. Orlando, FL: Alpha Press.
Moore, G. 1965. Cramming more components onto integrated circuits. Electronics 38(8) [online]. Available: http://www.cs.utexas.edu/~fussell/courses/cs352h/papers/moore.pdf [accessed Apr. 6, 2012].
Ng, A., B. Bursteinas, Q. Gao, E. Mollison, and M. Zvelebil. 2006. Resources for integrative systems biology: From data through databases to networks and dynamic system models. Brief. Bioinform. 7(4):318-330.
Ngai, E.W.T., L. Xiu, and D.C.K. Chau. 2009. Application of data mining techniques in customer relationship management: A literature review and classification. Expert Syst. Appl. 36(2):2592-2602.
NoSQL. 2012. NoSQL Website [online]. Available: http://nosql-database.org/ [accessed Apr. 30, 2012].
Pang, L. 2009. Best practices in data warehousing. Pp. 146-152 in Encyclopedia of Data Warehousing and Mining, 2nd Ed., J. Wang, ed. Hershey, PA: Information Science Reference.
Rajaraman, A., and J.D. Ullman. 2011. Mining of Massive Datasets. New York: Cambridge University Press.
Robert, L.G. 2000. Beyond Moore’s law: Internet growth trends. Computer 33(1):117-119.
Roy, P., C. Truntzer, D. Maucort-Boulch, T. Jouve, and N. Molinari. 2011. Protein mass spectra data analysis for clinical biomarker discovery: A global review. Brief. Bioinform. 12(2):176-186.
Samet, J.M., F. Dominici, F.C. Curriero, I. Coursac, and S.L. Zeger. 2000. Fine particu-late air pollution and mortality in 20 US cities, 1987–1994. N. Engl. J. Med. 343(24):1742-1749.
Stockwell, D.R.B. 2006. Improving ecological niche models by data mining large environmental datasets for surrogate models. Ecol. Model. 192(1-2):188-196.
Wang, X., M.J. Santostefano, M.V. Evans, V.M. Richardson, J.J. Diliberto, and L.S. Birnbaum. 1997. Determination of parameters responsible for pharmacokinetic behavior of TCDD in female Sprague—Dawley rats. Toxicol. Appl. Pharmacol. 147(1):151-168.