Very-large-scale data sets introduce many data management challenges. Two notable ones are managing the complexity of the data and harnessing the computational power required to ingest and analyze the data. Tools for analyzing massive data must be developed with an understanding of the developing capabilities for management of massive data.
Increasing the number of data sets brought to bear on a given problem increases the ability to address the problem, at least in principle. Sets that add rows (e.g., adding more patient records in a health-care data set) to existing data tables can increase their statistical power. Sets that add columns (e.g., adding the patient’s smoking history to each patient record) can enable new applications of the data. In the health-care domain, cross-collection data mining can enable exciting advances in personalized medicine. Another example is in the finance domain, where an ability to evaluate loans across multiple banks (e.g., as mortgages are bought and sold) is more powerful than an analysis that is limited to the records of just one bank.
Unfortunately, scaling the number of data sets is very difficult in practice due to heterogeneity in data representations and semantics, data quality, and openness. These aspects are explored below.
Data Representation and Semantics
Data sets managed by different (sub)organizations tend to have disparate representations and semantics. These attributes are described by metadata, which is critical for ensuring that the data can be effectively interpreted. Metadata consist of both structural and discipline-specific metadata. The structural metadata describe the structures of the data and its organization. The discipline-specific metadata describe the characteristics or uniqueness of the data for a particular discipline.
Many disciplines are working toward defining rich semantic models that can help in data searching and understanding. Ideally, data representation standards would permit improvisation; for example, a standard might stipulate a set of structured fields with a free-form key/value map that accommodates unforeseen information. Such an approach can interpolate between the current extremes of restrictive up-front standardization versus free-form chaos. Rich semantics allows tools to be developed that can effectively exploit relationships, thus enabling improved discovery and navigation, and several standards and technologies are emerging. However, current capabilities are still highly dependent on defining well-formed models and structures up-front. It may be important to consider how to evolve data standards over time so that as patterns are recognized in free-form entries, they can be gradually folded into the structured portion of the representation.
An alternative to requiring extensive metadata up-front is to aim for more of a “data co-existence” approach, sometimes referred to as a “dataspace.” A good description of that concept is captured in the following Wikipedia entry:
[Dataspaces provide] base functionality over all data sources, regardless of how integrated they are. For example, a [system] can provide keyword search over all of its data sources, similar to that provided by existing desktop search systems. When more sophisticated operations are required, such as relational-style queries, data mining, or monitoring over certain sources, then additional effort can be applied to more closely integrate those sources in an incremental [“pay-as-you-go”] fashion. Similarly, in terms of traditional database guarantees, initially a dataspace system can only provide weaker guarantees of consistency and durability. As stronger guarantees are desired, more effort can be put into making agreements among the various owners of data sources, and opening up certain interfaces (e.g., for commit protocols).1
In a sense, this approach postpones the labor-intensive aspects of data fusion until they are absolutely needed.
Data Quality and Provenance
Real-world data sets vary in quality, due to a range of factors, including imperfect data collection instruments, human data entry errors, data fusion mistakes, and incorrect inferences. When dealing with a large number of data sets from diverse sources, systematic recording and tracking of data-quality metadata are very important. Unfortunately, in full generality that goal appears to be extremely challenging.
A less daunting but still very ambitious goal is to track the provenance of data elements—that is, their origin, movement, and processing histories. Provenance is also useful for purposes other than reasoning about quality, such as in propagating data updates efficiently, and to attribute data properly in scientific publications. Provenance capture, representation, and querying are active research topics in the communities dealing with scientific workflow, file systems, and databases. The three communities’ approaches emphasize different priorities among various trade-offs (e.g., the trade-off between capture overhead and query expressiveness). None of these approaches has reached significant levels of adoption in practice. Currently, provenance is managed with one-off approaches and standards—for example, in planetary science research, each observation is tagged with the time, location, and platform from which it originated—and there is little systematic support for propagating provenance metadata with data wherever it travels.
Representation and propagation of constrained forms of data-quality metadata, such as confidence scores and error bars, is also an active area of research, although to date most work in that area has concentrated on theoretical issues. There could be an opportunity to consider how more formal statistical notions of uncertainty might be incorporated. Overall, there has been little work on scalable systems for the management of uncertain data.
Non-public data sets require great care when being shared across organizations. To maximize the pool of data that can be shared openly, technologies are needed that fuse open data while protecting proprietary data and preserving anonymity requirements. Data being shared that has been derived from private data (e.g., statistics created by aggregating private data points) is especially problematic, due to data leakage issues (accidental leakage as well as “harvesting” by malicious parties).
Massive data processing, storage, and analysis will require support from distributed and parallel processing systems. Because the processing speed of microelectronics is not increasing as rapidly as it used to, modern central processing units (CPUs) are instead becoming highly parallel. That is the only way to continue the performance improvements in large-scale processing that are demanded by applications.
In order to enable performance improvements in processing, input/ output (I/O) and storage must also become parallelized and distributed. Amdahl’s Law—a rule of thumb that has been valid for nearly 50 years—states that a balanced system needs one bit of I/O per CPU cycle, and thus improvements in processing speed must be matched by improvements in I/O. And high I/O performance necessitates a heavy use of local (on-chip) data storage, so that storage is as distributed and parallel as is processing.
The reason that single-threaded computation is still so common is that parallel and distributed systems are difficult to configure and maintain, and parallel and distributed software is difficult to write. The end of the ever-faster CPU era has led to once-exotic technologies becoming commonplace and to new parallel programming and data management systems that are easier to use. This section outlines recent trends in parallel and distributed computing and I/O.
“Hardware parallelism” is defined here as the presence in a system of many separate computing elements that operate simultaneously. In some cases the elements perform highly specialized tasks and, as a result, can do so very quickly with many elements operating in parallel. A long-standing example of hardware parallelism is integrated circuits for signal processing that can perform Fast Fourier Transforms as a hardware operation. More recent developments are motivated by problems in network management and by the hardware developed to accelerate computer graphics.
Specialized networking equipment, such as very-high-speed network monitoring and firewall gear, commonly makes use of specialized hardware known as field programmable gate arrays (FPGAs) and ternary content addressable memory (TCAM). FPGAs are customizable integrated circuits that can be configured for high performance on special-purpose tasks. A TCAM is akin to a cache memory, but TCAM chips allow the user to specify tie-breaking rules in the case of multiple matches. For example, TCAMs store subnetwork address ranges of interest and the FPGAs perform specialized tasks, such as regular expression matching on the packet content and routing (perhaps to a host for monitoring) based on the results of the TCAM
and other analyses. FPGAs have also found use in data warehouse analytics engines, performing filtering and other tasks on data streaming from disk.
Another recent development in hardware parallelism is motivated by the graphics processing units (GPUs) developed to accelerate computer graphics. GPUs are highly parallel processors, originally developed for high-end graphics applications and computer games. Various vendors (NVIDIA, ATI, IBM) have developed such platforms, which are increasingly used for applications requiring very high floating-point performance. (Many of the world’s top 500 computers are hybrid machines consisting of a large array of traditional CPUs and GPUs.) A typical GPU card can outperform a CPU by up to an order of magnitude, depending on the application, and the performance of a typical high-end graphics card exceeds a teraflop. Sorting performance is also spectacular, exceeding the rate of 1 billion records per second on some benchmarks. The performance per unit power dissipated is also significantly better with GPUs than with traditional processors, an issue that is becoming increasingly important for the total cost of ownership of high-end computing systems.
The main disadvantage in applying GPUs to large-scale data-intensive problems is the rather limited memory (typically 2-3 gigabytes (GB), up to 6 GB currently) attached to the cards. While data access for the on-board memory is very fast, over 100 GB/s, moving data in and out of the cards can be a bottleneck.
Also, GPU programming is still rather complicated, requiring special environments (e.g., CUDA, OpenCL). Because of the single-instruction/ multiple-data nature of the hardware, special attention must be paid to laying out the data to match the configuration of the low-level hardware. The situation is getting better every year, as more and more algorithms are ported to the GPU environment, and increasingly sophisticated debugging environments are emerging. Much of the Linear Algebra PACKage library (LAPACK) has been ported, the Fastest Fourier Transform in the West (FFTW) library for performing discrete Fourier transforms is part of the basic CUDA library, and many graph algorithms have also been successfully ported to GPUs.
The hardware is quickly evolving. Upcoming GPU cards will support more generic memory access, better communication with the host, more flexible task switching, and preemptive multitasking, making them increasingly comparable in programmability to traditional multicore architectures. There are many current efforts to integrate GPUs with databases and stream processing systems.
Over the past decade, conventional server-class CPUs have gained internal parallelism in two ways: by growing the number of cores (independent execution engines) per chip or per package and by increasing the number of operations a core can execute per cycle, largely by means of parallel operations on short vectors of data. Both trends increase the challenge of making good use of the available computational resources in real applications. Tools for automatically parallelizing and vectorizing applications are not currently very effective.
To date, the growth in the number of cores per chip has been somewhat restrained by market forces, in particular the need to retain good performance on non-parallel code. But the industry is well aware that higher peak performance, and higher performance per watt, can be achieved by integrating a much larger number of cores running at lower speeds. Whether massively multicore CPUs will soon play a major role in data analysis is hard to predict, but their eventual arrival now seems all but inevitable (Asanovic et al., 2006).
The rapid proliferation in flash memory is another very relevant trend. As noted in a recent publication on the topic,
Traditionally, system designers have been forced to choose between performance and safety when building large-scale storage systems. Flash storage has the potential to dramatically alter this trade-off, providing persistence as well as high throughput and low latency. The advent of commodity flash drives creates new opportunities in the data center, enabling new designs that are impractical on disk or RAM infrastructure (Balakrishnan et al., 2012, p. 1).
To date, flash memory has been used as a fast alternative for disk storage, but it appears to be a promising technology for lowering power requirements while maintaining high reliability and speed for large-scale data systems.
Data Stream Management Systems
Data stream management systems (DSMS) have emerged as a significant research topic over the past decade, with many research systems (Stream, Niagara, Telegraph, Aurora, Cougar) and commercial systems (Streambase, Coral8, Aleri, InfoSphere Streams, Truviso) having been developed. A DSMS runs a collection of standing queries on one or more input streams. The source streams are generally real-time reports of live phenomena. Examples
include stock ticker and other financial streams, feeds from sensor networks (e.g., highway monitoring), Web click-streams, video streams, streams of data from scientific experiments (e.g., astronomical or high-energy physics observations), and communications network traffic monitoring. These feeds are processed, correlated, and summarized by the DSMS on a continual basis for immediate action or further analysis.
There are two main methods for writing a query set for a DSMS. The first method uses a highly structured query language, often a variant of SQL (Structured Query Language), such as Contextual Query Language, CQL, which was developed for Stream. Another class of common stream query language incorporates regular expression-matching features to perform complex event detection. These stream languages differ from conventional SQL in that they generally require queries to use windowing constructs to limit the scope of the data used to compute any output record. For example, a stream query might ask, “for each five minute period, report the number of distinct source Internet protocol addresses of packets flowing through this network interface.”
The second method for specifying a query set to a DSMS uses a graphical “boxes-and-arrows” approach. Boxes represent data-processing tasks (or data sources), and arrows represent data flow. The programmer selects and customizes the boxes, then connects them with arrows to develop a data processing specification—often through a graphical user interface (e.g., Streambase, Infosphere Streams). The motivation for the boxes-and-arrows method of programming is that many stream analyses are difficult to express in an SQL-like language (e.g., time-series analysis for financial applications, facial recognition in video streams). However, a DSMS query set expressed using a structured query language is generally easier to write and maintain, and it can be more readily optimized.
There is not a rigid boundary between language-based and “boxes and arrows” data stream systems, as one can generally incorporate special-purpose operators into a language-based DSMS, and structured language programming tools have been developed for “boxes-and-arrows” DSMSs (e.g., Streambase, InfoSphere Streams).
If the DSMS is programmed using a declarative query language, the query analyzer will convert the textual queries into a collection of stream operators so that in either case a collection of interconnected stream operators is presented to the query optimizer. A directed graph of stream operators presents special opportunities for the query optimizer because a large and long-running system is presented to optimization. A stream query system presents many opportunities for multi-query optimization, ranging from scan sharing to identifying and merging common execution subtrees, which are not normally available in a database management system (DBMS). The well-structured and explicit nature of the data flow in
a data stream system can enable highly effective optimizations for parallel and distributed stream systems.
For example, GS Tool from AT&T Labs Research will analyze its query set to determine an optimal hash partitioning of the packet stream from the network interfaces that it monitors. The output of a high-speed interface, such as 10 Gigabit Ethernet, is normally split into multiple substreams. Very-high-speed links (e.g., the optical transmission rate OC-768) are normally split into eight 10-Gigabit Ethernet streams by specialized networking equipment, at the direction of the query optimizer. The InfoSphere Streams system makes many optimizations to the query graph to optimize parallel and distributed processing: splitting streams to enable parallelism of expensive operators, coalescing query operators into processing elements to minimize data copying, and allocating processing element instances to cluster nodes to maximize parallelism while minimizing data copy and network transmission overhead.
Cluster Batch (Grid) Systems
A cluster of high-performance servers can offer powerful computing resources at a moderate price. One way to take advantage of the computational resources of a cluster of servers is to use grid software. Examples of grid systems are Sun Grid and Load Sharing Facility. A typical grid system allows users to submit collections of jobs for execution, and the jobs are queued by the grid job manager, which schedules them for execution on nodes in the cluster. The job manager performs load balancing among the cluster nodes and shares compute resources among the users. A grid system can use a storage area network to create a local file system for a user’s job, or it can use a cluster file system.
A cluster file system is a common solution to the challenge of accessing massive distributed data. Such a system provides location-transparent access to data files to the servers on the cluster. The discussion that follows distinguishes between two types of cluster file systems, those which are POSIX compliant (or nearly so), and those which are not.2
A POSIX-compliant cluster file system is attractive to programmers because it provides a traditional interface for data access while requiring minimal reworking of a code base to take advantage of a cluster’s resources. POSIX-compliant cluster file systems are often built on top of a storage area network, typically one or more racks of hard drives attached to a highspeed network such as Fibre Channel.
2 POSIX refers to the Portable Operating System Interface standards developed by the Institute of Electrical and Electronics Engineers and International Organization for Standardization to ensure compatibility between software and operating systems.
A high-performance cluster file system such as Sun Microsystems’ Quick File System (QFS) separates metadata processing from disk block access. A metadata server manages one or more file systems, maintaining directories and I-nodes (perhaps on separate and specialized storage device such as an array of solid-state drives) and serving file metadata requests to the compute clients on the cluster. The clients access the actual data by direct access to the disks of the storage area network. File reliability and availability is typically provided by using the redundant array of independent disks (RAID) technology.
Although a POSIX-compliant cluster file system is intended to be a transparent replacement for a local file system, the complexities of implementing a distributed file system generally result in some gaps in compliance. These gaps generally occur where complex synchronization would be involved, e.g., file locks and concurrent file access by processes on different servers.
The difficulties of providing POSIX-compliance in a very-large-scale cluster have motivated the development of non-POSIX-compliant file systems, for example the Google file system and the Hadoop distributed file system (HDFS). The discussion below is based on the Google file system (Ghemawat et al., 2003), but HDFS is similar.
The Google file system is designed to support distributed analysis tasks, which primarily make scans of very large files. The underlying assumptions of the Google file system are as follows:
- Files are very large and contain well-defined records.
- Files are usually updated by processes that append well-defined records. The sequential ordering of the appended records is not critical, and many processes may be appending records concurrently.
- Analysis processes generally make large sequential scans of the data files.
- Actual files are stored in the local file systems of servers configured into one or more racks in a data center.
As with QFS and similar systems, the Google file system separates the metadata server from the data servers. The metadata server keeps a hot spare in sync by logging all metadata operations to the hot spare before responding to client requests. A single metadata server might coordinate a file system distributed over thousands of nodes, so minimizing the number of metadata requests is critical. Therefore the file block size is very large—64 megabytes (as compared to 4 kilobytes typical on a local file system).
To minimize the overhead and complexity of metadata-server failure recovery, the Google file system makes only weak guarantees about the cor-
rectness of the data written into a file. Duplicate records might be written into the file, and the file might contain garbage areas. The clients that use the Google file system must make provisions for these problems: records should contain consistency information (e.g., checksums), and analysis clients must filter out duplicate records. File availability is ensured using replication; a file block is, by default, replicated to three storage hosts, although critical or frequently accessed files might have a higher degree of replication.
An interesting aspect of the Google file system is that a single server with a hot spare controls thousands of file server nodes. This type of control system can be highly reliable because any single server is unlikely to fail. Instead, provisions are made for recovering from failures among the thousands of file servers—a failure among thousands of nodes is far more likely. Synchronization is achieved using lightweight mechanisms such as logging and the use of file leases. Heavyweight synchronization mechanisms such as Paxos are reserved for the file lock mechanism.
MapReduce is a style of distributed data analysis that was popularized by Google for its internal operations (see Dean and Ghemawat, 2004). Hadoop is an open-source version of MapReduce. MapReduce takes its name from a pair of functional programming constructs, map and reduce. A map invocation applies a function to every element of a list, while a reduce invocation computes an aggregate value from a list.
As used in a MapReduce system, the map phase will organize a collection of compute nodes to divide up a data source (e.g., one or more files) and apply the map’ed function to every record in the file(s). The result is the value of the function on the record, as well as a hash value. In the reduce phase, the map results are reshuffled among the compute nodes, using the result hash to ensure that common records get sent to the same node. The reduce node combines records with the same hash into an aggregate value.
As stated, MapReduce would not seem to provide a powerful programming construct. However, in the context of a large cluster, a MapReduce system provides a couple of critical services:
- The master server that organizes the MapReduce computation hands out portions of the computation to participating nodes and monitors their progress. If a node fails, or is slow to finish, the master will hand the unit of work to another node.
- The master server will ideally have a map of data locations (especially if the Google file system or the HDFS is used), node locations, and the network interconnecting them. The master server can at-
tempt to assign processing close to the data (same server or same rack), distribute work evenly among servers, coordinate among concurrent jobs, and so on.
By providing reliability and basic optimizations, MapReduce (or Hadoop) greatly simplifies the task of writing a large-scale analysis on distributed data for many types of analyses.
The abstraction offered by a single MapReduce job is rather constrained and low-level, relative to the needs of applications that are moving to MapReduce-based platforms. Such applications range from Web data management to genomics to journalism. As a result, there are numerous efforts to layer more flexible and high-level abstractions on top of MapReduce. Examples include machine-learning libraries (e.g., Mahout), structured query languages (e.g., Jaql, Hive, Pig Latin), and workflow managers (e.g., Cascading, Oozie).
For some application scenarios, using a MapReduce job as a building block is not considered a good fit in terms of system performance considerations such as latency and throughput. Hence, several projects are creating variations on MapReduce. One group of projects offers a general directed-acyclic-graph (DAG) processing model (e.g., Dryad, Hyracks, Nephele). Another group caters to applications that require iterative processing of a data set, such as many machine learning algorithms (e.g., HaLoop, Spark). Lastly, there are projects such as Mesos, which aim to separate cluster management and scheduling concerns from the particulars of a given data-processing framework (e.g., MapReduce, general DAGs, iterative processing) and permit multiple such frameworks to coexist on the same cluster.
The computational demand of a particular user can be highly variable over time. Efforts to make more efficient use of resources include grid and cloud computing systems. These systems make a collection of resources available to a user and allow increases or decreases in resource allocation. Cloud computing can be attractive because the overhead of managing a large and complex system can be outsourced to specialists.
Amazon started offering cloud computing services in 2006. With its Elastic Compute Cloud, users can specify an operating system and application disk image to be loaded on virtual servers ranging from low-end to high-end. The Simple Storage Service (S3) provides rentable persistent storage. Large-scale distributed applications can run on the Amazon cloud. For example, Apache Hadoop is designed to use Elastic Compute Cloud servers accessing data stored in S3. The success of Amazon’s cloud service has encouraged the development of other cloud computing offerings. For
example, Microsoft offers the Azure cloud service, providing compute and persistent storage services similar to those provided by Amazon.
Parallel and Distributed Databases
A database is a system that provides facilities for reliably storing data and later retrieving it using a convenient program. For the purposes of this discussion, it is assumed that the database is relational—i.e., that its records consist of a particular set of fields each with a specific data type, and all records in a table have the same set of fields. It is further assumed that the database provides an SQL interface for accessing the data. Most commercial and open-source databases fit this description. These databases might also provide extensions for storing and querying semi-structured data (e.g., Extensible Markup Language, XML) and might support an extended query language; for example, one that supports recursive queries. However, these extensions are not necessary for this discussion.
Most commercial and open-source databases are parallelized in the sense that they can use multiple compute cores to evaluate a query when executed on a multicore server. One method of parallelization is to use multiple threads for performing expensive tasks, such as sorting or joining large data sets. Another method of parallelization takes advantage of the nature of the programs that a query in a language such as SQL will generate. An SQL query is converted into a collection of query operators connected into a rooted DAG (the query graph); the edges of the graph indicate data flow among the query operators. If the query operators operate in a pipelined fashion (continually accepting input and producing output), multiple query operators can execute in parallel, in a manner similar to the inter-operator parallelism exploited by data stream systems. Parallelizing database processing has been an active research topic for several decades.
Very-large-scale parallel database systems are generally spread over a collection of servers using a shared-nothing architecture—that is, there is no cluster file system to provide a shared state. The tables in a shared-nothing database are horizontally partitioned, and the partitions are distributed among the database servers. Each of the database servers can run a parallelized database in the sense of taking advantage of all available cores.
A table can be partitioned among the database servers in many ways. Two common choices are round-robin (new data are spread evenly among the servers) and hash (data are spread among the servers based on a hash of one or more fields of the table). Different tables can be partitioned using different techniques. Critical or frequently accessed tables can be stored two or more times using different partitioning for each copy. The servers in a shared-nothing database cooperate to evaluate a query. Recall that a query is transformed into a rooted DAG of query operators. The root of the DAG
produces the query result, while the leaves access database tables. Subtrees of the query plan that operate on single tables are sent to the database servers, which compute the partial result represented by the subtree executing on the table partition local to the database server. The partial results generally need to be combined to form a result, whether for aggregation (which is similar to the reduce phase of a MapReduce program), or to join the result of the subtree with data from another subtree. This data transfer is represented by an operator commonly called shuffle. A complex query can involve a large query graph with many shuffle operators.
A shared-nothing parallel database can take many steps to optimize query evaluation. Some of the available techniques are as follows:
- Modify the way that one or more tables are partitioned among the database servers. For example, if tables R and S are frequently joined on key k, than one possible optimization is to partition both R and S on k, using the same hash function, to avoid a data shuffle when processing the join.
- If the result of two subtrees is to be joined, shuffle the results of the subtree that produces less data to the matching locations of the data from the other subtree result.
- Pipeline the operators to avoid the need to store very large partial results.
A database generally collects extensive statistics about the data that it manages and the queries that it processes to guide these and other optimizations. Shared-nothing databases were first developed in the 1980s (the Gamma and Grace research prototypes and the Teradata commercial DBMS), and extensive research has been performed on parallel database optimization. DeWitt and Stonebraker (2008) found that for data analysis tasks for which a relational database is well-suited, a shared-nothing relational database significantly outperforms a MapReduce program implemented using Hadoop. However, they also found that tuning parallel databases is often a difficult task requiring specialized expertise, whereas MapReduce systems are more readily configured to give good performance. Modern shared-nothing parallel databases include Teradata, Netezza, Greenplum, and extensions to Oracle and DB2.
However, a relational DBMS is not suitable for many analysis tasks. One well-known problem is that relational DBMSs are not well structured for managing array data—which are critical for many analyses. While the ability of modern databases to optimize storage layout and query evaluation plans makes array management with a database an attractive idea, the query optimization for array data is difficult, and the relational model is based on sets, not ordered data. Several efforts to incorporate array data
into the relational model have appeared in the research literature, but without lasting effect. The open-source project SciDB is developing a parallel shared-nothing database system designed to support array data. This system supports parallelism by chunking large arrays and distributing them among the database servers.
A NoSQL database is loosely defined as being a data store that provides fewer consistency guarantees than a conventional database and/or a database that stores non-relational data, such as documents or graphs. NoSQL databases attempt to improve scaling by providing only weak or eventual consistency guarantees on the stored data, eliminating much of the complexity and overhead of the traditional strong consistency provided by conventional databases, which is especially marked in a distributed setting. Examples of NoSQL databases include MongoDB (document store), Neo4j (graphs), Bigtable, and Cassandra.
Parallel Programming Languages and Systems
Developing parallel and/or distributed programs is notoriously difficult, due to the problems in finding resources, distributing work, gathering results, recovering from failures, and understanding and avoiding rare conditions. A variety of tools have been developed to reduce the burden of developing parallel and distributed programs, for example, the Message Passing Interface and Remote Method Invocation in Java. However, parallel programming with these intermediate-level tools is still difficult because the programmer is forced to specify many details of how the parallelism is managed. Simple access to parallel programming seems to require languages that are at least partly functional (e.g., MapReduce) or declarative (e.g., SQL).
One method for achieving simple user parallelism is to create languages whose primitives perform expensive operations on very-large data structures. Large matrix operations, such as multiplication or inversion, are expensive but readily parallelizable. Languages such as Matlab, S, Splus, and R, for which the basic data structure is a matrix, may therefore be promising aids to parallelism. R is the open-source version of S-plus, and it has attracted the most development effort. Open-source efforts include Multicore-R and R/parallel, which add a parallelized Apply construct to the language. Revolution Analytics produces a commercial version of R with a variety of extensions to support large-scale data analysis, for example, external memory versions of commonly used statistical analyses, and parallelized versions of looping functions such as “foreach” and “apply,” as well as multicore implementations of matrix operators. Ricardo interfaces R to Hadoop through Jaql and uses Jaql to run parallelized data analysis queries before loading the results into R for local analysis (Das et al., 2010).
Some new programming languages are designed to readily support parallel programming. For example, F# is a functional language derived from OCaml, but with simplified syntax. While F# supports an imperative programming style, its nature encourages a functional style.
The clear trend for large-scale data analysis is to make increasing use of multicore parallel and distributed systems. The method for achieving enhanced performance through parallelism will depend on the nature of the data and the application. The largest analyses will be performed in large data centers running specialized software such as Hadoop over HDFS to harness thousands of cores to process data distributed throughout the cluster. Other large and centrally maintained facilities might run streaming analysis systems that reduce massive qualities of real-time data into a more manageable high-level data product.
However, actual analysts need to explore these data sets for them to be useful. One option is a grid-style environment in which users submit batch jobs to a large cluster and sometime later retrieve a result. While this result might be highly reduced, e.g., a plot in a graphical user interface, it might also be a processed data set delivered to the analysts’ workstation (or cluster). Even inexpensive personal computers currently provide four high-performance cores and access to a powerful GPU. The local workstation will provide parallelized tools for exploring and analyzing the local data set.
While large server farms provide immense computing power, managing them is expensive and requires specialized technical expertise. Therefore the trend of outsourcing large computing tasks to cloud services such as Amazon’s SC2 is likely to continue. One roadblock to using cloud services for massive data analysis is the problem of transferring the large data sets. Maintaining a high-capacity and wide-scale communications network is very expensive and only marginally profitable.
Software systems tend to develop greater power and performance until the complexity of the system exceeds human (or organizational) ability to manage it. Parallel databases allow naive users to compose and execute complex programs over petabytes of data. Similarly, MapReduce removes enough of the complexity of writing very-large-scale distributed programs that a large user group can access the power of a large cluster. Hadoop overlays further reduce the complexity of large-scale data analysis.
However, using and maintaining large parallel and distributed systems remains difficult for the following reasons:
- While parallel databases are readily queried by casual users, they are very hard to tune, and data loading remains a bottleneck.
- Although modern systems and languages have made parallel programming much easier than previously, they remain significantly more difficult than serial programs.
- While modern systems and languages abstract away many of the difficulties of parallel and distributed programming, debugging remains difficult.
- Architecting, building, and maintaining a large cluster requires specialized expertise.
- Understanding the performance of parallel and distributed programs and systems can be extremely difficult. Small changes to program phasing, data layout, and system configuration can have a very large effect on performance. Very large systems are typically accessed by a user community; the effect of the interaction of multiple parallel programs compounds the problem of understanding performance.
Achieving greater use of the power of parallel and distributed systems requires further innovations that simplify their use and maintenance. Many very-large data systems need to store very-large amounts of historical data, but also provide real-time or near-real-time alerting and analytics. However, systems designed for real-time response tend to have a very different architecture than historical, batch-oriented large-data systems. A typical response to these needs is to build two separate and loosely coupled systems. For example, a streaming system might provide real-time alerting, while historical analyses are made on a batch-oriented system. Transparently bridging real-time systems with large-data systems remains a research issue.
Similarly, data integration and data-quality assurance are difficult problems, which, in spite of tools such as Clio (Haas et al., 2005) or IBM’s InfoSphere Information Analyzer, are generally bespoke, labor-intensive tasks. A significant direction of future research is the development of simple but powerful data-integration and data-quality tools that use machine learning techniques to automate these tasks.
Asanovic, K., R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, and K.A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. University of California, Berkeley, Technical Report No. UCB/EECS-2006-183. December 18. Available at http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html.
Balakrishnan, M., D. Malkhi, V. Prabhakaran, T. Wobber, M. Wei, and J.D. Davis. 2012. CORFU: A shared log design for flash clusters. Pp. 1-14 in 9th USENIX Symposium on Networked Systems Design and Implementation. USENIX Association, Berkeley, Calif., April.
Das, S., Y. Sismanis, S. Beyer, R. Gemulla, P.J. Haas, and J. McPherson. 2010. Ricardo: Integrating R and Hadoop. Pp. 987-998 Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery, New York, N.Y.
Dean, J., and S. Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. P. 10 in Proceedings of the 6th Symposium on Operating Systems Design and Implementation. USENIX Association, Berkeley, Calif.
DeWitt, D.J., and M. Stonebraker. 2008. “MapReduce: A Major Step Backwards.” Online discussion. Available at http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html.
Ghemawat, S., H. Gobioff, and S.-T. Leung. 2003. The Google file system. Pp. 29-43 in Proceedings of the Nineteenth ACM Symposium on Operating System Principles. Association for Computing Machinery, New York, N.Y.
Haas, L.M., M.A. Hernández, L. Popa, M. Roth, and H. Ho. 2005. Clio grows up: From research prototype to industrial tool. Pp. 805-810 in Proceedings of the 2005 ACM International Conference on Management of Data (SIGMOD). Association for Computing Machinery, New York, N.Y.