A Quintillion Live Pixels: The Challenge of Continuously Interpreting and Organizing the World’s Visual Information
Carnegie Mellon University
I estimate that by 2030 cameras across the world will have an aggregate sensing capacity exceeding 1 quintillion (1018) pixels. These cameras—embedded in vehicles, worn on the body, and positioned throughout public and private everyday environments—will generate a worldwide visual data stream that is over eight orders of magnitude greater than YouTube’s current daily rate of video ingest. A vast majority of these images will never be observed by a human eye—doing so would require every human on the planet to spend their life watching the equivalent of 10 high-definition video feeds! Instead, future computer systems will be tasked to automatically observe, understand, and extract value from this dense sampling of life’s events.
Some applications of this emerging capability trigger clear privacy and oversight concerns, and will rightfully be subject to rigorous public debate. Many others, however, clearly have the potential for critical impact on central human challenges of the coming decades. Sophisticated image analysis, deployed at scale, will play a role in realizing efficient autonomous transportation, optimizing the daily operation of future megacities, enabling fine-scale environmental monitoring, and advancing how humans access information and interact with information technology.
The ability to develop new image understanding techniques (see Grauman in this volume), architect large-scale systems to efficiently execute these computations (the subject of my research), and deploy these systems transparently and responsibly to improve worldwide quality of life is a key engineering challenge of the coming decade.
To understand the potential impact of these quintillion pixels, let’s examine
the role of image understanding in three contexts: via cameras on vehicles, on the human body, and in urban environments.
CONTINUOUS CAPTURE ON VEHICLES
It is estimated that there will be more than 2 billion cars in the world by 2030 (Sperling and Gordon 2010). Regardless of the extent to which autonomous capability is present in these vehicles, a vast majority of them will feature high-resolution image sensing. (High-resolution cameras, augmented with high-performance image processing systems, will be a low-cost and higher-information-content alternative to more expensive active sensing technologies such as Lidar.)
Image analysis systems can localize vehicles in their expected surroundings and interpret dynamic environments to predict and detect obstacles as they arise. They are thus critical to the development of vehicles that drive more safely and use roads more efficiently than human drivers. Researchers in academia and industry are racing to develop efficient image processing systems that can execute image understanding tasks simultaneously on multiple high-resolution video feeds and with low latency. Hundreds of tera-ops of processing capability—available only in top supercomputers just a decade ago—will soon be commonplace in vehicles, and computer vision algorithms are being rethought to meet the needs of these systems. High-performance analysis of vehicular video feeds will enable significant advances in transportation efficiency.
CONTINUOUS CAPTURE ON HUMANS
Although on-body cameras, such as Google Glass, have thus far failed to realize widespread social acceptance, there are compelling reasons for cameras to capture the world from the perspective of a human (“egocentric” video). For example, enabling mobile augmented reality (AR) requires systems to precisely know where a person is and what a headset wearer is looking at. (Microsoft’s HoloLens headset is one example of promising recent advances in practical AR technology.) To achieve commodity, pervasive AR demands continuous, low-energy egocentric video capture and analysis.
More ambitiously, for computers to take on a more expansive role in augmenting human capabilities (e.g., the ever-present life assistant), they must “understand” much more about individuals than their present location, the contents of their inbox, and their daily calendar. For this use computers will be tasked to observe and interpret human social interactions in order to know what advice to give, and when and how to interject information.
For example, during a recent trip to Korea I found myself wishing to experience a meal at a local night market. But my inability to speak Korean and my unfamiliarity with the market’s social customs made for a challenging experience in the bustling atmosphere. Imagine the utility of a system that, given a similar
view of the world as I, could not only identify the foods in front of me but also suggest how to assimilate into the crowd in front of a vendor (be assertive? attempt to form a line?), instruct me whether it was acceptable to sit in a rare open seat near a family occupying half a table (yes, it would be okay to join them), and detect and inform me of socially awkward actions I might be taking as a visitor (you are annoying the local patrons because you are violating this social norm!). These tasks illustrate how mobile computing devices will be tasked to constantly observe and interpret complex environments and complex human social interactions. Cameras located on the body, seeing the world continuously as the wearer does, are an attractive sensing modality for these tasks.
CONTINUOUS CAPTURE OF URBAN ENVIRONMENTS
It is clear that cameras will be increasingly pervasive in urban environments. It is estimated that about 280 million security cameras exist in the world today, with cities such as London, Beijing, and Chicago featuring thousands of cameras in public spaces (IHS 2015). While today’s deployments are largely motivated by security concerns, the ability to sense and understand the flow of urban life both in public and private spaces provides unique opportunities to better manage modern urban challenges such as optimizing urban energy consumption, monitoring infrastructure and environmental health, and informing urban planning.
PUTTING IT ALL TOGETHER: ONE QUINTILLION PIXELS
In 2030 there will be 8.5 billion people in the world (UN 2015), 2 billion cars, and, extrapolating from recent trends (IHS 2015), at least 1.1 billion security/web cameras. Conservatively assigning one camera to each human and eight views of the road to each car, and assuming 8,000 stereo video streams per source (2 × 33 megapixels), there will be nearly 1 quintillion pixels across the world continuously sensing visual information.
The engineering challenge of ingesting and interpreting this information stream is immense. For example, using today’s state-of-the-art machine learning methods to detect objects in this worldwide video stream would consume nearly 1013 watts of computing power (Nvidia 2015), even if executed on today’s most efficient parallel processors. This is approximately the same amount of power used by humans around the world today (IEA 2015). Clearly, advances both in image analysis algorithms and the design of energy-efficient visual data processing platforms are needed to realize ubiquitous visual sensing.
Addressing this challenge will be a major focus of research spanning multiple areas of engineering and computer science in the coming years—machine vision, machine learning, artificial intelligence, compilation techniques, and computer architecture. Success developing these fundamental computing technologies
will provide new, valuable technology tools to tackle some of the world’s most important future challenges.
Grauman K. 2017. First-person computational vision. Frontiers of Engineering: Reports on Leading-Edge Engineering from the 2016 Symposium, Washington: National Academies Press.
IEA [International Energy Agency]. 2015. 2015 Key World Energy Statistics. Paris.
IHS. 2015. Video Surveillance Camera Installed Base Report. Englewood, CO: IHS Technology.
Nvidia. 2015. GPU-Based Deep Learning Inference: A Performance and Power Analysis. Santa Clara: Nvidia Corporation.
Sperling D, Gordon D. 2010. Two Billion Cars: Driving Toward Sustainability. New York: Oxford Academic Press.
UN [United Nations]. 2015. World Population Prospects: The 2015 Revision. New York.