Large-Scale Activity-Recognition Systems
Intel Research Laboratory
Building computing systems that can observe, understand, and act on day-to-day physical human activity has long been a goal of computing research. Such systems could have profound conceptual and practical implications. Because the ability to reason and act based on activity is a central aspect of human intelligence, from a conceptual point of view, such a system could improve computational models of intelligence. More tangibly, machines that can reason about human activity could be useful in aspects of life that are currently considered outside the domain of machines.
Monitoring human activity is a basic aspect of reasoning about activity. In fact, monitoring is something we all do—parents monitor children, adults monitor elderly parents, managers monitor teams, nurses monitor patients, and trainers monitor trainees; people following medication regimens, diets, recipes, or directions monitor themselves.
Besides being ubiquitous, however, monitoring can also be tedious and expensive. In some situations, such as caregiver-caretaker and manager-worker relationships, only dedicated, trained human monitors can make detailed observations of behavior. However, such extensive observation causes fatigue in observers and resentment in those being observed. The constant involvement of humans also makes monitoring expensive.
Tasks that are ubiquitous, tedious, and expensive are usually perfect candidates for automation. Machines do not mind doing tedious work, and expensive problems motivate corporations to build machines. In fact, given the demograph-
ics of our society, systems that notify family members automatically when elderly relatives trigger simple alarms, such as falling, not turning off the stove, or not turning off hot water, are now commercially available. However, compared to a live-in family member who can monitor an elder’s competence in thousands of day-to-day activities, these systems barely scratch the surface. In this paper, I describe a concrete application for a monitoring system with broad activity-recognition capabilities, identify a crucial missing ingredient in existing activity “recognizers,” and describe how a new class of sensors, combined with emerging work in statistical reasoning, promises to advance the state of the art by providing this ingredient.
THE CAREGIVER’S ASSISTANT
Caring for the elderly, either as a professional caregiver or as a family member, is a common burden in most societies. Gerontologists have developed a detailed list of activities, called the activities of daily living (ADLs), and metrics for scoring performance of crucial day-to-day tasks, such as cooking, dressing, toileting, and socializing, which are central to a person’s well-being. An elder’s ADL score is accepted as an indicator of his or her cognitive health.
Professional caregivers in the United States are often required to fill in ADL forms each time they visit their patients. Unfortunately, although the data they collect are used as a basis for making resourcing decisions, such as Medicaid payments, the data are often inaccurate because (1) they are often based on interviews with elders who may have strong motives for misrepresenting the facts and (2) because the data-collection window is narrow relative to the period being evaluated. Given increasing constraints on caregivers’ time, purely manual data collection seems unsustainable in the long run.
The Caregiver’s Assistant system is intended to fill out large parts of the ADL form automatically based on data collected from the elder’s home on a 24/ 7 basis. The system would not just improve the quality of data collected, but (because it provides constant monitoring) might also be able to provide proactive intervention and other assistance. Figure 1 shows a prototype form of the Caregiver’s Assistant. Actual forms include activities in 23 categories, such as “housework” and “hygiene,” which instantiate to tens of thousands of activities, such as “cleaning a bathtub” and “brushing teeth.”
Thus, an activity-recognition system that could track thousands of activities in non-laboratory conditions would remove a substantial burden from human monitors. Professional caregivers could, at any time, be provided with a version of this form with potentially troublesome areas highlighted. If a nurse were given this form before a visit, for instance, he or she could make better preparations for the visit and could focus on the most important issues during the visit. A study of roughly one hundred professional caregivers around the country has shown that such a system would be useful, at least for caregivers.
DISCRIMINATING AMONG ACTIVITIES
The process of recognizing mundane physical activities can be understood as mapping from raw data gathered by sensors to a label denoting an activity. Figure 2 shows how traditional mapping systems are structured. Feature selection modules typically work on high-dimensional, high-frequency data coming directly from sensors (such as cameras, microphones, and accelerometers) to
identify relatively small numbers of semantically higher level features, such as objects in images, phonemes in audio streams, and motions in accelerometer data. Symbolic inference modules reason about the relationship between these features and activities in a variety of ways. The reasoning may include identifying ongoing activities, detecting anomalies in the execution of activities, and performing actions to help achieve the goal of the activities.
Both feature selection and inference techniques have been investigated extensively, and depending on the feature, researchers can draw on large bodies of work. In the computer vision community alone, extensive work has been done on objects, faces, automobiles, gestures, and edges and motion flows, each of which has a dedicated sub-community of researchers. Thus, once features for an activity-recognition system have been selected, a very large number of model representations and inference techniques are available. These techniques differ in several ways, such as whether they support statistical, higher order, or temporal reasoning; the degree to which they learn and the amount of human intervention they require to learn; and the efficiency with which they process various kinds of features, especially higher dimensional features. In Figure 2, the variety of feature selections and inference algorithms is indicated by stacks of boxes.
Despite the profusion of options, no activity inferencing system capable of recognizing large numbers of day-to-day activities in natural environments has been developed. A key underlying problem is that no existing combination of sensors and feature selector has been shown to detect robustly the features necessary to distinguish between thousands of activities. For instance, objects used during activities have long been thought to be crucial discriminators. However, existing object-recognition and tracking systems tend not to work very well when applied to a large variety of objects in unstructured environments (Sanders et al., 2002). Activity-recognition systems based on tracking objects, therefore, tend to be customized for particular environments and objects, which limits their utility as general-purpose, day-to-day activity recognizers. Given that producing each customized detector is a research task, the goal of general-purpose recognition has, not surprisingly, not been reached.
A new class of small, wireless sensors seems likely to provide a practical means of detecting objects used in many day-to-day activities (Philipose et al., 2004; Tapia et al., 2004). Given a stream of objects, recent work has shown that even simple symbolic inference techniques are sufficient for tracking the progress of these activities.
DETECTING OBJECT USE WITH RADIO FREQUENCY IDENTIFICATION TAG SENSORS
A passive radio frequency identification (RFID) tag (Figure 3a) is a postage-stamp-sized, wireless, battery-free transponder that, when interrogated (via radio) by an ambient reader, returns a unique identifier (Finkenzeller, 2003). Each
tag consists of an antenna, some protocol logic, and optional nonvolatile memory. RFID tags use the energy of the interrogating signal to return a 64-bit to 128-bit identifier unique to each tag, and when applicable, data stored in on-tag memory. Short-range tags, which are inductively coupled, have a range of 2 to 30 cm; long-range backscatter-based tags have a range of 1 to 10 m. Tags are available off the shelf for less than 50 cents each. Short-range readers cost a few hundred dollars; long-range readers cost a few thousand dollars. If current trends continue, there will be a steep drop in the price of both tags and readers in the next few years.
If an RFID tag is attached to an object (Figure 3b) and the tag is detected in the vicinity of a reader, we can infer that the attached object is also present. Given their object-tracking abilities, RFID-based systems are currently being seriously considered for commercial applications, such as supply-chain management and asset tracking. Existing uses include livestock tracking, theft protection in the retail sector, and facilities management. The promise of a viable RFID system for tracking the presence of large numbers of objects suggests that it might be the basis of a system for tracking objects used by people whose activities we wish to monitor. Because a sensor can be attached to each object, we
have, in principle at least, an “ultra-dense” deployment of sensors that could allow each tagged object to “report” when it is in use.
However, neither short-range nor long-range RFID systems, as conventionally designed, are quite up to the task of detecting object use in a way that would be useful for tracking activity. Short-range RFID readers are typically bulky hand-held units (similar to bar-code readers) that must be intentionally “swiped” on tags. Clearly, it is not practical to expect a person whose activities are being tracked (whether an elder or a medical student) to carry a scanner and swipe tagged objects in the middle of day-to-day tasks.
Long-range tags, however, do not require the explicit cooperation of those being monitored. Readers in the corner of a room can detect tags anywhere in that room. Unfortunately, because a conventional RFID tag simply reports the presence of tagged objects in the reader’s field, and not their use, long-range tags cannot tell us when objects are being used either. Long-range tags simply list all tagged objects in the room they are monitoring.
However, each of these modalities can be re-engineered to detect object use unobtrusively. Figure 4 shows how the short-range RFID reader can be adapted to become an unobtrusive sensor of object use (Figure 4a). Essentially, the RFID reader is a radio with built-in processor, nonvolatile memory, and a power supply integrated into a single bracelet called the iBracelet (Fishkin et al., forthcoming). The antenna of the RFID reader is built into the rim of the bracelet. When turned on, the bracelet scans for tags at 1 Hz at a range of 20–30 cm. Any object, such as the water pitcher in Figure 4b, that has a tag within 10 to 15 cm of its grasping surface, can therefore be identified as having been touched. The data can either be stored on board (for later offloading through a data port) or imme-
diately radioed off board. The bracelet can currently read for 30 hours between charges when storing data locally, and roughly 10 hours when transmitting data.
Careful placement of tags on objects can reduce false negative rates (i.e., tags being missed). However, given the range of the bracelet, “accidental” swipes of objects are unavoidable. Therefore, the statistical framework that processes the data must be able to cope with these false “hits.” Early studies indicate that an iBracelet equipped with inexpensive inductively coupled tags are a practical means of detecting object touch, and therefore object use.
Some people may consider wearing a bracelet an unacceptable requirement, however. In these cases, wireless identification and sensing platforms (WISPs) may be a useful way of detecting object use (Philipose et al., 2005). WISPs, essentially long-range RFID tags with integrated sensors, use incident energy from distant readers not only to return a unique identifier, but also to power the onboard sensor and communicate the current value of the sensor to the reader. For activity-inferencing applications, so-called α-WISPs, which have integrated accelerometers and are about the size of a large Band-Aid™, are attached to objects being tracked (Figure 5). When a tagged object is used, more often than not the accelerometer is triggered and the ambient reader notified.
A single room, which may contain hundreds of tagged objects (most of them inactive at any given time), can be monitored by a single RFID reader. A complication with WISPs is that the explicit correspondence between the person using the object and the object being used is lost. Thus, higher-level inference software may be necessary to track the correspondence implicitly.
Given the sequence of objects detected by RFID-based sensors, the job of the inference system is to infer the type of activity. The inference system relies on a model that translates from observations (in this case, the objects seen) to the activity label. Recent work has shown that even very simple statistical models of activities are sufficient to distinguish between dozens of activities performed in a real home (Philipose et al., 2004).
Figure 6 shows a model for making tea, in which each activity is represented as a linear sequence of steps. Each step has a specified average duration, a set of objects likely to be seen in that step, and the probability that one of these objects will be seen in an observation window. In the figure, the first step (corresponding to boiling tea) takes five minutes on average; in each one-second window, there is a 40, 20, and 30 percent chance respectively of a kettle, stove, or faucet being used. Experiments in a real home with 14 subjects, each performing a randomly selected subset of 66 different activities selected from ADL forms, and using activity models constructed by hand to classify the resulting data automatically, have yielded higher than 70 percent (and often close to 90 percent) accuracy in activity detection.
Although the models are simple, it is still impractical to model tens of thousands of activities by hand. However, because the features to be recognized are English words that represent objects and the label to be mapped to is an English phrase (such as “making tea”), the process of building a model is essentially translating probabilistically from English phrases to words. Recent work based on this observation has successfully, completely automatically, extracted translations using word co-occurrence statistics from text corpora, such as the Web (Wyatt et al., 2005). If one million Web pages mention “making tea” and 600,000
of them mention “faucet,” these systems accept 60 percent as the rough probability that a faucet is used when making tea. These crude “common-sense” models can be used as a basis for building customized models for each person by applying machine-learning techniques to data generated by that person. Experiments on the data set just described have shown that these completely automatically learned models can recognize activities correctly roughly 50 percent of the time. Analyses of these corpus-based techniques have also provided indirect evidence that object-based models should be sufficient to discriminate between thousands of activities.
Monitoring day-to-day physical activity is a tedious and expensive task now performed by human monitors. Automated monitoring has the potential of improving the lives of many people, both monitors and those being monitored. Traditional approaches to activity recognition have not been successful at monitoring large numbers of day-to-day activities in unstructured environments, partly because they were unable to identify reliably sufficiently discriminative high-level features. A new family of sensors, based on RFID, is able to identify most of the objects used in activities simply and accurately, and even simple statistical models can classify large numbers of activities with reasonable accuracy. In addition, these models are simple enough that they can extract automatically from massive text corpora, such as the Web, and can be customized for observed data.
This paper describes work done by the author jointly with the SHARP group at Intel Research Seattle and with researchers at the University of Washington. Specifically, work on the iBracelet was done with Adam Rea and Ken Fishkin. The work on WISPs was done with Joshua Smith and the WISP team. The work on mining models was done with Mike Perkowitz, Danny Wyatt, and Tanzeem Choudhury. Inference techniques were developed jointly with Dieter Fox, Henry Kautz, and Don Patterson.
Finkenzeller, K. 2003. RFID Handbook, 2nd ed. New York: John Wiley & Sons.
Fishkin, K., M. Philipose, and A. Rea. Forthcoming. Hands-On RFID: Wireless Wearables for Detecting Use of Objects. Proceedings of the International Symposium on Wearable Computing, Osaka, Japan, October 18–20, 2005. Washington, D.C.: IEEE Computer Society.
Philipose, M., J. Smith, B. Jiang, A. Mamishev, S. Roy, and K. Sundara-Rajan. 2005. Battery-free wireless identification and sensing. Pervasive Computing 4(1): 37–45.
Philipose, M., K.P. Fishkin, M. Perkowitz, D.J. Patterson, D. Fox, H. Kautz, and D. Hähnel. 2004. Inferring activities from interactions with objects. Pervasive Computing (October-December):50–57. Available online at: http://seattleweb.intel-research.net/people/matthai/pubs/pervasive_sharp_04.pdf.
Sanders, B.C.S., R.C. Nelson, and R Sukthankar. 2002. The OD Theory of TOD: The Use and Limits of Temporal Information for Object Discovery. Pp. 777–784 in Proceedings of the Eighteenth National Conference on Artificial Intelligence. Menlo Park, Calif.: AAAI Press.
Tapia, E.M., S.S. Intille, and K. Larson. 2004. Activity Recognition in the Home Setting Using Simple and Ubiquitous Sensors. Pp. 158–175 in Proceedings of PERVASIVE 2004, vol. LNCS 3001, edited by A. Ferscha and F. Mattern. Berlin/Heidelberg: Springer-Verlag. Available online at: http://courses.media.mit.edu/2004fall/mas622j/04.projects/home/TapiaIntilleLarson04.pdf.
Wyatt, D., M. Philipose, and T. Choudhury. 2005. Unsupervised Activity Recognition Using Automatically Mined Common Sense. Pp. 21–27 in Proceedings of the Twentieth National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference. Menlo Park, Calif.: AAAI Press.