The fourth workshop session focused on multi-object tracking, including information such as extracting species-specific characteristics, minimizing double counting, and species-specific parameterization. Several of the presentations addressed domain areas distinct from traditional areas of fisheries research—including image analysis of bats, human crowds, and bees—to draw similarities between fields and examine areas in which capabilities can be leveraged from other fields to fisheries research. Some useful references for this session’s topics, as suggested by the workshop program committee, include Schell and Linder, 2006; Schell et al., 2004; and Yilmaz et al., 2006. The session was chaired by Mubarak Shah (University of Central Florida), with presentations by Margrit Betke (Boston University), Mubarak Shah, Jules Jaffe (University of California, San Diego), and Ashok Veeraraghavan (Rice University).
Margrit Betke, Boston University
Margrit Betke explained that she works with computer vision applied to the three-dimensional (3D) multi-object tracking and analysis of bat and bird flight. This is a collaborative effort that includes experts from geography, engineering, computer science, and biology. Betke specifically studies the Brazilian free-tailed bat, whose population may have dropped from 150 million to 11 million in the past 50 years (Betke et al., 2008). She noted that the Brazilian free-tailed bat acts as
a pest control service, with each bat eating as many as 114 corn earworm moths in a single night, so there are economic as well as environmental reasons for desiring the bat to thrive.
Betke explained that she studies bats using thermal infrared cameras with a high spatial resolution at 131 frames per second. Using a three-camera system, she is able to develop 3D reconstructions of flight behavior. Her team has developed a protocol and calibration for multi-camera videography (Theriault et al., 2014). A calibration device with landmarks, which are easily identifiable because of their differing temperatures, is used to calibrate the space in which the bats are moving. After calibration, the two main elements to develop an object’s trajectory are detection and tracking. Tracking can be further divided into both position estimation (which Betke indicated is fairly easy to accomplish) and data association/disambiguation (which Betke indicated is still quite difficult).
Betke explained that if bats are imaged at an observation distance of about 10 m, the resulting resolution is at least 10 pixels per bat. In the reported experiment, observation distances were chosen so that the reconstructed 3D positions had uncertainties less than 10 cm, the approximate length of a single bat; at a 10 m distance, the measured uncertainty in the 3D projection was 7.8 cm.
Betke explained that objects are tracked in 3D space with two-dimensional (2D) measurements through two possible methods of path reconstruction:
- Reconstruction-tracking method. 3D positions are first reconstructed from multiple views, then a tracking approach is applied using feature-to-feature fusion. In other words, the correspondence of detected objects is found first across views, then across time.
- Tracking-reconstruction method. 2D tracking in each view is applied independently, followed by the reconstruction of 3D trajectories through track-to-track fusion. In other words, the correspondence of detected objects is found first across time, then across views.
Betke explained that her group’s tracking approach couples object detection with position estimation and data association (Wu et al., 2012). This method, essentially a multi-dimensional optimization problem, provides additional flexibility and scalability, is less sensitive to initialization, and does not constrain the pixel resolution of the target.
Betke then turned to the study of group behavior. Brazilian free-tailed bats, she explained, exhibit columning behavior when they emerge from their daytime roosts, which gives rise to behavioral questions about how bats fly together in the column (including questions such as distance between bats, relative positioning, flight speed variations, and avoidance). Her studies indicate that bats do not like to fly above one another, are adept at avoiding one another without needing to
change velocity, and tend to fly at roughly the same velocity regardless of their location within the column.
Betke also discussed applying trackers to pedestrian traffic, or appearance-based tracking (Wu et al., 2013; Bai et al., 2013). In appearance-based tracking, objects are divided into patches, and the motion of each patch is observed and monitored. This approach works for both stationary and moving cameras. Pedestrian data tend to contain a significant number of occlusions, making the task more difficult. Betke explained that she and her students also visited the New England Aquarium to record fish with their three-camera visible-light high-speed video-recording system; at this point they had merely conducted census studies, which were fairly accurate: the number of fish counted in each of the three camera views mostly matched the number of fish known to be in the tank.
Betke concluded by stressing the importance of computer vision in collecting and analyzing multi-camera video data sets. She stated that her software tool for planning field experiments (Theriault et al., 2014) might be helpful to the fisheries community. She emphasized the importance of estimating uncertainty in measurements for different target distances and angles between the optical axes of the cameras used, and indicated that the appearance-based techniques her group has developed for pedestrian tracking may be helpful with fisheries species classification.
Mubarak Shah, University of Central Florida
Mubarak Shah explained that his work focuses on high-density, crowded scenes, such as festivals, marathons, marches, and other instances of high-density, moving people. The work encompasses counting, localization, tracking, and characterizing crowd behavior. Shah noted that counting is particularly important, as there is often a desire to know such information as the number of people that participate in rallies, how many people attend concerts, or the volume of commuting traffic. Counting requires fusing information from multiple sources, and there is demand for improved automation. However, the reliability of current methods is limited by low-resolution imagery, occlusion, foreshortening, and variations in perspectives.
Shah gave an overview of a framework for analyzing crowds (Idrees et al., 2013). Three steps—head detection, Fourier analysis, and interest-point-based counts—are used in combination to obtain patch counts. Head detection is used for counting because bodies are often obscured in crowd images. Shah explained that a human detection model is trained on an existing data set so it can recognize heads. Fourier analysis is used when the head size is too small; as crowds often contain
a periodic recurrence of heads, Fourier analysis can identify that periodicity, said Shah. Interest-point-based counting is primarily conducted using a Scale Invariant Feature Transform. The three methods are used in concert for enhanced accuracy.
Shah noted that, in general, overall crowd count is fairly accurate, but each person is not localized in this system. He noted that the system produces many false positives as well as some inconsistent scaling (in which people in the same region of the image are not the same approximate size). He explained his research group’s method of finding bounding boxes, which is useful when localization is needed. That method consists of the following steps: detect a human (using a combination-of-parts model), apply scale and confidence information (information about the approximate size of a person and the confidence that detection has been made, respectively), apply a Markov random field model, iterate, and then apply global occlusion reasoning to make a final detection decision. Global occlusion reasoning, Shah explained, sets rules to help find and collect visible parts of the same person. A chain constraint is used to avoid degenerate solutions (e.g., when a head and legs are selected for the same bounding box, but the matching shoulders and abdomen are not selected for that bounding box). Shah indicated that this method of human detection compares favorably to state-of-the-art systems.
Shah then applied his methods to tracking individuals in a dense crowd (Idrees et al., 2014). There are two key elements to successfully tracking individuals:
- Context. A neighborhood motion concurrence model is used to understand context around an individual.
- Prominence. Some crowds have prominent individuals who are easy to track, referred to as “queens,” who can be identified first.
Shah explained that crowd behaviors, such as bottlenecks, departures, lane formations, arches and rings, and blockings, can also be tracked and identified.
In a later discussion session, Shah noted that there appears to be a gap between what the fisheries community is doing in the areas of sensors, 3D, and metadata relative to what is traditionally done in computer vision, and, thus, there may be opportunities for future work and improvements in the approaches used for fisheries stock assessments.
Jules Jaffe, University of California, San Diego
Jules Jaffe began by stating that fish are highly maneuverable and capable of behavior that is not common in many other contexts. For instance, they turn quickly in the water with rapid acceleration rates, exerting forces on their bodies
many times the force of gravity. As a result, the required frame rates to track fish are quite high—on the order of hundreds of frames per second. While one might usually implement a Kalman filter to track sources such as these, Jaffe said, sudden movements of fish are beyond the capability of a Kalman filter, which is better suited to predictable, piecewise linear motion. A segmented track identifier may be better suited to more abrupt fish movements. Jaffe described a segmentation and fitting algorithm to develop a piecewise, least-squares fit of a track to a parametric motion model; he compared it conceptually to a spline (Schell et al., 2004; Schell and Linder, 2006). The two tracking methods were applied to data gathered via 3D sonar with several cameras.
Jaffe then described underwater vehicles under development in his laboratory. The mini-autonomous underwater explorer (mini-AUE) is a compact, self-balancing vehicle with a hydrophone and oceanographic sensors. It has a battery life of approximately 1 week and can change its vertical height. The methods of recovery (Global Positioning System [GPS], radio frequency beacon, and light-emitting diode strobe) are redundant, as are the mini-AUE’s buoyancy control mechanisms (piston, burn wire with drop weight, and timed weight release). Jaffe’s laboratory developed 20 mini-AUE devices for an underwater position tracking system. GPS-equipped reference buoys were stationed at the surface of the water; for each drifting mini-AUE, distance measurements were made by observing the acoustic time-of-flight from the mini-AUE to the beacons. A factor graph was used to make a state-space representation of the drifter moving in time, and it was solved using an iterative message-passing algorithm. A localization test was conducted in December 2011 using a floating hydrophone array and five reference buoys spread across a 2-km-diameter range. The localization test showed that the acoustic track matched the known GPS track with less than 2 m average error (0.1 percent of the range). A “mini-swarm” of mini-AUEs were tested in September 2013 (Mirza and Schurgers, 2009, 2012; Mirza et al., 2012, 2013; Yi et al., 2013).
Jaffe concluded by shifting to a brief discussion of his planned experiments with omni-directional stereo tracking. One hundred cameras will be inset into a 3D-printed scaffold set “cross-eyed” to obtain stereo information. He believes that this novel camera design may have useful underwater applications.
Ashok Veeraraghavan, Rice University
Ashok Veeraraghavan described three projects that are peripherally related to fisheries tracking: 2D tracking in cluttered environments, 3D tracking of multiple small targets, and small baseline tracking using light-field cameras. His research
focuses on bee tracking. Veeraraghavan explained that bees are a challenging topic of study: beehives are a highly cluttered environment, with many other bees in close proximity; bees are capable of rapid movements; and there are complex interactions between the bees. Conventional bee tracking has been conducted manually: the bees are hand-marked, which can be a labor-intensive process. Tracking insects automatically is accomplished by modeling behavior at three different levels: structural/anatomical, behavioral, and interactions. An anatomical model divides the bee into its three basic body parts (head, thorax, and abdomen) and models each as an ellipse. An anatomical model has the following constraints:
- The physical limits of body parts are consistent.
- Structural limitations are incorporated.
- Correlation is made among the orientation of body parts.
- Insects move in the direction of their head.
A behavioral model takes into account specific behaviors that bees typically exhibit. The modeling is done via a hidden Markov model, explained Veeraraghavan, although the states are manually defined by the scientists. An interaction model takes into account how bees move and interact in the hive. Veeraraghavan said that he and his coworkers specifically tracked foraging bees upon their return to the hive, observing their dance communication behaviors.
Veeraraghavan then turned to a discussion of 3D tracking of small, dim targets. The objective of the work was to observe the response of bees to visual stimuli. Bees were placed in a highly scripted environment (i.e., a staged room), where a few hundred bees were observed at a time and two fixed cameras recorded their motion. Bees were typically 20 to 50 m from a camera, each image frame contained an average of 6 to 8 bees, and each bee was represented by 5 to 10 pixels. The bees are considered a dim target because of the relatively few pixels per bee. Veeraraghavan explained that the tracking algorithms used were very simple: background subtraction to remove variations in the background, connected component analysis to link pixels belonging to the same target, and probabilistic data association. The points at which the bees turn in flight (i.e., the points of maximal 2D curvature) are used to corroborate data across the two cameras. Veeraraghavan said that no camera calibration was needed because the bees’ unique trajectory curvature made it simple to correlate the cameras. Once correspondence is established, the data are triangulated to create a 3D flight path.
Veeraraghavan then turned to a discussion of light-field cameras and their application. Light-field cameras capture single-shot 3D and multi-view information by including an array of microlenses within one camera. These cameras are compact, lightweight, and easy to use. The main drawback to a light-field camera, Veeraraghavan said, is that it provides only low-resolution images. These cameras
record information about depth that can be extracted and exploited to refocus the image on different depths. Veeraraghavan used a light-field camera to image and track fish in a small fish tank to determine the feasibility of applying light-field cameras in water. He was able to demonstrate that 3D trajectories of fish targets can be extracted, even when one fish occludes another.