The fifth session of the workshop focused on current approaches used to classify fish and identify key parameters such as size, shape, texture, color, and motion. This workshop session also included examples from a diverse array of disciplines. Some useful references for this session’s topics, as suggested by the workshop program committee, include Miller et al., 2013, 2014; Dryden and Mardia, 1998; Srivastava et al., 2011; Kurtek et al., 2012; Hsiao and Hebert, 2013; Gu et al., 2012; Zhu et al., 2008; Thayananthan et al., 2003; Belongie et al., 2002; Ochs et al., 2014; Ricco and Tomasi, 2012; and Sundaram et al., 2010. The session was chaired by Hui Cheng (SRI International) and included presentations by Elizabeth Clarke (NOAA Fisheries), Anuj Srivastava (Florida State University), Anthony Hoogs (Kitware), Hui Cheng, and Michael Miller (Johns Hopkins University).
Elizabeth Clarke, NOAA Fisheries
Elizabeth Clarke stated that while it is vital to obtain accurate fish size in order to obtain corresponding age and biomass measurements, measuring fish is a time-consuming step. She explained that the weight, age, and life stage of individual fish can be estimated from the measurement of total fish length. Other morphometric
and meristic features1 can be later measured for taxonomists, but these measurements are not done in the field, as they are too time-consuming.
Clarke explained that, currently, measurements are typically made by hand, on the deck of a vessel or in a laboratory. Automated measuring boards exist to help with on-deck measurements of large catches. While lasers also have been used from submersibles to obtain in situ measurements, stereo imaging cameras are currently considered state-of-the-art for underwater applications.
In addition to measuring length, on-deck sampling observers can measure weight, extract otoliths,2 and obtain tissue and other biological samples. Clarke emphasized that some catch will always be needed to obtain these physical samples, no matter how well other measurements can be obtained via automated methods. In response to a later question, she indicated that real-time information would also be useful for on-deck applications, such as allowing fish catch to be quickly assessed on fishing vessels.
Clarke stated that fish come in a variety of shapes and sizes, with a corresponding variety in their visibility and ease of measurement in images. She explained that lasers had been the standard method of obtaining fish size in situ. In such a system, several lasers are mounted a known, fixed distance apart. However, it can be difficult to obtain accurate measurements via lasers: most fish are not on the bottom of the seafloor and are therefore not perpendicular to the camera. In addition, Clarke noted that fish tend to react to certain lasers: flat fish have been known to “chase” green lasers, for instance. Stereo cameras are now the de facto standard, Clarke said, although they must be carefully calibrated to correct for (1) optical distortions of the lenses and (2) epipolar geometry (i.e., the relative translation and rotation of the cameras). Commercial or “one-off” software is then used to examine images and obtain measurements. These measurements are done by hand, and Clarke emphasized the laborious nature of this task. She also explained that the morphometric and meristic measures are critical to identify fish species. Even with those observations, however, certain fish (vermillion and sunset rockfish, for example) can be distinguished only by their genetics and cannot be identified in-hand. Other fish can be very difficult to identify via photos, she said, as certain distinguishing details are not visible on an image. In discussion, a participant indicated that in an automated context, it may be easier to obtain surface area rather than length, and that may be a more robust measure to use for weight and age relationships.
Clarke concluded by noting that improved biomass estimates could corre-
1 Morphometric and meristic features are quantitative measures of a fish, such as the number and size of fins, scales, jaw length, and eye-to-jaw ratio, that can be used by taxonomists to distinguish different species.
2 The otolith is a structure in the inner ear of vertebrates that helps with balance and movement. In fish, markings on the otolith can be used to determine the animal’s age.
spondingly improve assessment of fish stocks. She suggested that it would also be helpful to be able to distinguish geologic features for classifying habitat.
Anuj Srivastava, Florida State University
Anuj Srivastava discussed his research in the detection of the types, locations, and quantity of underwater mines. He noted that there are few mines in the images he obtained; most of what is imaged is considered clutter, such as fish or underwater debris. The first step in the detection process is to find regions of interest using a variety of machine learning algorithms. The second step is to look at a “big picture” to detect spatial patterns in the regions of interest and to model the terrain. Finally, centers of attention are examined to determine features such as shape and appearance, and the object is classified via an analysis of shape contours.
Srivastava explained that shape analysis of contours is complex and consists of two challenges:
- Extraction of contours. Srivastava explained that extracting the boundaries of the segmentation may have to overcome complexities such as low image contrast, occlusion and clutter, and multiple sources.
- Statistical analysis of contour shapes. Metrics for comparing shapes and models for capturing the statistical variability in the shapes are both needed, said Srivastava. A statistical model can enable the researcher to put a confidence value on any shape classification.
Srivastava explained that registration is the determination of correspondences across curves and the matching of points across curves. For every two shapes under comparison, the optimal registration must be found, and linear registration is usually suboptimal. In elastic shape analysis, registration and comparisons are performed jointly. Elastic shape analysis requires the computation of a metric to quantify the shape difference; it should be independent of the choice of comparison points and should not disturb points that are already well matched. Srivastava indicated that shape metrics are useful in clustering data, and a statistical model of the shapes can be developed with summary information on the population of shapes (information such as the mean, covariance, and principal components of the population). Srivastava said that statistical shape analysis has been successfully applied to the shape analysis of leaves. However, more tools are needed for statistical shape analysis, particularly the following:
- Partial shape analysis. The full shape may not always be available.
- Moving beyond similarity invariance. Objects observed from different viewing angles may have an altered or distorted perceived shape, Srivastava explained, and a similarity transformation would not be applicable. Other types of projections or transformations may be needed.
- Comparing populations. Currently, individual shapes are compared to one another; Srivastava posited that it may be useful to compare population distributions, rather than merely individual shapes.
Srivastava also explained that an active contour model, in collaboration with statistical shape analysis, can be used to develop a shape. A priori information about the types of desired shapes can be given to the active contour model. He indicated that this method has proven useful when applied to an object of interest that is only several pixels in size.
Srivastava concluded by noting the importance of shape analysis to not only identify individual items, but also provide feedback to other elements in a larger detection system, such as the determination of regions of interest. In addition, several of these shape analysis methods have already been extended to include 3D objects and surfaces.
Anthony Hoogs, Kitware
Anthony Hoogs began by defining a paradigm in which a user can interact with a machine learning system to tell the system the type of information the user would like. He described a concept of vision algorithm generation by non-experts. The system contains a variety of potential algorithms, and it determines which algorithm(s) would best apply to a given input to obtain a desired output. The user supplies the content (i.e., an image or set of images), and the system provides the algorithm and parameters to obtain a result. Such a system would enable scientists, analysts, and other end users to extract useful content from imagery and video across heterogeneous scientific domains without requiring any additional expertise in analytic algorithms or visualization.
Hoogs explained that the normal process today calls for an expert to manually adjust between 5 and 25 algorithm parameters and choose the best values based on experience, intuition, and trial and error. However, even with a large investment of time, the optimal parameter and model configuration settings may never be found with manual tuning. In addition, the parameters do not then generalize, and each subsequent data stream requires this same, potentially lengthy, manual tuning procedure. Hoogs indicated that a better method would be to automatically
iterate the potential parameters and models until performance is acceptable. His approach is to cluster videos into similarity sets based on a set of variables, such as scene conditions or camera metadata, and learn the best configuration parameters for each set.
Hoogs also discussed automatic video surveillance. In an exemplar-based query, the user provides a video clip (such as a recording of a person bending over or performing some other specific activity) and asks the system to provide other instances of that same activity from a lengthier recording. In this surveillance application, the user provides feedback to tune the results. This system has been in development for 5 to 6 years, explained Hoogs, and it is in the process of transition to defense applications. The system first detects movement and then determines the object type (i.e., whether it is a vehicle, person, or other object). Tracks are examined at the global level (to obtain information such as the kinematic properties, object type, and how the object is changing in time). Many descriptors are computed, including track interval descriptors, motion descriptors, and relational descriptors. In low-resolution video, humans and other objects of interest may be only a few pixels in size, which adds to the challenge. Hoogs explained that statistical shape analysis proved to be the best tool for identification and tracking, and he showed results identifying instances of a specific activity, including finding people wearing backpacks and people doing cartwheels. He suggested that similar classifiers could be trained, with user input, to find fish similar to a selected one.
Hoogs concluded by emphasizing the importance of using a meta-learning approach to generate vision algorithms by the domain scientist, rather than by experts in computer vision. He stated that auto-tuning, combined with user interaction and feedback, can improve the results. He suggested that the community develop a broader biological data set suitable for developing and testing tracking algorithms. Such a set could then be used to identify the best trackers to suit the data. In a later discussion, Hoogs suggested that the underwater observing community may suffer from too much specificity in its data sets; he has observed that as many as 30 to 40 percent of research papers introduce new data sets, leading to data set competition instead of collaboration.
Hui Cheng, SRI International
Hui Cheng began by describing a number of the challenges associated with shape, motion, and texture analysis in video for fisheries applications, such as the diverse life forms with variation in shape, size, movement, and texture as well as diverse environments with variation in lighting and background conditions. Cheng explained that one potential way to meet these challenges is through the integration
of multiple sensors and cues to cut across shape, motion, behavior, and texture. He stressed that SRI International has not investigated any underwater applications at this time.
Cheng explained that optical flow analysis, the traditional method of analyzing video, tends to blur boundaries when items are in motion. It also does not adequately address occlusion. He showed results using bilateral filtering (Xiao et al., 2006) that produces sharp, clear motion boundaries. He also said that with coarse-to-fine model-based image alignment, one high-resolution camera can be paired with lower-quality cameras without loss of quality in detection and tracking. This may contrast with the natural impulse to use the highest-resolution cameras possible.
Cheng described work in large-area aerial surveys to detect and track different types of moving vehicles. The goal is to be able to distinguish different vehicle types from one another, as well as identify and track specific vehicles (Cheng referred to the latter as “fingerprinting” the vehicle). Images are first scanned, and any objects found are indexed. The items are then matched, and the matched sets are verified to create a unique match or a shorter list of possible matches. Finally, the object is “fingerprinted,” which is done in three dimensions, he explained, because there are so many possible two-dimensional (2D) renderings of each object, and 2D matching is too computationally expensive. Specific points on different vehicle models (such as corners and joints) are used as key feature locations. The key feature locations are then projected back onto the 2D image, and a matching matrix is obtained to describe the feature matching. Feature matching also provides information about what parts of the vehicle are occluded and what parts of the vehicle do not match well to the model.
Cheng then discussed the challenges associated with finding and tracking people in video imagery (Zhu et al., 2014). In a large-area survey, a person is very small on the image—typically only 6 to 8 pixels wide. In addition, the images tend to be low contrast contain sensor artifacts and incomplete information, and contain occlusions. To detect people in video, one looks for a spatial signature (size and contrast) as well as a motion signature (contrast difference, speed, and travel distance). In addition, one uses a multi-frame cue to detect changes from one frame to the next, along with measures of spatial and temporal saliency.
Cheng explained that these wide-area video analyses can also classify behaviors, or patterns of activity, using entity-centric event detection and recognition (Cheng et al., 2008). This has been applied to aerial and surveillance videos, and he noted that this type of tracking could also be applied to fish.
Michael Miller, Johns Hopkins University
Michael Miller described the emerging discipline of computational anatomy, which applies the vernacular and formalism of computational linguistics to the quantitative analysis of biological shape (Miller and Grenander, 1998). A set of metrics is used to index and cluster biological shapes, and machine learning is then performed on those shapes. Biological shape, Miller explained, is developed by using geodesic positioning to obtain coordinates; those coordinates then describe the shape. As with computational linguistics, parsing is a critical element. In computational linguistics, parsing means dividing the data into smaller segments—that is, building correspondences between a target (such as a word) and some ontology (such as a grammar). In computational anatomy, positioning is the equivalent of parsing, and diffeomorphisms3 are used for the transformation from the observed image and some template. Miller noted that he is particularly interested in applying computational anatomy to pediatric neurodevelopment.
Miller showed an example of the application of geodesic positioning and diffeomorphisms applied to temporal lobe dementia. Over 2 years, he imaged the temporal lobe of patients with dementia to observe the progression of the disease, and observed changes in the volumes of those lobes during that time (Miller et al., 2013, 2014).
He then explained that a geodesic positioning system provides information about both positioning and the geodesic coordinates. Geodesic coordinates describe the diffeomorphic shape; that is, the vector field at the origin determines the geodesic flow. With these coordinates, one can do machine learning and build classifiers to identify different diseases. Miller then described the application of such a system to high-throughput brain clouds. Johns Hopkins University has a database of more than 10,000 brain images, but there is no structured index associated with the images for searching them. Text-based searches are not terribly helpful, Miller pointed out, because high-quality, structured information about anatomical locations does not exist. Miller then explained that brain images can be parsed into an ontology: each brain image (consisting of 10 million or more pixels) can have its dimensional space reduced to 1,000 bits. Geodesic coordinates are then developed for each image, and machine learning is performed on the resulting coordinates. The vectors are then clustered to group images with similar characteristics. Miller showed some examples applying this technique to images of brains affected by cerebral palsy and noted that cerebral palsy has many different profiles.
3 In mathematics, diffeomorphisms are smooth, differentiable, invertible maps that relate points on one manifold to those on another manifold, encoding all pointwise relationships.