Below are the first 10 and last 10 pages of uncorrected machineread text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapterrepresentative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 9
2
Image Analysis and
Computer Vision
Donald Geman, University of Massachusetts
and
Basilis Gicias, Brown University
2.l Introduction to Image Analysis
The goal of computer vision is to build machines that reconstruct and in
terpret a threedimensional environment based on measurements of radiant
energy, thereby automating the corresponding processes performed by bio
logical vision systems, and perhaps eventually extending some abilities of
these systems. This goal is far from being attained, and indeed! most of
the fundamental problems remain largely unsolved. The field is generally
immature by comparison with standard scientific disciplines, the present
methodology being a hybrid of those from artificial intelligence, classical
signal processing, pattern theory, and various branches of mathematics, in
cluding geometry and statistics. Still, important advances have been made
that are beginning to substantially affect such areas as industrial automa
tion, earth science, medical diagnosis, and digital astronomy.
Computer vision tasks are generally clivicled into "Iowlevel" and "high
level" problems to differentiate between those that (apparently) are largely
datadriven ("early vision") and those that (apparently) rely heavily on
stored knowledge and symbolic reasoning. More specifically, Towlevel vi
sion includes such problems as coding and compressing ciata for storage and
transmission; synthesizing natural and manmade patterns; restoring images
degraded by blur, noise, digitization, anal other sensor effects; reconstructing
images from sparse data or indirect measurements (e.g., computed tomog
raphy); computing optical flow from motion sequences; en c! reconstructing
9
OCR for page 9
10
threedimensional surfaces from shading, motion, or multiple views (e.g.,
binocular stereo) or multiple wavelengths (e.g., multichannel satellite ciata).
In contrast, highlevel vision tends to be driven by more specific goals and
generally involves object recognition and scene interpretation. These are
perceived as the most Circuit of the "natural" processes to replicate; in
particular, one of the factors that inhibits the introduction of industrial and
exploratory robots is their inability to "see," in particular, to infer enough
information about objects to navigate in complex, cluttered environments.
Indeed, invariant object recognition is one of the most challenging modern
scientific problems whose resolution may require new conceptual principles,
computational procedures, and computer architectures.
Biological vision systems, in particular the human eye and brain, analyze
scenes in an apparently effortless way. This ability appears miraculous (es
pecially to workers in computer vision) and is attributed to several factors,
not the least of which is the large proportion of the human brain devoted
to vision. Still, we actually know very little about the principles of biolog
ical vision despite the information gathered by physiological, psychophys
ical, and neurophysiological studies. We do know that our visual system
is able to integrate cues from many sources (e.g., binocular stereo, motion,
and color) and that we exploit a priori expectations, specific scene knowI
edge, and contextual clues to reduce ambiguities and "correctly" perceive
the physical world. It also appears that lowlevel analysis of retinal infor
mation and hi~hlevel cognition are clone interactiveIv. sometimes referred
cat ~  ,
. ~ · . . · ~ arc . . ·~ ~ fit . ~ qq ·
to as the integration of nottomup and topcown processing. Finally,
there is little doubt that our recognition and interpretation system is largely
scale invariant and at least partially rotation invariant.
Computer vision systems are usually quite inferior to biological ones.
This may be due in part to the lack of raw processing power or suitably par
allel computation, but also, and perhaps more important, to the inability of
synthetic systems to integrate sources of information and place appropriate
global constraints. At the moment, automated visual systems rarely make
"interpretationguided" or "knowledgedriven" decisions, due probably to
a lack of sufficiently invariant representations en c! to feedback mechanisms
between these representations and the raw data. It appears inefficient, if not
fruitless, to attempt to represent a given object in all its possible presen
tations. Instead, object representations and recognition algorithms should
possess certain invariances with respect to scale, location, and rotation.
Despite these shortcomings, important successes have been achieved in
actual applications. In some areas, the sheer amount of data leaves no choice
OCR for page 9
11
except automated analysis. For instance, a single Landsat Multispectral
Scanner image may contain 30 Mbytes of data. In fact, some of the earliest
and most publicized successes of computer vision occurred during the 1960s
and 1970s when images received from orbiting satellites and space probes
were substantially improved with linear signal processing techniques such
as the Wiener filter. More recently, significant advances have been made
in the classification of satellite data for weather and crop yield prediction,
geologic mapping, and pollution assessment, to name but three other areas of
remotesensing. Another domain in which automated systems are achieving
success is the enhancement and interpretation of medical images obtained by
computed tomography, nuclear magnetic resonance, and ultrasonics. Other
applications include those to optical astronomy, electron microscopy, silicon
wafer inspection, optical character recognition, robot navigation, and robot
manipulation of machine parts and toxic material.
By and large, the algorithms used in machine vision systems are specif
ically dedicated to single applications and tend to be ad hoc and unstable.
From a practical viewpoint, problems arise when algorithms are so critically
"tuned" to the particulars of the environment that small perturbations in
the output of the sensors, or ordinary daytoday variations, will signifi
Obviously, there is a need for
algorithms that are more robust and that are based on solid theoretical
foundations. But these are ambitious goals; the problems are very clifficult,
ranging, in mathematical terms, from isconditioned (unstable) to ilIposed
(underdetermined), the latter due to the Toss of information in passing from
the continuous physical world to sampled and quantized twodimensional
arrays. In order to reduce the ambiguity, it is necessary to reduce the set of
plausible interpretations by incorporating a priori knowledge and integrating
multiple cues. One then seeks a mathematical representation for structure
and regularity.
Standard regularization theory, as applied, for example, to inverse prob
lems in particle scattering, is deterministic and lacks flexibility. One major
current trend is toward "stochastic regularization." This is not surprising in
view of the fact that many natural regularities are in fact nondeterministic:
they describe correlations and likelihoods. Spatial statistics, in general, and
latticebasec! random field models, in particular, provide a promising frame
work for capturing these regularities and quantifying generic and a priori
knowledge.
cantI',r reduce their level of performance.
Sinch models, properly conceived, impose severe but natural
restrictions on the set of plausible interpretations. Thus spatial processes
and Bayesian inference have provided a coherent theoretical basis for cer
OCR for page 9
12
tain inverse problems in lowlevel vision. Moreover, this framework supports
robust and feasible computational methods (e.g., Torte CarIo algorithms),
measures of optimality and performance, and welldesigned principles of in
ference. This methodology is described in more detail in §2.4. Section 2.2
contains a brief review of digital images, and §2.3 describes four specific
image analysis tasks.
2.2 Digital Images
The data available to an automated derision system are one or more images
acquired by one or more sensors. The most familiar sensors are optical sys
tems, with ordinary cameras and lenses, that respond to visible light. Other
sensors respond to electromagnetic radiation corresponding to other parts of
the spectrum (e.g., infrared, Xrays, and microwaves), or to other forms of
energy such as ionized highenergy particles (protons, electrons, alpha parti
cles), ultrasound, and pressure (tactile sensors). Many applications employ
multiple sensors; for example, navigation robots may be equipped with video
camera~s), range, and tactile sensors, and the Landsat multispectral scan
ners collect data in bands of both visible light and infrared! radiation. Sensor
fusion is a current trend in many technologies and inferential procedures.
Regardless of the form of energy acquired and the specific processes of
detecting, recording, and digitizing, the output of all sensors has a common
structure: it is a finite collection of measurements, y = {y`: t ~ T3, indexed
by a finite set T. With few exceptions (e.g., photon emission tomography),
y is a twodimensional array, i.e., T is a grid of points (pixels) in the two
dimensional image plane. Each ye is integervalued or, as in the case of
multispectra satellite and color images, a vector with integervalued compo
nents. Except in photon counting devices, the values of y may be regarded
as "quantizations" of a continuum signal g = Age: t ~ TV, which, in turn,
is a discrete approximation or sampled version of a function glut, u ~ R2,
defined on the twodimensional image plane (or some domain So C Rib.
In addition to the errors introduced by digitization (= sampling + quan
tization), 9 involves various types of degradation (e.g., blur and noise; see
below). In the absence of these degradations and other artifacts, i.e., in an
"ideal" system, we would observe an ideal energy pattern foul, u ~ R2.
The data y or {9~: t ~ Ti may be thought of as a representation of the
physical scene being imaged. The task of computer vision is to estimate or
infer properties ("attributes") of the scene on the basis of the data and a
OCR for page 9
13
priori knowledge or expections about the physical world. Attributes of inter
est may be the true pattern foul itself (as in image restoration); geometric
features (e.g., orientation, depth, curvature) of objects in the scene; or se
mantical labels for scene entities, such as in object recognition or remote
sensing, in which regions are classified as "forest," "cropland," "water," and
so on. The relation of specific attributes (e.g., shape) to the true pattern
flus is problemspecific, and some concrete cases will be treated in §52.3,
2.4.
More specifically, the true pattern flub corresponds to the distribution
of energy flux (radiance) emitted by objects, either because they are "il
luminated" by an energy source, or because they are a primary source of
energy themselves; it is often referred to as "scene intensity" or "brightness."
The measured values 9 correspond to the energy flux (or irradiance) inter
cepted, detected, and recorded by the sensor, and are usually referred to as
"image intensities" or again simply as "brightness." Between emission and
detection, various types of distortion and artifacts occur. These are usu
ally lumped into three major categories (before digitization): blur, which
may be introduced by scattering within the medium (e.g., atmosphere, hu
man body), defocused camera lenses, or motion of the medium, objects,
or cameras; noise, introduced by the random nature of photon propagation
("quantum noise") or by the sensing and recording devices (e.g., film grain
noise or current fluctuations in electronic scanners); nonlinear tmnsforma
tions of the signal (referred to as "racliometric distortion" ) introduced again
by the sensing and recording devices.
These degradations amount to a transformation from flus to glut. The
most general transformation encountered is
glut = ?~i'~f)(u)], Buy.
(2.1)
Here, I; accounts for blur and I{(f )(u) = J I((u, v, f TV. In the linear
case, I((u,v,z) = I((u,~)z and I((u,v) is called the point spread function
(PSF); the function ~ accounts for radiometric distortions, ~ is a collec
tion of noise processes, and i,) defines the noise mechanism (e.g., additive,
multiplicative). These parameters have been studied extensively for many
sensors (e.g., Vidicon and CCD cameras) and media (e.g., the atmosphere
and human tissues). We refer to [9] and references cited there for more
details.
OCR for page 9
14
2.3 Some Problems in Image Analysis
In this section we describe four specific problems, which are representative
of those in lowlevel computer vision: (1) image restoration; (2) boundary
detection; (3) tomographic reconstruction; (4) threedimensional shape re
construction. These problems demonstrate the mathematical difficulties en
countered in converting information which is implicit in the recorded digital
image to explicit properties and descriptions of the physical world. By and
large, these problems are naturally nonparametric, and, as inverse problems,
range from illconditioned to isposed. Consequently, as discussed in §2.1,
one seeks a priori assumptions, or a priori information about the physical
world (i.e., separate from the data and imaging equations), to constrain the
set of possible, or at least plausible, solutions. An approach to some of these
problems based on stochastic regularization is presented in §2.4. Other ap
proaches are briefly indicatecI in §52.3 and 2.4, and a few references are cited
for more details. However, these problems have been studied extensively in
the computer vision and engineering literature, and we refer the reader to
[9] (and other surveys) for more complete references to related approaches
based on stochastic regularization and B aye sian inference. Finally, the field
of mathematical morphology [26i, which is not considered here, offers a quite
different methodology for some of these problems.
2.3.] Image Restoration
The classical image restoration problem for intensity data is that of recov
ering a true, twodimensional (listribution of radiant energy, foul, u ~ R2,
from the actual recorded image values. in a "continuouscontinuous" set
up, the problem is posed by equation (2.1~. However, since the number
of recorded values is finite (in fact spacediscretization is inherent in many
sensors), the continuousdiscrete formulation is more realistic. Ignoring, for
the moment, quantization of the measurement values, we can assume the
actual recorder! data constitute a twodimensional array 9 = 19i: i ~ S) of
positive real numbers on a regular grid S = {i = (i3, ivy:
in fact an N x N integer lattice. Then (2.1) is replaced by
gi = {[I], Hi}, i ~ S.
l,i2~},
(2.2)
Assuming the degradation mechanism to be known (or previously estimated),
the problem of recovering ffu) from {gig is nonparametric and obviously
imposed in general. For computational purposes, the domain of foul is
OCR for page 9
15
discretized into a larger integer lattice S', concentric to S. of dimension
N x N. N > N. and fLu) is replaced by f = {fi:i ~ S'). Then I: be
comes a discrete representation of the point spread function, and for linear
spaceinvariant systems we have
(K fat. = ~ Mali—jiffy, i ~ S.
jES'
Assuming K has bounded support (a reasonable assumption in most cases),
then S' should be taken large enough so that the summation over j ~ S'
includes all terms for which K(i j) > 0. This is not exactly a convolution;
one can imagine f convolved with I; on the infinite lattice, but only observed
on S. and reconstructed on S'.
The problem of estimating {fig from {gi) is still imposed. To see this,
consider the simple case of a linear model: 9 = Iff + 9, with only blur
and a single noise process. By relabeling the sites of S and S', we can
regard 9, 71, and f as vectors of dimension N2, N2, and N'2, and Ii as an
N2 x N'2 matrix. For ~ = 0 and N < N', the problem is underdetermined
and I{~ is not welldefined. But even if N = N' and K were invertible (e.g.,
toroidal convolution), the matrix is usually nearly singular, so the existence
of measurement and quantization errors then renders the problem unstable
in the sense that the propagation of errors from the data to the solution
is not controlled. Put differently, given 9 and If, two images with blurred
values very close to 9 can be very different.
Consequently, one seeks additional information to constrain or "regular
ize" the problem. Traditional approaches (see section 4.2 of [9] and the ref
erences there) may be divided into linear methods, such as the traditional
Wiener filter and other constrained leastsquares methods, and nonlinear
methods, such as the maximum entropy technique. Linear methods are typ
ically illconditioned and the reconstructions are often compromised by large
oscillatory errors. Maximum entropy has been extensively examined and is
widely popular in certain areas, such as digital astronomy [16, 8, 19, 244.
The regularization underlying the standard constrained leastsquares filter
is equivalent to a Gaussian (improper) prior distribution, whereas that of
maximum entropy is equivalent to a particular random field mode! whose
variables interact only through their sum (= total energy). None of these
methods addresses the crucial issue of discontinuities, which often convey
much of the information in the image, and which are difficult to recover with
standard methods. A general framework for nonlinear estimation based on
spatial stochastic processes is outlined in §2.4.
OCR for page 9
16
2.3.2 Boundary Detection
The boundary detection problem is that of locating (discrete) contours in a
digital image that correspond to sudden changes of physical properties of the
underlying threedimensional scene such as surface shape, depth (occluding
boundaries), surface composition (texture boundaries), and surface mate
rial. A common complication is that sharp physical boundaries may appear
in an image as slow intensity transitions or may not appear at all due to
noise and other degradation effects. In addition, extraneous "edges" may ap
pear from artifacts of the imaging system, such as sensor nonTinearities and
digitization, or from "nonphysical" boundaries, such as shadows. Boundary
ciassipcation refers to detecting and labeling boundaries according to their
physical origins, but it is rarely attempted and appears essentially impossi
ble, at least without additional information from multiple views or sensors,
temporal sequences, or specific scene knowledge.
Segmentation is a closely related problem; one seeks to partition an
image into disjoint regions (pixel cIasses) on the basis of local properties
such as color, depth, texture and surface orientation, or on the basis of
more global (or even semantical) criteria, for instance involving dichotomies
such as "objectbackground" or "benignmaTignant." Clearly, each partition
induces a unique boundary "map," whereas only boundary maps that are
sufficiently organized yield useful segmentations.
These problems are studied extensively in computer vision, and there are
many concrete applications. For example, texture is a dominant feature in
remotely sensed images, and texture segmentation is important in the anal
ysis of satellite data for resource classification, crop assessment, and geologic
mapping. Other applications include automated navigation and industrial
quality control; for example, in silicon wafer inspection, Tow magnification
views of memory arrays appear as highly structured textures. Nonetheless,
most work on boundary detection and segmentation has been of a general
nature, separated from specific problems, and regardecl as a "preprocessing"
step toward further analyses such as extracting three(limensional shape
attributes (see §2.3.4), object recognition, and fullscale scene interpreta
tion. In the view of some researchers, including the authors, this modular
approach to image analysis is highly suspect, and generic segmentation is
overemphasized.
Algorithms abound for "edge detection," which refers to the problem
of locating the individual changes in intensity independently of the overall
scene geometry. Most of these algorithms are heuristic, partly because it is
OCR for page 9
17
Circuit to state these problems in precise mathematical terms. Traditional
edge detectors [17, 20] are deterministic procedures based on (discrete) dif
ferential operators. Boundaries are then regarded as weDorganized subsets
of edges and the construction of boundary maps is usually considered as a
second phase in which the detected edges are "cleaned," "smoothed," and
otherwise massaged into structures we associate with read boundaries. A
variational method proposed in [22] for boundary detection has lead to in
teresting mathematical problems, but its practical utility is uncertain. Sta
tistical methods for segmentation in remote sensing are very prevalent [9,
25] and often successful. However, until recently most techniques employed
nonspatial methods, such as linear discriminant analysis. Statistical meth
ods that are truly spatial are currently gaining popularity, and an example
involving texture segmentation is presented in §2.4.2.
2.3.3 Tomographic Reconstruction
Tomography is an imaging technology widely used in medical diagnosis (and
also in industrial inspection and other areas). The two basic types are
transmission and emission tomography. The most commonly used form of
transmission tomography is the "CATscan," whereby a radiation source
rotates about the patient's body and bombards it with Xrays or other
atomic particles. Those particles that pass through the body are counted,
and an image (or series of images) is formed from the combined counts; fewer
counts correspond to regions of higher attenuation, which may, for example,
indicate the presence of tumors.
In emission tomography, a pharmaceutical product is combined with a
radioactive isotope and directed to a location in the patient's body, usually
by injection or inhalation. The pharmaceutical is selected so that its con
centration at the target location is proportional to some organic function of
interest, for example, metabolic activity or local bloocT flow. The objective is
then to reconstruct the (internal) twodimensional or threedimensional iso
tope concentration based on counting the number of released photons that
escape attenuation and are registered by arrays of detectors placed around
the body. For instance' in positron emission tomography (PET), the isotope
emits a positron, which, upon colliding with a nearby electron, produces two
photons propagating in opposite directions.
From here on the focus is on single photon emission computed tomog
raphy (SPECT), in which the isotope releases a single photon each time a
radioactive decay occurs. Arrays of collimators are placed arousal the area
OCR for page 9
18
of interest and capture photons that escape attenuation and whose trajecto
ries carry them down the collimator. The problem is then to reconstruct the
isotope concentration from the photon counts and is similar to the inverse
problems mentioned in §2.3.1.
The dominant physical effect is photon attenuation, by which photons
are annihilated and their energy absorbed by body matter. Other significant
effects are photon scattering and background radiation, as well as those
effects induced by the sensors. Attenuation is accurately described by a
single function ~ whose values are known for bone, muscle, and so on, and
for various photon energies. The incorporation of scattering and other effects
is more subtle [134.
We formalize the SPECT reconstruction problem as follows [12, 274. Let
Mu) be the isotope density defined on some domain So, which is usually
taken as a twodimensional region corresponding to a cross section of the
body. The detectors A, j = 1,...,m, are arranged on a linear array at
equally spaced intervals, and the array is positioned at equally spaced angles
Ok, k = 1, . . ., n, for the same time duration at each angle. Let T = ~ (oh, Ok ~ :
j = 1, . . . ,m; k = 1, . . . ,n). The total number of observed photons is an
array y = {y': t ~ T). Assuming that the photons generated in regions
So are governed by a spatially nonhomogeneous Poisson process with mean
x~u) at the point u ~ So, the observation y is a realization from another
nonhomogeneous Poisson process, Y = {Y~: t ~ T3, with mean
~
BY = Ax,
(2.3)
where the linear operator A incorporates attenuation (via the attenuates!
Radon transform) and other physical effects [134. Hence, the conditional
probability distribution of Y given x is
P(Y = yin) = II ~ )t expel—(Ax)~.
ACT t
(2.4)
For computational purposes, the region So is discretized into a grid S of
pixels. Then x = fXi: i ~ ST represents a piecewise constant approximation
~ , ~ ~
of Mu), and the operator A becomes a matrix A = (a',,i, t ~ T. ~ ~ S.
The oldest, and still current, method used for reconstructing x from y is
backprojection, which essentially inverts (2.3), and is not very accurate. A
more recent method [27] is that of maximum likelihood (ML), i.e., maximize
P(y~x) with respect to x, which is implemented with the expectation maxi
mization (EM) algorithm. However it has been widely realized that the ME
· . · ~ _ ~ ~ ~ ~ · . ~
OCR for page 9
19
estimator is too "rough." Various procedures for smoothing this estimator
have been suggested (see the references in [9~), including penalized ME and
the method of sieves. Another alternative within the Bayesian framework is
described in §2.4.2.
2.3.4 ThreeDimensional Shape Reconstruction
The problem of estimating or reconstructing properties of threedimensional
surfaces, such as orientation, height, Gaussian curvature, and Euler charac
teristics, from digital images is also widely studied in computer vision. In a
common paradigm, especially in robot vision, these features are regarded as
indispensable steps toward object recognition and other goals; the extraction
of geometric features is followed by the imposition of relational structures
among the features (using "grammars," symbolic graphs, and so on), and
in turn by matching these data structures against stored models of real ob
jects and spatial relationships among them. Again, the paradigm consists
of distinct steps, and the extraction of geometric features is itself precedecl
by "preprocessing," which encompasses noise removal and other aspects of
restoration, edge and boundary detection, and perhaps segmentation.
The main sources of information ~ "cues" ~ for threedimensional shape re
construction are the intensity changes themselves ("shapefromshading"),
pairs of images corresponding to multiple views ("stereopsis") or wave
lengths, motion sequences ("shapefrommotion"), texture analysis ("shape
fromtexture"), and range data. Stereopsis is thought to be the most im
portant process by which human beings obtain depth information, but has
played a lesser role in automated systems due to computational demands
and lack of accuracy. In contrast, shading has played a larger role in ma
chine vision than in human vision, and range data is becoming an important
source of information for automated systems but is unavailable to humans.
The mathematical problems encountered in threedimensional shape re
construction are similar to those in §52.3.12.3.3. We conclude §2.3 with
two examples in more detail.
Stereopsis
Stereo vision uses two cameras Corresponding to our two eyes) and a single
light source. The relative difference, or socalled disparity, in the positions of
an object in the two images is useful information for extracting surface orien
tation and relative distances between objects. A physical point in the viewed
;
OCR for page 9
26
pixel lattice by SP = {i = (il,i2): 1 < il. i2 < N}, the graylevel intensity
process by XP = {XP: i ~ SP}, and the observation process by Y = {Yi: i
SP}. Then
Yi = {I, Hi}, i ~ SP . (2.12)
We assume that K, ¢, fib, and the law of ~ are known.
The basic idea proposed in [10] is to use MRF models to account for
the spatial coherence among the intensity levels of nearby pixels, as well
as for the existence of discontinuities between intensity surfaces. To this
end, in addition to the intensity process XP, a second "edge" process XE is
introduced. The process XE accounts for discontinuities, and is indexed by
the dual lattice SE of SP; SE consists of all nearestneighbor pairs ~ i, j >
from. SP, and an element ("site") t ~ SE corresponds to a putative edge
between the corresponding pixel sites. Thus, XE = {xE: t ~ SET =
{XEi j>: i, j ~ SP}, XE ~ {0,1) where XE = 1 (resp. 0) indicates presence
(resp. absence) of an edge at t ~ He. The process XE is neither part of the
data nor the target of estimation; rather, it is an auxiliary process designed
to bring exogenous information into the model, en cl it is coupled to XP in
such a manner that in the likely states of the joint probability distribution of
X = (XP,XE), the intensity function is TocaDy smooth with possibly sharp
transitions, and the locations of the edges satisfy our a priori expectations
about the behavior of boundaries.
The process X = (XP,XE) is indexed by S = SP U so, and is chosen
to be a MRF with energy function of the form
U(x) = U(x ,xE) = U~(xP,xE) + U2(xE), (2.13)
where Us reflects our expectations about interactions between intensities and
edges (where edges "belong"), while U2 reflects boundary organization. Both
terms are constructed from "local energies" corresponding to a neighborhood
system N = {~c':cl ~ S). The simplest neighborhood system ("nearest
neighbors") is shown in Figure 2.1, where dots denote pixels and crosses
edge sites.
The energy Ui(XP,XE) is defined so that the Tow energy states will have
xEi j> = 1 (resp. O) when hip—xjP~ is large (resp. small). More specifically,
Ui~xP,XE) = 8i ~ ¢(XiP—XjP)~1—XEi j>), (2.14)
with 8~ > 0, ¢~0) =  1, and ~ even and nondecreasing on [0, oo). Note that
when x = I, the bond ("interaction") between pixels i and j is broken;
OCR for page 9
27
x
x x x
x i x · xt
x x x
x
Phi, i ~ SP, Ibid = 8 Jet, t ~ SE, ~tl = 8
FIGURE 2.1
also, the properties of ~ ensure that when xP = xP, i—Al = 1, then XEi j> = 0
is a lower energy state than xEi j> = 1 and, in the absence of boundaries,
there is cooperation between nearby pixel intensities. Note also that when
¢(xP _ UP) = 0, we have no preference about the state of an edge at site
i, j >. A specific ¢~) that has been used in restoration problems [9] is
¢~y = ~—2 [I + ~—~ ] , ~ ~ ~—K,ki, (2.15)
where ~ is a scaling constant.
The energy U2(xE) reflects our prior expectations about boundaries:
most pixels are not at boundary regions; boundaries are usually persis
tent (no isolated or abandoned segments); boundary intersections (junc
tions), sharp turns, and "small structures," are relatively unlikely. For spe
cific choices of U2(xE), we refer to [9] (and references cited there). Some
of the above generic properties of boundaries may also be captured with
penalty functions (see (2.7~. For problems, such as image restorations and
shapefromshading, where the edge process XE is auxiliary, using "soft con
straints" such as U2(xE) is satisfactory, whereas for problems in which the
boundaries are of central interest (e.g., texture segmentation; see the next
subsection), penalty functions are often more appropriate.
The energy function (2.13) determines the prior proxy. Assuming that
the noise process {yi) is independent of X = (XP,XE), the transforma
tion (2.12) determines the degradation model P(Y~X) in a straightforward
manner. The posterior distribution ~r(X~Y) is then computed from the
joint distribution P(Y~X)1r(X). For example, suppose that Hi is Gaus
sian with mean ~ and variance a2, and that b ~ ~ ¢(a,b) is invertible, say,
OCR for page 9
28
7li = F(yi'¢[(KxP)i]) for some F(u,v). Then assuming that F(u,v) is
strictly increasing and Fl(u, v) = ~8 F(u, v) exists, the posterior energy (see
(2.8)) is given by [9]
U( ly) = U(z ~ ) + 2  —F (Yi, [(K P)i]
2a iESP
~ log F1 (yi, ¢~[(KXP)i]) · (2.16)
iEsP
Boundary Detection
We summarize the procedure developed in [11] for detecting textural bound
aries. The method also applies to the problem of locating boundaries rep
resenting sudden changes in depth and surface orientation.
The procedure involves two processes: the intensity process XP = fXP:
i ~ SP), as in the first subsection of §2.4.2, and the boundary process
XB = {XB:t ~ SIB), which is indexed by a regular lattice SB with sites
interspersed among the pixels of sB. Again, XB = {0,11 with X~ = 1
(resp. O) indicating presence (resp. absence) of a boundary at t ~ sB. The
prior distribution for X = (XP, XB) is chosen to be of the form (2.7) with
U(x) = U(xP, xB), and V(x) = V(xB).
The intensityboundary term U(xP,xB) is chosen to promote placement
of boundaries between regions in the image that demonstrate distinct spatial
patterns. In [11] it was chosen to be of the form
U(xP,xB) = ~ Ax )~1—at as ), t,s ~ S
, (2.17)
where denotes nearestneighbors in so. The function 4~',s is critical;
it is a measure of disparity between the graylevel values in two blocks of
pixels aclJacent to ~ I,s >; large disparities ( ~ O) encourage the presence
of an active boundary (i.e., xB = xsB = 1), while small disparities (~ << 0)
discourage the presence of a boundary (i.e., xBxsB = 0~. Specific choices of
As in [11] are constructed in terms of the KolmogorovSmirnov distance
applied to either the raw (lata ("firstorder" statistics), or to transformed
data corresponding to higherorder statistics (e.g., window means, range,
variance, and "directional residuals"~. As a function of xB, (2.17) is similar
to "spin glass" models in statistical mechanics.
The penalty function V(xB) is chosen to inhibit unwanted configurations
such as blind endings of boundaries, redundant boundaries, sharp turns, and
other forbidden patterns (see [11] for details).
OCR for page 9
29
Assuming that there are no degradations that preclude direct observation
of UP, then the data y _ UP. The degradation model P(Y~XB) is singular
(the point mass at y = UP ), and the posterior energy is U(xB ~ y = UP ~ =
U(y=xP,xB). The MAP estimate is equivalent to global optimization with
constraints.
Plates 2.1 and 2.2 (from [114) show two experiments.
.. .. . ~ In . ~ ~ . ~ . ~
Plate 2.1 is an
banana synthetic aperture racer tbAb) Image or Ice noes in the ocean:
(a) original image, 512 x 512, (b) sixteen "snapshots" from sixty sweeps of
stochastic relaxation with constraints. Plate 2.2 is a collage composed of nine
Brodatz textures: leather, grass, and pigskin (top row), raffica, wool, and
straw (middle row), and water, wood, and sand (bottom row). Two of the
textures, leather and water, are repeated in the two circles; (a) original 384 x
384, (b) detected boundaries obtained by deterministic (left) and stochastic
(right) algorithms.
Single Photon Emission Tomography
The digitized isotope intensity (see §2.3.3) is thought to be a realization of
a spatial process X = {Xi: i ~ S). The idea of [12, 13] is to use a Gibbs
prior to reflect the common observation that neighboring locations of the
isotope surface typically have similar concentration levels, whereas sharp
changes in concentration may occur, for instance, across an arterial wall or
across a boundary between two tissue types. In contrast to the procedure
in restoration, sharp changes are not represented explicitly by an edge or
boundary process; instead, the intensity mode! is designed to allow such
changes. Specifically ~ cf. [12] ),
U(x) = ~ ~ ¢(Xi  Xj) + pm ~ ¢(Xi  Xj),
Vy Li,j]
(2.18)
where < i, j ~ denotes a nearestneighbor bond, [i, j] represents a nearest
neighbor(liagonal bond, and ~ is given by (2.15~.
The degradation model is given by (2.4), and the posterior energy is then
U(x~y) = U(x) + ~~(Ax)~—y' Tog(Ax)~; .
t6T
Although U(x) has a local structure, the graph for U(x~y) is highly nonfocal
due to A.
The choice of the "smoothing" parameter ,l] is critical. For ,B = 0, the
MAP estimator is just the MID estimator an(l, hence, typically too rough. For
OCR for page 9
30
large A, the MAP estimator is too faithful to the prior and, hence, typically
too smooth. The value of ~ is estimated in [12, 13] via the EM algorithm
or the method of moments. The parameter ~ in ~ is also important, but
its statistical estimation from the data appears to be difficult. Fortunately,
reconstructions are not sensitive to moderate changes in A, and empirical
values based on information about range intensities work well [124.
Plates 2.3, 2.4, and 2.5 show three experiments from [13] with real (hos
pital) data, using the ICE algorithm. For comparison, the reconstruction
with the filtered back projection (FBP) method is also shown. In all three
cases the ~ is estimated by the ML method. Plate 2.3 shows a slice of a
human skull across the eyes: (a) FBP, (b) ICE with ~ = 2.7. Note that
in (b) one can distinguish details such as the nose bone, eyes, brain region;
also the skull border is sharp. Plate 2.4 displays a SPECT reconstruction
of a simulated phantom. The model used in this experiment was developed
by the Nuclear Medicine Department of the University of Massachusetts
Medical Center, in Worcester. This is a comprehensive model that captures
the effects of photon scattering, photon attenuation, camera geometry, and
quantum noise: (a) original phantom, (b) FBP reconstruction, (c) ICE re
construction with ,B = 1. Plate 2.5 is a human liver/spleen scan: (a) FBP,
(b) ICE with ,l] = 3. The value ,0 = 3 is the ME estimation; (c) and (~) are
ICE reconstructions with ,0 = 0 and ,l] = 20 respectively; they demonstrate
the significance of the parameter ,0.
ShapeE`rom Shading
We focus on the estimation of surface orientation. For simplicity, we assume
that S and p are known, that the reflectance map is spatially homogeneous
and known, and that V is constant throughout the image (orthographic pro
jections). However, the procedure presented below can be modified [28] to es
timate also S and p. There are three basic processes: The true (undegradecI)
intensity process XP = {XiP: i ~ S), the shape process N = Toni: i ~ S)

where Ni is the unit normal at the surface ``point,, corresponding to pixel i'
and the observation process Y = {Yi: i ~ S}. Here, S is a discretization of
the domain of zebu) (see §2.3.4~. The shape process N (the target of estima
tion) is related to XP via the discrete version of (2.5), and XP is related to
Y via (2.12~. XP is an indeterminate process and will play no direct role
here.
In the absence of degradation we have Yi = R(Ni), which is a determin
istic constraint on N. The shapefromshading problems usually refer (see
OCR for page 9
31
[17~) to this case (i.e., observable XP), and we refer to [17,18] for various
approaches to the problem including deterministic regularization techniques.
The procedure below was developed in [28] and applies to both unde
graded and degraded data. The basic idea is to use Gibbs distributions to
articulate general properties of shapes: surfaces are locally smooth, orienta
tion may exhibit jumps because of changes in depth or presence of surface
discontinuities. As in restoration, the process N is coupled to an edge pro
cess XE = (XE: t ~ sE), where XE and SE are as in the first subsection of
§2.4.2. The coupled process X = (N. XE) is a MRF with an energy function
U(x) = U~(N,xE)+ U2(xE) (compare with (2.13~), where U2(xE) is chosen
as in the first subsection of §2.4.2, and
Ui(N,xE) = 8i ~ (1  zEiti>~(~Ni  Nj~)
+ 82 ~(1  hti,j~ ) ¢( Hi—Nj ~ ), (2.19)
fi,j,
with 8~, 82 > 0, , [i,ji as in (2.18), and Hopi Hi = 1 if any of xExE = 1,
xExE = 1, xExE = 1, xExE = 1, are true, zero otherwise; here i, j, tl, t2,
t3, t4 are as in Figure 2.2.
It X · ~
t4 x x t2
i. X t3
FIGURE 2.2
The function ~ was chosen [28] to be f(1Ni—Nj~) = 1 + Toni—Nj~2 =
_ _ _
—Ni · Nj. Because of the constraint Knit = 1, the prior resulting from
this choice of ~ is nonGaussian even if 82 = 0 and Be = 0. In fact, mode]
(2.19) has worked well in some cases t28] even without the edge process (i.e.,
XE = 01.
In the absence of degradation, the distribution P(Y~X) is degenerate
and amounts to the constraint Vi(N) = ~ Hi  R(Ni)~2 = 0. In the
ins
presence of degradation, P(I'~X) is computed as in the first subsection of
§2.4.2. In both cases, there is an extra deterministic constraint V2(N) = 0
OCR for page 9
32
corresponding to the integrability condition. In the former case, the shape
fromshading problem reduces to minimizing U(N, X8) under the constraint
Vi(N) + V2(N) = 0; in the latter case, to minimizing the posterior energy
U(N,xE) under the constraint V2(N) = 0.
The procedure does not require boundary conditions. However, in the
absence of degradation and a single image, the quality of reconstruction
is better [28] if one assumes correct boundary conditions along the image
border. In the presence of noise, the results are satisfactory if one uses two
images (photostereo) obtained with a single camera but two light sources of
different origin.
Plate 2.6 shows an experiment from [28] with an egg imaged under un
controlled illumination. The surface of the egg was assumed to be matte,
and the algorithm estimated, in addition to N. the albedo p and an effective
light source direction S. The reconstruction used a combination of con
strained annealing and ICM: (a) original image, 64 x 64, (b) reconstruction,
(c) reconstructed egg illuminated from xdirection, and (~) reconstructed
egg illuminated from ydirection.
Deformable Templates
In this subsection we briefly describe a powerful and elegant methodology
introduced by Ulf Grenander for pattern synthesis and analysis of biological
shapes. It provides a promising geometric/Bayesian paradigm for medical
and industrial applications.
The procedure [6] is based on global shape models designed to incorpo
rate prior (biological) knowle(lge about the variability anti global properties
of shapes, and quantitative information about the variability of the geomet
ric object. The shape mode! has three basic components: (a) a "geometry"
consisting of a space G of generators, a connector graph a, a "regular
ity" relation R. and a transformation group S:G ~ G; (b) a template
(or templates) describing the overall architecture of the shape; and (c) a
~ w ~ ~
groupvalued stochastic (typically Markov) process, which articulates the
statistical variations of the shape. The choices of the template, transfor
mation group, and stochastic process control the desired global and local
geometric properties of shapes. The choice of (a) is applicationspecific.
We refer to [6] for the general framework, experiments, and references.
Here we outline the method for the special case of twodimensional (planar)
shapes (as in the HAND experiment 6. Assuming that all the relevant
information is contained in the boundary, and approximating the boundary
OCR for page 9
33
by a polygon with, say, n edges, the components of (a) are as follows: pi ~
G. j = 0, 1, . . ., n—1 are polygonal edges; a is a cyclic graph consisting of the
n nodes; R may, for example, be the condition that the polygon is closed and
simply connected (other regularity conditions are often desirable); S may
be chosen to be the general linear group G];~2), or the Euclidean group (i.e.,
rotations 0~2) and translations), or the group US(2) of uniform dilations
(scare), or US(2) x 0~2~.
The configuration space of interest is C(R) = {c = Ago, . . . ,gn_~): pi ~
G. j = 0, . . ., n—1, c satisfies R3, i.e., the set of boundaries of closed, simply
connected polygons. The interior of a polygon defines a pure image I. A
template is a specific configuration c(°) = a~g(°), . . ., 9~°~ ~ ~ C(
34
serve a degraded version ID Of a true pure
estimation. The degradation mechanism
model (apt, equivalently P(ID~c), as
~~ .. . . . .. . A, . _n ~
image ~ which is our target of
~ iD defines the clegraciation
in the first subsection of Q2.4.2.
linen the posterior distribution Ptc~l~) is computed as before, and con
tains all the relevant information about the unknown image ~ (equivalently
c). The computational burden of estimating ~ is demanding, but feasible.
Asymptotic arguments and other specific properties have been exploited in
[6] to reduce the computations. One important aspect of the approach is
that it can often be combined with dynamic programming to speed up the
processing considerably.
Bibliography
[~] Aarts, E., and J. Korst, Simulated Annealing and Boltzmann Machines,
John Wiley and Sons, 1989.
[2] Almeida, M., and B. Gidas, A variational method for estimating the
parameters of MRF from complete or incomplete clata, preprint, Brown
University, 1989.
[3] Besag, J., Spatial interaction and the statistical analysis of lattice sys
tems (with discussion), '7. R. Stat. Soc., B 36 (1974), 192236.
[4] Besag, J., On the statistical analysis of dirty pictures (with discussion),
J. R. Stat. Soc., B 48 (1986), 259302.
t5] Besag, J., Towards Bayesian image analysis, J. Appl. Stat. 16 (1989),
395407.
[6] Chow, Y., U. Grenander, and D. Keenan, HANDS, A Pattern Theoretic
Study of Biological Shapes, to be published by SpringerVerIag, 1990.
OCR for page 9
35
[7] Comets, F., and B. Gidas, Parameter estimation for Gibbs distributions
from partially observed data, submitted to Ann. Appl. Prod. (1989~.
[8] Frieden, B. R., Restoring with maximum likelihood and maximum en
tropy, J. Opt. Soc. Amer. 62 (1972), 511518.
[9] Geman, D., Random Fields and Inverse Problems in Imaging, to appear
in Lecture Notes in Mathematics, SpringerVerIag, 1990.
[10] Geman, S., and D. Geman, Stochastic relaxation, Gibbs distributions,
and the Bayesian restoration of images, IEEE Trans. Pattern Anal.
Machine Intell. 6 (1984), 721741.
[11] Geman, D., S. Geman, C. Graffigne, and P. Dong, Boundary detec
tion by constrained optimization, IEEE Trans. Pattern Anal. Machine
Intell. 12 ( 1990), 609628.
t12] Geman, S., and D. McClure, Statistical methods for tomographic im
age reconstruction, in Proceedings of the 46th Session of the Int. Stat.
Institute, Bulletin Int. Stat. Inst. 52 (1985~.
[13] Geman, S., D. McClure, K. Manbeck, and J. Mertus, Comprehen
sive statistical model for single photon emission computed tomography,
preprint, Brown University, 1990.
[14] Gidas, B., A renormalization group approach to image processing prob
lems, IEEE Trans. Pattern Anal. Machine Intell. 11 (1989), 164180.
[15] Grenander, U., Tutorial in Pattern Theory, Technical Report, Brown
University, 1983.
[16] GuH, S. F., and J. Skilling, Maximum entropy methods in image pro
cessing, TEE Proc. 131 ~ 1984), 646659.
[17] Horn, B. K. P., Robot Vision, MIT Press, 1986.
[18] Horn, B. K. P., Height and Gra(lient from Shading, MIT A.~. Memo
No. 1105.
[19] Jaynes, E. T., On the rationale of maximum entropy methods, Proc.
IEEE 70 (1982), 939952.
[20] Marr, D., Vision, W. H. Freeman, 1982.
OCR for page 9
36
[21] Marroquin, J. L., S. Mitter, and T. Poggio, Probabilistic solution of ill
posed problems in computational vision, ~7. Am. Stat. Assoc. 82 (1987),
7689.
[22] Mumford, D., and J. Shah, Boundary detection by minimizing func
tionals, talk presented at the IEEE Conference on Computer Vision
and Pattern Recognition, San Francisco, 1985.
[23] Poggio, T., V. Torro, and C. Koch, Computational vision and regular
ization theory, Nature 317 (1985), 314319.
[24] Ripley, B. D., Statistical Inference for Spatial Processes, Cambridge
University Press, 1988.
[25] Ripley, B. D., Statistics, images, and pattern recognition, Can ~ Stat
14 (1986), 83111.
[26] Serra, J., Image Analysis and Mathematical Morphology, Academic
Press, New York, 1982.
[27] Shepp, L. A., and Y. Vardi, Maximum likelihood reconstruction in
positron emission tomography, IEEE Trans Medical Imaging 1 (1982),
113122.
[28] Torreao, J., A Bayesian Approach to ThreeDimensional Shape Esti
mation for Robot Vision. thesis, Brown University, 1989.