Testing and Evaluation Examples
Large-scale biometric systems traditionally undergo a series of tests beyond technology and scenario testing. These large-scale system tests are typically at the system level, not just the biometric subsystem level, and occur multiple times in the life of a system in such forms as factory acceptance tests before shipment, site or system acceptance tests before initiating operations, and in-use tests to ensure that performance remains at acceptable levels and/or to reset thresholds or other technical parameters.
The following examples outline aspects of testing from three real-world, large-scale systems: the FBI’s Integrated Automated Fingerprint Identification System (IAFIS), Disney’s entrance control system, and the U.S. Army’s Biometrics Automated Toolkit (BAT). Note that there are numerous other systems that offer lessons as well—these include the DOD’s Common Access Card, US-VISIT, California’s Statewide Fingerprint Imaging System (CA SFIS), and so on. The inclusion or exclusion of systems in this discussion is not meant to convey a judgment of any sort.
THE FBI’S IAFIS SYSTEM
IAFIS underwent numerous tests before and after deployment in 1998 and 1999. After deployment the FBI tracked performance for 5 years to determine the threshold for automatic hit decisions. After measuring and analyzing the performance data, the FBI was able to say with confidence
that all candidates for whom the 10-fingerprint to 10-fingerprint search score was above a certain level were always declared as matches (or hits) by the examiners. As a result of this analysis the FBI was able to let the IAFIS system automatically declare as matches about 25 percent of the hits that previously required human intervention. This test used routine operational data in an operational environment and was not orchestrated to include any controls or prescreening on the input data. The transactions were run through the system normally, and match decisions were made by human examiners working with candidates presented by the IAFIS automated matchers.
DISNEY’S ENTRANCE CONTROL SYSTEM
Walt Disney World (WDW) has publicly reported1 on internal testing using several different biometric technologies over the years. (See Box 5.1 for more on Disney’s use of biometrics.) WDW tested various hand geometry and finger scanning technologies at several theme park locations to evaluate alternative technologies to the then-existing finger geometry used in its turnstile application. WDW also tested technologies for other applications to increase guest service and improve operating efficiency. Testing there is done in four stages: laboratory testing, technology testing, scenario testing, and operational evaluation. Since WDW has had existing biometric technology in place since 1996 and a substantial amount of experience with the biometric industry, its mind-set is that a threshold has been set for performance in both error rates and throughput and prospective vendors must exceed this level of performance to be considered for future enhancement projects.
In WDW lab testing, prospective biometric devices or technologies are examined for the underlying strengths of their technology/modality, usability, and accuracy. This testing is performed under optimal, controlled conditions for all of the relevant parameters that can affect performance. Parameters like technology construction and architecture, component mean time between failures, and theoretical throughput are extrapolated based on the results of laboratory tests. The goal of laboratory testing is to quickly determine whether a device or technology is worth investigating further. If a technology does not meet a performance level above the existing technology under optimal conditions, there is no point in investigating further.
If a prospective biometric device or technology is determined to be promising in the WDW lab environment, then the next stage of testing,
called “technology testing,” is conducted to examine the limitations of the technology where some of the parameters will be controlled and others will be allowed to vary into “extreme limits” to see how the technology reacts and where it fails. For example, increasing the amount of ambient lighting for a facial recognition system or increasing the amount of background noise for a voice system stresses the capabilities of those systems. If the technology is still determined to be promising, scenario testing is performed by testing the technology in the live, operational environment with real-world users.2 Typically, all data are captured and subsequently analyzed to determine if the system performed as expected, better, or worse. Analysis is performed to determine if some parameter was unexpectedly affecting performance. Often video of the user interactions will be recorded to assist in the data analysis and is particularly useful if the results of the testing show unexplained anomalies or unexpected results. For example, video of the interactions may detect users swapping fingers between enrollment and subsequent use in a fingerprint system. During this entire testing process, a potential system enhancement cost/benefit analysis is updated with the results of each round of the testing. If the performance gain is determined to be worthwhile, a business decision may then be made to migrate to the new technology. Disney has followed these scenario tests with operational tests on deployed systems to estimate actual error and throughput rates.
U.S. ARMY’S BIOMETRIC AUTOMATED TOOLKIT
BAT was developed in the late 1990s by the Battle Command Battle Lab of the Army Intelligence Center at Fort Huachuca, Arizona, as an advanced concepts technology demonstration (ACTD) to enable U.S. military forces to keep a “digital dossier” on a person of interest.3 Other features such as biometric collection and identification badge creation were also included. BAT uses a rugged laptop and biometric collection devices (facial images, fingerprints, iris images, and, in some cases, voice samples) to enroll persons encountered by the military in combat operations. Hundreds of devices were rushed into production to meet demand during Operation Iraqi Freedom and Operation Enduring Freedom in Afghanistan.
This military use of biometrics ensures that a person, once registered, can later be recognized even if his or her identity documents or facial characteristics change. This permits postmission analysis to identify per-
J. Ashbourn, “User Psychology and Biometric Systems Performance” (2000). Available at http://www.adept-associates.com/User%20Psychology.pdf.
Available at http://www.eis.army.mil/programs/dodbiometrics.htm.
sons of future interest and in-mission analysis to detain persons of interest from biometrically supported watch lists. These systems are considered by many to be a technical success today, and the data are shared, when appropriate, with the FBI, DHS, and the intelligence community. When first deployed they did not go through factory or system acceptance tests due to the rapid prototyping and the demand for devices. After operational use, it was determined that the fingerprints collected were not usable by the FBI because several factors had not been considered in the original tactical system design, which did not include sending output to the FBI’s strategic system, IAFIS. BAT was then formally tested operationally and the required changes identified and made. The operational retests before and after deployment showed that the current generation BAT systems generally met all of the image quality and record format protocols specified by the FBI. These BAT devices, however, use proprietary reference representations to share information on watch lists, which makes them less interoperable with standards-based systems than with one another.