Why agencies should conduct their own AFIS benchmarks rather than relying on others.

Reprinted from a LinkedIn article published by Michael K. French on June 25, 2021.

The first Automated Fingerprint Identification System (AFIS) procurement in the U.S. was intended to be a sole source acquisition that later turned into a competitive bidding process with a benchmark to decide. This is what happened in 1983 when the city of San Francisco was facing a crime wave of burglaries and turned to new technology to solve the problem — known as the “San Francisco Experiment”.

The city originally considered awarding a sole source contract to Printrak, but decided to compare systems from the NEC Corporation and Logica as well in a benchmark test. The fact that NEC won the benchmark and was awarded the contract was a surprise since Printrak was the only company known to have an operational AFIS ready for sale at the time. Oh and by the way, the experiment worked, as the resulting AFIS sent the burglary rate tumbling for years to come. After the San Francisco Experiment many agencies, including the FBI and Royal Canadian Mounted Police, went on to conduct benchmarks of their own as AFIS proliferated across the globe.

As of 1999 there were at least 500 AFIS sites around the world, and the FBI IAFIS (Integrated Automated Fingerprint Identification System) had also gone on-line, collecting and matching tenprint and latent print records from all 50 states. It’s difficult to know for sure, but today there are an estimated 1,000 AFIS installations around the world, and the U.S. National Institute of Standards and Technology (NIST) conducts periodic testing to measure the performance of algorithms submitted by different vendors. This testing seems to have displaced most, if not all the AFIS benchmarks undertaken at the state and local level. But is this a good thing?

We have learned invaluable information from these NIST evaluations, rich with data and interpretations. Things like how vendors handle different types of data, or accuracy relative to CPU intensity or biometric template size, etc. However, it is also important to remember that NIST does not test operational systems, but rather technology submitted as software development kits or SDKs. Sometimes these submissions are labeled as research (or just not labeled), but in reality it cannot be known if these algorithms are included in the product that an agency will ultimately receive when they purchase a biometric system. And even if they are “the same”, the operational architecture could produce different results with the same core algorithms optimized for use in a NIST study.

Making the case for state and local agencies to conduct their own benchmarks implies the expectation that the system being tested (or something better) will be the one that is delivered to the customer six to twenty-four months down the road. By conducting a benchmark, the customer can measure performance against data from their current system, calculating accuracy and speed, and observing how a new system will handle data or conditions problematic for the current system. A test plan can be devised to assess qualitative measurements as well, which will not be found in a NIST evaluation.

A benchmark can reveal insights previously unknown to the customer, thus refining requirements or buying decisions, or can expose defects unknown to the vendor, which will be corrected before they ever make it to the project. A benchmark can also reveal how vendors handle data differently like palms and major case prints; or in the case of face recognition, different poses, angles, glasses, and face coverings.

A typical ABIS benchmark— Automated Biometric Identification system is the modern term — lasts about a week and involves enrolling sets of tenprints, palm prints, major case prints, latent prints, faces, etc., then conducting searches and recording results. The data sets need to be large enough to be statistically meaningful, need to be representative of the agency’s own biometric data, and involve a variety of quality and conditions, e.g. tenprint sequence errors. Test data should be enrolled in a database containing a sufficient amount of background data to serve as “noise”, to make these tests more realistic. If a test is not properly set up, observed, and recorded, it may as well have never happened.

The results of the benchmark will offer a great deal of visibility into the behavior of each system in comparison to others, and if the performance is properly recorded, and the data preserved, then it can be carried forward for independent verification and validation (IV&V) or acceptance testing in the ABIS replacement project itself. And while many people think of a benchmark as only a comparison of matching performance, it can also compare the usability of the workstations and other front end functionalities.

In some instances, it might be wise to record the time or number of mouse clicks to perform a specific task. In other cases, it will be useful to know how long it takes the system to load images for comparisons and do things like accurately calculate mated minutiae on a fingerprint comparison. This way a clear set of requirements can be established and the agency will know what they are getting in the long run; removing the risk they will be dissatisfied at the time of delivery with no contractual requirements to fall back on. Data can also be collected to estimate how much effort will be required to make a system interoperable with another.

The main argument against benchmarking is that an agency can cut costs bypassing this form of testing which may be considered redundant to acceptance testing, especially since the ABIS market has been mature for some time now and all fingerprint systems are perceived to offer very high accuracy. An agency may argue that it has already made its decision based on its perception of vendor matching or other capabilities, and simply wants to move forward with the replacement project.

I would argue that this view of benchmarking is narrow, since every ABIS project is unique, and the ABIS components, software platforms, and hardware evolve due to changes in technology. This means performance on one system could be significantly different from another, and performance on one agency’s set of data could differ from performance on another agency’s set of data. The only way to be sure is through proper testing, which in the long run can save money on the project and the life cycle of the operational system — defects or change orders can be costly and involve additional test cycles!

Looking forward, it is easy to imagine that artificial intelligence and machine learning will create better systems in the very near future, and some of these systems may come from previously unknown companies with untold reputations. If this becomes the case the only way to know for sure, and to avoid risk, is to conduct your own benchmark.

Michael K. French is the founder of APPLIED FORENSIC SERVICES LLC specializing in ABIS requirements, project support, IV&V, and conformance to forensic and biometric standards.

Leave a Reply