Facial Recognition Error Rates Vary by Demographic
Is facial recognition software fair? The U.S. National Institute of Standards and Technology (NIST) recently evaluated 189 software algorithms from 99 developers and found that most programs exhibit different levels of accuracy depending on demographics, including sex, age, and racial background.
The NIST report, Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects (NISTIR 8280), was released in late 2019 as part of an ongoing facial recognition study. Previous segments of the program have measured advancements in facial recognition accuracy and speed, face image quality assessments, and the ability to detect facial morphing or deep fake technology.
The demographic-focused study tested algorithms on two different tasks: confirming a photo matches a different photo of the same person in a database (known as “one-to-one” matching, most commonly used for verification such as unlocking a smartphone) and determining whether the person in the photo has any match in a database (“one-to-many” matching, which can be used to detect a person of interest).
The NIST team also measured algorithms’ false positive and false negative rates. In a false positive, the algorithm said photos of two different people showed the same person; in a false negative, the algorithm failed to correctly detect that two photos showed the same person.
NIST has assessed facial recognition algorithm accuracy in the past, but one of the key differences in this report was the addition of the demographic factor, especially in testing one-to-many matching.
Four collections of photographs—containing 18.27 million images of 8.49 million people—were pulled from databases provided by the U.S. State Department, the Department of Homeland Security, and the FBI to test the algorithms. The photos contained metadata information—such as the subject’s sex, age, and race or country of birth—which enabled the NIST team to determine error rates among these tags.
“The demographic report looks at differences in performance across demographic groups to see if false positive or negative rates changed,” says Craig Watson, image group manager at NIST. “The findings showed that various algorithms had different rates of error across different demographic groups.”
The study highlighted several broad findings across the algorithms: for one-to-one matching, Asian and African American faces had higher false positive rates than Caucasian images. Among American-developed algorithms, there were similar rates of false positives in one-to-one matching for Asians, African Americans, and native groups. The American Indian demographic had the highest false positive rates.
Algorithms developed in Asian countries, however, had no major difference in false positive rates between Asian and Caucasian faces.
For one-to-many matching, the NIST team found there were higher rates of false positives for African American females than for any other group.
“Differentials in false positives in one-to-many matching are particularly important because the consequences could include false accusations,” the report said. Overall, false positives were higher in women than men, but the effect was smaller than racial error rate differences.
The error rate in facial recognition algorithms carries different weight depending on the application, Watson says. In an access control situation, the ramifications of a false negative—failing to detect that the person standing at the door matches the credential—are often merely annoying or a waste of time: the person does not gain entry immediately and has to try again or use a different credential, such as a badge. In a law enforcement or investigation application, however, not identifying someone on a watch list could have immense ramifications.
Similarly, a false positive—erroneously alerting on an innocent person who may resemble someone on a watch list—could have long-term effects through a false accusation or potential false imprisonment.
When determining whether to use a face recognition algorithm in a use case, Watson says end users should consider what the cost of failure is.
Something else to consider when weighing security technology options is context, says Desmond Patton, associate professor at the Columbia University School of Social Work and director of SAFELab, a research initiative that studies how youth of color navigate violence offline and online.
Patton has worked on projects that include image-based analysis. He says there are extreme challenges for algorithms to detect context such as how behavior changes in different places and what items of clothing or hand gestures might mean in various situations.
For example, he says, a picture of a youth flashing a gang sign does not necessarily mean that person is a gang member. Instead, depending on the context of the situation—the other people present, the geographical area—it might merely be a way for the subject to protect himself or signal belonging in the community.
“Oftentimes the goal is to have the most accurate system, but the most accurate system can also be weaponized against the community it is intended to help,” Patton says.
To deploy fairer technologies, security directors need diverse teams—from a range of ethnicities and educational backgrounds, or individuals who might be affected by the tools’ use—to work through different scenarios and possibilities, Patton says.
“You want people who will be on the other side of these issues at the table to anticipate any challenges,” he explains. “But they will also alert you to the various ways these tools might be effective.”
For example, he adds, analytics that detect certain red flags—such as warning signs of potential violence—could be used as triggers to offer counseling or outreach to help the subject instead of merely mitigating risk.
“For some reason, we blindly trust these systems, and over and over again, when we apply them to the real world, we quickly realize that they have limits. We’re trusting these systems far beyond their capacities,” Patton adds.
In the NIST study, not all algorithms gave these high rates of false positives across demographics—the report emphasized that different algorithms perform differently. As Watson notes, for the best, reasoned application of facial recognition tools, it behooves end users to “know your use case, know your algorithm, know your data. All three of those matter in making these decisions.”
The NIST report authors echoed his sentiments, adding that “Given algorithm-specific variation, it is incumbent upon the system owner to know their algorithm. While publicly available test data from NIST and elsewhere can inform owners, it will usually be informative to specifically measure accuracy of the operational algorithm on the operational image data, perhaps employing a biometrics testing laboratory to assist.”