SlideShare a Scribd company logo
Comparison of Face Image Quality Metrics
Kevin O’Connor, Gregory Hales,
Jonathon Hight
Biometric Standards, Performance and
Assurance Laboratory
Purdue University, Department of
Technology, Leadership and Innovation
West Lafayette, IN
Shimon Modi, Ph.D.
Center for Development of Advanced
Computing,
Mumbai, India
Stephen Elliott, Ph.D.
Biometric Standards, Performance and
Assurance Laboratory
Purdue University, Department of
Technology, Leadership and Innovation
West Lafayette, IN
Abstract—Automated face recognition offers an effective method
for identifying individuals. Face images have been used in a
number of different applications, including driver’s licenses,
passports and identification cards. To provide some form of
standardization for photographs in these applications, ISO / IEC
JTC 1 SC 37 have developed standardized data interchange
formats to promote interoperability. There are many different
publically available face databases available to the research
community that are used to advance the field of face recognition
algorithms, amongst other uses. In this paper, we examine how
an existing database that has been used extensively in research
(FERET) compares with two operational data sets with respect to
some of the metrics outlined in the standard ISO / IEC 19794-5.
The goals of this research are to provide the community with a
comparison of a baseline data set and to compare this baseline to
a photographic data set that has been scanned in from mug-shot
photographs, as well as a data set of digitally captured
photographs. It is hoped that this information will provide Face
Recognition System (FRS) developers some guidance on the
characteristics of operationally collected data sets versus a
controlled-collection database.
Keywords-face recognition; image quality; law enforcement;
biometrics
I. INTRODUCTION
Face recognition has been used extensively to verify the
identity of an individual. However, the performance of such
systems is constrained by the quality of the images in the data
set. Publicly available face data sets have contributed to the
development of many face recognition algorithms. In fact,
evaluations such as face-recognition vendor test (FERET) [1]
and academic data sets such as the Pose, Illumination, and
Expression database [2] have served this purpose [3]. Many
times, operational data sets are not available to the academic
community for a number of reasons, such as privacy rules and
existing regulations. This limitation is unfortunate because in a
recent report, it was noted that face images captured at U.S.
ports of entry “do not conform either to the national FR (face
recognition) standard adopted by DHS or to FACESTD, the
international standard specified in the Registry of USG
Recommended Biometric Standards” [4] (page 12). Moreover,
such operational environments introduce challenges such as
pose, head size, non-uniformity of lighting and general
illumination issues. The National Institute of Standards and
Technology (NIST) also conducted a study of face images in
2004 and compared operational data from these ports of entry
to the FERET data set. They concluded that the operational
images present challenges, which include faces that are not
centered, non-frontal head poses and poor illumination /
backgrounds [5]. Therefore, examining data sets against a
common list of standard metrics of image quality may provide
valuable information to facial recognition algorithm
developers. In 2006, L-1, a commercial identity solutions
provider, conducted a quality assessment of facial images. One
of the goals of their research was to contribute to the quality
assessment framework of the ISO/IEC 19794-5 standard. A
summary of their report can be found here [6]. The metrics
used in that particular study map to the same metrics used in
this paper.
It is not only important to compare one data set against
another data set by using the same metrics, but it is of interest
to the community that older images that have been collected
historically be included. For example, many mug-shot
photographs were collected using film and subsequently
printed out and stored. Over time, these photographs have aged.
Also, these photographs were captured when acquisition best
practices were not in place (or adopted by the organization).
Large amounts of legacy data of this type are stored by
organizations, and these data may be the only photographic
record available of a given individual. Understanding the
quality of such legacy data and how it will be processed by a
face recognition system are therefore important.
II. RELATED WORK
The motivation behind this work stemmed from an initial
request to analyze mug-shot photographs collected in a
correctional facility, to identify problematic standardized image
quality metrics, and to optimize the capture process. The goal
was to improve the image quality of photographs taken in an
operational environment [7]. That study only examined images
captured by a digital camera. However, paper-based
photographs were also made available for analysis. To provide
more insight into the results, the comparison with the FERET
data set was used. The impact of poor-quality data has been
studied at length [8]. In that report, the authors make a number
of observations, including that the detection and measurement
of quality is necessary for dealing with such poor-quality data.
Therefore, a methodology for accurately quantifying quality is
required if “prevention and mediation” are to occur (page 4).
Prior to the publication of that report, NIST published a best-
practice document for capturing mug shots in the law
enforcement environment [5]. That document has been used as
a reference tool for law enforcement agencies worldwide.
III. METHODOLOGY
Three different data sets were used in the analysis. Two
data sets were collected in an operational environment. The
first data set, called Legacy in this paper, consisted of
photographs that were captured over a number of years. Some
of these photographs were of very poor visual quality and had
not been stored in optimal conditions, nor were they printed on
archival-quality paper. Some of the photographs had holes in
them where they had been filed in a ring binder, or they had
started to change color because of their age. A Kodak il220
scanner was used to convert these images to digital format for
subsequent processing in the image quality tool. The scanner
had the following resolution: 300 dpi, 24-bit color, media type
of card stock, document type of photograph, with no
compression. A total of 9,233 photographs were scanned and
then subsequently cropped to remove any identifying artifacts,
such as their correction number. This was a requirement of the
University’s Institutional Review Board (IRB). A second data
set, called Electronic data set, consisting of 49,694 images was
collected, of which 48,786 were used for this study. All of the
metadata for these images also had to be removed because of
the IRB restrictions on this data set; however, these images
would have been collected around 2009 / 2010. The third data
set was the FERET data set, which consisted of 4,063 images.
All three of these data sets were processed by a commercially
available image quality algorithm. There were several reasons
for choosing a commercial image quality algorithm, as opposed
to creating one for the purposes of this study. First, the
experiment is agnostic to the type of image quality tool that is
being used because the output of the tool is a standardized set
of metrics. Therefore, the motivation was to examine the
differences in the standardized metrics, as opposed to creating a
novel approach to extracting such information. The tool was
not the focus; rather, the image quality results were the focus.
Second, the tool was well understood within the biometric
community and had been used to perform reliable and
repeatable studies. The software provided similar metrics as
shown in [6]. Each of the image quality metrics were clustered
together: format, digital, photographic, scene, and algorithmic.
In total, 36 image quality metrics were evaluated. These five
clusters are described in more detail in ISO/IEC 19794-5 [9].
Out of these 36 image quality metrics, 28 provided a range
score between zero and 10, which was used for subsequent
statistical analysis. These 28 image quality metrics were
banded in three ranges: 0–3.9 indicates a poor score, 4–6.9
indicates a medium score, and 7–10 indicates a good score. An
analysis of variance (ANOVA) method of analysis was used to
determine whether the means of the different groups were
equal. Parametric tests involve hypothesis testing that have a
strict set of assumptions that have to be met [10]. The ANOVA
results can divided into two segments: the variation that is
explained by the model (1) and the variation that is not, which
is called the error (2). Both are used to calculate the F-statistic
(3) testing the hypothesis Ho= µ1= µ2=…= µI. The results are
described as p-values. When the Ho is rejected, the variation of
the model (SSM) tends to be larger than the error (SSE), which
corresponds to a larger F value. This is represented by the
equation below:
We have used this methodology in other experiments.
Following this initial analysis, a Tukey’s test was conducted to
determine which means were significantly different from one
another. The test compares the means of every treatment with
every other treatment. This method is more suitable for
multiple comparisons [11].
The hypothesis for this experiment was as follows:
H0: µiqcL = µiqcE = µiqcF (5)
Ha: µiqcL ≠ µiqcE ≠ µiqcF (6)
where iqc is the individual image quality metric, L is the
Legacy data set, E is Electronic, and F is the FERET. The
alpha (α) for each of the tests was set at 0.05.
IV. RESULTS
The first set of analyses was to examine the image quality
scores of the three data sets. The ANOVA revealed that for the
28 variables that were used in the statistical analysis, the null
hypothesis was rejected at α = 0.05. The results are presented in
Tables I–VI, showing overall image quality as well as the five
clusters (format, digital, photographic, scene, and algorithmic)
from ISO / IEC 19794-5.
TABLE I. OVERALL
Quality Metric Legacy Electronic FERET
Overall 7.14 6.24 7.28
Table I shows the overall quality metric revealed
interesting results. We would expect that the FERET data set
would have the best image quality, and the Legacy data set the
worst. This was not the case. The more current, digitally
captured photographs (the Electronic data set) had the worst
image quality of the three data sets. This result indicates that
artifacts other than the acquisition technology might affect the
image quality. A graphical representation of the overall quality
score is shown below in Figure 1.
Figure 1. Overall quality distribution of the three data sets
The distribution of the overall quality scores for the
Legacy, Electronic and FERET data sets showed that the
Legacy and FERET data sets had similar distributions, which
were slightly skewed toward the “good-quality” metric.
However, the bulk of the Electronic photographs have a spike
between 3.0 and 3.5, which is in the “poor-quality” metric
band. Although there are significantly more photographs in the
Electronic data set, the distribution is clearly different than the
other two data sets.
TABLE II. FORMAT
Quality Metric Legacy Electronic FERET
Compression Artifacts 6.74 6.79 5.08
As can be seen in Table II, the compression artifact is of
medium quality for all three of the data sets.
TABLE III. DIGITAL
Quality Metric Legacy Electronic FERET
Contrast 6.39 6.90 6.48
Scanning Artifacts 6.68 7.12 7.79
Interlaced 9.77 7.28 8.78
Sensor Noise 7.35 6.80 5.64
Table III shows that no one data set provided a clear
advantage over the other with respect to the image quality
metrics.
TABLE IV. PHOTOGRAPHIC
Quality Metric Legacy Electronic FERET
Centered 5.89 5.28 6.60
Cropping 9.95 9.95 9.99
Focus 7.63 4.24 5.07
Motion Blur 8.01 7.96 8.41
Exposure 7.21 6.96 7.74
Unnatural Color 6.93 7.38 6.62
It should be noted that the Legacy data set was manually
cropped to remove any identifiers, and therefore, this is
probably an artificially high metric for this type of photograph.
The other two databases were not cropped. It is interesting to
note that the Electronic data set has the lowest score with
respect to centering, focus, and motion blur. This result
correlates with the related results of the point of entry, which
included non-centered faces and blurry or poorly illuminated
faces [4]. These results are similar to those in [7]. Clearly, the
operational images are not conforming to the Standard
Geometric Characteristics shown in Figure 2.
Figure 2. Standard Geometric Characteristics [ HYPERLINK "" l "ISO05"
1 ].
Tables V and VI show the scene and algorithmic clusters of
image quality variables.
TABLE V. SCENE
Quality Metric Legacy Electronic FERET
Eyes Clear 9.53 9.45 8.77
Glare Free 6.58 6.63 6.56
Sunglasses 6.81 5.28 6.01
Eyes Open 8.38 7.94 7.77
Shadows in the Eye Sockets 8.64 8.08 8.13
Uniform Lighting 5.39 4.82 4.61
Quality Metric Legacy Electronic FERET
Hot Spots 5.36 4.92 6.09
Facial Shadows 7.02 7.56 8.12
Background Uniformity 3.85 6.86 5.94
Background Brightness 4.00 3.38 5.31
Background Shadows 6.27 5.10 7.94
Frontal Pose 6.90 7.37 8.40
TABLE VI. ALGORITHMIC
Quality Metric Legacy Electronic FERET
Faceness 8.94 9.37 9.43
Texture 7.54 3.47 4.18
The second analysis was Tukey’s test, the results of which
can be deducted from the results in Tables I–VI. Because of
the IRB constraints, we were unable to process these through a
face recognition algorithm and retrieve performance scores.
However, we did complete an analysis on the metric Failure to
Extract. A Failure to Extract is concerned with samples that
are unable to be processed completely. A Failure to Extract
may occur for a number of reasons, such as feature
segmentation, extraction, or quality control. A failure to
extract could be a failure to feature extract or failure to
determine quality [11]. The results of the FTX rates for each
database are given in Table VII.
TABLE VII. FAILURE TO EXTRACT
Database Legacy Electronic FERET
FTX 810 1791 8
Total Images 9232 49692 4063
FTX rate 8.77% 3.60% 0.19%
Overall quality score (Table I) 7.14 6.24 7.28
The results show that the FERET data set had the lowest
FTX at 0.19%, and the Legacy data set was the worst at
8.77%. This finding does not correlate with the overall quality
score, repeated for clarity from Table I.
V. CONCLUSIONS AND RECOMMENDATIONS
The results of this analysis indicate that there is still much
to be done to improve the image quality of operational data
sets. Additionally, this paper reports the image quality results
of operationally collected data sets and compares them to a
publically available data set. The FERET data set, which has
been used to train face recognition algorithms, had a better
overall image quality and better results for many of the image
quality metrics. Those agencies that collect images in the field
need to be aware of image quality deficiencies and should
strive, where appropriate, to collect images that have better
image quality. However, because of operational constraints,
such a desire might be unobtainable. Therefore, algorithm
developers should be able to work with operationally gathered
data sets to realistically model such environments, and
organizations should try and make such data sets available in
accordance with the appropriate rules and regulations. While
this research was unable to generate performance results for
the data sets, it is evident from the image quality scores and
extraction rates that developers may want to rethink the
training process to include operational data or data that are
captured under similar constraints if actual operational data are
not available.
VI. REFERENCES
[1] P. J. Phillips, J. Wechsler, and P. J. Rauss, "The FERET database
and evaluation procedure for face recognition algorithms," Image
Vis. Comput., vol. 16, pp. 295-306, 1998.
[2] T. Sim, S. Baker, and M. Bsat, "The CMU Pose, Illumination, and
Expression Database," IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 25, no. 12, pp. 1615-1618, Dec. 2003.
[3] P. Mohanty, S. Sarkar, R. Kasturi, and P. J. Phillips, "Subspace
Approximation of Face Recognition Algorithms: An Empirical
Study," IEEE Transactions on Information Forensics and Security,
vol. 3, no. 4, pp. 734-748, Dec. 2008.
[4] DHS, "Facial Image Quality Improvement and Face Recognition
Study Final Report," United States Visitor and Immigrant Status
Indicator Technology (US-VISIT) Program, 2007.
[5] NIST, "Best Practices Recommendation for Capturing Mug-shots
and Facial Images, Version 2," National Institute of Standards and
Technology, 1997.
[6] R.-L. V. Hsu, J. Shah, and B. Martin, "Quality Assessment of Facial
Images," in Biometrics Consortium, Baltimore, MD, 2006, pp. 1-18.
[7] G. T. Hales, "Evaluation of the Indiana Department of Corrections
Mug Shot Capture Process," MS Thesis, Purdue University, West
Lafayette, IN, 2010.
[8] A. Hicklin and R. Khanna, "The Role of Data Quality in Biometric
Systems," Mitretek Systems, 2006.
[9] ISO/IEC: 19794-5 Information Technology – Biometrics –
Biometric data interchange formats – Part 5: Face image data. 2005.
[10] NIST / SEMATECH. (2006) e-Handbook of Statistical Methods.
[11] E. P. Kukula and S. J. Elliott, "Beyond current testing standards: A
framework for evaluating human-sensor interaction," in
International Biometric Performance Testing Conference,
Gaithersburg, MD, 2010.
(2011) Comparison of Face Image Quality Metrics

More Related Content

PDF
A Hybrid Approach to Face Detection And Feature Extraction
PDF
SEED IMAGE ANALYSIS
PDF
La2418611866
PDF
Handwritten Character Recognition: A Comprehensive Review on Geometrical Anal...
PDF
Fusion of Images using DWT and fDCT Methods
PDF
5 k z mao
PDF
Kk341721880
PDF
Defining a Summative Usability Test for Voting Systems
A Hybrid Approach to Face Detection And Feature Extraction
SEED IMAGE ANALYSIS
La2418611866
Handwritten Character Recognition: A Comprehensive Review on Geometrical Anal...
Fusion of Images using DWT and fDCT Methods
5 k z mao
Kk341721880
Defining a Summative Usability Test for Voting Systems

What's hot (19)

PDF
Plant Monitoring using Image Processing, Raspberry PI & IOT
PDF
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
PDF
Deep Generative model-based quality control for cardiac MRI segmentation
PDF
Carter, Kenneth CDC Poster
PDF
A01110107
PDF
A review robot fault diagnosis part ii qualitative models and search strategi...
PDF
IRJET - Liver Cancer Detection using Image Processing
PPTX
Tomato leaves diseases detection approach based on support vector machines
PDF
Feature selection for multiple water quality status: integrated bootstrapping...
PDF
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
PDF
IRJET- Plant Disease Identification System
PDF
International Journal of Computational Engineering Research (IJCER)
PDF
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
PDF
FMS-2016-03-08
PDF
Leaf Disease Detection and Selection of Fertilizers using Artificial Neural N...
PDF
Extending canonical action research model to implement social media in microb...
PDF
IRJET- Result on the Application for Multiple Disease Prediction from Symptom...
PDF
Review of Multimodal Biometrics: Applications, Challenges and Research Areas
PDF
IRJET- An Expert System for Plant Disease Diagnosis by using Neural Network
Plant Monitoring using Image Processing, Raspberry PI & IOT
International Journal of Biometrics and Bioinformatics(IJBB) Volume (2) Issue...
Deep Generative model-based quality control for cardiac MRI segmentation
Carter, Kenneth CDC Poster
A01110107
A review robot fault diagnosis part ii qualitative models and search strategi...
IRJET - Liver Cancer Detection using Image Processing
Tomato leaves diseases detection approach based on support vector machines
Feature selection for multiple water quality status: integrated bootstrapping...
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Plant Disease Identification System
International Journal of Computational Engineering Research (IJCER)
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
FMS-2016-03-08
Leaf Disease Detection and Selection of Fertilizers using Artificial Neural N...
Extending canonical action research model to implement social media in microb...
IRJET- Result on the Application for Multiple Disease Prediction from Symptom...
Review of Multimodal Biometrics: Applications, Challenges and Research Areas
IRJET- An Expert System for Plant Disease Diagnosis by using Neural Network
Ad

Viewers also liked (10)

PPT
Facial recognition technology by vaibhav
PPTX
Face Recognition
PPSX
Face recognition technology - BEST PPT
PPT
Face recognition technology
PPTX
Face Recognition Technology
PPTX
Face recognition technology
PPT
Face Recognition Technology
PPTX
face recognition
PDF
一般向けのDeep Learning
PPT
Face recognition ppt
Facial recognition technology by vaibhav
Face Recognition
Face recognition technology - BEST PPT
Face recognition technology
Face Recognition Technology
Face recognition technology
Face Recognition Technology
face recognition
一般向けのDeep Learning
Face recognition ppt
Ad

Similar to (2011) Comparison of Face Image Quality Metrics (20)

PDF
(2006) Impact of Image Quality on Performance: Comparison of Young and Elderl...
PDF
Quality assessment of resultant images after processing
PDF
IMAGE QUALITY ASSESSMENT- A SURVEY OF RECENT APPROACHES
PDF
(2007) Impact of Age Groups on Fingerprint Recognition Performance
PDF
(2013) A Trade-off Between Number of Impressions and Number of Interaction At...
PPS
Dr Gurumurthi V. Ramanan Face Recognition - Presentation
PDF
An SVM based Statistical Image Quality Assessment for Fake Biometric Detection
PDF
Fake Multi Biometric Detection using Image Quality Assessment
PDF
10[1].1.1.115.9508
PDF
IMAGE QUALITY ASSESSMENT FOR FAKE BIOMETRIC DETECTION: APPLICATION TO IRIS, F...
PDF
(Fall 2012) Image Quality Analysis of Facial Features
PDF
FAKE FACE DATABASE AND PREPROCESSING
PDF
Fake face database and pre processing
PPTX
OSFair2017 Training | FAIR metrics - Starring your data sets
PDF
(2004) The challenges of the environment and the human/biometric device inter...
PDF
IRJET-Gaussian Filter based Biometric System Security Enhancement
PDF
Assessment and Improvement of Image Quality using Biometric Techniques for Fa...
PDF
BIOMETRIC SECURITY SYSTEM AND ITS APPLICATIONS IN HEALTHCARE
PDF
Analysis of wavelet-based full reference image quality assessment algorithm
(2006) Impact of Image Quality on Performance: Comparison of Young and Elderl...
Quality assessment of resultant images after processing
IMAGE QUALITY ASSESSMENT- A SURVEY OF RECENT APPROACHES
(2007) Impact of Age Groups on Fingerprint Recognition Performance
(2013) A Trade-off Between Number of Impressions and Number of Interaction At...
Dr Gurumurthi V. Ramanan Face Recognition - Presentation
An SVM based Statistical Image Quality Assessment for Fake Biometric Detection
Fake Multi Biometric Detection using Image Quality Assessment
10[1].1.1.115.9508
IMAGE QUALITY ASSESSMENT FOR FAKE BIOMETRIC DETECTION: APPLICATION TO IRIS, F...
(Fall 2012) Image Quality Analysis of Facial Features
FAKE FACE DATABASE AND PREPROCESSING
Fake face database and pre processing
OSFair2017 Training | FAIR metrics - Starring your data sets
(2004) The challenges of the environment and the human/biometric device inter...
IRJET-Gaussian Filter based Biometric System Security Enhancement
Assessment and Improvement of Image Quality using Biometric Techniques for Fa...
BIOMETRIC SECURITY SYSTEM AND ITS APPLICATIONS IN HEALTHCARE
Analysis of wavelet-based full reference image quality assessment algorithm

More from International Center for Biometric Research (20)

PPTX
HBSI Automation Using the Kinect
PDF
An Investigation into Biometric Signature Capture Device Performance and User...
PPTX
Examining Intra-Visit Iris Stability - Visit 4
PPTX
Examining Intra-Visit Iris Stability - Visit 6
PPTX
Examining Intra-Visit Iris Stability - Visit 2
PPTX
Examining Intra-Visit Iris Stability - Visit 1
PPTX
Examining Intra-Visit Iris Stability - Visit 3
PPTX
Best Practices in Reporting Time Duration in Biometrics
PPTX
Examining Intra-Visit Iris Stability - Visit 5
PPTX
Interoperability and the Stability Score Index
PPTX
Advances in testing and evaluation using Human-Biometric sensor interaction m...
PDF
Cerias talk on testing and evaluation
PDF
(2010) Fingerprint recognition performance evaluation for mobile ID applications
HBSI Automation Using the Kinect
An Investigation into Biometric Signature Capture Device Performance and User...
Examining Intra-Visit Iris Stability - Visit 4
Examining Intra-Visit Iris Stability - Visit 6
Examining Intra-Visit Iris Stability - Visit 2
Examining Intra-Visit Iris Stability - Visit 1
Examining Intra-Visit Iris Stability - Visit 3
Best Practices in Reporting Time Duration in Biometrics
Examining Intra-Visit Iris Stability - Visit 5
Interoperability and the Stability Score Index
Advances in testing and evaluation using Human-Biometric sensor interaction m...
Cerias talk on testing and evaluation
(2010) Fingerprint recognition performance evaluation for mobile ID applications

Recently uploaded (20)

PDF
Classroom Observation Tools for Teachers
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
master seminar digital applications in india
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Complications of Minimal Access Surgery at WLH
PPTX
GDM (1) (1).pptx small presentation for students
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Lesson notes of climatology university.
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Computing-Curriculum for Schools in Ghana
Classroom Observation Tools for Teachers
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Final Presentation General Medicine 03-08-2024.pptx
master seminar digital applications in india
STATICS OF THE RIGID BODIES Hibbelers.pdf
Basic Mud Logging Guide for educational purpose
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Renaissance Architecture: A Journey from Faith to Humanism
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Complications of Minimal Access Surgery at WLH
GDM (1) (1).pptx small presentation for students
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Lesson notes of climatology university.
Anesthesia in Laparoscopic Surgery in India
TR - Agricultural Crops Production NC III.pdf
PPH.pptx obstetrics and gynecology in nursing
O5-L3 Freight Transport Ops (International) V1.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Computing-Curriculum for Schools in Ghana

(2011) Comparison of Face Image Quality Metrics

  • 1. Comparison of Face Image Quality Metrics Kevin O’Connor, Gregory Hales, Jonathon Hight Biometric Standards, Performance and Assurance Laboratory Purdue University, Department of Technology, Leadership and Innovation West Lafayette, IN Shimon Modi, Ph.D. Center for Development of Advanced Computing, Mumbai, India Stephen Elliott, Ph.D. Biometric Standards, Performance and Assurance Laboratory Purdue University, Department of Technology, Leadership and Innovation West Lafayette, IN Abstract—Automated face recognition offers an effective method for identifying individuals. Face images have been used in a number of different applications, including driver’s licenses, passports and identification cards. To provide some form of standardization for photographs in these applications, ISO / IEC JTC 1 SC 37 have developed standardized data interchange formats to promote interoperability. There are many different publically available face databases available to the research community that are used to advance the field of face recognition algorithms, amongst other uses. In this paper, we examine how an existing database that has been used extensively in research (FERET) compares with two operational data sets with respect to some of the metrics outlined in the standard ISO / IEC 19794-5. The goals of this research are to provide the community with a comparison of a baseline data set and to compare this baseline to a photographic data set that has been scanned in from mug-shot photographs, as well as a data set of digitally captured photographs. It is hoped that this information will provide Face Recognition System (FRS) developers some guidance on the characteristics of operationally collected data sets versus a controlled-collection database. Keywords-face recognition; image quality; law enforcement; biometrics I. INTRODUCTION Face recognition has been used extensively to verify the identity of an individual. However, the performance of such systems is constrained by the quality of the images in the data set. Publicly available face data sets have contributed to the development of many face recognition algorithms. In fact, evaluations such as face-recognition vendor test (FERET) [1] and academic data sets such as the Pose, Illumination, and Expression database [2] have served this purpose [3]. Many times, operational data sets are not available to the academic community for a number of reasons, such as privacy rules and existing regulations. This limitation is unfortunate because in a recent report, it was noted that face images captured at U.S. ports of entry “do not conform either to the national FR (face recognition) standard adopted by DHS or to FACESTD, the international standard specified in the Registry of USG Recommended Biometric Standards” [4] (page 12). Moreover, such operational environments introduce challenges such as pose, head size, non-uniformity of lighting and general illumination issues. The National Institute of Standards and Technology (NIST) also conducted a study of face images in 2004 and compared operational data from these ports of entry to the FERET data set. They concluded that the operational images present challenges, which include faces that are not centered, non-frontal head poses and poor illumination / backgrounds [5]. Therefore, examining data sets against a common list of standard metrics of image quality may provide valuable information to facial recognition algorithm developers. In 2006, L-1, a commercial identity solutions provider, conducted a quality assessment of facial images. One of the goals of their research was to contribute to the quality assessment framework of the ISO/IEC 19794-5 standard. A summary of their report can be found here [6]. The metrics used in that particular study map to the same metrics used in this paper. It is not only important to compare one data set against another data set by using the same metrics, but it is of interest to the community that older images that have been collected historically be included. For example, many mug-shot photographs were collected using film and subsequently printed out and stored. Over time, these photographs have aged. Also, these photographs were captured when acquisition best practices were not in place (or adopted by the organization). Large amounts of legacy data of this type are stored by organizations, and these data may be the only photographic record available of a given individual. Understanding the quality of such legacy data and how it will be processed by a face recognition system are therefore important. II. RELATED WORK The motivation behind this work stemmed from an initial request to analyze mug-shot photographs collected in a correctional facility, to identify problematic standardized image quality metrics, and to optimize the capture process. The goal was to improve the image quality of photographs taken in an operational environment [7]. That study only examined images captured by a digital camera. However, paper-based
  • 2. photographs were also made available for analysis. To provide more insight into the results, the comparison with the FERET data set was used. The impact of poor-quality data has been studied at length [8]. In that report, the authors make a number of observations, including that the detection and measurement of quality is necessary for dealing with such poor-quality data. Therefore, a methodology for accurately quantifying quality is required if “prevention and mediation” are to occur (page 4). Prior to the publication of that report, NIST published a best- practice document for capturing mug shots in the law enforcement environment [5]. That document has been used as a reference tool for law enforcement agencies worldwide. III. METHODOLOGY Three different data sets were used in the analysis. Two data sets were collected in an operational environment. The first data set, called Legacy in this paper, consisted of photographs that were captured over a number of years. Some of these photographs were of very poor visual quality and had not been stored in optimal conditions, nor were they printed on archival-quality paper. Some of the photographs had holes in them where they had been filed in a ring binder, or they had started to change color because of their age. A Kodak il220 scanner was used to convert these images to digital format for subsequent processing in the image quality tool. The scanner had the following resolution: 300 dpi, 24-bit color, media type of card stock, document type of photograph, with no compression. A total of 9,233 photographs were scanned and then subsequently cropped to remove any identifying artifacts, such as their correction number. This was a requirement of the University’s Institutional Review Board (IRB). A second data set, called Electronic data set, consisting of 49,694 images was collected, of which 48,786 were used for this study. All of the metadata for these images also had to be removed because of the IRB restrictions on this data set; however, these images would have been collected around 2009 / 2010. The third data set was the FERET data set, which consisted of 4,063 images. All three of these data sets were processed by a commercially available image quality algorithm. There were several reasons for choosing a commercial image quality algorithm, as opposed to creating one for the purposes of this study. First, the experiment is agnostic to the type of image quality tool that is being used because the output of the tool is a standardized set of metrics. Therefore, the motivation was to examine the differences in the standardized metrics, as opposed to creating a novel approach to extracting such information. The tool was not the focus; rather, the image quality results were the focus. Second, the tool was well understood within the biometric community and had been used to perform reliable and repeatable studies. The software provided similar metrics as shown in [6]. Each of the image quality metrics were clustered together: format, digital, photographic, scene, and algorithmic. In total, 36 image quality metrics were evaluated. These five clusters are described in more detail in ISO/IEC 19794-5 [9]. Out of these 36 image quality metrics, 28 provided a range score between zero and 10, which was used for subsequent statistical analysis. These 28 image quality metrics were banded in three ranges: 0–3.9 indicates a poor score, 4–6.9 indicates a medium score, and 7–10 indicates a good score. An analysis of variance (ANOVA) method of analysis was used to determine whether the means of the different groups were equal. Parametric tests involve hypothesis testing that have a strict set of assumptions that have to be met [10]. The ANOVA results can divided into two segments: the variation that is explained by the model (1) and the variation that is not, which is called the error (2). Both are used to calculate the F-statistic (3) testing the hypothesis Ho= µ1= µ2=…= µI. The results are described as p-values. When the Ho is rejected, the variation of the model (SSM) tends to be larger than the error (SSE), which corresponds to a larger F value. This is represented by the equation below: We have used this methodology in other experiments. Following this initial analysis, a Tukey’s test was conducted to determine which means were significantly different from one another. The test compares the means of every treatment with every other treatment. This method is more suitable for multiple comparisons [11]. The hypothesis for this experiment was as follows: H0: µiqcL = µiqcE = µiqcF (5) Ha: µiqcL ≠ µiqcE ≠ µiqcF (6) where iqc is the individual image quality metric, L is the Legacy data set, E is Electronic, and F is the FERET. The alpha (α) for each of the tests was set at 0.05. IV. RESULTS The first set of analyses was to examine the image quality scores of the three data sets. The ANOVA revealed that for the 28 variables that were used in the statistical analysis, the null hypothesis was rejected at α = 0.05. The results are presented in Tables I–VI, showing overall image quality as well as the five clusters (format, digital, photographic, scene, and algorithmic) from ISO / IEC 19794-5. TABLE I. OVERALL Quality Metric Legacy Electronic FERET Overall 7.14 6.24 7.28 Table I shows the overall quality metric revealed interesting results. We would expect that the FERET data set would have the best image quality, and the Legacy data set the worst. This was not the case. The more current, digitally captured photographs (the Electronic data set) had the worst
  • 3. image quality of the three data sets. This result indicates that artifacts other than the acquisition technology might affect the image quality. A graphical representation of the overall quality score is shown below in Figure 1. Figure 1. Overall quality distribution of the three data sets The distribution of the overall quality scores for the Legacy, Electronic and FERET data sets showed that the Legacy and FERET data sets had similar distributions, which were slightly skewed toward the “good-quality” metric. However, the bulk of the Electronic photographs have a spike between 3.0 and 3.5, which is in the “poor-quality” metric band. Although there are significantly more photographs in the Electronic data set, the distribution is clearly different than the other two data sets. TABLE II. FORMAT Quality Metric Legacy Electronic FERET Compression Artifacts 6.74 6.79 5.08 As can be seen in Table II, the compression artifact is of medium quality for all three of the data sets. TABLE III. DIGITAL Quality Metric Legacy Electronic FERET Contrast 6.39 6.90 6.48 Scanning Artifacts 6.68 7.12 7.79 Interlaced 9.77 7.28 8.78 Sensor Noise 7.35 6.80 5.64 Table III shows that no one data set provided a clear advantage over the other with respect to the image quality metrics. TABLE IV. PHOTOGRAPHIC Quality Metric Legacy Electronic FERET Centered 5.89 5.28 6.60 Cropping 9.95 9.95 9.99 Focus 7.63 4.24 5.07 Motion Blur 8.01 7.96 8.41 Exposure 7.21 6.96 7.74 Unnatural Color 6.93 7.38 6.62 It should be noted that the Legacy data set was manually cropped to remove any identifiers, and therefore, this is probably an artificially high metric for this type of photograph. The other two databases were not cropped. It is interesting to note that the Electronic data set has the lowest score with respect to centering, focus, and motion blur. This result correlates with the related results of the point of entry, which included non-centered faces and blurry or poorly illuminated faces [4]. These results are similar to those in [7]. Clearly, the operational images are not conforming to the Standard Geometric Characteristics shown in Figure 2. Figure 2. Standard Geometric Characteristics [ HYPERLINK "" l "ISO05" 1 ]. Tables V and VI show the scene and algorithmic clusters of image quality variables. TABLE V. SCENE Quality Metric Legacy Electronic FERET Eyes Clear 9.53 9.45 8.77 Glare Free 6.58 6.63 6.56 Sunglasses 6.81 5.28 6.01 Eyes Open 8.38 7.94 7.77 Shadows in the Eye Sockets 8.64 8.08 8.13 Uniform Lighting 5.39 4.82 4.61
  • 4. Quality Metric Legacy Electronic FERET Hot Spots 5.36 4.92 6.09 Facial Shadows 7.02 7.56 8.12 Background Uniformity 3.85 6.86 5.94 Background Brightness 4.00 3.38 5.31 Background Shadows 6.27 5.10 7.94 Frontal Pose 6.90 7.37 8.40 TABLE VI. ALGORITHMIC Quality Metric Legacy Electronic FERET Faceness 8.94 9.37 9.43 Texture 7.54 3.47 4.18 The second analysis was Tukey’s test, the results of which can be deducted from the results in Tables I–VI. Because of the IRB constraints, we were unable to process these through a face recognition algorithm and retrieve performance scores. However, we did complete an analysis on the metric Failure to Extract. A Failure to Extract is concerned with samples that are unable to be processed completely. A Failure to Extract may occur for a number of reasons, such as feature segmentation, extraction, or quality control. A failure to extract could be a failure to feature extract or failure to determine quality [11]. The results of the FTX rates for each database are given in Table VII. TABLE VII. FAILURE TO EXTRACT Database Legacy Electronic FERET FTX 810 1791 8 Total Images 9232 49692 4063 FTX rate 8.77% 3.60% 0.19% Overall quality score (Table I) 7.14 6.24 7.28 The results show that the FERET data set had the lowest FTX at 0.19%, and the Legacy data set was the worst at 8.77%. This finding does not correlate with the overall quality score, repeated for clarity from Table I. V. CONCLUSIONS AND RECOMMENDATIONS The results of this analysis indicate that there is still much to be done to improve the image quality of operational data sets. Additionally, this paper reports the image quality results of operationally collected data sets and compares them to a publically available data set. The FERET data set, which has been used to train face recognition algorithms, had a better overall image quality and better results for many of the image quality metrics. Those agencies that collect images in the field need to be aware of image quality deficiencies and should strive, where appropriate, to collect images that have better image quality. However, because of operational constraints, such a desire might be unobtainable. Therefore, algorithm developers should be able to work with operationally gathered data sets to realistically model such environments, and organizations should try and make such data sets available in accordance with the appropriate rules and regulations. While this research was unable to generate performance results for the data sets, it is evident from the image quality scores and extraction rates that developers may want to rethink the training process to include operational data or data that are captured under similar constraints if actual operational data are not available. VI. REFERENCES [1] P. J. Phillips, J. Wechsler, and P. J. Rauss, "The FERET database and evaluation procedure for face recognition algorithms," Image Vis. Comput., vol. 16, pp. 295-306, 1998. [2] T. Sim, S. Baker, and M. Bsat, "The CMU Pose, Illumination, and Expression Database," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 12, pp. 1615-1618, Dec. 2003. [3] P. Mohanty, S. Sarkar, R. Kasturi, and P. J. Phillips, "Subspace Approximation of Face Recognition Algorithms: An Empirical Study," IEEE Transactions on Information Forensics and Security, vol. 3, no. 4, pp. 734-748, Dec. 2008. [4] DHS, "Facial Image Quality Improvement and Face Recognition Study Final Report," United States Visitor and Immigrant Status Indicator Technology (US-VISIT) Program, 2007. [5] NIST, "Best Practices Recommendation for Capturing Mug-shots and Facial Images, Version 2," National Institute of Standards and Technology, 1997. [6] R.-L. V. Hsu, J. Shah, and B. Martin, "Quality Assessment of Facial Images," in Biometrics Consortium, Baltimore, MD, 2006, pp. 1-18. [7] G. T. Hales, "Evaluation of the Indiana Department of Corrections Mug Shot Capture Process," MS Thesis, Purdue University, West Lafayette, IN, 2010. [8] A. Hicklin and R. Khanna, "The Role of Data Quality in Biometric Systems," Mitretek Systems, 2006. [9] ISO/IEC: 19794-5 Information Technology – Biometrics – Biometric data interchange formats – Part 5: Face image data. 2005. [10] NIST / SEMATECH. (2006) e-Handbook of Statistical Methods. [11] E. P. Kukula and S. J. Elliott, "Beyond current testing standards: A framework for evaluating human-sensor interaction," in International Biometric Performance Testing Conference, Gaithersburg, MD, 2010.