Face Detection Project Report: Ana Bertran, Huanzhou Yu, Paolo Sacchetto (Nuska, Hzhyu, Paolos) @stanford - Edu

Computer Project for Digital Image Processing EE368 Spring 2001/2002
Face Detection Project Report

Ana Bertran, Huanzhou Yu, Paolo Sacchetto {nuska, hzhyu, paolos}@stanford.edu
Abstract
Human face detection by computer systems has become a major field of interest. Face detection algorithms are used in a wide range of applications, such as security control, video retrieving, biometric signal processing, human computer interface, face recognitions and image database management. However, it is difficult to develop a complete robust face detector due to various light conditions, face sizes, face orientations, background and skin colors. In this report, we propose a face detection method for color images. Our method detects skin regions over the entire image, and then generates face candidates based on a connected component analysis. Finally, the face candidates are divided into human face and non-face images by an enhanced version of the template-matching method. Experimental results demonstrate successful face detection over the EE368 training images.
1. Introduction
There have been many attempts to solve human face detection problem. The early approaches are aimed for gray level images only, and image pyramid schemes are necessary to scale with unknown face sizes. View-based detectors are popular in this category, including Rowleys neural networks classifier [1], Sung and Poggios correlation templates matching scheme based on image invariants [2] and Eigen-face decomposition [3]. Model based detection is another category of face detectors [4]. For color images, various literatures have shown that is possible to separate human skin regions from complex background based on either YCbCr or HSV color space [5, 6, 7, 8]. The face candidates can be generated from the identified skin regions. Numerous approaches can be applied to classify face and non-face from the face candidates, such as wavelet packet analysis [6], template matching for faces, eyes and mouths [8, 9, 10], feature extraction using watersheds and projections [5]. In this project, a new face detector for color images is developed. The objective of this project is to develop a very efficient algorithm, in terms of low computational complexity, with the maximum number of face detections and the minimum number of false alarms. To achieve these objectives first, the image is transformed to HSV color space, where the skin pixels are determined. The skin regions in HSV space are described by crossing regions of several 3D linear equations, which are found using training data. Also, the median luminance condition Y value of the image is determined. For high luminance images, the chances that more non-skin pixels are set to skin regions are high, thus an additional but simple classification on YCbCr space is performed to remove hair pixels. Hence, a binary mask of the original image can be obtained. This binary mask is then filtered with some image morphology processing to break connections between faces and remove scattered noise. A connected component analysis is followed to determine the face candidates. The final step is to determine real faces from the face candidates using a multi-layer classification scheme. The application of this project justifies an assumption that the faces will have approximately the same size. So, we use a correlation template matching for the face candidates that are close to -1-
Computer Project for Digital Image Processing EE368 Spring 2001/2002 the median size. For large boxes, convolution template matching is used instead because it is more likely that only part of the face candidate box contains the face. Another finer level of template matching is applied to remove hand-like non-faces and five more face templates are tested again to avoid missing a human face. Moreover, the standard deviation of the pixel gray levels for the face candidates is also used to remove non-faces caused by uniform skin-colorlike region, such as floors, buildings and clothes. In the following sections, we will present the detailed algorithm of our face detector. We will show that our detector gives 100% accuracy on six out of seven project training images. The only missing face on one of the images is due to very dark glasses. No false alarms are found in any of the seven images. Finally, conclusion and future works are discussed.
2. Color Segmentation
The first step is color segmentation of the image. Several color spaces are available but Hue-Saturation-Value (HSV) color map is the most adequate for differentiating the skin regions from the rest of the photo contents. A set of equations that maximize the amount of skin pixels while minimizing the number of background pixels can be found using the plot of skin regions vs. non-skin regions in H vs. S, S vs. V and H vs. V. These bounding equations are used to generate the first binary image. However, some face candidate boxes contained two people because their black hairs were connected and included as skin region. In order to balance taking out the hair vs. loosing some of the face skin pixels, the luminance and chrominance (YCbCr) color space is also used to differentiate the black hair pixels from the skin pixels in case of high luminance images. The following Figure 2-1 Figure 2-3 show the skin vs. non-skin regions in HSV, YCbCr and RGB color spaces for the training image no. 1.
Figure 2-1: Skin data (blue) vs. background data (red) in HSV color space.
-2-
Figure 2-2: Skin Data (blue) vs. background data (red) in YCbCr color space. A lot of the background pixels (red) occupy the same space as the skin pixels (blue).
Figure 2-3: Skin Data (blue) vs. background data (red) in RGB color space. A lot of the background pixels occupy the same space as the skin pixels. As can be seen from the previous plots the HSV space has less non-skin pixels overlapping with skin pixels verses the YCbCr and the RGB color space. Moreover, the RGB color space doesn't differentiate the luminance information from the color information. It was observed that the color of the image background contents, such as floors, building and clothes, is similar to skin color. In the HSV space it is very convenient to get rid of the skincolor-like pixels because they mostly fall between H values of 0.1 and 0.2, while most skin pixels are less than 0.1.
-3-
Computer Project for Digital Image Processing EE368 Spring 2001/2002 Figure 2-4 is a comparison of skin vs. close to skin pixels in HSV and YCbCr. Wall Pixels Wall Pixels
Plot of wall pixels vs. skin pixels in HSV space. By setting the threshold at <0.1 we will reject the wall pixels.
Plot of wall pixels vs. skin pixels in YCbCr space. The wall pixels fall right on the skin area.
Figure 2-4: Wall pixels vs. Skin pixels in HSV and YCbCr. The H parameter contains the color information and as can be seen by the plot the majority of the skin pixels in the training images fell in a range below 0.1 and above 0.8 for H. The wall presented the greatest problem but the walls pixels fall between 0.1 and 0.2. The use of linear equations to delimit the skin vs. non-skin regions is another advantage of the HSV color space. There is a linear trend in S vs. V where the majority of the skin pixels fall within two bounding equations. This linear equations are simple so that the complexity of our algorithm is reduced, which leads into a short processing time.
Image 3 1.5 1.5 Image 4
0.5
V1 0.5 0 0 0.1 0.2 0.3 S 0.4 0.5 0.6 0.7 0 0 0.1 0.2 0.3 S 0.4 0.5 0.6 0.7
Figure 2-5: S vs. V for training images with different light conditions.
-4-
Computer Project for Digital Image Processing EE368 Spring 2001/2002 As can be seen in Figure 2-5, there is vertical V offset from one training image to another. To eliminate this offset, V is normalized by subtracting the whole images V mean from the data points. S and H parameters dont have to be normalized since there is no significant offset from training image to training image. Figure 2-6 shows the skin data points from all the training images. The S vs. V trends are separated in two populations one for data points with H<0.1 and one for data points with H>0.8. The population separation provides a more precise segmentation, so that the number of face candidates is reduced. The overall computation time is reduced due to less templatematching correlation and convolution operations in favor of extra computational time in the first step, which is minimal due to only logical operations.
Figure 2-6: Graphs of the skin pixel samples from all images and the bounding equations used. The bounding equations are a trade-off of removing the non-skin pixels while keeping the skin ones. An example of such optimization is that instead of H>0.9 we have chosen H>0.8 due to otherwise undetected faces. On the other hand, the lower H limit is set at 0.1 instead of 0.2 in order to get rid of most of the wall pixels while loosing some unimportant face pixels such as faces with a lot of already segmented pixels.
-5-
Computer Project for Digital Image Processing EE368 Spring 2001/2002 The final sets of bounding equations are: Population 1) Population 2) H<0.1 V<-1.33.*S+0.986 F=H>0.8 V<-1.51.*S+0.853 S<=0.8 V>-0.603.*S-0.039; S<0.7 V>-0.671.*S-0.062;
The two population equation sets are combined with an OR operation. Figure 2-7 shows the first iteration of binary images without hair removal. These faces did not have enough separating pixels
Figure 2-7: First iteration of binary images without hair removal. The first set gave a non satisfying result due to some face candidate boxes that contained two people. These people could have been separated because there is hair between them. Plots of hair vs. skin samples indicated it is difficult to differentiate them in the HSV space. Instead, the YCbCr space, as can be seen by Figure 2-8, had a clear differentiating line between the hair vs. skin pixels. A Cb value that gives the best tradeoff of taking out enough hair pixels vs. leaving sufficient skin pixels is experimentally chosen. It was determined that it was not possible to make a satisfying tradeoff between losing enough hair pixels while keeping enough skin pixels that would work for all luminance conditions. Thus the training images are divided into two sets: ones with high luminance and ones with low luminance based on a threshold. For those whose luminance was higher than
-6-
Computer Project for Digital Image Processing EE368 Spring 2001/2002 the threshold we took out the hair while we kept the hair for the other group since otherwise too many skin pixels would be lost while that group had no problems with face separation.
Figure 2-8: Skin (blue) vs. black hair (red) pixels. The hair removal procedure gives satisfying results. One of the binary images can be seen in Figure 3-1. The color segmentation scheme has several limitations. One such limitation is that the facial features (mouth and eyes) in our training images fall in the same color space area as the skin color in HSV, YCbCr and RGB. Thus, it is not possible to make these facial features more prominent in order to increase the correlation with the template. A similar limitation is observed with certain hair colors such as light brown.
Figure 2-9: Plot of skin (blue) vs. eyes (green) vs. mouth(black) vs. fair hair (magenta) samples in YCbCr space.
Figure 2-10: Plot of skin (blue) vs. eyes (green) vs. mouth(yellow) vs. fair hair (cyan) vs. background (red) samples in HSV space.
-7-
Computer Project for Digital Image Processing EE368 Spring 2001/2002 Edge filtering is investigated as an alternative to taking out the hair since it highlights the face edges. Figure 2-11 shows that this methodology introduces too many black pixels within our face candidates (due to edges within the face areas being highlighted) which caused some faces to be divided in half. Between the three possibilities of edge filtering: horizontal, vertical or both horizontal filtering gives better results because it introduces less face divisions. However neither of the edge filtering options is good enough, thus this technique is abandoned in favor of the taking out the hair technique.
Figure 2-11: All edge filtering, Vertical Filtering only and Horizontal Filtering only Overall we are satisfied with the segmentation results since they provided a good enough compromise between the different training images to successfully carry out the next steps.
3. Connected Component Analysis

The color segmentation generates a Binary Mask with the same size of the original image. Figure 3-1 shows an example of the Binary Mask generated from the training image no. 5.
Figure 3-1: Binary Mask Figure 3-1 includes most of the skin regions, such as faces, hands and arms. However, some regions similar to skin also appear white: pseudo-skin pixels such as clothes, floors and buildings. The goal of the Connected Component algorithm is to analyze the connection
-8-
Computer Project for Digital Image Processing EE368 Spring 2001/2002 property of skin regions, and identify the face candidates, which will be described by rectangular boxes. Ideally, each face is a connected region and separated from each other. However, in some circumstances, two or even three faces can be connected by ears or high luminance hairs. In addition, pseudo-skin pixels are scattered and generate hundreds of connected components, which cost unnecessary computations if they are identified as face candidates. Therefore, preprocessing of the binary mask before connected component analysis is necessary. Figure 3-2 shows two faces that are connected. However, the connection is thin compared to the inside regions of the face and it can be broken by image morphology operations. In particular, one row direction and one column direction image erosion operations are applied so that more pixels are eroded in column directions. This is based on the observation that faces are usually connected more horizontal. In addition, within a face, connections between the parts above and below the eyes are fragile, and it is desired not to erode this connection. At the same time, erosion operations act similar to median filter, and can remove pseudo-skin pixels because of their scattered and weak connection property. The light condition of the image plays an important role for the quality of the binary masks. In strong light condition, there tends to be more pseudo-skin pixels, and this requires more erosion operations. But, too much erosion will make the faces fall apart for weak light images. Therefore, we perform the erosion operation adaptively depending on the light condition of the image. The light condition can be determined based on the median luminance value of all the pixels in the image. For strong light images, two additional column erosions are included. Between first and second level erosions, holes are filled so that later erosions only happen at edges of the connected components and will not cause regions inside faces to fall apart. A pre-processing of the binary image that breaks connection between faces in strong light conditions is shown in Figure 3-2Figure 3-5. In particular, Figure 3-2 is the binary image from skin segmentation that has two faces connected. Figure 3-3 is the image obtained by the first level column and row erosion. Figure 3-4 is the image obtained by the hole filling and second level column erosion. Figure 3-5 is the image obtained by the third level erosion with two separate connected components for two faces.
-9-
Figure 3-2: Connected Faces
Figure 3-3: First Level Erosion
Figure 3-4: Hole Filling and Second Level Erosion
Figure 3-5: Third Level Erosion
The connected component analysis consists of labeling the pre-processed masks by looking at the connectivity of neighboring pixels. Each connected component is considered as a potential face candidate and a rectangular boundary box is computed. An adaptive scheme to filter out some non-face boxes based on size information is used. Assuming that all the faces will not be larger and smaller than the median size to some extent, it is possible to remove the boxes that have unreasonable large or small areas, width, height and ratio of width to height. The rest of the boundary boxes are considered face candidates and passed to the template matching step. Figure 3-6 shows the face candidates obtained by applying the connected component analysis to the Figure 3-1 binary mask.
- 10 -
Figure 3-6: Face Candidates
The above algorithm is able to identify all faces except one for all the seven training images, without two or more faces being in one face candidate. The lost face is due to the dark glasses the person wears.
4. Template Matching
The template-matching compares the face candidate image with the face template, measures the level of similarity and concludes whether it is human face or a non-face. Several enhancements have been made to optimize the template-matching algorithm for the training images given by the EE368 instructors. A multi-layer classification scheme has been implemented to avoid missing faces or having non-faces. The color space chosen for the template matching is gray because the best results have been experimentally obtained. The template matching algorithm loads the face and non-face template images, it computes the 2 dimensional (2-D) cross-correlation or the 2-D convolution. The face template is an image made by averaging all faces on the training images. Figure 4-1 shows the used Face Template image.
Figure 4-1: Face Template image
- 11 -
Computer Project for Digital Image Processing EE368 Spring 2001/2002 A few human faces are not detected if only one face template is used. The reason for the undetected faces is due to the very different color of skin or face profiles found across several subjects. Additional face templates are used to detect the missing faces. Figure 4-2 shows the additional face template images.
Figure 4-2: Additional Face Template images A few non-faces, such as hands or clothes that have similar color to the skin, are detected as human faces if only one face template is used. To avoid this issue, hand templates have been created to remove hand-like non-faces.
Figure 4-3: Non-face Template images After loading all template images, the median box sizes of all face candidates present in the image under test is determined. Then each face candidate is analyzed one by one. A rotation of the face template or test candidate is not performed because it is not required by any face case present in the training images. If the face candidate box size is similar or smaller than the median face size, the face candidate is resized to the face template size and the 2-D cross-correlation is applied. Then, if the cross-correlation with face templates is greater than a predetermined threshold, it is concluded that the face candidate is a human face or a non-face otherwise. If the crosscorrelation with a non-face template is greater than a predetermined threshold, it is concluded the face candidate is a non-face image. Moreover, standard deviation of the gray pixels is also computed to remove non-faces such as clothes or other regions having uniform skin-colorlike. If the cross-correlation value is less than a predetermined threshold, it is concluded the face candidate is a non-face image. On the other hand, if the face candidate size is larger than the median face size, a convolution template matching is used because it is more likely that only part of the face
- 12 -
Computer Project for Digital Image Processing EE368 Spring 2001/2002 candidate contains the face. There could be a face inside the large box because the received face candidate might consist of faces with long shining hair or two faces that are too close or even superposed. Applying the cross-correlation function doesnt work for large boxes because the face would be resized to a very small size. To avoid missing a face inside a big box, the 2-D convolution function is applied. The convolution between the inverse of the face template and the face candidate is carried out. Then, the peak of the convolution is computed and normalized using the face template weight. Figure 4-4 shows an example of a big image box that contains a human face
Figure 4-4: Large Size Image that contains a Human Face If the peak value is greater than a predetermined threshold, it is concluded there is one or more faces otherwise it is a non-face alarm. The Matlab code can be improved detecting the number of peaks that are greater than a threshold in order to detect multiple faces inside the big image. This optimization is not made at present because using the training images no case of multiple faces inside one box has been found. The following results have been found applying Training Image no. 4 to the Matlab procedure and evaluate.m procedure given by the instructors.
- 13 -
Computer Project for Digital Image Processing EE368 Spring 2001/2002 Figure 4-5 shows the Resulting Image. The green boxes represent the detected human faces and the red boxes the non-faces. Human faces or non-faces Case Numbers have been introduced on Figure 4-5 to help the following review of the resulting image: Case 1: a human face detected by the cross-correlation operation with the face template presented in Figure 4-1, Case 2: a human face detected by the cross-correlation operation with the face template presented in Figure 4-2, Case 3: a non-face, a hand in particular, removed by the cross-correlation operation with the non-face template presented in Figure 4-3, Case 4: a non-face, a pant in particular, removed by the standard deviation requirement, Case 5: a human face detected by the convolution operation with the face template presented in Figure 4-1, Case 6: a non-face removed by the convolution operation with the face template presented in Figure 4-1. This enhanced template-matching algorithm is capable of detecting all human faces and no false alarms. We classified half faces as non-faces. In order to detect half faces it is possible to determine if the candidate box is located at the edge of the image, and then use half template to carry out the cross-correlation operation.
5 2
Figure 4-5: Resulting Image. Green boxes are faces, red boxes are non-faces.
- 14 -
5. Conclusion
We have presented a face detection algorithm for color images that uses color segmentation, connected component analysis and multi-layer template-matching. Our method uses the color information in HSV space, compensates for the luminance condition of the image, and overcomes the difficulty of separating faces that are connected together using image morphology processing. Finally, an enhanced version of the template-matching algorithm is used to detect all human faces and reject the non-faces such as hands and clothes. Experimental results have shown that our approach detected 164 out of 165 faces present in the seven project training images (half faces are classified as non-faces). The only one missing face is due to very dark glasses. No false alarms are raised in any of the seven images. The average run time on ISE lab workstation is ~12 seconds. Future work will be focused on verifying the algorithm performance against general images and studying the required modifications to make the algorithm robust with any image.
6. Reference
[1] H. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. In Proc. IEEE Conf. on Computer Visioin and Pattern Recognition, pages 203-207, San Francisco, CA, 1996 [2] Kah-Kay Sung and Tomaso Poggio. Example-based learning for view-based human face detection. A.I. Memo 1521, CBCL Paper 112, MIT, December 1994 [3] B. Moghaddam and A. Pentland. Probabilistic visual learning for object representation. In S.K. Nayar and T. Poggio, editors, Early Visual Learning, pages 99--130. Oxford Univ. Press, 1996. [4] O. Jesorsky, K. J. Kirchberg, R.W. Frischholz. Robust Face Detection Using the Hausdorff Distance. In Proc. Third International Conference on Audio- and Video-based Biometric Person Authentication, Halmstad, Sweden, 2001 [5] K. Sobottka and I. Pitas. Looking for faces and facial features in color images. Pattern Recognition and Image Analysis: Advances in Mathematical Theory and Applications, Russian Academy of Sciences, 1996. [6] Garcia C., Zikos G., Tziritas G., Face Detection in Color Images using Wavelet Packet Analysis . Proceedings of the 6th IEEE International Conference on Multimedia Computing and Systems (ICMCS'99), 7--11 June, 1999, Florence, p.703-708. [7] J.-C. Terrillon, M. David, and S. Akamatsu. Automatic detection of human faces in natural scene images by use of a skin color model and of invariant moments. In Proc. of the Third International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 1998. pp. 112-117 [8] R.-L. Hsu, M. Abdel-Mottaleb, and A. K. Jain, Face detection in color images. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 696-706, May 2002 [9] Chuo-Ling Chang, Edward Li, Zhifei Wen, Rendering Novel Views of faces Using Disparity Estimation. Stanford EE368 Spring 2000/2001 Final Project [10] Xiaoyan Mu, Mehmet Artiklar, Metin Artiklar, Mohamad Hassoun and Paul Watta, Training Algorithms for Robust Face Recognition using Template-matching Approach. Proceedings of the IJCNN01, Washington DC, July 15-19, 2001
- 15 -
Appendix I: Detection results of our algorithm on seven EE368 training images
Training_1 Score: 23/23
- 16 -
- 17 -
Training_4 Score: 21/ 22
- 18 -
- 19 -
Appendix II (Work breakdown)

The project was broken down in three parts and each one of us was designated the main person responsible for each part (Color Segmentation Ana Bertran, From Binary Image to Face Candidates - Huanzhou Yu, Face vs. non- Face decision Paolo Sacchetto). However, we each helped each other in developing the parts that needed extra help. Thus we all put the same amount of time and effort in the project. We also each wrote one part of the project but we all revised the final draft and modified each others part. Same thing applies to the slides.
- 20 -

Face Detection Project Report: Ana Bertran, Huanzhou Yu, Paolo Sacchetto (Nuska, Hzhyu, Paolos) @stanford - Edu

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Face Detection Project Report: Ana Bertran, Huanzhou Yu, Paolo Sacchetto (Nuska, Hzhyu, Paolos) @stanford - Edu

Uploaded by

Copyright:

Available Formats

Computer Project for Digital Image Processing EE368 Spring 2001/2002