• Ei tuloksia

4   GENDER CLASSIFICATION

4.4   R ESULTS

The classification accuracies for the first experiments are shown in Table 4.3. The classification accuracy is the percentage of the faces that are correctly labeled as either male or female. The neural network and LUT Adaboost produced the best classification rates on average but there were no great differences between the methods. Furthermore, the classification rate is slightly better for the “with hair” images in average although there are two exceptions at the method level and there are no statistically significant differences between the “with hair” and “without hair”

conditions (Wilcoxon signed-rank test: z6 = 0.734, p = 0.463).

ROC curves were briefly introduced in Subsection 2.6.1 when considering face detection. However, it is possible to draw a ROC curve for a gender classifier, too. The ROC curves are shown for “without hair” images in Figure 4.3 and “with hair” images in Figure 4.4. The curve can be drawn for a method by changing the threshold value that determines the classification. For example, with the neural network we used the possible output values between -0.5 and 0.5. Changing the threshold little by little from -0.5 to 0.5 we therefore affect the fraction of faces classified as males and females. The closer the threshold is to the -0.5 the more female faces will be classified as female but at the same time more the male faces will be classified as female. The fraction of males classified correctly is presented on the y-axis and the fraction of the females classified incorrectly is presented on the x-axis. For example, a curve point at the coordinates (x, y) = (0.21, 0.84) means that 21% of the females are classified incorrectly when 84% of the males are classified correctly. The greater the area under the ROC curve is the better the method is at classifying genders.

The perfect curve would be such that it goes from the lower left corner to the upper left corner and from there to the upper right corner.

(a) (b) (c) (d)

Table 4.3. Classification accuracies for the classifiers with the face images with and without hair in the first experiment.

Classification accuracy %

Method without hair

(24*24) with hair (32*40) Average

Neural Network 92.22% 90.00% 91.11%

SVM 88.89% 82.00% 85.45%

Threshold Adaboost

86.67% 90.00% 88.34%

LUT Adaboost 88.89% 93.33% 91.11%

Mean Adaboost 88.33% 90.00% 89.17%

LBP + SVM 80.56% 92.00% 86.28%

Average classification

accuracy

87.59% 89.56% 88.57%

The ROC curves show that Adaboost variants especially gave a very similar performance. However, SVM with pixel based input and multi-layer perceptron with pixel based input also gave a fairly similar performance for both with hair and without hair images. The SVM with LBP features performed clearly worse than the other methods when without hair images were used but was slightly better than the other methods when with hair images were used. The findings indicate that gender classification performance may depend more on the features used than on the actual classifier. In addition, since LUT Adaboost did produce rates similar to the other Adaboost classifiers, it seems that gender classification of the frontal manually aligned faces is a linearly separable problem, at least when Haar-like features are used, and the threshold Adaboost and the mean Adaboost classifiers can be used for the task.

Figure 4.3. ROC curves for images without hair (24*24 size images). (a) ROC curves for the SVM with pixel based input, for the SVM with LBP features, and for the multi-layer perceptron. The top left part of the curve is zoomed on the right. (b) ROC curves for the mean Adaboost, for the threshold Adaboost, and for the LUT Adaboost. The top left part of the curve is zoomed on the right.

0.60 1.00

0.00 0.40

Fraction of males classified correctly

Fraction of females classified incorrectly 0.95

0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.0

Fraction of males classified correctly

Fraction of females classified incorrectly

SVM

Fraction of males classified correctly

Fraction of females classified incorrectly

0.60 1.00

0.00 0.40

Fraction of males classified correctly

Fraction of females classified incorrectly 0.95

0.05 0.10 0.15 0.20 0.25 0.30 0.35

Mean Adaboost Threshold Adaboost

LUT Adaboost

(a)

(b)

Figure 4.4. ROC curves for images with hair (32*40 size images). (a) ROC curves for the SVM with pixel based input, for the SVM with LBP features, and for the multi-layer perceptron. The top left part of the curve is zoomed on the right. (b) ROC curves for the mean Adaboost, for the threshold Adaboost, and for the LUT Adaboost. The top left part of the curve is zoomed on the right.

Now the results of the second part of the experiments are presented.

While the first part of the experiments concentrated on the effects of excluding and including hair in the face images the second part of the experiments explored the effect of variations in face image quality that may be present, for example, when automatic face detection precedes gender classification.

Fraction of males classified correctly

Fraction of females classified incorrectly

0.60 1.00

0.00 0.40

Fraction of males classified correctly

Fraction of females classified incorrectly 0.95

0.05 0.10 0.15 0.20 0.25 0.30 0.35

0.0

Fraction of males classified correctly

Fraction of females classified incorrectly

0.60 1.00

0.00 0.40

Fraction of males classified correctly

Fraction of females classified incorrectly 0.95

0.05 0.10 0.15 0.20 0.25 0.30 0.35

The effects of rotation and scale averaged over all three face image sizes (24*24, 36*36, and 48*48) for the four classifiers are shown in Figure 4.5 and in Figure 4.6.

−50 −40 −30 −20 −10 0 10 20 30 40 50

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Angle in degrees

Accuracy

LBPSVM SVM Adaboost Neural network

Figure 4.5. Effect of rotation on the gender classification rates when rates have been averaged over all image sizes.

0.1 1 10 0.3

0.4 0.5 0.6 0.7 0.8 0.9 1

Scale

Accuracy

LBPSVM SVM Adaboost Neural network

Figure 4.6. Effect of scale on the gender classification rates when rates have been averaged over all image sizes.

While considering the effect of the rotation it seems that Adaboost with Haar-like features is the most resistant method for the rotation variations.

Although the best classification rates were achieved with SVM when image pixels were used as input, the classification accuracy fell fast when the face orientation was changed.

The results of the scaling are also interesting. The most striking and also the most surprising result is the high classification accuracy for Adaboost and SVM with pixel based input with the scaling factor close to 3. What is also interesting is the below chance classification accuracies for all the methods when a scaling factor between 0.3 and 0.6 or close to 0.2 was used.

There is no obvious reason for the peaks and pitfalls in the performances.

The effect of the rotation and scaling averaged over the four methods is shown in Figure 4.7 and in Figure 4.8. As can be seen in the two Figures there were no large differences between different image sizes when orientation and scale were varied.

−50 −40 −30 −20 −10 0 10 20 30 40 50 0.3

0.4 0.5 0.6 0.7 0.8 0.9 1

Angle in degrees

Accuracy

24x24 36x36 48x48

Figure 4.7. Effect of rotation on the gender classification rates when rates have been averaged over all classification methods.

0.1 1 10

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Scale

Accuracy

24x24 36x36 48x48

Figure 4.8. Effect of scale on gender classification rates when rates have been averaged over all classification methods.

The effect of the translation on the classification accuracy with different image sizes is shown in Figure 4.9. The accuracies are averages calculated from the accuracies of all the methods.

Figure 4.9. Effect of translation on classification accuracy with different image sizes. (a) 24*24 size images. (b) 36*36 size images. (c) 48*48 size images. (d) Average over all image sizes (and over all classifiers).

The accuracy decreased slightly faster for the 24*24 size face images than for the other two face image sizes, as could be expected because the relative translation was greater for the small image size.

The effects of rotation, scale, and translation on the specific methods with specific image sizes are shown in Figures of Appendix 2. The issue that one notices on scrutinizing those figures is that the classification performance curves do a little zigzag although the overall trend is almost always what one could expect. Actually the zigzag is what one could expect to see, because, for example, one percentage unit change in rotation has only little effect on the face and with such a small change a classifier could classify more faces correctly by chance.

(a) (b)

4.5 DISCUSSION

Maybe the most interesting finding was that features may affect gender classification performance more than the machine learning method. This was indicated by the fact that the classification rates between different methods were the smallest when they used the same features. In many earlier studies (Shakhnarovich et al., 2002; Sun et al., 2002b; Wu et al., 2003a) the differences between some of the methods tested have also been relatively small, which further supports the importance of selecting proper features for the classification.

The threshold Adaboost with Haar-like features was clearly more resistant to the rotation variation than the other classifiers. The reason may be the Adaboost classifier itself, the Haar-like features or they may both contribute. The study by Baluja and Rowley (2007) suggests that at least the Adaboost classifier is important. They compared performance of the SVM classifier and Adaboost classifier, and the Adaboost classifier proved more resistant to the rotation variation. The features they used with the Adaboost were Boolean comparison values between image pixels. The SVM had pixel based input.

Moghaddam and Yang (2000) showed that face image size is not an important factor for SVM classification performance. The results of these experiments support their finding. In fact, it seems that size is not an important factor with any classifier. The effect of the translation offset was indeed larger with the smaller face image size but (usually) the alignment errors occur before the face image is resized.

Including hair in the face images improved classification performance, but not much. Abdi et al. (1995) reported that, for example, face shape affects the classification in addition to hair. In light of these facts it seems that some improvement in performance can be achieved when face outline and hair are included in the face image.

There were also some indications that gender classification with well aligned frontal faces may be a linearly separable classification problem.

The Adaboost variant did not affect classification performance. In addition, when an optimal amount of hidden nodes for the neural networks were searched, the neural networks with one hidden node produced results nearly equal to the networks with two hidden nodes.

An interesting issue is also the effect of the face image scaling on the classifier performance. The peaks in performance for the SVM with pixel based input and Adaboost with scaling factor at around 3 can hardly be explained by chance alone. There were also scaling factors that produced less than 50% accuracy for all classifiers, which is not good for a two-class classification problem. Furthermore, Baluja and Rowley (2007) reported similar, although not as strong, peaks and pitfalls in classifier

performances when the scaling factor was varied. The study of the issue requires further work.

4.6 SUMMARY

In this chapter gender classification was studied experimentally from different perspectives. Gender classification was studied with manually aligned face images and with various gender classifiers. The gender classification accuracy was slightly improved when hair was included in the face images. In addition, features used with the classifier seemed to be a more important factor for classification accuracy than the type of classifier. Furthermore, the results indicate that gender classification could be a linearly separable problem when Haar-like features are used with the classifier. The sensitivity of the classification to image rotation, scale and translation was studied. The most interesting finding was that Adaboost was more resistant than the other methods when a face was rotated. The other interesting issue was that the classification accuracy was in some cases unexpectedly high or low when the scale of the face was changed.

The results reported here help to put the experiments described in the next chapter in a wider context. The topic of the next chapter is how to use automatic face detection with gender classification. Such topics as classification reliability with low quality images and classification speed are considered. These issues are important, for example, from the viewpoint of perceptual user interfaces.

5 Combining Face Detection and