• Ei tuloksia

Convolutional Neural Networks to Classify Facial Palsy

3. TECHNICAL BACKGROUND

3.4 Other Methods

3.4.1 Convolutional Neural Networks to Classify Facial Palsy

Convolutional neural networks (CNN) have become a central solution method - as mentioned earlier; in landmark detection and tracking [53] - but also in general in image and video recognition [93]. This success is inter alia due to increased computation performance especially based on the usage of graphics processing units (GPU), large open access image databases such as ImageNet, and an annual competition called ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [93, 94].

ML models [95–97] and CNNs [67, 98] have been used to rate facial palsy. However, Sajid et al. [67] referred to [95–98] and reported that to the best of their knowledge, their study is the first one to grade facial paralysis on a large dataset. Sajid et al.

used 2000 facial images and then further augmented them to result in 7000 facial

pictures. The used augmentation method produced different levels of paralysis from a single image. The substantial number of images containing varying levels of palsy, addresses the common issue in facial palsy grading; repeatability. Also, in the case of CNNs, adequate training data rejects overfitting. [67]

Additionally, Sajid et al. [67] emphasized their automatic feature selection and its accuracy improving effect to the underlying classification task. In more detail, they juxtaposed the automatic feature selection completed by CNNs and the human-based, or hand-crafted, feature picking (representative studies of that category referred by [67] are for example [95–97]) in their discussion. Their conclusion was that hand-crafted methods may easily choose the wrong features for the task, here facial palsy evaluation, and thus result in poorer accuracy. The rest of this section concentrates on the method developed and represented by Sajid et al. in [67]. This choice is based on the discussion above; their recent method published in 2018 uses a large amount of adequate images and a CNN-based feature selection.

Figure 3.10 summarizes the five facial palsy classification steps of Sajid et al. [67].

Figure 3.10 An illustration of the facial palsy classification procedure’s major steps.

Image is taken from [67]. It is worthy of comment that Sajid et al. [67] used range nulla - V whereas House-Brackmann intended their system to range from I to VI.

This systematic shift should be kept in mind when reading further.

The beginning step marked in Figure 3.10 is image acquisition. Sajid et al. [67]

acquired 2000 annotated and labelled images from several sources. The labelling utilized the wide-spread House-Brackmann scale and the formed dataset had images of each House-Brackmann palsy level excluding normal. The second step shown in Figure 3.10 ispreprocessing. In more detail, Sajid et al. rotated all the images to be in an upright-position, converted them into grayscale, completed histogram equalization, cropped to certain dimensions and applied a filter. Data augmentation was the third step. The authors’ decision to keep the whole pipeline automatic, and thus augment the dataset blindly, limited the choice of augmentation method. A novel method of generative adversial network (GAN), proposed in [99], was selected for

image synthetization. [67] The GAN is based on an adversial procedure; a generator confuses the other model, the discriminator, and tries to maximize the discriminator’s errors. The discriminator on the other hand tries to differentiate whether the data came from the original sample or from the generator. [67, 99] The GAN procedure generated every House-Brackmann scale level of facial palsy based on the original picture [67]. The fourth step shown in Figure 3.10 is further illustrated in Figure 3.11.

Figure 3.11 The CNN structures responsible for feature learning (C1 and C2) and palsy grading (C3) illustrated. Image is taken from [67].

Figure 3.11 lays out the details of the fourth step, the feature learning. The keypoint is that there were three neural networks; two first neural networks, labelled as C1 and C2 in the figure, consisted of two convolutional and two max pooling layers, and were responsible for feature encoding. [67] According to the authors, Sajid et al. [67], the C1 and C2 structures have been described by Brachmann and Redies in [100]. Brachmann and Redies proposed a CNN method to measure image symmetry, and Sajid et al. modified the method to suit automatic facial paralysis feature encoding by utilizing two CNNs, C1 and C2, instead of a single CNN. Then, a facial picture and its mirror image were used as inputs to the two CNNs by Sajid et al. in [67].

After the second max pooling, the C1 and C2 outputs were connected to the third CNN denoted by C3 in Figure 3.11. The purpose of this third neural network was to serve as a deep architecture framework for the paralysis classification. [67]

The third CNN was a pre-trained network called VGG-16 originally trained with ImageNet pictures [67, 101]. Sajid et al. [67] chose the VGG-16 as their classification framework due to reported excellent performance by [93]. The used VGG-16 had 16 convolutional and 5 pooling layers [67].

The fifth step of Figure 3.10 is classification. The softmax layer of the VGG-16 marked in Figure 3.11 was modified by Sajid et al. [67] to contain five neurons;

one for each used House-Brackmann palsy grade. Thus, each of the five neurons produced a posterior probability to describe the likelihood of the image belonging to the corresponding House-Brackmann grade. During training, this softmax layer was responsible for adjusting the neuron weights by back propagation. [67]

To be able to produce trustable results, Sajid et al. [67] divided their collected dataset into three subject-exclusive groups at the very beginning; training, validation, and testing subsets. From the 2000 original images, 1000 were extracted for training.

After augmentation, the training dataset contained 6000 images with varying palsy levels. The remaining original 1000 images were then split equally for validation (500) and testing (500). The validation or testing data were not augmented. [67]

The results were then expressed in terms of average accuracy percentage, F1 score that describes the accuracy too, precision and sensitivity. The performance of the CNN system was reported to be consistent and superior to existing methods. Thus, Sajid et al. [67] concluded with a suggestion to apply their method in wide-scale for example in health care. To summarize, the developed CNN-based system took an advantage of novel methods developed by others by modifying them; GAN to automize augmentation, pre-trained CNNs to extract features automatically, and VGG-16 to grade them. Sajid et al. acquired large dataset, avoided overfitting, addressed repeatability, and developed scalable narrow artificial intelligence (AI) to House-Brackmann grade facial palsy from 2D images. [67]