• Ei tuloksia

Facial Action Coding System

3. TECHNICAL BACKGROUND

3.4 Other Methods

3.4.2 Facial Action Coding System

Facial Action Coding System (FACS) was developed in 1970s by Ekman and Friesen and revised by Ekman et al. in 2002 [102, 103], and has been described as "a taxonomy of human facial expressions" [103]. FACS divides face into its muscle components, or action units (AU), and describes every facial expression as a combination of these numbered AUs [102]. For example, happiness is expressed with AUs of 6 (cheek raiser), 12 (lip corner puller) and 25 (lips part), and fear may include all AUs of 1 (inner brow raiser), 2 (outer brow raiser), 4 (brow lowerer), 5 (upper lid raiser), 20 (lip stretcher), 25, 26 (jaw drop), and 27 (mouth stretch) [102, 103]. As mentioned, these number coded AUs and their FACS names all have listed muscular basis; for example AU 1 involves muscles called frontalis and pars medialis, and AU 12 a muscle named zygomatic major [102]. An advantage of FACS is that as its components are facial muscle sets, any facial expression can be communicated with FACS just as different combinations of these elemental units called AUs [102].

FACS also enables measurement of intensity, dynamics and symmetry [45, 103] which makes FACS interesting in the sense of facial palsy measurement. The intensity is measured with a five point ordinal scale from A to E; A corresponds to trace of movement and E to maximum action [102]. Dynamics refer to the temporal aspects of facial motion; is the expression beginning or disappearing, how are the duration of different steps of the expression, how is the symmetry of individual AUs and how is the interaction between different AUs [103]. Marquez et al. [103] discussed studies about differences between voluntary and involuntary expressions (in other words, faked and spontanenous expressions) and summarized that dynamics reveal the difference. Thus, FACS is comprehensive [102, 103] and has been used in a variety of psychology and neuroscience studies [103]. However, FACS is very labourious to learn: 100 hours of practice has been reported in order to gain satisfactory skill level to scoring [102], and even after mastering FACS it may demand hours to manually score a one minute video [45].

Thus, automatizing the FACS scoring process would speed up research and make FACS available without the labour-consuming training [103]. It has been suggested that an automated scoring method could also increase the FACS reliability, precision, reproducibility and dynamic measures [103, 104]. Martinez et al. [103] reviewed current solutions of FACS machine analysis. Their survey is summarized in Figure 3.12 which shows the main steps of a generic automatized analysis (pre-processing, feature-extraction, and analysis), and their substeps and/or approach type. The steps are not further elaborated here as the review considered 2D methodology.

Figure 3.12 The main steps and substeps of 2D automised FACS solutions reviewed in [103]. It is worthy to mention that most feature extraction methods listed in this figure are so called hand-crafted feature extraction methods, when comparing to CNN-based feature extraction (see Subsection 3.4.1). Figure is taken from [103].

Martinez et al. [103] concluded that there have been major advancements to obtain automatic (and real-time) FACS based on 2D data but the detection of AU segments and their intensities still remain an open problem. Martinez et al. [103] included central issues to include inter alia occlusion handling, non-frontal head poses, co-occurring speech, varying illumination conditions, and detection and grading of low intensity AUs. Additionally, Martinez et al. [103] higlighted the need for 2D/3D open source databases; realistic and natural in-the-wild database is required to avoid algorithmic local maxima.

Sandbach et al. [84] reviewed 3D methods and concluded then in the same manner that the development of databases is in the central position of transferring the 3D facial analysis from its then infant stage. More specifically, Sandbach et al. mentioned need for spontaneous, natural and dynamic data that would contain complex states.

They also suggested that progress in 3D data acquisition should increase the degree and the number of these 3D databases. [84] This urgent requirement has been addressed recently by Liu et al. [91]. Liu et al. [91] proposed an AU-level synthesis framework that was based on a geometric model and an adversial environment. They acknowledged the need for large-scale database in order to employ deep-learning approaches to rate for example AU intensities and thus suggest an augmentation method.

Another recent study was conducted by Romero et al. [92] and diverged from the current state-of-the-art approach of automating facial expression analysis and FACS to be more precise. Romero et al. proposed a novel CNN-based solution to detect AUs from multi-view videos by directly analysing the whole human face. Thus, the standard benchmark of FACS automatisation, landmark localisation, is skipped. The system by Romero et al. is called AUNets and consists of several modules; optical flow computation, ensemble of AU detectors, and a view point classifier. The optical flow field of the input video investigates the shared and individual appearances of AUs, the viewpoint classifier selects suitable AU detectors from the ensemble of them based on the input video’s angle, and finally the CNN based detectors conduct presence predictions of the AUs. This method was concluded to be flexible due the modular design, efficient and advance the solution of the problem. However, limitations of the approach were listed to be a big number of parameters and the long time required to train them. [92] A result example is visualized in Figure 3.13.

Figure 3.13 An example result of the AUNets that shows the detected AU-name and its code, and also the probability of its presence. The AUNets was trained to recognize 12 different AUs. Figure is taken from [92].

The fact that FACS has a physiological basis [92, 102] easens the computer vision solution development; it is possible to concentrate on the core of the task [92].

Solving automated facial expression analysis task would be a step towards a highlevel human-computer interaction [92], and have a major effect on games, security and health industry [103].