• Ei tuloksia

Automatic Landmark Detection

3. TECHNICAL BACKGROUND

3.2 Landmark-based 3D Systems

3.2.2 Automatic Landmark Detection

The automatic detection of facial landmarks can be divided into two categories.

The knowledge may be presented as empirical rules that have been observed to provide acceptable results, or by building a machine learning (ML) model. [54] This subsection concentrates on the machine learning approach.

Numerous algorithms have been developed to automatically detect the facial land-marks from images and videos. These methods can be divided into three paramount categories that differ in the information utilization: holistic methods, constrained local model (CLM) methods, and regression-based methods that include inter alia the deep-learning techniques. The training data may be gathered in controlled conditions or in-the-wild when facial expressions, head poses, facial occlusion, or envinronmental conditions such as illumination are not controlled. [53] It has been reported in 2018 by Wu et al. [53] that the current facial landmark research con-centrates on landmark detection in-the-wild, although a robust method is still to be developed. Recently, regression-based deep learning methods, Convolutional Neural Network (CNN) models that follow direct regression have been showing the highest performance in landmark detection and tracking. [53]

However, the mentioned categories do not include any 3D methods, the importance of having 3D data was discussed in Section 3.1. There are factors that hinder 3D

landmark detection development. Firstly, there is a lack of 3D face scan databases.

This is partially related to the fact that the 3D landmarking is new compared to 2D landmark detection, but 3D face scans are also harder to obtain than 2D images or video. Secondly, 3D landmark labelling is typically more difficult than 2D labelling.

However, several different 3D landmark detection methods have been reported and those methods can be classified based on the dimensions of the used data. There are methods that utilize 2D images and videos, and 3D methods that operate on 3D face scans; both method types yield three-dimensional landmark coordinates. [53] In order to obtain 3D landmark coordinates from two-dimensional videos and images, the current methods either utilize the limited 3D training data available [53, 68], or use a pre-trained facial shape model [53, 69, 70], and unite them with the ML approach. Table 3.1 lists the methods of obtaining 3D coordinates from 2D input data and those methods are explained further after the table.

Table 3.1 A summary of the methods of detecting 3D coordinates from 2D facial data such as images or video.

Method Source

Tulyakov and Sebe [68] estimated 3D facial shape from two-dimensional images by using limited 3D training data to predict the necessary depth information. The novelty of the method was the single-step nature: the 3D information was used in the learning pipeline to produce 3D feature indexing. Thus, as the depth was taken into consideration at every level of the cascade, there was no need for further phases to gain the depth. Another contribution worth mentioning is that Tulyakov and Sebe estimated all landmarks under consideration even in the cases of self-occlusion instead of ignoring the landmark or taking a nearest visible point as a replacement.

[68]

Another approach to yield 3D facial landmark coordinates from two-dimensional images or videos is to use the limited 3D data to build a 3D shape model, detect 2D landmarks, and then fit those estimated landmarks on the 3D shape model to gain the 3D facial landmarks [53]. An example of this method is provided by Gou et al. [69] and Jeni at al. [70]. Gou et al. suggested in 2016 a novel 3D face alignment method that used a 3D deformable shape model. They had two steps; at first to detect the 2D landmarks with regression, and then to estimate the correct 3D shape

based on the initial deformable model and the obtained landmarks [53, 69]. Jeni et al. on the other hand used a dense model and thus their procedural step after gaining the 2D landmarks was to iterate the 2D coordinates to fit the pre-trained rigid model [53, 70].

However, there are also methods that use both of the aforementioned methods in the same approach [53]. In other words, the pre-trained shape model and the usage of 3D data in the training step have been reported to be combined by Jourabloo et al. [53, 71]. Jourabloo et al. [71] developed a novel method to estimate 3D landmarks and their visibilities in a pose-invariant manner. The training of the cascaded regressor-based model included building a 3D point distribution model from an available set of 3D scans (method of using a shape model), and using a 2D image with manually labelled 2D landmarks combined with the 3D ground truth [71].

Finally, also methods that localize 3D coordinates for facial landmarks from 3D scans have been reported [53, 72, 73]. Papazov et al. [72] developed a method that first detects location candidates for every landmark under investigation from facial 3D scans. To be more precise, Papazov et al. used a triangular surface patch (TSP) descriptors, that find the 3D shape of a triangular facial area, to get the landmark candidates. Then, the final estimates for both landmarks and pose were gained by fitting the candidates on a pre-trained shape model. [53, 72] Liang et al. [73]

presented an approach that detects dominant landmarks on 3D facial scan by using the landmarks’ particular geometric properties. Their next step was also to match a deformable 3D model to gain the remaining supporting landmarks. [53, 73] Papazov et al. [72] used synthetic 3D images as training data due to availability. They concluded that the method is viewpoint invariant, works in real-time and produces improved pose and location estimates compared to earlier published work.

Wu et al. [53] discusses the current problematics of automatic ML-based landmark detection and tracking. They report that as facial detection in general is a prior step to landmark detection, the landmarking results depend heavily on the face detection accuracy and performance. This is a limitation. Additionally, Wu et al. remind that some landmark detection and tracking algorithms are still computationally expensive.

Finally, they note that the focus has been on simply detecting and tracking the landmarks; the dynamic information has not been employed in its full potential. [53]

Regarding to facial palsy and facial symmetry analysis, automatic facial landmark detection and tracking are merely a midstep; landmarking is a task to be completed prior to further facial analysis [71].