Multi-camera finger tracking for 3D touch screen usability experiments

(1)

Master’s Programme in Computational Engineering and Technical Physics Intelligent Computing Major

Master’s Thesis

Vadim Lyubanenko

MULTI-CAMERA FINGER TRACKING FOR 3D TOUCH SCREEN USABILITY EXPERIMENTS

Examiners: Professor Heikki Kälviäinen Docent Yana Demyanenko Supervisors: M.Sc. Toni Kuronen

Docent Tuomas Eerola Professor Lasse Lensu Professor Heikki Kälviäinen

(2)

Intelligent Computing Major Vadim Lyubanenko

Multi-camera finger tracking for 3D touch screen usability experiments Master’s Thesis

2017

66 pages, 27 figures, 10 tables.

Examiners: Professor Heikki Kälviäinen Docent Yana Demyanenko

Keywords: human-computer interaction, object tracking, finger tracking, multi-view tracking, trajectory processing, epipolar geometry, 3D reconstruction, image processing

Three-dimensional Human-Computer Interaction (HCI) has the potential to form the next generation of user interfaces and to replace the current 2D touch displays. The recent progress in computer vision and object tracking has made it possible to get rid of data gloves or other sensors that constrain user experience and to accurately measure hand movements using a camera-based system. The objective of this thesis was to develop a framework for recording human-computer interaction in 3D which is expected to form the basis for the future study of touch and gesture based user interfaces. The research included building of a multi-camera measurement setup, evaluating the eleven state-of-art object trackers for the task of sustainable finger tracking, selecting a post-processing method, combining data from a high-speed and regular frame rate cameras, and obtaining a 3D trajectory. The developed framework was successfully evaluated in the application where 3D touch screen usability is studied with 3D stimuli. The most sustainable performance was achieved by the SCT visual tracker complemented with the LOESS smoothing, and thus, this combination formed the core of the proposed framework.

(3)

PREFACE

I wish to thank my supervisors, Professors Heikki Kälviäinen and Lasse Lensu, Do- cent Tuomas Eerola, and M.Sc. Toni Kuronen, for the valuable support and engagement through the entire learning process. Their unbiased comments targeting various points of the research allowed me to get into and polish each tiny piece of the work. They con- sistently allowed this paper to be my own work, but steered me in the right the direction whenever I needed it.

I would also like to acknowledge the staff of Southern Federal University, especially Do- cent Yana Demyanenko and Senior Lecturer Vitaly Bragilevsky, for sincere and valuable guidance during my abroad study.

Finally, I must express my very profound gratitude to my parents, friends and to my girlfriend for their patience, understanding and unfailing support throughout the process of writing this thesis and my life in general.

Lappeenranta, May 22, 2017

Vadim Lyubanenko

(4)

2 OBJECT TRACKING 11

2.1 Background . . . 11

2.2 Benchmarking trackers . . . 12

2.3 Selected tracking methods . . . 13

2.3.1 Sum of Template And Pixel-wise LEarners . . . 14

2.3.2 An improved Staple tracker with multiple feature integration . . . 14

2.3.3 Distractor Aware Tracker . . . 15

2.3.4 Scale Adaptive Mean Shift . . . 16

2.3.5 Kernelized Correlation Filter . . . 16

2.3.6 Structuralist Cognitive model for visual Tracking . . . 17

2.3.7 Scalable Kernel Correlation Filter with Sparse Feature Integration 17 2.3.8 Structured Output Tracking with Kernels . . . 19

2.3.9 Incremental Learning for Robust Visual Tracking . . . 19

2.3.10 Spatio-Temporal Context tracker . . . 20

2.4 Performance evaluation . . . 20

3 PROCESSING OF TRAJECTORY DATA 23 3.1 Problems with tracking data . . . 23

3.2 Tracking failure detection . . . 24

3.3 Trajectory filtering and smoothing . . . 24

3.3.1 Local Regression Smoothing . . . 25

4 OBTAINING 3D TRAJECTORIES 27 4.1 Single view geometry . . . 27

4.1.1 Pinhole camera model . . . 27

4.1.2 Imaging geometry . . . 28

4.1.3 Camera calibration . . . 29

4.2 3D reconstruction . . . 30

4.2.1 Epipolar geometry . . . 31

4.2.2 Scene reconstruction . . . 32

5 EXPERIMENTS 35 5.1 Test environment . . . 35

(5)

5.2 Dataset . . . 36

5.2.1 Challenging factors . . . 38

5.3 Tracker evaluation . . . 40

5.3.1 Performance measures . . . 40

5.3.2 Parameter selection . . . 41

5.3.3 Tracking results . . . 42

5.4 Trajectory smoothing . . . 49

5.5 Camera calibration . . . 50

5.6 Reconstructing 3D trajectory . . . 51

6 DISCUSSION 56 6.1 Current results . . . 56

6.2 Future work . . . 57

7 CONCLUSION 59

REFERENCES 60

(6)

DAT Distractor Aware Tracker EKF Extended Kalman Filter HCI Human-Computer Interaction HOG Histogram of Oriented Gradients HS High-Speed camera

IVT Incremental Learning for Robust Visual Tracking

KF Kalman Filter

KCF Kernelized Correlation Filter LOESS Local Regression

LOWESS Locally Weighted Scatterplot Smoothing NS Normal-Speed camera

OTB Online tracking benchmark RANSAC RANdom SAmple Consensus SCT Spatio-temporal context tracker

sKCF Scalable Kernel Correlation Filter with Sparse Feature Integration SSD Sum-of-Square Differences

Staple Sum of Template and Pixel-wise Learners

Staple+ An improved Staple tracker with multiple feature integration STC Structuralist Cognitive model for visual Tracking

STRUCK Structured Output Tracking with Kernels SVM Support Vector Machine

VOT Visual Object Tracking challenge UKF Unscented Kalman Filter

(7)

1 INTRODUCTION

1.1 Background

Human-computer interaction (HCI) is an area of applied cognitive science and engineering design. It focuses on (1) understanding how people make use of applied devices and systems, and (2) designing new tools that enhance human performance and experience [1]. Recent progress in the domain of HCI has allowed to form the next generation of user interfaces, combining a touch input with three-dimensional (3D) content visualization. Stereoscopically rendered views provide the user with additional depth cues which usually decrease the overall cognitive load for understanding complex scenes [2].

Although touch input has already proved its utility and indispensability for various HCI applications, interacting with stereoscopically rendered content is still a challenging task.

Usually the touch recognition surface is placed onto another plane than the displayed content, which being stereoscopically rendered “floats” freely in front of or behind the monitor. It has been shown that touching an intangible surface (i.e., touching the void) leads to confusion and a significant number of overshooting errors [3].

This thesis is a part of the research held within the framework of the Computational Psy- chology of Experience in human-computer interaction (COPEX) project [4], which is a collaborative work between Machine Vision and Pattern Recognition Laboratory (MVPR) of Lappeenranta University of Technology (LUT) and Visual Cognition Research Group of University of Helsinki. The primary objective of the COPEX project is to study touch and gesture techniques of human-computer interaction by utilizing novel methodologies for measuring user experience. This thesis continues the previous research studying how different people use their hands while interacting with 3D content on a computer screen.

It focuses on processing and analysis of the dataset comprising of video recordings of the process of interaction, gathered during the practical experiments engaging twenty volunteers. Various methods for extracting motion features from the multi-view visual data (Figs. 1 and 2), subsequent trajectory filtering, and 3D modeling are explored.

1.2 Objectives and delimitations

The research focuses on the evaluation of existing tracking algorithms, namely, the analysis of applicability of general object trackers for the finger tracking task. Moreover, the thesis includes an overview of the available trajectory filtering and smoothing methods,

(8)

Figure 1. The sample video frames of volunteer interaction with the 3D touch screen display captured with the high-speed camera.

Figure 2. The sample video frames of volunteer interaction with the 3D touch screen display captured with the normal frame rate camera.

which are expected to result in an increased robustness and accuracy of the collected data.

It is not the objective of the thesis to develop new methods for solving these problems but to study and to analyze approaches that are currently available. The limiting factors in the selection of methods for tracking and filtering are the operation environment and

(9)

availability of the source codes of the algorithms. The most recent and efficient solutions are preferred.

The next step of the research is to perform necessary calibration and alignment proce- dures to combine the 2D trajectories extracted from multi-view visual recordings to obtain a single 3D trajectory in real-world coordinates. The finger movement trajectories obtained from high-speed videos during the previous research [5] are to be complemented by tracked data from standard frame rate videos captured with a different viewing angle.

The described processing flow is summarized in Fig. 3.

Figure 3. The flow of the proposed approach.

1.3 Structure of the thesis

This thesis is organized as follows. Chapter 2 focuses on the problem of hand tracking and object tracking in general. It contains descriptions of the tracking methods, an overview of the most comprehensive visual tracker benchmarks, publicly available datasets for performance evaluation, and different comparison metrics.

In Chapter 3 topics related to trajectory data processing are studied. Possible problems and constraints in tracking data as well as various filtering and smoothing approaches for handling the issues are presented. Moreover, this chapter takes a look at existing methods for tracker failure detection for an automatic elimination of incorrect measurements from the subsequent analysis.

Chapter 4 contains an introduction to the principal ideas hidden behind the topic of 3D reconstruction. It concentrates on the theoretical background and available practical tools required for processing the trajectory data gathered from multiple view cameras and thus estimating an object three-dimensional movement in real-world coordinates.

(10)

(11)

2 OBJECT TRACKING

2.1 Background

Object tracking constitutes a crucial component in a wide range of applications besides human-computer interaction, such as medical imaging, vehicle navigation, surveillance, sports analytics, augmented reality and robotics. Given the initial state of an object generally defined via a rectangular bounding box in the first frame of a video, the tracker should estimate the position of the target in the subsequent frames. A huge variety of possible applications generates a multiplicity of factors affecting an object tracker performance, e.g., noise, illumination, occlusion, pose change, motion blurring, background distortions and object deformations (Fig. 4). Varying circumstances need to be reconciled in a single tracking algorithm, which presents a challenging problem. The difficulty and importance of the visual tracking problem have resulted in a considerable popularity of this research area with a significant number of tracking algorithms presented and evaluated annually in journals and at high profile conferences.

Figure 4. Visual object tracking challenges [6].

The core component of many visual trackers is an appearance model holding a representation of the object built from the previous video frames. Based on the model tracking algorithms can be categorized as either discriminative [7–9] or generative [10–12]. Dis- criminative methods rely on a binary classifier which is learned online to distinguish the target from the background, while generative algorithms study the representation of the target and seek the region with the highest matching score.

(12)

trackers [18].

2.2 Benchmarking trackers

Multiple research groups have attempted to establish varied datasets covering an abundant number of possible circumstances and to conduct a comprehensive performance evaluation of available trackers, notwithstanding that annual appearance of vast amount of new trackers causes a rapid obsolescence of any survey. Smeulderset al. [19] conducted a pioneer extensive research with the analysis of 19 object trackers on 315 video sequences.

Wuet al. [20] assembled a comprehensive dataset of 50 fully annotated video sequences and investigated the performance of 29 online tracking methods. The term online in the context of the research means only the information of previous few frames might be utilized for tracking at any time point.

The Visual Object Tracking (VOT) challenge [21–24] introduced the compound database of video sequences and novel performance evaluation techniques for measuring robustness and accuracy of tracking algorithms. The challenge is held since 2013, and the most recent report in 2016 [24] is considered to be the largest benchmark on short-term tracking. It assessed 70 single-camera, single-target, model-free, causal, short-term trackers.

The model-free property defines a class of visual trackers for which no pre-trained model of an object appearance is provided and the only training example is defined by the bounding box in the initial frame. The causality in this context means that neither preceding, nor subsequent frames are used for object current pose estimation. Short-term attribute restricts performing a redetection, once the target is lost. VOT2016 top 10 object trackers are presented in Table 1.

Despite a significant recent effort put in the area of visual tracking, most state-of-art algorithms limit themselves to the grayscale realm, i.e., they utilize only grayscale information of a video sequence. There are several possible reasons for this: (1) color analysis may cause extra computational cost and decrease performance; (2) grayscale realm is generally sufficient to show a reasonably good result; and (3) color data can be distorted by envi-

(13)

Table 1.VOT2016 top 10 object trackers [24].

# Method Abbrev. Year of

publication 1 Continuous Convolution Operator Tracker [25] C-COT 2016 2 Tree-structured Convolutional Neural Network Tracker

[26]

TCNN 2016

3 Scale-and-State Aware Tracker [16] SSAT 2015

4 Multi-Level Deep Feature Tracker [24] MLDF 2015 5 Sum of Template And Pixel-wise LEarners [27] Staple 2016 6 Discriminative Deep Correlation Tracking [24] DDC 2016

7 Edge Box Tracker [28] EBT 2015

8 Salient Region Based Tracker [24] SRBT 2016

9 An improved Staple tracker with multiple feature integration [24]

Staple+ 2016

10 Dual Deep Network Tracker [29] DNT 2016

ronmental factors such as an illumination fluctuation. Liang and Blasch [30] conducted an extensive systematic study related to the possibility of color processing integration into modern object trackers by encoding 10 chromatic models into 16 state-of-art visual trackers and evaluating their performance on a set of 128 color videos with the ground truth. The result showed that an appropriate color model systematically improves the performance of existing grayscale trackers at the expense of their computation efficiency.

2.3 Selected tracking methods

In this thesis the selection of the general tracking methods for the task of finger tracking is based on the findings presented in the VOT2016 challenge, since this work is considered to be the most up-to-date and comprehensive comparison of object trackers. The objective was to select the algorithms, which work in real-time (at least 25 frames per second) and have publicly available source codes. VOT2014 introduced the tracking performance measure called equivalent filter operations (EFO) used in the subsequent VOT benchmarks. The authors proposed a threshold of 20 EFO units: all the trackers with a higher mark are expected to run real-time. The threshold utilized in the current research for algorithm selection was insignificantly decreased and set to 10 EFO in accordance to the performance tests run on the local workstation (refer to Chapter 5 for further details). All the trackers meeting the described criteria were selected for the experiments. ColorKCF and DSST were afterwards excluded from the evaluation due to run issues. The final set of selected algorithms is presented in Table 2.

(14)

gration [24]

31 Distractor Aware Tracker [9] DAT 2015

32 Scale Adaptive Mean Shift [31] ASMS 2014

37 Kernelized Correlation Filter tracker [13] KCF 2014 39 Structuralist Cognitive model for visual Tracking [18] SCT 2016 52 Scalable Kernel Correlation Filter with Sparse Feature

Integration [14]

sKCF 2015

58 Structured Output Tracking with Kernels [17] STRUCK 2016 63 Incremental Learning for Robust Visual Tracking [10] IVT 2008

65 Spatio-temporal context tracker [32] STC 2014

2.3.1 Sum of Template And Pixel-wise LEarners

Correlation filters have shown excellent performance for visual object tracking [13]. How- ever, they have a significant drawback in being highly sensitive to target appearance deformation since they are inherently limited to usage of a rigid template. In order to utilize the advantages of correlation filters and simultaneously achieve the desirable accuracy in tracking of transforming objects, correlation filter-based trackers may be complemented with a target representation that is insensitive to shape variation.

Bertinetto et al. [27] suggested using color histograms as an additional representation robust to deformation, since they do not depend on the spatial structure within the image patch. Thus, Sum of Template And Pixel-wise LEarners (Staple) tracker relies on the strengths of the both color-based and template models (Fig. 5). The template model is based on the Histogram of Oriented Gradients (HOG) [33] image features. The tracker output for a particular video frame patch represents a weighted linear combination of two model scores, and the desired object position is estimated as the result of translation and scale search in a region around the previous location.

2.3.2 An improved Staple tracker with multiple feature integration

Xuet al. [24] proposed an improved version of the Staple tracker, noted as Staple+, for evaluation in the VOT2016 benchmark. While the original algorithm extracts HOG fea-

(15)

Figure 5. The Staple algorithm flow [27]. Staple relies on the strengths of the both color-based and template (based on HOG features) models. The tracker output represents a weighted linear combination of two model scores.

tures from a gray-scale image, Staple+ relies on HOG features retrieved from color probability map, which are expected to better represent the image patch color information.

2.3.3 Distractor Aware Tracker

Posseggeret al. [9] presented an appearance-based tracking-by-detection algorithm, called Distractor Aware Tracker (DAT). The method relies on a discriminative object-surrounding model employing a color histogram for differentiating the tracker object from the background. The histogram-based Bayes rule is applied to each pixel of a candidate area producing the object likelihood maps as the result. Map thresholding with an adaptive border complemented with a successive segmentation and connected component analysis allows to select an enclosing bounding box corresponding to the object location region.

Color-based approaches are usually exposed to drifting to neighboring regions with similar appearance. To suppress the risk of drifting DAT proposes an additional distractor- aware model which allows to robustly detect possible distracting objects whenever they appear within the field-of-view (Fig. 6).

(16)

Figure 6. Distractor Aware Tracker (DAT) likelihood maps: (a) The object-surrounding model;

(b) The distractor-aware model; (c) The combined map [9].

2.3.4 Scale Adaptive Mean Shift

The Scale Adaptive Mean Shift (ASMS) [31] tracker enhances the mean-shift tracking algorithm [34] by targeting the problem of a fixed size window. The original method minimizes the distance between an object and an object candidate image patch histograms treated as two probability density functions (pdfs). Since the histogram similarity does not take into account the position of every pixel, algorithms utilizing this measure are expected to be robust to articulation and deformation. The main drawback of the mean- shift tracking is the use of a fixed size bounding box, which leads to low performance on sequences with changing object size.

To address the issue of a fixed size bounding box, ASMS encompasses a novel scale estimation mechanism based on the mean-shift procedure for the Hellinger distance [35].

Moreover, the authors present a technique to validate the estimated output, called the Backward scale consistency check. It uses reverse tracking from step t to t−1 to get convinced that the object size has not changed mistakenly. In the case of a detected scale inconsistency, the correct object size is recovered as a linear combination of (i) the new estimated size, (ii) the previous frame size and (iii) the size of the initial bounding box. The other improvement is the introduction of a novel histogram color weighting method, called background ratio weighting (BRW), which uses the target background color information computed over its neighborhood in the first frame.

2.3.5 Kernelized Correlation Filter

Most contemporary tracking algorithms rely on a discriminative model, which aims to identify the target against the background. The model is usually trained with a large number of synthetic samples produced by translating and scaling of the initial object bounding box. Processing these data requires a huge number of iterations and consumes a

(17)

lot of computation power [8]. Henriqueset al. [36] showed that the training dataset can be represented as a circulant matrix, and moreover, all circulant matrices are made diagonal by the Discrete Fourier Transform (DFT), which can be computed in an efficient way.

This observation allows to diminish the storage and computation processing requirements for training the appearance model by several orders of magnitude.

Kernelized Correlation Filter (KCF) [13] represents a kernel ridge regression solution trained with thousands of samples at different relative translations, without explicit itera- tion over them. The KCF tracker may operate on robust HOG features as well as on raw pixels, reaching an excellent performance in both configurations.

Vojir [37] extended the original algorithm by a scale estimation (7 different scale steps) and by color-names features [38]. This method is denoted as KCF2.

2.3.6 Structuralist Cognitive model for visual Tracking

The Structuralist Cognitive model for visual Tracking (SCT) [18] tracker decomposes tracking into two stages: disintegration and integration. In the first stage, multiple cognitive structural units, so-called attentional feature-based correlation filters (AtCFs), are generated. Each unit consists of an attentional weight estimator and Kernelized Correla- tion Filter (KCF) [13]. In order to target various properties of the tracked window and effectively distinguish the foreground from the surrounding background, each AtCF utilizes a unique pair of a feature (color, HOG, etc.) and a kernel (linear, Gaussian, etc.) type.

In the integration step, the object appearance is expressed as a representative combination of AtCFs, which is memorized for future usage (Fig. 7).

2.3.7 Scalable Kernel Correlation Filter with Sparse Feature Integration

The original version of Kernelized Correlation Filter [13] produces accurate and robust tracking results. Nevertheless, its efficiency is limited by the usage of a fixed size template and hence the algorithm can not handle target scaling in an efficient way. In order to overcome this constraint, Montero et al. in [14] presented an improved version of the KCF tracker, called Scalable Kernel Correlation Filter with Sparse Feature Integration (sKCF).

(18)

Figure 7. Cooperation between two stages of the SCT tracker model: disintegration and integration [18].

sKCF replaces the cosine window with an adjustable Gaussian windowing function to support target size changes and, hence, produce better back- and foreground separation (Fig. 8). The new appearance window size is estimated with a forwards-backwards optical flow strategy. It extracts relevant keypoints of the target area on the successive frames and then estimates the scale change by analyzing the pair-wise difference.

Figure 8. Gaussian (sKCF) and cosine (KCF) windows filtering raw pixel values. First column:

object appearance at three different scales; second and third columns: adjustable Gaussian and fixed-size cosine windows respectively; fourth and fifth columns: image samples filtered with these filters [9].

Finally, the sKCF tracker boosts the computation efficiency of the original KCF tracker.

The authors integrated fast HOG descriptors by Felzenszwalbet al. [39] and Intel’s Com- plex Conjugate Symmetric (CCS) packed format into the DFT/IDFT calls.

(19)

2.3.8 Structured Output Tracking with Kernels

Structured Output Tracking with Kernels (STRUCK) [17] is an adaptive visual object tracking algorithm based on structured output prediction, which aims to directly predict the change in object configuration between frames, instead of searching for the position with maximal classification score typical for binary classifiers. It utilizes the idea of the structured output Support Vector Machine (SVM) framework, and customize it for the case of online learning.

The basic issue essential for online learning with kernels is the so-called curse of kernel- ization, whereby the number of support vectors grows with each provided training sample. Since the computational complexity of a kernelized SVN is linearly dependent on the number of support vectors, the growth inevitably decreases the performance. In order to limit the expansion, a budgeting mechanism is introduced at the end of the algorithm adaptation stage.

The STRUCK tracker supports three different types of kernels, namely, linear, Gaus- sian and intersection; moreover, it can be utilized for both single-scale and multi-kernel tracking. The image features types such as raw features, Haar features, and intensity histograms are considered.

2.3.9 Incremental Learning for Robust Visual Tracking

Ross et al. [10] presented the Incremental Learning for Robust Visual Tracking (IVT) tracking method that incrementally learns a low-dimensional subspace representation of the target, efficiently adapting online to object changes. In contrast to prior works in the domain of visual tracking utilizing fixed representative models, the proposed method efficiently reflects the varying object appearance.

The IVT tracker utilizes a particle-filtering method for estimating the possible target locations on a new frame, which are later compared to the stored representation of the object. The most likely location is selected as a current target position. The update of the stored observation model is conducted with a novel incremental PCA algorithm, which recomputes the eigenbasis as well as the mean, once an additional portion of training data arrives. Further, it incorporates an adopted forgetting factor in the incremental model update, as proposed by Levy and Lindenbaum [40], which allows to sustain the computational efficiency during the evaluation.

(20)

equipped with an i7 processor). The STC algorithm extensively employs the relationship between the target appearance and its dense spatial context motivated by the human visual system and formulated in terms of a Bayesian framework. This technique allows to handle heavy occlusion and similar distractor ambiguity since the surrounding context usually uniquely determines the object. The learned spatial context model h^sc_t is used to update a spatio-temporal context modelH_t^stc. The subsequent construction of an object location likelihood confidence map which integrates the spatio-temporal context data allows to estimate the target position as the result of maximization operation (Fig. 9).

Figure 9.The STC algorithm flow [32]. STC utilizes two local context models, namely spatialh^sc_t and spatio-temporalH_t^stccontext models, to learn the relationship between the object (indicated by the yellow rectangles) and its local context region (inside the red rectangles).

2.4 Performance evaluation

An objective visual tracker performance comparison requires two basic components: (i) dataset containing various video sequence covering different possible applications of tracker evaluation, and (ii) quantitative performance measure.

(21)

Tracker performance evaluation in the pioneer works on the topic was based on various automatic measures not utilizing the ground truth, since manual annotation of the target presence and position in a video sequence required a considerable amount of resources, and thus were not available in a sufficient amount. Erdem et al. [41] proposed a metric based on an evaluation of shape and color differences of the results, and moreover suggested to compare the initial and the final positions of the object position. Wuet al. [42]

investigated the potential of using the backward analysis of physical motion.

The popularity of visual tracking in the last decade resulted in a considerable amount of annotated data: many video datasets have been provided for various vision problems, such as FERET dataset for face recognition [43], optical flow dataset [44] and CAVIAR for surveillance [45]. Comprehensive datasets for generic object tracking mostly originate from the major benchmark papers, such as “Amsterdam Library of Ordinary Videos”

(ALOV) [19], “Online Tracking Benchmark” (OTB) [20], “TColor-128” [30] and annual

“Visual Object Tracking” (VOT) challenge publications [21–24]. With the availability of manually annotated data, the goal of performance measures is to estimate the extent to which the position predicted by a tracker agrees with the ground truth.

The task of object tracking suffers from the absence of a commonly accepted convention about which measures should be used in experiments. Various research groups utilize different methods and frameworks for establishing the efficiency of the proposed algorithms, which obstructs cross-paper tracker comparison. Cehovin et al. [46] attempted to pick a number of the popular performance measures and to conduct its theoretical and experimental analysis. The first group of studied measures comprises the average center prediction error (CE) (e.g., in [20]), which calculates the mean difference between the ground truth and the predicted target center over the sequence, and its variations, namely, average normalized center error (NCE) and root-mean-square error (RMSE). The second group contains the variations of the region average overlap (AO) measure (e.g., in [24]), which is defined as

¯ ϕ= 1

N X

t

ϕ_t, ϕ_t= R^G_t ∩R^T_t

R^G_t ∪R^T_t , (1)

where N is the number of frames in a sequence, R^T_t and R_t^G denotes the region of the object at timetaccording to tracker output (T) and ground truth (G). AO reckons an overlap between the ground truth and predicted target regions. Tracking length [47] measure counts the number of successfully tracked frames from the moment of tracker initialization to its failure. The failure may be detected either manually or automatically, for

(22)

An intuitive way of visual tracking evaluation is an estimation of the accuracy of the tracker (i.e., how precisely the object position is determined) and its robustness (i.e., how many times it loses the target) [46]. Cehovinet al. [46] suggested narrowing the set of potential measures to only two complementary metrics, namely, average overlap measure (accuracy) and failure rate (robustness), which sufficiently describe the above-mentioned properties. Nevertheless, the authors warn that various methods reflect different aspects of tracking performance, so it is impossible to choose the best multipurpose measure.

Moreover, many of the widely utilized metrics have a strong output correlation between each other and thus their pairwise usage provides no additional information.

Weet al. [20] suggested an alternative mechanism of robustness evaluation. The authors pointed out that many object trackers are highly sensitive to the initialization, and even small adjustment of the start conditions may increase the performance. Therefore two methods to measure robustness to initialization were proposed, namely, temporal (TRE) and spatial robustness evaluation (SRE). The first means launching tracker from different frames, while the second recommends trying different bounding boxes, including center and corner shifts, and scale variations. The final score is computed as the average of all evaluations.

(23)

3 PROCESSING OF TRAJECTORY DATA

3.1 Problems with tracking data

The trajectory data retrieved as the result of tracking usually presents an ordered list of object location coordinate points in an image plane. These measures may contain movement noise or completely incorrect position estimation (once the tracker lost the target) since none of the currently available visual trackers achieve an irreproachable accuracy. More- over, most visual trackers estimate the object location only with a pixel precision, and therefore the obtained trajectory presents a broken line instead of a desired smooth curve, especially this problem is peculiar to low-resolution videos (Fig. 10). As noted in [5], the rough-edged transforms between the trajectory points noticeably affect the precision of succeeding calculations. These negative effects can be eliminated by an introduction of tracking failure detection and trajectory smoothing methods into the processing flow.

Figure 10. The raw rough-edged (cyan) and filtered (yellow) trajectory data. The ground truth is shown in red color [5].

(24)

data. Nevertheless, even state-of-art tracking algorithms may occasionally fail in complex scenarios. Typically, it happens when the observed target appearance does not fit the generalization capabilities of the tracking method model. In this context, it is highly important to timely detect tracking failures and then, for example, to exclude these samples from the further processing.

In situations where ground truth is available, tracking performance can be measured with methods discussed in Section 2.3. The absence of the ground truth, which is usually common for massive datasets or newly recorded videos, requires an application of specific tracking failure detection methods. It is important to notice, that some of these methods are based on the principles used in tracker performance evaluation metrics with no provided ground truth.

A commonly used approach to detect failures is to calculate sum-of-square differences (SSD) [50] or utilize other similar measures [41] between the consequent object area patches. This measure allows to detect various occlusions or target rapid leaps, but it does not recognize gradual trajectory drifting. Comparison between the current object appearance and affine warps of the initial patch can be utilized for drift detection [51], but this method is only applicable for rigid objects and uniform environment.

Wu et al. [42] and Kalal et al. [52] proposed the strategy of forward and backward in time analysis of the video sequence. Backtracking is used for an estimation of the reverse trajectory from the current timestamp to an earlier moment, then the divergences between these two trajectories are measured (Fig. 11).

3.3 Trajectory filtering and smoothing

In signal processing, noise is a general term for unwanted (and, in general, unknown) modifications that a signal may suffer from during capture, storage, transmission, processing, or conversion [53]. The goal of filtering is to reduce the noise level or to eliminate it completely without weakening the underlying signal.

(25)

Figure 11.The Forward-Backward error penalizes inconsistent trajectories [52].

In [54] the performance of eight filtering methods for processing the finger trajectory data, retrieved as the result of the KCF [13] visual tracker evaluation on a high-speed video dataset, was investigated. The following methods in multiple configurations were studied: Moving Average (MA) [55], Kalman Filter (KF) [56], Extended KF (EKF) [57], Unscented KF (UKF) [58], Local Regression (LOESS) [59], Locally Weighted Scatter- plot Smoothing (LOWESS) [59], Savitzky-Golay (SG) [60], and Total Variation Denois- ing (TVD) [61]. The comparison of the smoothed values against the ground truth showed that the LOESS filtering algorithm produced the most consistent results for all the tests, besides the other methods also performed well showing an insignificantly lower accuracy.

Moreover, according to the research findings smoothing allows to achieve better calcu- lation results for the derivatives of the target position, that may be used for velocity and acceleration estimation, being compared to raw data.

Based on the above, the LOESS filtering method utilizing a linear polynomial was selected for smoothing trajectory data obtained in the current research.

3.3.1 Local Regression Smoothing

LOWESS and LOESS [59] are both non-parametric locally weighted linear regression methods, used in statistics for fitting a smooth curve or surface between predictor vari- ables. LOESS presents a later generalization of the original LOWESS method.

(26)

w_i = (1−

x−x_i d(x)

3

)³ , (2)

wherex denotes the examined point,x_i are the points within a local neighborhood, and d(x)is the distance to the most remote point in the span. The tricube function is utilized for computing the regression weights, giving more weight to predictor values near the point whose response is being estimated and less value to the distant points. The dataset points outside the span are not considered.

(27)

4 OBTAINING 3D TRAJECTORIES

Imaging presents the formation of a two-dimensional representation of a three-dimensional world. This transformation comes along with the loss of depth information about the observed scene. Nevertheless, the lost spatial layout and information about the camera setup can be recovered from 2D images. Single monocular images allow building rough depth maps based on the analysis of the global context of the image since local features alone are insufficient to estimate depth at a point due to geometrical constraints [62].

A precise restoration of the 3D structure is possible with a multi-view or so-called stereo camera setup. The required theoretical background and available methods are discussed below. Section 4.1 introduces the basics of a single view geometry, which forms the foundation for the discussion of a three-dimensional reconstruction in Section 4.2. The Chapter ends up with the establishment of the interrelationship between the 3D reconstruction and the task of computation a 3D trajectory.

The content of this Chapter mostly relies on the knowledge presented in the book written by Hartley and Zisserman [63], which may supply the reader with a more comprehensive background material on the topic.

4.1 Single view geometry

4.1.1 Pinhole camera model

The imaging process can be modeled with the pinhole camera model. It is the simplest camera model, which does not describe all the existing features of photography, such as geometric distortions, but nevertheless, allows to clearly establish the mathematical relationship between the coordinates of a 3D point and its projection onto the image plane.

The geometric mapping from 3D to 2D described by a pinhole camera model is called a central projection. The light rays outgoing from the imaging object pass through a fixed point, the center of projection or camera center. These rays intersect a specific plane, so- called image plane, and thus project the appearance of the object. The line perpendicular to the image plane passing through the camera center is known as the principal axis, it intersects the image plane at the principal point. Fig. 12 visualizes the described model.

(28)

Figure 12.Pinhole camera geometry [63].

4.1.2 Imaging geometry

Let us consider that the imaging device is located at the origin of a Euclidean coordinate system with the principal axis pointing straight down the Z-axis and the image plane with Z = f, where f defines the camera focal length; then the central projection from the camera to image plane points can be simply expressed as a linear mapping between their homogeneous coordinates. The geometrical relation between the camera frame point denoted as X_cam = (X, Y, Z,1)^T and the image pointx = (u, v,1)^T may be written in terms of matrix multiplication as





 u v 1





=







f γ p_x 0 0 f p_y 0 0 0 1 0











 X Y Z 1







, (3)

where(p_x, p_y)^T are the coordinates of the principal point,γis the skew coefficient, which is non-zero only if the image axes are not perpendicular, andf denotes the camera focal length. Now, writing

K =







f 0 p_x 0 f p_y 0 0 1





 , (4)

(29)

then (3) has the concise form

x=Kh I 0

i

X_cam . (5)

Usually, world points are expressed in terms of a different Euclidean coordinate frame than the camera frame, called as world coordinate frame. The two frames are related via a rotationRand a translationT (Fig. 13). In this case, the relation may be written as

x=Kh R t

i

X_world. (6)

The joint rotation-translation matrix[R|t] is called a matrix of extrinsic (external) camera parameters, while K refers to intrinsic (internal) parameters. The result of matrix multiplicationP =K[R|t]is a homogeneous camera projection matrix.

Figure 13.The Euclidean transformation between the world and camera coordinate frames [63].

4.1.3 Camera calibration

Camera calibration estimates the parameters of a lens and image sensor, namely, the intrinsic and extrinsic matrices, and lens distortion. These parameters can be used for distortion correction of the captured images, for determining the camera and object locations in the scene, or for matching the observed object image dimensions with world units.

(30)

distortion and slight tangential distortion, which can be eliminated algorithmically based on the estimated during calibration camera parameters.

Camera calibration process is based on basic geometrical transformations, described in [63].

Ready-to-use tools for calibration are available for different development platforms and environments, such as Computer Vision System Toolbox for MATLAB [64].

Basically, three-dimensional reconstruction (i.e., retrieving the spatial data about the scene) is not possible from a single view image without having a prior information about the observed environment, since projection from a 3D scene onto a 2D plane removes depth information. It is impossible to determine which world point on the projection ray emanating from the camera center corresponds to the specific image point. If at least two perspective views are available, then the position of a 3D point can be found as the intersection of the two projection rays. This method requires further inter-camera calibration and relies on multiple-view geometry background, which is discussed in Section 4.2.

4.2 3D reconstruction

Multiple perspective views may be captured sequentially by a single moving camera or acquired simultaneously by a stereo camera setup. These cases do not imply an application of different methods since they are geometrically equivalent.

Three-dimensional scene reconstruction from two perspective views requires an estimation of image points x ↔ x⁰ corresponding to the same 3-space point X. The relation between these points is described by the epipolar geometry and represented by the fundamental and the essential matrices denoted asF andE respectively. These matrices allow to conduct a reconstruction of both cameras’ projection matrices and then the desired scene structure.

The following paragraphs provide a description only of the basic steps of this process, while a more detailed discussion is left beyond the content of this work; an inquisitive

(31)

reader is referred to [63] for further details.

4.2.1 Epipolar geometry

When two cameras capture a picture of a single scene point, the acquired image points have a strict geometrical relation between each other. These constraints may be described in terms of the epipolar geometry or literally in terms of the geometry of stereo vision.

Since cameras are imaging the scene from distinct viewpoints, each camera center may be projected onto other camera’s image plane. These points are called epipoles and are denoted aseande⁰. The definition implies, that epipoles lay on a single line connecting both camera centres, which is called a baseline. The epipolar plane π, which contains epipoles, is determined by the baseline and a world pointX.

Suppose the first camera’s image point xcorresponds to the world point X. Evidently, the second camera’s image pointx⁰ lies inπ, hence it lies on the line, known as epipolar line and denoted asl⁰, of intersectionπwith the second image plane (see Fig. 14).

Figure 14.Epipolar geometry [63].

The fundamental matrixF is the algebraic representation of the described epipolar geometry. It is the unique3×3matrix of rank 2 which satisfies

x⁰^TF x= 0. (7)

(32)

coordinates may be obtained asxˆ=K⁻¹x, defining the equation for the essential matrix

ˆ

x⁰^TExˆ= 0. (8)

Both fundamental and essential matrices allow to reconstruct the projection matrices of the cameras and then reconstruct the observed scene. The advantage of utilizing the essential matrix over the fundamental one is that it shrinks the reconstruction ambiguity from projective to scale (Fig. 15). In addition, the terms metric and similarity reconstruction are also used in literature to denote reconstruction with scale ambiguity. An availability of extraneous information about the real-world scene dimensions allows to conduct the true Euclidean reconstruction, which includes determination of the overall scale.

Figure 15.Reconstruction ambiguity [63].

4.2.2 Scene reconstruction

Hartley and Zisserman [63] proposed the method for three-dimensional scene reconstruction shown in Algorithm 1, which is based on the theoretical notes described above.

(33)

Algorithm 1The algorithm for three-dimensional scene reconstruction [63]

1. Find point correspondences in multiple-view images.

2. Compute the essential matrix from point correspondences.

3. Compute camera projection matrices from the essential matrix.

4. For each point correspondencex_i ↔x⁰_i, compute the position of a scene pointX_i.

Computation of the essential matrix The essential matrix may be computed directly from (8) using normalized image coordinates, or using a relation to the fundamental matrix, which can be derived from (7) and (8) as

E =K⁰^TF K. (9)

Given a sufficient number of point correspondences (at least 8), it is possible to solve a system of linear equations defined by (8) either directly, or finding its least-squares solution (requires more than 8 points), and thus obtain the essential matrix. The normalized 8-point algorithm [63] introduces data normalization before finding a linear solution. Un- fortunately, the apparent linear methods are not stable for noisy data with a huge number of outliers, which is common for real-world applications. A more robust methods may be used, such as MLESAC [65], LMedS [66] or RANSAC [67].

Computation of projection matrices A pair of camera projection matricesP andP⁰ corresponding to the essential matrix E may be computed using the Theorem 1 (refer to [63] for a proof).

Theorem 1 For a given essential matrix E = Udiag(1,1,0)V^T, and the first camera matrixP = [I|0]there are four possible choices for the second camera matrixP⁰, namely P⁰ = [U W V^T| ±u₃]or P⁰ = [U W^TV^T| ±u₃]. (10)

The four-fold ambiguity of P⁰ (10) may be eliminated by testing the obtained camera structure with a single world point.

(34)

Usually, image point coordinates are corrupted by an unwanted noise, which means that naive triangulation does not work since the emanating rays do not intersect in 3D space (Fig. 16). In this case, it is possible to estimate the best solution for a world point with the linear triangulation method [63].

Figure 16. Rays back-projected from imperfectly measured image pointsxandx⁰ do not intersect in 3D space, which prevents the usage of the naive triangulation method for the 3D point coordinates estimation [63].

The task of computing a 3D trajectory from multiple 2D trajectories is essentially equivalent to the described process of 3D scene reconstruction and can be solved directly with the Algorithm 1. The first step of the proposed algorithm aiming at the estimation of image point correspondences can be interpreted as the problem of a pairwise trajectory point alignment. It means that for each two-dimensional trajectory pointxthe matching pointx⁰ of the complementary trajectory which corresponds to the same world point X has to be found.

(35)

5 EXPERIMENTS

This chapter contains a comprehensive description of the practical part of the conducted research and presents the obtained results. As it was noted in Chapter 1, this thesis covers only a single intermediate step of the extensive research project, called Computational Psychology of Experience in HCI (COPEX). A starting point of the COPEX project was a dataset retrieval stage conducted by Visual Cognition Research Group of University of Helsinki. The goal of the experiment was to record the process of human interaction with a 3D touch screen. Volunteers were advised to touch the virtual objects sequentially emerg- ing on a stereo display with various parallax. The process was simultaneously recorded with two cameras from different viewing angles. The analysis of the data retrieved from the high-speed camera was presented in [5]. The current research focuses on processing the video dataset captured with the subsidiary normal-speed camera and ends up with the construction of 3D real-world trajectories from the multiple-view two-dimensional trajectory data.

This chapter is organized as follows: Section 5.1 describes the test environment and devices utilized for recording the human-computer interaction; Section 5.2 focuses on the dataset essential for meeting the goals of the current research, while successive sections report the conducted experiments and the outcomes.

5.1 Test environment

During the laboratory trials, 20 subjects were asked to perform a clear pointing action towards the observed 3D stimuli. Stereoscopic vision was provided by NVIDIA 3D Vi- sion kit. The LG T1710 4:3 17” touch screen was placed at a distance of 0.65 meters in front of the person. The trigger-box, which button pressing denoted the beginning of a single pointing action, was set up 0.25 m away from the screen. The process was recorded with two cameras: a Mega Speed MS50K high-speed camera equipped with the Nikon 50mm F1.4D objective and C-mount adaptor, and a normal-speed Sony HDR-SR12 camera. The high-speed camera was installed on the right side of the touch screen with an approximately 1.25 m gap in-between, while the normal-speed camera was mounted on the top (Fig. 17).

A single test unit involving one subject participation was divided into 9 blocks each containing 40 trials. Within the block the high-speed camera captured only the first ten point-

(36)

Figure 17.Experiment setup [5].

ing actions supplemented with each third subsequent action, producing 20 videos per block in total. The normal-speed camera was recording the entire experiment process without any interruptions.

5.2 Dataset

The dataset captured with the normal-speed camera comprises of twenty about 90-minute length videos, each presenting a single test unit involving one person. The videos were recorded at 25 frames per second framerate and 1440x1080 (4:3) resolution, and were compressed with an interlaced coding.

An interlaced video frame presents two images, also noted as fields in television, captured at different moments in time: the content of one field is used on odd-numbered frame lines and the other is displayed on the even frame lines. In theory, the process of building a non-interlaced image presents a simple combination of two successive fields. However, when combined into a single frame, differences between the two fields slightly displaced in time may be noticed. These artifacts known as interlacing effects are peculiar to fast motion scenes (Fig. 18). Deinterlacing algorithms try to minimize the visual defects, nevertheless, some artifacts, such as motion blur, are not eliminated completely.

Yet another deinterlacing filter (yadif) implemented by the FFmpeg project [68] was used in the current research for pre-processing of the dataset. Deinterlacing with this method may be conducted in two modes: (i) frame-to-frame and (ii) field-to-frame conversion.

(37)

The second option allows doubling the framerate of the output video being compared to the source. The impact of an increased frame rate on trackers performance is discussed in Section 5.3.

Figure 18.Sample video frames demonstrating the interlacing effect.

For the purpose of tracker performance evaluation, a versatile set of samples was selected.

Videos containing touches of distant screen areas, various manner of hand movement and acceleration were preferred. To avoid subjectivity, tracker comparison task requires an establishment of the reference data, against which the result is to be assessed. Ground truth, or literally the desired output, may be generated manually or by the use of a gold- standard method [69]. The first method was used since it allows to achieve better accuracy.

The time resources required for manual annotation considerably limited the amount of available labeled data.

The evaluation dataset for the tracker comparison task included 17 video fragments containing a single pointing action, only movements towards the screen were considered. For each video frame, a fingertip position was manually specified with a bounding box, which sides are parallel to the coordinate axes. The bounding boxes were selected the way their sides correspond the finger appearance as accurately as possible. The example frames are provided in Fig. 19.

(38)

Figure 19.Sample video frames from the evaluation dataset. Blue rectangles denote the manually labeled ground truth fingertip positions.

5.2.1 Challenging factors

Various circumstances may significantly decrease the tracker performance. Possible challenges peculiar to the utilized dataset are discussed as follows:

Motion blur Motion blur is an apparent streaking of rapidly moving objects. This defect is partially caused by the applied deinterlacing. Besides the fingertip presents a rigid object, its imaged appearance is noticeably distorted by motion blur and thus not persistent.

(39)

Light reflection Human nail presents a glossy surface, especially, being covered by a nail polish. Light reflection drastically changes the appearance of the fingertip and thus may lead to tracker failure.

Scale variation The camera is mounted the way, it captures the coming fingertip movements. Since the object is moving towards the camera, the imaged finger size increases dramatically. For the trackers not supporting the tracking window size variation, object scale change may lead to non-accurate results.

Fast movement Many test subjects performed a pointing action in a rapid manner, which being supplemented with a close camera disposition and low frame rate, led to large gaps in the image plane between the fingertip positions on the successive frames.

Dissolution against the background Another artifact caused by application of deinterlacing methods is object dissolution against the background of the similar color.

The examples of the described challenges are presented in Fig. 20.

Figure 20.The examples of challenging factors, which may lead to tracker failure, namely, motion blur (top left), light reflection (top right), scale variation (bottom left), and dissolution against the background (bottom right).

(40)

of performance. Huge diversity of evaluation environments and multiplicity of challenging factors presents an insuperable obstacle for tracking algorithms. It is important to assess the tracker performance for each particular task before choosing a single operation method, since the tracker working well in general may unexpectedly fail in a particular case.

This section presents the results of tracker performance evaluation for the task of finger tracking in videos captured with the normal-speed camera. The performance of the selected methods was assessed against the ground truth. The utilized measures, tested configurations, and obtained results are reported next.

5.3.1 Performance measures

Selecting the tracker performance measure is a sophisticated process and no straightfor- ward solution exists. A huge number of methods targeting various points of the tracker performance are available, thus the selection of a final comparison method should reflect the particular task features.

In this research, accuracy was measured with the variation of center location error (CLE), namely, the finger end position was taken into account, since this definition explicitly corresponds to the goal of tracking a fingertip position. Simple averaging of the retrieved values over the sequence may not fairly assess the performance, because when an algorithm loses the object, its output is usually random, and thus, provides incorrect results.

Instead, the percent of frames where the distance between the ground truth and the estimated position was below a fixed threshold (τ) was utilized as a comparative score as suggested in [7]. The threshold value for tracker failure detection was estimated empiri- cally and set to 16 pixels. The performance visualization was done via precision plots [7]

(Figs. 21, 22), which show a percentage of frames (y-axis) denoted as a tracker precision within a particular location error threshold (x-axis).

Cehovin et al. [46] proposed to use failure rate as a supplementary measure targeting tracking robustness. As noted in Section 2.4, it treats tracking as a supervised process and

(41)

reinitializes the tracker once it fails. The number of manual interventions is taken into account. This method is suitable for long-term sequences, while for the current dataset with less than 25 frames per video in average, it provides meaningless results. Instead, spatial and temporal robustness [20] was studied. 6 scale variations of the initial bounding box supplemented with 3 different start time positions were tested. Thus, each tracker was launched 18 times. These tests are denoted as spatial robustness evaluation (SRE) and temporal robustness evaluation (TRE) respectively.

5.3.2 Parameter selection

During the experiments, the default tracker parameters provided in the source code were utilized. Besides adjusting the configuration might improve the tracker performance, it was difficult to study and test various configurations for all the evaluated methods.

Thereby experiment results may be treated as the lower boundary of performance. The utilized key tracker parameters are listed in Table 3.

Table 3.The key parameters of the evaluated tracking algorithms.

Method Parameters Values Staple scale adaptation true Staple+ scale adaptation true

DAT color space rgb

ASMS - -

KCF feature type kernel type

gray gaussian

KCF2 - -

SCT - -

sKCF feature type kernel type

hog gaussian STRUCK feature type

kernel type budget size

haar gaussian unlimited IVT batch size

max basis

5 15

STC - -

(42)

port of GCC compiler, and referenced OpenCV 3.2 computer vision library (if required).

Deinterlacing allows reconstructing normal and double frame rate image sequences. Real- time processing of videos with the higher framerate requires the higher computational efficiency of the tracking algorithm. Object trackers, which are able to be run in real-time on a 25 fps video stream, may not operate in real-time on a considerably higher framerate.

Thus the both frame rate cases were analyzed in order to estimate the necessity of the use of the double frame rate videos.

The preliminary benchmarking results showed, that the first frames of the dataset videos were too complicated for some trackers and caused them to fail right at the beginning, while these trackers were able to cope well with the rest part of the sequence. This problem was partially addressed by the introduction of temporal robustness evaluation (Section 5.3.1), moreover the applicability of tracking from the end of video was considered. Besides reverse tracking is meaningless in terms of the real-time tracking task, it presents an interesting case to study for offline evaluation and thus was chosen for a detailed analysis.

Thereby, the following tracker evaluation cases were considered:

• NFR/FT: normal frame rate (25 fps), forward tracking

• DFR/FT: double frame rate (50 fps), forward tracking

• NFR/BT: normal frame rate (25 fps), backward tracking

• DFR/BT: double frame rate (50 fps), backward tracking

Tables 4 and 5 present tracker forward evaluation outcomes for the both frame rates.

The best result was achieved by the SCT tracker, which was able to precisely track the target through 57% and 77% of frames in the normal and the double frame rate cases respectively. The further results distribution varies depending on the case. In NFR/FT, the second place belongs to KCF2 (49%), while the third is shared between sKCF and STRUCK (46%). In DFR/FT, STRUCK is the second (62%), followed by KCF2 with a close result of 61% correctly tracked frames.

(43)

Generally, none of the trackers coped with all out of 17 sequences in the dataset. In DFR/FT, SCT tracked the target till the end only in 11 sequences, KCF2 in 7, while the others managed with no more than 3 sequences. In NFR/FT, the SCT tracker was the best with the results of 4 sequence in total, DAT successfully processed 2 videos, while KCF2 did only one.

In the video #3, the target appearance is altered due to light reflection, and only STRUCK successfully adapted to these changes. Five videos, namely #8, #10, #14, #15, and #17, appeared excessively challenging for the evaluated algorithms, since none of them was completely tracked in the both frame rate cases. #8 and #10 are significantly distorted with the motion blur, while in the other videos volunteers bend their forefingers during the experiment making them partially occluded with respect to the camera view.

The precision plots for the NFR/FT and the DFR/FT cases are shown in Fig. 21.

0 5 10 15

0 0.2 0.4 0.6 0.8 1

Location error threshold [px]

Precision

0 5 10 15

0 0.2 0.4 0.6 0.8 1

Location error threshold [px]

Precision

Staple Staple+

DAT

KCF KCF2 ASMS

SCT STC sKCF

STRUCK IVT

Figure 21.Forward tracking precision plots. Normal (left) and double (right) frame rate cases.

(44)

47% 42% 47% 33% 100% 100% 41% 48% 53% 100% 45% 100%

2 102.4 53%

2.8 100%

144.4 37%

4.8 100%

37.2 42%

89.8 29%

84.9 50%

2.2 100%

5.5 95%

2.6 100%

105.7 25%

126.9 4%

3 118.2 23%

59.2 37%

104.8 18%

66.3 34%

117.2 24%

140.4 26%

56.2 44%

16.4 37%

213.5 32%

134.4 34%

199.6 32%

123.1 16%

4 122.6 47%

95.0 30%

143.3 47%

114.8 20%

6.1 100%

3.2 100%

107.3 33%

90.5 30%

119.3 47%

5.6 90%

148.5 53%

4.0 100%

5 155.0 55%

142.9 50%

185.1 50%

155.8 50%

133.4 29%

46.8 29%

107.0 55%

11.6 76%

134.4 55%

9.0 94%

155.7 35%

115.8 24%

6 102.5 47%

88.6 43%

104.1 47%

86.6 47%

82.9 44%

30.7 46%

112.5 38%

9.2 96%

114.1 47%

33.6 47%

111.3 29%

127.9 17%

7 151.2 44%

120.4 45%

174.9 44%

78.4 30%

147.9 8%

22.1 31%

65.5 44%

73.8 31%

172.7 44%

71.8 35%

174.2 39%

56.5 9%

8 164.9 50%

119.5 16%

166.6 50%

170.2 28%

150.2 15%

89.0 18%

78.8 10%

67.4 18%

92.3 10%

92.4 11%

153.9 20%

125.4 18%

9 257.8 13%

129.1 16%

248.2 13%

79.0 16%

111.9 23%

12.8 78%

267.7 31%

69.2 14%

267.6 31%

189.1 16%

239.6 31%

94.7 5%

10 188.2 33%

190.3 27%

168.3 33%

189.5 27%

196.9 20%

69.4 17%

240.9 9%

110.0 27%

232.9 33%

124.4 27%

195.8 33%

213.1 17%

11 113.2 56%

114.4 46%

121.1 56%

63.7 57%

19.3 71%

8.6 89%

78.6 61%

14.3 71%

55.7 65%

5.7 100%

103.2 44%

108.3 32%

12 167.7 47%

148.5 52%

149.6 47%

151.8 46%

150.2 53%

73.0 38%

151.7 53%

69.0 52%

153.8 53%

5.6 100%

108.1 47%

123.8 34%

13 126.7 50%

62.8 52%

122.0 50%

82.2 21%

92.0 56%

94.3 11%

122.1 50%

2.4 100%

130.6 50%

4.4 100%

118.7 44%

102.4 28%

14 112.6 62%

146.6 42%

113.1 62%

144.1 31%

129.4 54%

61.7 15%

100.7 62%

87.7 42%

133.5 62%

196.5 42%

96.7 62%

121.4 42%

15 87.8 44%

31.8 54%

80.4 47%

30.2 61%

41.9 25%

17.2 59%

72.7 47%

17.6 46%

171.0 53%

11.7 68%

83.6 47%

96.0 32%

16 88.6 57%

57.3 39%

119.9 50%

104.2 52%

113.7 50%

37.6 38%

97.5 36%

15.8 52%

85.9 57%

16.0 57%

127.6 50%

157.2 33%

17 174.8 47%

203.9 19%

165.6 47%

189.1 13%

136.9 33%

90.2 18%

53.0 20%

72.8 16%

140.1 53%

98.5 32%

120.8 40%

153.0 6%

141.7 44%

102.5 43%

146.7 43%

102.6 40%

100.3 41%

52.2 46%

120.3 42%

45.5 50%

145.0 49%

63.1 61%

146.1 39%

109.6 28%