Machine learning methods in interaction inference from gaze

(1)

uef.fi

PUBLICATIONS OF

THE UNIVERSITY OF EASTERN FINLAND Dissertations in Forestry and Natural Sciences

ISBN 978-952-61-3044-6 ISSN 1798-5668

Dissertations in Forestry and Natural Sciences

DISSERTATIONS | HANA VRZAKOVA | MACHINE LEARNING METHODS IN INTERACTION INFERENCE... | No 340

HANA VRZAKOVA

MACHINE LEARNING METHODS IN INTERACTION INFERENCE FROM GAZE

PUBLICATIONS OF

THE UNIVERSITY OF EASTERN FINLAND

Eye movements have been long associated with various cognitive processes. In this thesis I investigate spontaneous eye movements during

human-computer interaction and intentions to act, and I propose a machine-learning based framework for intelligent interaction inference.

Pervasive gaze-informed inference presents a fundamental part of future user-aware

symbiotic interfaces.

HANA VRZAKOVA

(2)

(3)

PUBLICATIONS OF THE UNIVERSITY OF EASTERN FINLAND DISSERTATIONS IN FORESTRY AND NATURAL SCIENCES

No. 340

Hana Vrzakova

MACHINE LEARNING METHODS IN INTERACTION INFERENCE FROM GAZE

ACADEMIC DISSERTATION

To be presented by the permission of the Faculty of Science and Forestry for public examination in the Louhela Hall, Science Park, Joensuu, on April 24th, 2019, at noon.

University of Eastern Finland School of Computing

Joensuu 2019

(4)

Grano Oy Jyväskylä, 2019

Editors: Pertti Pasanen, Matti Tedre, Jukka Tuomela, and Matti Vornanen

Distribution:

University of Eastern Finland Library / Sales of publications julkaisumyynti@uef.fi

http://www.uef.fi/kirjasto

ISBN: 978-952-61-3044-6 (print) ISSNL: 1798-5668

ISSN: 1798-5668 ISBN: 978-952-61-3045-3 (pdf)

ISSNL: 1798-5668 ISSN: 1798-5676

(5)

Author’s address: University of Eastern Finland School of Computing

Länsikatu 15 80110 JOENSUU FINLAND

email: hana.vrzakova@uef.fi

Supervisors: Associate professor Roman Bednarik University of Eastern Finland School of Computing

email: roman.bednarik@uef.fi Professor Markku Tukiainen University of Eastern Finland School of Computing

email: markku.tukiainen@uef.fi Reviewers: Senior researcher Dr. Peter Kiefer

ETH Zurich

Institute of Cartography and Geoinformation Stefano-Franscini-Platz 5

8093 ZURICH SWITZERLAND email: pekiefer@ethz.ch Professor Kai Kunze Keio University

Graduate School of Media Design 4-1-1 Hiyoshi, Kohoku-ku

223-8526 YOKOHAMA JAPAN

email: kai@kmd.keio.ac.jp

Opponent: Professor Hans Gellersen

Lancaster University InfoLab21

LA1 4WA LANCASTER UNITED KINGDOM

email: h.gellersen@lancaster.ac.uk

(6)

(7)

Hana Vrzakova

Machine learning methods in interaction inference from gaze Joensuu: University of Eastern Finland, 2019

Publications of the University of Eastern Finland Dissertations in Forestry and Natural Sciences

ABSTRACT

Eye gaze has been long claimed to reflect individual’s cognitive states and to provide a link between eye movements and reasoning. Consequently, a line of prior and current research has been developing gaze-aware applications that infer interaction intentions from eye gaze patterns. Application of eye movements in the design of pervasive gaze-aware interfaces out of lab has not been exactly straightforward.

Similarly to other behavioral signals, high between-subject variance and sensor- and task-induced bias and noise inevitably influence eye-tracking outcomes.

In this thesis, I investigate spontaneous non-manipulated eye movements underlying human-computer interaction and aim to recognize various aspects of interaction, such as an intention to act and underlying affect, using intelligent data processing and machine learning. The novel framework comprises three eye-tracking unimodal and multimodal datasets (8Puzzles, Code review, and Daily clicking) and a set of tools for automatic interaction inference (PandasEye). The accuracy of action predictions reported in this work reach 80%. Main findings summarized in this work serve as recommendations and a baseline for systematic comparisons.

Universal Decimal Classification:004.5, 004.85, 159.94, 612.846

Library of Congress Subject Headings:Eye – Movements; Eye tracking; Gaze; Artificial intelligence; Machine learning; Human-computer interaction; Human behavior; Intention;

Affect (Psychology); Inference

Yleinen suomalainen asiasanasto: silmänliikkeet; katseenseuranta; tekoäly; koneoppimi- nen; ihmisen ja tietokoneen vuorovaikutus; vuorovaikutus; toiminta; päättely; ennusteet

(8)

(9)

ACKNOWLEDGEMENTS

I would like to express my gratitude to advisory committee, Prof. Markku Tukiainen and Doc. Roman Bednarik for their support in forming new hypotheses, patience in endless proof-reading of my drafts, and mentoring. My appreciation also belongs to my additional research collaborators and mentors with whom I had the incredible opportunity to work on joint projects: Prof. Yukiko Nakano, Seikei University, who kindly hosted me during my stay in Japan; Dr. Andrew Begel, MSR who introduced me to the domain of affective computing and software engineering; Prof. Lauri Mehtätalo, UEF, who meticulously re-examined all the steps in our statistical models, and Prof. Juha E. Jääskeläinen and Prof. Mikael von zu Fraunberg, KYS, whose excitement for innovations in microsurgery was contagious. I would like extend my thanks to Dr. Peter Kiefer and Prof. Kai Kunze, who dedicated their considerable time and expertise serving as reviewers of this thesis, and Prof. Hans Gellersen, who served as an honorable opponent of this thesis.

Even in the Middle of Knowhere (a UEF slogan), my colleagues and friends were a great source of inspiration and support. Their optimism, hard work, and stub- bornness that we won’t give up yet, were essential of our shared times. My thanks (alphabetically ordered) belong to Piotr Bartczak, Tereza Busjahn, Antti-Pekka Elo- maa, Yuliya Fetyukova, Antti Huotarinen, Jani Koskinen, Juha Lång, Tomi Leppä- nen, Najlah Gali, David Gil de Gómez Pérez, Pavel Orlov, and colleagues in the UEF OpenSpace and KYS Neurocenter, where all the magic happened. Finally, I am very grateful having a support of my friends and family, who were cheer-leading me during my work at UEF.

This research would not be possible without support of Cross Border Univer- sity (2012), ECSE doctoral position during 2013-2016 at University of Eastern Fin- land, Academy of Finland, FIRST project, grant no. 305199 (2017-2018), and mobility grants.

Joensuu, March 22, 2019 Hana Vrzakova

(10)

(11)

LIST OF PUBLICATIONS

The thesis consists of the present review of the author’s work in the field of eye- tracking and the following selection of the author’s publications. The order of the manuscript is chronological and covers a work on developing machine learning methods for interaction inference from eye movements (P1), an extension of extracted gaze-derived features (P2) and multimodal features (P5), and properties of data sequencing (P3). Datasets included in this work spanned problem solving in 8Puzzles (P1, P2, P3), source code review (P5), and daily web-browsing (P4).

Manuscript P6 provides an overview of the eye-tracking domain and of the studies employing machine learning methods to infer interaction aspects and cognitive processes from eye movements.

P1 Bednarik, R., Vrzakova, H., and Hradis, M.:What do you want to do next: a novel approach for intent prediction in gaze-based interaction. In Proceedings of the Symposium on Eye Tracking Research and Applications, pp. 83–90, ACM, 2012.

P2 Vrzakova, H. and Bednarik, R.: Fast and Comprehensive Extension to Intention Prediction from Gaze. InProceedings of the 2nd IUI Workshop on Interacting with Smart Objects, pp. 34–39, 2013.

P3 Vrzakova,H. and Bednarik, R.: Quiet Eye Affects Action Detection from Gaze more than Context Length. In Proceedings of User Modeling, Adaptation and Personalization, pp. 277-288, Springer, 2015.

P4 Nguyen, D., Vrzakova, H. and Bednarik, R.: WTP: web-tracking plugin for real-time automatic AOI annotations. In Proceedings of the 2016 ACM Inter- national Joint Conference on Pervasive and Ubiquitous Computing: Adjunct, pp.

1696–1705, ACM, 2016.

P5 Vrzakova, H., Begel. A, Mehtätalo, L., and Bednarik, R.: Affect Recognition in Code Review: An in-situ biometric study of reviewer’s affect. In revision at Journal of Systems and Software, pp. 1–10, Elsevier, 2018.

P6 Vrzakova, H. and Bednarik, R.: Machine Learning in Eye-Tracking Studies for User State and Characteristics Modeling: A Semi-Systematic Review. In revision atJournal of User Modeling and User-Adapted Interaction, pp. 1–28, Springer, 2018.

(12)

AUTHOR’S CONTRIBUTION

The publications selected in this dissertation are original research articles on interaction inference from eye movements. The contribution of the authors are described in detail below:

The idea of machine learning methods for eye-tracking analysis P1 was proposed by Dr. Roman Bednarik and Dr. Michal Hradis. Dr. Hradis provided the guidance in building the machine learning framework. Author implemented data processing methods and run experiments using the machine learning framework. Dr. Bednarik and author co-wrote the final manuscript.

Articles P2 and P3 originated from joint discussions with Dr. Roman Bednarik.

Author was a main contributor, implemented methodology extensions, run the experiments and prepared the manuscript. Dr. Roman Bednarik contributed to the introduction and discussion part of the manuscript. In P5, the machine learning pipeline from P1 was scripted and extended to multimodal signal processing in the context of software engineering. Dr. Andrew B. Begel guided the experiment design and embedded sensing of multimodal signals into CodeReview, Prof. Lauri Mehtätalo contributed to the statistical evaluation of the experiment outcomes, and Dr. Roman Bednarik contributed to discussion on machine learning.

The article P4 emerged from Duc Nguyen’s MSc. Thesis where he implemented WTP plugin for real-time AOI annotation plugin for web-based studies of interaction. I was supervising the design of the user study and authored sections Introduc- tion, Background and Case Study of the article.

The article P6 presents a domain overview and originated from joint discussions with Dr. Roman Bednarik. Author collected the source articles and conducted annotations of the papers according to the agreed annotation scheme. Dr. Bednarik and author co-wrote the final manuscript.

(13)

red present fixation during the action click; fixations in orange and blue present fixations classified in the positive class (action-related) and in the negative class (free viewing), respectively. ... 20 3.3 Scheme of fixation and saccade measures. Figure extended from P1... 21 3.4 Averaged pupillary responses around the action event (left), adopted

from Richer and Beatty (1985) with a kind permission to reprint. Mean pupil dilations around click (light red) and during free viewing (dark blue) from P1. ... 23 3.5 Steps employed in pupil processing. ... 24 3.6 Raw pupillary responses of one participant before the action (left) and

aligned and normalized signal (right). The data segments were extracted from three fixations around the action event from gaze-augmented dataset (P1). The red vertical line (t0) represents the action click... 24 3.7 Scheme of building blocks and steps in the machine learning pipeline. ... 32 3.8 Example of the Receiver Operating Characteristic (ROC) curve (red). The

dotted line presents performance received by chance, the blue line illus- trates decreasing threshold in the classification. ... 34 3.9 Example of a linear classifier splitting two classes with a decision bound-

ary with a soft margin. ... 35 3.10 Example of classifier with non-linearly separable classes (right) and their

separability in a new feature space using a kernel trick. ... 35 3.11 RapidMiner: setting up the parameters of grid search. For each parame-

ter, a user defines minimal and maximal boundaries, a step and a scale;

the grid search generates an appropriate a set of parameters, systematically tested in each search trial. ... 40 3.12 User’s scanpath in 8Puzzles gameplay. The task is to re-shuffle tiles to

match a target configuration given in the lower left corner. Adopted from P3... 43 3.13 Interface for source code review. Adopted from P5. ... 44 3.14 User’s scanpath (blue) and mouse trajectory (red) in Google search for

presidential candidates 2016. Adopted from P4. ... 46 3.15 Task difficulty vs. ecological validity in datasets. ... 48 4.1 Histograms of average percentage change of pupil size (APCPS) on the

left and mean fixation duration (MFD) on the right. Adopted from P1. ... 50

(16)

4.2 Increase and decrease of fixation (left) and saccade (right) based features in intention segments compared to free viewing segments. Adopted from P2... 51 4.3 Average pupillary responses of positive (intention to act) and negative

(free viewing) classes. The segments were extracted two fixations prior to and one fixation after the mouse click. Pupillary responses were normalized using PCPS with the mean of each sequence (left) and with the mean of the dataset (right). Adopted from P1. ... 52 4.4 Overview of methods in Data Preprocessing tool ... 56 4.5 Overview of methods in Machine Learning tool... 57 5.1 Examples of implicit, explicit, and newly, semi-explicit intentions in gaze

interaction... 64

(17)

LIST OF TABLES

1.1 Research questions discussed in the included publications. ... 3 2.1 Example studies inferring internal and external user states from gaze. ... 11 3.1 Examples of feature selection methods... 27 3.2 Overview of features derived from eye movements and multimodal sig-

nals employed in this work. ... 28 3.3 Challenges in eye-tracking user studies, data handling, and machine

learning experiments and suggested remedies. ... 38 3.4 Overview of datasets collected in this thesis. ... 42 4.1 Overview of classification performance achieved with gaze-augmented

(GA), mouse-only (MO) interaction and datasets based on fixation (F), saccade (S), and pupil dilation (P) features. ... 49 4.2 The highest achieved prediction performance for each data transformation. 51 4.3 Differences between highest and lowest achieved prediction performances

linked to data transformations... 51 4.4 Summary of prediction performances achieved in included publications. .. 53 4.5 Overview of input and output parameters in the PandasEye. ... 54

(18)

(19)

Figure 1: Extraordinary Observer by Enkel Dika (2019). Reprinted with author’s kind permission.

(20)

(21)

1 Introduction

With recent advances in eye-tracking sensors and the availability of recording in- struments, observing interaction and eye movements should have never been easier.

Gaining insights from the coupling between gaze and interaction in a user interface (interaction inference), however, is persistently challenging, and methods to determine human reasoning from eye movements are still lacking (Jacob and Karn, 2003).

One reason is that our human memory simply fails to provide sufficient details about past events when we are explicitly asked to recall these interactions. Miss- ing ground truth related to human reasoning creates a gap between observations of a participant’s behavior and corresponding but covert reasoning. The inference of interaction aspects and a participant’s cognitive processes as indicated by behavioral signals thus resembles an inverse problem in which the behavioral signals, i.e.

gaze data streams, recorded during interaction are used to estimate the underlying intentions and reasoning.

Interaction inference from eye movementspresents a quantitative analysis that computationally describes the link between interaction, eye movements, and participant’s cognitive processes. In previous studies, researchers have inferred various aspects of interaction, as interaction goals and quality, from eye movements. Dilated pupils, for example, have been observed to increase with arousal and love affection (Hess and Polt, 1960), and changes in pupil size were associated with a lie (Bradley and Janisse, 1979), though this was later disregarded by Janisse (1974). Further research has continued to link the eye movements to a variety of cognitive processes during interaction. Recent advances in eye-tracking technology have allowed the unobtrusive measurement of gaze direction in various interactive settings, including human-computer interaction (Gegenfurtner et al., 2014, Rayner, 2009, Sharafi et al., 2015), human-robot interaction (Ruhland et al., 2015), and face-to-face dyadic and multiparty social interaction (Gatica-Perez, 2009).

When searching for patterns in recorded gaze data streams, one must compensate for current eye-tracking shortcomings as voluminous noisy data. Due to thousands of gaze samples and a high variance in human physiology and behavior, standard statistical tests are to be applied with caution. Previous research has demonstrated that standard statistical tests applied to the entire eye-tracking data recorded during an experiment are likely to deliver statistically significant results, though with a low effect size (Bednarik, 2007). Analytical methods require intelligent data processing to understand the link between interaction, eye movements, and user’s states.

In this work, we explored machine learning methods for interaction inference from eye movements and studied the extent to which one can infer aspects of interaction from a participant’s gaze behavior. Our main motivation was to develop computational methods that would help in the effective modeling of differences between interactive actions and non-interactive events (free viewing). Recognition of these two states is essential (Mulvey and Heubner, 2012) for enhancing both visual attentive interfaces (Raiha et al., 2011) and gaze-sensing ubiquitous applications (Majaranta and Bulling, 2014).

(22)

1.1 RESEARCH QUESTIONS

In the main hypothesis of this work, I investigate the extent to which it is possible to infer interaction and interaction characteristics from eye movements? The main hypothesis is divided into the following research questions:

RQ1 What are suitable event representations using eye movements for interaction inference?

a) Which eye movement features best describe the intentionality in interaction?

b) What is the impact of interaction modality on eye movement features?

RQ2 How data processing methods affect interaction inference?

a) How do data segment size and timing impact interaction inference?

b) How does pupil normalization impact interaction inference?

RQ3 How do properties of the classification process influence prediction performance?

RQ4 What are practices and tools for automatic inference of user states from eye movements?

RQ5 How should a machine learning tool be devised to enable interaction inference from eye movements given the best practices learned in this thesis?

1.2 RESEARCH PROCESS

In this work, we aimed to investigate the extent to which one can prevent the Midas Touch effect (Jacob, 1991) in gaze-aware interfaces using interaction inference from eye movements. With this goal in mind, we sliced interaction-related eye-tracking data into short episodes of eye movements (fixation sequences) and then split them into two classes: action and free viewing. As we aimed for automatic computational methods, we developed computational routines for data preprocessing, feature engineering, and classification. Preliminary results raised several questions that are mainly related to how methodological settings impact recognition rates. Conse- quently, we experimented with the segmentation of eye-tracking data and studied the impact of the size and timing of segments on recognition rates. The developed methodology was transferred and extended in the secondary study of code reviews, where we aimed to predict the reviewer’s affective states when commenting on source code. The advances in methodology offer a framework for an analysis of interaction inference from voluminous eye-tracking data streams, even beyond the originally indented scope of work. The developed methods are described in paper P1, and the methodological improvements are investigated in papers P2 and P3, as summarized in Table 1.1.

1.3 METHOD OF THESIS

We employed multiple research methods to answer the aforementioned research questions. In three empirical eye-tracking studies, we recorded and analyzed eye

(23)

Table 1.1:Research questions discussed in the included publications.

Research question Article

1 What are suitable event representations using eye movements for

interaction inference? P1, P2, P3

2 What are the effects of data processing on interaction inference? P1, P2, P3 3 How do properties of the classification process influence predic-

tion performance? P1, P2

4 What are practices and tools for automatic inference of user states

from eye movements? P4, P5, P6

5 How should a machine learning tool be devised to enable interaction inference from eye movements given the best practices learned in this thesis?

This thesis

movements during problem-solving, code review, and common web-browsing activities. All experiments were originally designed for different purposes. The first experiment was initially targeted to study differences in interaction modalities and their impact on problem-solving. In the second experiment, biometric sensors were employed to reveal affective states in source code review. The last experiment was designed to reveal differences in eye movements with respect to various common activities during web-browsing. In each study, we developed a novel methodology to answer the research questions.

Part of this work employed a constructive research paradigm as we iterated between the methodology development and the evaluation. In the implementation and construction phases, we experimented with machine learning methods and representations of eye movements in a feature space. The improved settings were the construct, and the evaluation of machine learning performance took the form of analytical reflections. In each methodological iteration, the improved methodology was evaluated, and the best solutions were propagated to the next research iteration.

1.4 ORGANIZATION OF THESIS

The thesis is organized into the following chapters. Chapter 2 summarizes previous and current research on interaction inference from eye movements using machine learning methods. Chapter 3 details the methodology developed during the research. The main results and challenges are described in Chapter 4 and evaluated in Chapter 5. Appendices contain the research papers included in this work.

(24)

(25)

2 Background

Machine learning offers a suitable methodology for processing noisy and voluminous data. While high variance and large datasets can lead to spurious results, machine learning methods can benefit from voluminous data and create better statistical models (Langley, 1988). Also for this reasons, research on human behavior and cognitive processes, i.e. eye tracking, employed the machine learning as means of analysis. Studies included in this chapter present purposively selected examples.

This chapter covers the principles of the video-based eye-tracking used in this work and a selection of studies linking gaze to user states and aspects of interaction (interaction inference from gaze). Selected studies provide an overview of topics and methods, and introduce the context for the following chapters. In P6, we conducted a semi-systematic literature review of the eye-tracking domain and investigated the topic in a greater detail.

2.1 PRINCIPLES OF EYE-TRACKING

Human visual system is complex. When a beam of light enters the eye, it bends on the eye lens, travels through the sclera, and lands on the fovea, a small region on the retina. Rods and cones, the retinal cells sensitive to dim achromatic and bright chromatic light, respectively, then encode the information in the light beam and translate it to electrochemical signals that are transmitted through the optical nerve and visual pathways to the corresponding brain regions. In the visual cortex, the neural signals are decoded back to perception. This is only the perception side;

the oculomotor part of vision is complex as well (Duchowski, 2007, Chap. 2).

Contrary to our experience of perception, eyes are not still. They are restless, constantly shifted by eye muscles. Three pairs of extraocular muscles are responsible for moving the eye into six directions and enable it to direct its focus on a selected object. Various microsaccadic eye movements - fixations, saccades, smooth pursuits, vergence, and vestibular and psychological nystagmus - "refresh" the retina and ensure that the image on the retina remains projected. The neural control system and extraocular muscles compensate for the tremble, resulting in steady and sharp perception of the image. When the retina stays fixated in one position, the perceived object begins to fade away after several seconds.

Direct observation of eye movements is tedious work; in the early years of eye- tracking, enthusiastic researchers equipped with a camera film, and a ruler spent their time measuring spatial differences in each film frame. As technology has ad- vanced, capturing eye movements has become easier. At the time of writing this thesis, most of commercial systems work on the principle of video-based tracking and pupil corneal reflections. Compared to pure video-based tracking, where the eye camera captures the image of the eye, detects the pupil in the frame and infers the gaze direction, the reflection-based eye-trackers employs infra-red illumination to detect additional reference points in the form of corneal reflections (Hansen and Ji, 2010). As an advantage, the use of corneal reflections allows to compensate for

(26)

minor head movements, since the glint remains fixated while the eye ball rotates (Duchowski, 2007, Chap. 5).

When processing the video frames, the eye-tracking system computes differences in both the position of the pupil and the corneal reflections, maps the relationship between the pupil and the corneal reflection to the ranges recorded during calibra- tion, and estimates a gaze vector on the stimulus (Duchowski, 2007). Obtained gaze points in the form of horizontal and vertical coordinates on the stimulus (either in pixels or in a visual angle) present a raw data output that is further processed and filtered to obtain fixations and saccades.

Fixations

Fixations are the eye movements responsible for image stabilization on the retina.

About 90% of viewing time is spent by fixating (Duchowski, 2007, Chap. 2). Even though the image seems to be stabilized, the eye continues to perform three mi- cro movements – tremor, drift, and microsaccades – within about 5^◦range around a fixated object. To estimate the fixation center and duration, fixation detection algorithms cluster the raw gaze points. Fixation filtering belongs to a longstanding branch of eye-tracking research; fixation filters are designed and parametrized with respect to the eye-tracker sampling rate, data quality, and task characteristics, to name few. Consequently, a typical range of fixation duration can fluctuate between 150 and 600 ms (Holmqvist et al., 2011, chap. 11).

Fixations have been perceived as a unit of cognitive processing. In their fundamental work on eye span in reading, Just and Carpenter (1980) observed that individuals were cognitively processing fixated words with minimal lag, suggesting that cognitive processing happens during fixations. Influenced by the Eye-Mind hypothesis, further studies investigated gaze behavior in reading (Rayner, 2009), med- ical expertise (Gegenfurtner et al., 2011), daily activities (Land, 2006), and sports (Vickers, 2007) to name a few. Fixations have been of great importance also in gaze interaction and eye-typing methods (Majaranta and Räihä, 2002). Although the Eye- Mind hypothesis has limitations (Anderson et al., 2004), in this work, we employ fixations as a unit of analysis and study spatial-temporal latent features of fixations (further in Section 3.1).

Saccades

Saccades, first described by Javal in 1879, are fast movements that orient the eye to a new location in the visual field according to attentional changes (Land, 2006).

During a saccade, the eye is essentially blind, as few external changes are projected on the retina. Presumably, the saccadic "blindness" is due to pre- and post-saccadic suppression, when the eye prepares for a physical movement and for the stabilization of the image on the retina (Volkmann, 1986).

Two possible approaches – exclusion and filtering – have been applied in saccades detection. In exclusion, saccades are detected as the movements between two fixations and thus, they are filtered by excluding the data points linked to blinks, fixations, and noise. In filtering, saccades are computationally filtered using event detection algorithms based on the velocities and acceleration between raw gaze points, with thresholds varying in ranges 30-100^◦/s (velocity) and 4000-8000^◦/s²(acceleration), respectively. To reliably detect saccades, the recommended sampling rates of

(27)

eye-trackers have been set to 300Hz to fulfill the Nyquist-Shannon sampling theorem (Holmqvist et al., 2011, chap. 2).

As the fastest measurable movements in the human body, saccades have pre- dominantly been of interest in the studies of reading, where saccade amplitude, orientation, and duration have been linked to various reading stimuli and personal characteristics, including expertise, dyslexia, text difficulty, and reading aloud (for review, see Rayner (1998)). In his initial study of scene perception, Yarbus (1967) visualized and investigated the differences in saccade direction and dispersion in relation to individual’s cognitive tasks. In this work, we depart from the detailed inspection of saccade amplitude and orientation, and focus on the latent characteristics of eye movements that we encode into gaze features, similarly to research practice in biometrics (Rigas and Komogortsev, 2017).

Pupillary responses

Pupillary responses are probably the most studied group of eye movements. Al- though prior research has examined the changes in pupil diameter with respect to cognition, the primary purpose of pupillary responses is to accommodate the human eye to external light conditions. Generated by two opposite muscles in the iris (the sphincter and the dilator pupillae), pupil contractions and dilations reg- ulate incoming light from surrounding illumination and adjust the depth of field by changing the curvature of the eye lens (Beatty and Lucero-Wagoner, 2000). The pupillary light reflex and the accommodation reflex, however, are not the only reflexes observable in pupil dilations. Internal cognitive processing and other minor reflexes, as psychosensory (Kahneman, 1973) and task-evoked (Beatty and Lucero- Wagoner, 2000), also impact pupil dilations.

The reflexes triggered by internal processing are, however, relatively small compared to the reflexes caused by light and depth accommodation. Additionally, the range of pupil dilations has been linked to an individual’s age and the use of medi- cations. Since the absolute size of pupil dilations (measured in millimeters or pixels) is prone to these artifacts and is unreliable in between-subject comparisons, it is recommended to normalize the raw pupillary responses and compare them against a baseline (Beatty and Lucero-Wagoner, 2000). Although this recommendation is well accepted, the methods for establishing the baseline and normalizing a pupil diameter have differed across studies. Mean subtraction, Z-score, and percentage change in pupil size (PCPS) present common choices and have been compared in our initial work, P1.

In a range of cognitive processes, the pupil responses have mainly been linked to task workload and individual’s affective states (Wang et al., 2013). In their initial study, Hess and Polt (1964) observed that the participant’s pupil dilates twice as much in harder multiplication tasks than in simpler ones. The correlation between task difficulty and pupil diameter was also observed in reading complex sentences (Just and Carpenter, 1993), evaluating a complex source code (Fritz et al., 2014), aiming in challenging laparoscopic conditions (Jiang et al., 2014), and even in pressing a simple button with more effort (Richer and Beatty, 1985).

Changes in pupil size have also been indicative in studies of task boundaries, signaling a switch between two subtasks (Iqbal et al., 2004) and task properties (Jiang et al., 2014). Although it remains unclear whether the changes were caused by the type of subtask or the level of subtask difficulty, pupillary responses have been prominent in the inference of cognitive processing.

(28)

In this work, we experimented with various features of pupil dilations to infer intentionality of interaction, as described in Section 3.2.

2.2 LINKS BETWEEN COGNITION, ATTENTION AND ACTION Since the beginning of the eye-tracking field, the link between cognition, attention, and eye movements has been in the spotlight. "Even though it is difficult to identify the specific properties of eye movement trace that reveal cognition, there is a clear information in those scanpaths that can nonetheless be used to infer the cognitive processes." (Spivey and Dale, 2011).

Cognition, attention and eye-movements

Sensory information available through human perception is enormous (10⁸-10⁹bits) as estimated by (Itti, 2000, Koch et al., 2006). Unsurprisingly, humans have only limited information processing capacity to make sense of the input (Marteniuk, 1976, Chap. 3). When it is impossible to process complete sensory input in parallel, attention serves as aserial bottleneckto limit the input sensory stream and to filter the information that passes through. Two competing theories have dominated the field of psychology, arguing whether the attention is mainly early-selective (driven by low-level stimulus features) or late-selective (oriented by internal human goals) (Anderson, 2005, Chap 3.).

Theories of visual attention have followed this division (Duchowski, 2007, Chap.

1). Inbottom-upmodels, visual attention has been hypothesized to be guided mainly by physical appearance and low-level characteristics of stimulus as contrast, colors, and motions in the visual field (Broadbent, 2013, Helmholtz and Southall, 1925), and these are processed in parallel. The bottom-up visual processing has been particu- larly compelling in the development of probabilistic models aiming to predict the distribution of visual attention at the observed stimulus (as reviewed by Borji and Itti (2013)).

Intop-down models, which originated in the studies of art perception (Buswell, 1935, Yarbus, 1967), visual attention has been also observed to be task driven and oriented by individual’s high-level goals and intentions (Gibson, 1941, James, 1981).

In top-down processing, eye movements sequentially visit foveal regions of interest (Noton and Stark, 1971) and assemble information necessary in the current task.

This thesis operates with the assumption that various aspects of interaction result from cognitive plans and intentions.

The link between eye movements, attention, and cognition is, however, far more complicated. Though correlated, eye movements and attention are not identical.

When attending to multiple road signs, for example, a driver’s attention was observed to be directed to several signs simultaneously, though eye movements were slightly delayed (Salvucci, 2000). Furthermore, gaze can be directed to a stimulus without attention being paid to it (see studies of mind-wandering in education (Bixler and D’Mello, 2016)) and, with some conscious effort, attention can be paid to a stimulus without accompanying foveal direction (Ballard et al., 1995, Posner, 1980). The role of covert attention is also unclear, as current eye-trackers allow tracking only of foveal vision. The important contributions of parafoveal and peripheral vision remain untraceable and are accessible only using systems that dynamically mask fixated areas, as in case of code comprehension (Orlov and Bednarik, 2017, e.g.).

(29)

The link between eye movements and cognition is even more challenging to decode. In early studies and in well-controlled laboratory research, eye movements were perceived as a result of cognitive processing. Fixations and saccades revealed the reading process (Dodge and Cline, 1901) and cognitive tasks in scene perception (Yarbus, 1967). Pupillary responses were observed to increase with arousal (Hess and Polt, 1960) and workload (Hess and Polt, 1964). However, the link between cognition and eye movements is not exclusively one-way (Spivey and Dale, 2011).

The opposite direction of the link, in which eye movements influence cognition, has been supported in fixation and saccade frequencies. A human eye can fixate on a new object before the previously fixated object has been cognitively processed (Land, 2006). Gold and Shadlen (2000) hypothesized that eye movements accumu- late necessary perceptual information before a decisive action can be completed.

Spivey and Dale (2011) characterized the eye movements as coexisting with cognition, both revealing and influencing ongoing cognitive processing. Despite the evidence for the bidirectional link between gaze and cognitive processes, it remains unclear how dynamic and temporally fluctuating the link is. This aspect has also been of great importance in the studies of eye-hand coordination and interaction with the outer world that I explore next.

Link between eye movements and interaction

In interaction with outer world, the role of eye movements has resonated with bottom-up and top-down attention models as well as with the fluctuating relationship between cognition and eye movements (Spivey and Dale, 2011). When interacting with objects, eye movements have been observed either to support ongoing interaction and to bereactiveto the stimulus (Ballard et al., 1992), or to predict upcoming interaction and toproactivelyattend the future regions of interest (Flanagan and Johansson, 2003). Although detailed studies are rare, following evidence has been collected in studies of aiming tasks (Vickers, 2007) and daily activities (Land, 2006).

In their initial studies, Ballard et al. investigated eye-hand coordination and timing in object transfer tasks where colored blocks were about to be moved between source and model areas on the screen. They noticed two distinct types of gaze behavior: the eye movements were either continuously informing the user about block shape and color (reactive), or proactively directing the hand movements towards the goal area (Ballard et al., 1992) and preceding the upcoming action (Ballard et al., 1995).

Although one could complete this type of task just by focusing his or her gaze on the central part of the screen, reactive eye movements enable the fast and economi- cal execution of eye-hand coordination (Ballard et al., 1995, Land et al., 1999). In the daily task of tea making, Land et al. (1999) classified reactive fixations asdirecting (accompanying the hand movement towards the object),guiding(simultaneously co- ordinating and distributing attention across multiple objects), andchecking(evaluating the final state of the object). Additionally, in well-rehearsed tasks, eye movement did not dwell on the final location. Instead, the eyes were located to a new destina- tion 0.5 seconds before the object manipulation was completed. Land et al. (1999) explained this behavior with an information buffer where the information necessary for completion was at the disposal of the motor system.

(30)

Proactive eye movements and gaze before the action have been of interest in aiming tasks. In the form of eitherlocating fixations, establishing the position of the next object (Land et al., 1999), orlook ahead saccades, preceding and anticipating the next action (Pelz and Canosa, 2001), researchers observed that gaze indeed precedes the action. In studies ofquiet eye fixations(Vickers, 1996), the last fixation prior to the action was in question. Williams et al. (2002) found that when aiming in archery, the quiet eye fixation is statistically longer than preceding fixations. The effect was also correlated with expertise. Similar evidence was gathered in other sports, as summarized by Vickers (2007).

Indeed, eye movements play numerous roles in interaction with the outer world, and this makes the inference of cognitive processes tremendously difficult. In eye- hand interaction with objects, eye movements both guide and predict the actions (Ballard et al., 1992). In social interaction, gaze both provides the cognition with information about the conversation and signals partner’s attentiveness (Oertel et al., 2016) and conversation turns (Jokinen et al., 2013). Observations of cognition from the latent relationship between eye movements and interaction are rarely easy (Bed- narik, 2007, Jacob and Karn, 2003).

In this section, the reviewed studies relied on carefully isolated contexts and well-controlled experimental tasks without noise of daily interaction. These present both advantage and limitations to interpretation, and one should be cautious in gen- eralizing these findings to complex situations. Interference in the form of task inter- ruptions, along with unpredictable interaction with other objects and individuals, will inevitably increase variance in gaze behavior. The next section will discuss how the methods employed in the studies embraced the high variance in eye-tracking data and sought for spatial-temporal gaze patterns using machine learning algorithms.

2.3 PREDICTION OF INTERACTION FROM EYE MOVEMENTS Interaction inference from eye movements examines how eye movements follow and predict interaction events, how gaze corresponds to various aspects of interaction, and how interaction impacts user’s eye movements and cognitive states. Since the eye-trackers are better suited for computer-based studies, a wide range of experiments has involved human-computer interaction. In research on social interaction, head-mounted eye-trackers were used to accommodate for dynamic head movements during a group conversation (Vrzakova et al., 2016). This section presents purposively selected studies to illustrate the domain where machine learning methods are employed in interaction inference from eye movements. Since the studies vary in experimental task, context, and methods, we split the studies into two groups according to the manifestation of the observed interaction.

The first group of studies inspectsinternal user’s statesthat are observable only indirectly, using psycho-physiology sensors (as an eye-tracker, EDA or EEG) and questionnaires. Eye-tracking in these studies serves as a tool to access user’s visual attention and to infer the mental states, for example, in decision-making in shopping situations. The second group of studies investigates eye movements manifested during human activity or user’s actions, here noted asexternal user’s states. The aim of these studies is to establish how eye movements reflect ongoing activities and how to inform pervasive and context-aware applications and interfaces.

(31)

Table 2.1:Example studies inferring internal and external user states from gaze.

Topic Inference References

Attention

Attentiveness Yonetani et al. (2012) Disengagement Ishii et al. (2013) Divided attention Rodrigue et al. (2015) Mind wandering Bixler and D’Mello (2016)

User states

Text quality Biedert et al. (2012a)

Workload Fritz et al. (2014), Wang et al. (2013) Learner modeling Steichen et al. (2014), Toker and Conati

(2014), Lallé et al. (2015), Familiarity Kasprowski (2014) Curiosity Hoppe et al. (2015)

Activity

Office routines Bulling et al. (2009)

Implicit relevance Vrochidis et al. (2011), Sugano et al. (2014) Text skimming Biedert et al. (2012b)

Implicit interaction Kandemir and Kaski (2012) Daily activities Steil and Bulling (2015) Interactive action This Thesis

2.3.1 Predictions of internal user’s states Attentiveness and engagement detection

Following the Eye-Mind hypothesis, in which fixations determine cognitive processing during reading, research on inference from eye movements has diversified and aimed to classify various user states. An eye-tracker is a convenient tool to study the visual attention for which researchers have analyzed eye movements both as a predictor of attentiveness and as a predictor of its absence. Disengagement, which can negatively impact the ongoing interaction and user’s performance, has been investigated as mind wandering in reading (Bixler and D’Mello, 2014b), a lack of engagement in group conversation (Ishii et al., 2013), or divided attention when the user is in the position of multitasking or facing external distractions (Rodrigue et al., 2015, Yonetani et al., 2012), respectively. Independently of the context, a proactive agent or an intelligent system should be capable of sensing the attentional fluctua- tions from the user’s behavior and proactively probe the user to return his attention back to the ongoing interaction (Bixler and D’Mello, 2014b).

Although the studies have employed machine learning to predict user disengagement, both the methodologies and the recognition performances have varied.

In detecting mind wandering during reading of academic text, Bixler and D’Mello (2014b) searched for the best classifier out of 20 and achieved the best results with Bayesian classifiers within a page (accuracy = 59%) and at the end of the page (accuracy = 72%). When detecting decreased attention while watching TV com- mercials, Yonetani et al. (2012) used multi-mode saliency-dynamics modeling and distinguished with 80% accuracy whether a user reached states of low or high attentiveness. In predictions of listener’s disengagement in a conversation, Ishii et al.

(2013) employed Decision Tree with achieved F1-measure=0.7. To detect divided attention in reading, Rodrigue et al. (2015) extracted gaze-derived and EEG-based

(32)

features related to attentional states and classified them using Support Vector Ma- chines (SVMs). The gaze-based features predicted divided attention with about 70%

average accuracy, and reached more than 95% accuracy when combining EEG and eye-tracking features.

User’s states estimation

Following the potential of gaze-sensitive applications, prior research has analyzed a diverse range of user states, such as curiosity (Hoppe et al., 2015), image familiarity (Kasprowski, 2014), perceived text quality (Biedert et al., 2012a), cognitive workload (Fritz et al., 2014, Wang et al., 2013), and learner’s states (Lallé et al., 2015, Ste- ichen et al., 2014, Toker and Conati, 2014). As in studies of attentiveness, prediction methods and achieved results have varied.

To infer user’s cognitive workload during arithmetic, for example, Wang et al.

(2013) applied a Boosting algorithm to Haar-like features derived from pupillary responses and achieved a recognition rate greater than 79%. Machine learning and feature extraction were employed to improve the stability of workload recognition against the changing surrounding illumination. Likewise, Fritz et al. (2014) assessed programmer’s task workload in daily work; Näive Bayes recognized workload levels from multimodal features derived from gaze, EDA, and EEG with 84% precision.

The general motivation here was to detect the increased task workload to proactively stop the programmer before introducing an unnecessary mistake into the code.

In the context of interactive learning environments and information visualiza- tions, Conati et al. carried out a series of experiments of proactive teaching as- sistance and learning process adaptation according to predicted student’s personal traits (Steichen et al., 2014), cognitive workload (Toker and Conati, 2014), or learning curve (Lallé et al., 2015). In these studies, the choice of machine learning algorithm was part of the optimization process. To automate a search for the best machine learning algorithm, WEKA (Hall et al., 2009) was employed to select the most optimal candidate classifier from the set of predefined algorithms. Similarly to previous sections, selected classifiers differed greatly across studies: in Steichen et al.

(2014) and Toker and Conati (2014), the selected classifier was Logistic Regression, which resulted in accuracy = 70% (visualization type) and accuracy = 64% (learning curve), respectively. In the Lallé et al. (2015) study, Random Forest tuned with 50 trees achieved overall accuracy = 77% for expertise prediction and overall accuracy

= 75% for the prediction of learning speed.

2.3.2 Predictions of external user’s activity and actions

The previous section described user’s internal states as hidden to the external observer; the ground truth about ongoing cognitive processes is assessed using questionnaires and self-reports and is linked with eye-tracking data. In this section, user processes are studied in the context of ongoing physical activity (e.g., walking, reading, web-browsing) and the resulting action (e.g., hitting a ball, clicking on a link), which is directly observable without the immediate need of an eye-tracker. Al- though the user’s activity and actions are apparent, the internal preparation linked to the ongoing activity or the resulting action is concealed. Interaction inference in the following studies aims to access user’s eye movements during an activity and link them to various aspects and goals of the activity.

(33)

Activity recognition

Recognizing user’s activity from gaze has emerged from a line of studies conducted by Bulling et al. In their initial study, Bulling et al. (2009) sought patterns underlying current activity. Using a wearable EOG-based eye-tracker and a trained SVM classifier, they examined five typical office activities, either at a desk (copying a text, reading a paper, taking notes) or on a computer screen (web-browsing and watching a video). The trained classifier distinguished the activities using fixation, saccade, blink and word book-based features with 76% precision and 70% recall.

In a later study, Steil and Bulling (2015) extended the topic to long-term activity recognition from gaze and compared supervised and unsupervised methods of detection. In this case, they aimed to distinguish eight daily activities (outdoor walking, commuting, group conversation, eating, focused work, paper reading, work on a computer, watching a video) using an unsupervised LDA-based topic model and compared them using Näive Bayes and SVM as supervised methods. On average, the best performance rates were received for activities requiring focused attention (reading: F1 = 74.75%, focused work: 70%, and work at the computer: 64.18%), while the other activities received F1-score of 50%. Interestingly, the performance rates were sensitive not only to the type of the activity, but also to the personal characteristics. For a single combination of the activity and the participant, the classifier could reach a high recognition rate (watching media: F1 = 93.83%), while another combination could deliver a low performance (social interaction: F1 = 7.58%).

Action prediction

Action prediction could be perceived as a subset of activity recognition. On the one hand, a result of the cognitive processing (action) is visible without the need for additional sensors. On the other hand, user’s preparation and reasoning often remain hidden to a direct observer. Studies on immediate action prediction are rare and are specific to the predicted task and context. Generally, the studies are motivated by the hypothesis that user’s eye movements betray the intentions to act (Land et al., 1999, Mulvey and Heubner, 2012) and therefore, could be employed to inform a user interface about user’s future actions.

A classical motivation stems from studies of gaze interaction and the Midas Touch effect. Early gaze-aware interfaces have relied on the threshold-based methods in which a user was required to reproduce a predefined gaze behavior to activate a button (e.g. dwell time based button activation). Due to a high variance in gaze behavior, however, these methods failed since they could not differentiate correctly between non-interactive focused gazing (reading, for instance) and intentional interactive gazing. Jacob (1991) coined the term Midas Touch effectfor the unintentional firing of actions in the user interface. Machine learning methods in this context could offer a powerful solution. Action prediction could learn and distinguish the gaze patterns related to intentions to act from other gaze behaviors related to other activities in the user interface.

Action prediction has been rarely explored out of context and has been an in- herent part of the studies of object relevance (Kandemir, 2010, Kandemir and Kaski, 2012), decision making Sugano et al. (2014), familiarity Kasprowski (2014), and visual search Vrochidis et al. (2011). In the study of relevance, for example, Kandemir (2010) aimed to predict user’s judgment about projected information boxes in a vir- tual lab tour. Logistic regression trained on the set of fixation- and saccade-based

(34)

features distinguished object relevance with accuracy over 80%. Later, in the study of relevance in the simulated art gallery, Kandemir and Kaski (2012) extended the methodology and applied a Gaussian process to a set of features to recognize which painting is about to be selected as relevant. Although the methods were primarily targeted to estimate the relevance of visual stimuli, the task could be translated to action prediction since the relevant painting were selected by a manual click and the predictions were based on eye movements related to this event. Kandemir and Kaski (2012) proposed the concept ofimplicit gaze interaction, wherein the participant is not required to learn any explicit gaze patterns. Instead, a machine learning classifier would be trained to recognize gaze patterns from unrestricted gaze behavior with respect to ongoing interaction.

In the context of aesthetics, Sugano et al. (2014) aimed to predict which image is about to be selected according to user’s personal preferences. From generated pairs of photographs selected from Flickr, participants were instructed to choose within a 10 second time window which picture they found more pleasing. Related gaze-based features and image descriptors were employed to characterize the result of decision-making. Interestingly, gaze-based features were better predictors of user action (Decision Tree, accuracy = 73%) than image-based descriptors and meta- information obtained from Flickr. To predict the outcomes of the memory recall and decision-making, Kasprowski (2014) analyzed eye movements when users were asked whether a person displayed on the screen was familiar to them. Using WEKA and an ensemble of weak classifiers, an overall classifier performed with accuracy

= 70%. Classification performance varied with personal characteristics, however.

One participant received recognition rates of 88% accuracy (AUC = 0.92), while the other’s recognition rate was in line with the overall performance (accuracy = 73%, AUC = 0.7).

Implicit action prediction from unrestricted eye movements can serve not only as reinforced gaze interaction but also as a source for recommendation systems.

As media databases are becoming over-saturated with resources, users are over- whelmed and are thus reluctant to provide explicit feedback. To overcome this challenge, Vrochidis et al. (2011) foresaw gaze based recommender system which would predict and recommend relevant videos to be selected. Applying an SVM- C classifier with RBF kernel on gaze and behavioral features, the classifier could recognize users’ past video selections and recommended an upcoming videos with more than 85% accuracy on average.

In previous sections, the studies were categorized into two groups (internal user states and external user activity) according to human ability to observe ongoing phe- nomena without the need of an eye-tracker. This categorization, however, should not be perceived strictly as binary. Predictions of human behavior and cognition always lie on the borderline between internal and external user states. For instance, recognition of fast skimming and slow reading behavior (Biedert et al., 2012b) intuitively belongs to the category of activity recognition, however, the external observation of the reading activity is rarely possible without the eye-tracker. Similarly, the prediction of relevant paintings in the gallery closely follows the tradition of action prediction. However, the same preparatory eye movements could be linked the decision making that preceded the resulted action. Thus, these predictions should belong to the category of internal user states.

Independent of the categorization, methodological recommendations do not con- verge. Although the eye-tracking studies using machine learning have aimed to infer different states of attention, interaction, and cognition, they have shared only a little

(35)

similarity in data preparation and processing, and have varied in chosen machine learning algorithms. The best practices and benchmarks are therefore hard to establish (P6). This thesis shares the same motivation for and suffers from the same challenges of interaction inference. In later chapters, we systematically investigate methodological aspects of the machine learning pipeline employed in interaction inference from eye movements.

2.4 SUMMARY

This section has attempted to provide a brief summary of literature that supports the relationship between cognition and visual attention and the link between eye movements and interaction. In this work, we have adopted the Eye-Mind hypothesis and the supporting assumptions that eye movements are an informative predictor of attention and an upcoming action. The findings reviewed here have suggested a pertinent link between cognition, eye movements, and interaction. In the following chapters, we investigate the extent to which the inference based on machine learning reveals the aspects of interaction and the intention to act.

The research to date has been designed to determine whether eye movements reveal a link to human cognition. The generalizability of the majority of published research on this issue is challenging, as gathered studies have varied in experimental tasks, methods of data preprocessing and machine learning, and received range of performance. The following chapter describes a systematic approach to unify the practice.

(36)

(37)

3 Interaction inference from eye movements using machine learning: Framework and design choices

Many decisions must be made when using machine learning to analyze eye-tracking data. Some are connected to the experiment hypothesis, while others are tightly linked to the employed machine learning algorithm. This is partially based on my experience and partially based on practice in other domains. In this chapter, I sum- marize decisions and methodological steps needed for interaction inference from eye movements. Predictions in eye-tracking follow principles similar to those for speech recognition (for an overview of practices in speech recognition see Kinnunen and Li (2010)), while the gaze-based features were adopted from psychology research on eye movements (Jacob and Karn, 2003). When I started this work in 2012, similar studies were rare and further studies started emerging after 2012. While machine learning methods have developed greatly since 2012, a standard rule of thumb for inferential problems in eye-tracking is still absent and is hard to infer from previous studies.

In the included articles, we systematically manipulated several methodological steps in the machine learning process and searched for the best combination of data preprocessing (Section 3.1), features engineering (Section 3.2), and machine learning parameters (Section 3.3) that would suit the interaction inference. This chapter and enclosed papers aim to fully disclose the methodology, as we believe that only in- depth reports allow for research repeatability and advancement of the field.

3.1 PREPROCESSING OF EYE-TRACKING DATA

When analyzing eye movements in complex and unconstrained interaction, the ground truth about cognitive processes is often limited. Contrary to well-controlled experiments, where the tasks are designed to induce a particular cognitive process (Salvucci and Anderson, 1998), cognition temporal boundaries, duration, and manifestation in oculomotor activity become harder to infer in out-of-lab interaction. Similar to the studies introduced in Chapter 2, we hypothesize that cognitive processing is observable from eye movements. In the eye-tracking analysis and interaction inference, we systematically experiment with these characteristics using data manipulations, as has been done in this thesis.

During our work, we made three contributions related to the segmentation of eye-tracking data. We developed a method for data sequencing based on fixations, manipulated the sequence lengths, and experimented with sequence timing in relation to observed interaction. Computational cognition modeling through eye movements and machine learning offers tools for extending understanding about cognitive processing and its link to eye movements.

3.1.1 Data sequencing

When analyzing continuous data streams, researchers have adopted practices from speech recognition and segmented eye-tracking data into preset time windows. The

(38)

Figure 3.1: Fixation based segmentation. The data stream is segmented to sequences with one fixation overlap. The first sequence consists of all data points from fixations t1, t2, and t3; the second sequence consists of data from fixations t2, t3, and t4, etc. Adopted from P1.

stream of eye-tracking data is split into segments of specific size that are usually expressed in seconds. For example, in their study of mind wandering, Bixler and D’Mello (2014b) systematically experimented with segment sizes of four, eight, and 12 seconds prior to an interaction event to find the optimal settings. However simple, temporal-based segmentation suffers from half-cut fixations and saccades, and one has to decide whether to add them or remove them from the studied segment.

In our pioneering work (P1), we aimed to compensate for half-cut eye movements and adopted the idea of Klingner (2010), where the unit of analysis is a window of consecutive fixations rather than a window of fixed time length. Our hypothesis was that the cognitive processes reflected in eye movements are delimited by fixations and would not change within a single fixation. In our segmentation process, the eye-tracking data were segmented so that the sequence consisted of all gaze points withinnconsecutive fixations andn-1saccades, as illustrated in Figure 3.1.

The fixation-based approach has several advantages over temporal-based segmentation. First, we do not have to apply rules for cut-in-half fixations and saccades since the sequence consists of completed eye movements. In commercial eye- tracking systems, data often come out prefiltered as fixations and saccades. Conse- quently, the segments are easier to process in feature extraction (Section 3.2) since the data are already cleaned from the partially split eye movements. This approach also allows for a more precise selection of fixations, which we used later in our experiments with sequence length and timing. Additionally, fixation-based segmentation reflects on personal differences (e.g. as age (Rayner et al., 2006), sex (Bargary et al., 2017b), or culture (Rayner et al., 2007)), resulting in varying fixation durations.

An important concern related to data segmentation and windowing is when exactly the sequence should start and how long it should take. In our case of action prediction, we hypothesized that the sequence can encapsulate preparatory eye movements related to interaction. Thus, sequence characteristics (features) would reflect on this assumption in every sequence closer to the action. As we aimed to mimic the real-time classification of action prediction, we implemented segmentation with one fixation overlap, as illustrated in Figure 3.1.

(39)

3.1.2 Sequence length

When the unit of analysis is set, there is a consequent question regarding the size of the sequence. Since sequences consisting of a constant number of fixations do not implicitly answer this question, we experimented with the sequence length, which defines the size of the segmented window in number of fixations.

Due to lacking ground truth about the link between eye movements and interaction intentions, we experimented with different setups. Our hypothesis was that if the segmented window was too long, the performance would decrease due to additional cognitive processes reflected in the eye movements. On the other hand, if the segmented window was too short, there would not be enough information captured in the eye movements and the classification rates would drop.

In Paper P1, we set the segment length to three fixations, which approximately reflected the duration of pupillary response related to a physical click (Richer and Beatty, 1985). Later in P3, we extended the experiments to window sizes two, five, and nine fixations, since we aimed to investigate whether additional information encoded in the fixations would increase prediction performance. Figure 3.2 summarizes our experiments with segment length and timing in connection to each publications.

3.1.3 Sequence timing and synchronization

In P1, we hypothesized that intentions to interact should be reflected before the action click. In the following setups, we experimented with how far prior to the action we could infer the upcoming action. The opposing hypothesis was based on prior research and movement related pupillary responses (Richer and Beatty, 1985), suggesting that eye movements reflect the intentions with an observable delay up to 1.5 second and that interaction can be therefore inferred postmortem.

In our first step, we segmented the data streams to create timing tight to the action click; the sequence consisted of data points from the range of one fixation prior to the action to one fixation immediately after the action. The motivation of the timing was to contrast the eye movements during free viewing and eye movements close to the action (P1).

In the next step, we studied how interaction is linked to timing both before and after the action (P2). We isolated sequences strictly before the action and after the action. The former condition was suitable for action prediction and followed the hypothesis that eye movements are proactive (Flanagan and Johansson, 2003), the latter condition was meant for activity recognition scenarios, similar to Bulling and Roggen (2011).

In P3 we were interested how far prior to event it is possible to estimate an upcoming action. Since the classifier needs computational time to make a decision and to inform the interface, and consequently, the interface needs time to invoke an appropriate action, we aimed to buy the classifier time and experimented with timing further from the action click. The final setup consisted of sequences shifted one, two, and three fixations prior the action click. Figure 3.2 demonstrates fixations and their relation to the event. In sum, we systematically experimented with the sequence length and timing in relation to the interactive action to see how different settings impact a prediction performance.

(40)

Figure 3.2: Overview of segmentation setups tested in P1, P2, and P3. Fixations in red present fixation during the action click; fixations in orange and blue present fixations classified in the positive class (action-related) and in the negative class (free viewing), respectively.

3.2 FEATURE ENGINEERING AND SELECTION

In the ideal world of zero data latencies and benchmarked labels of cognitive processes, this section would not be needed at all. The opposite is true, however, and feature engineering from eye movements is a crucial part of every eye-tracking study.

In our case, eye movements in the context of interaction should signal differences at fine level of data granularity; intentions to act would be preferably recognized from a single fixation, and relevant objects would be tagged from a single gaze.

However, isolated eye movements are rarely sufficiently descriptive. More often, they are biased by noise introduced by sampling frequency, changing task conditions, level of illumination, and personal characteristics. Generally, a large variance in data – a typical property of biometric signals – makes the analysis based on a single fixation difficult (Jacob and Karn, 2003). Thus, fixation characteristics linked to a phenomenon in one domain may not necessarily reflect same in another domain.

For example, thresholds for fixation duration established in reading of natural text differed fundamentally in reading a source code (Busjahn et al., 2011).

For this reason, a data analysis works over data segments (as described in Sec- tion 3.1.1), encoded into gaze-derived metrics (gaze features). Computed over data segments (a unit of analysis), gaze features have the ability to capture nuances in short-term gaze behavior that are hidden in long-term averages. The fundamental question is how to capture meaningful information in a noisy gaze signal and which features contribute well to interaction inference. This section covers feature engineering – that is, constructing gaze features from data segments – and feature selection methods applied in eye-tracking studies.

Machine learning methods in interaction inference from gaze

uef.fi

Dissertations in Forestry and Natural Sciences

HANA VRZAKOVA

MACHINE LEARNING METHODS IN INTERACTION INFERENCE FROM GAZE

HANA VRZAKOVA

Hana Vrzakova

MACHINE LEARNING METHODS IN INTERACTION INFERENCE FROM GAZE

TABLE OF CONTENTS

1 Introduction

2 Background

3 Interaction inference from eye movements using machine learning: Framework and design choices