Gamified crowdsourcing model for studying cognitive biases in speaker comparison

(1)

Gamified crowdsourcing model for studying cognitive biases in speaker comparison

Sandip Ghimire

Master’s Thesis

Faculty of Science and Forestry School of Computing

November 2021

(2)

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, Joensuu School of Computing

Master’s Thesis

Student: Sandip Ghimire

Title of the Thesis: Gamified crowdsourcing model for studying cognitive biases in speaker comparison

Supervisor: Professor Tomi Kinnunen November 2021

Abstract:

Speaker recognition by listeners is interesting in its own right, and has implications to fields such as biometric user authentication and forensics. They may involve critical decisions to be made on the basis of voice similarity based on human perception.

However, it is known that human decisions are subject to various cognitive biases.

The author addresses some of the well known cognitive biases includingframingand overconfidenceeffects in speaker comparison task through human voice perception.

To this end, the author applied the concept ofgamificationto design a framework for empirical data collection. It allows the participants to play a game where they can listen to pairs of speech samples and give their decision on whether the speakers are same or different. In order to determine the difficulty level of the game, speaker similarity score is used which is generated by automatic speaker recognition system. The experiment was conducted with 3 disjoint panels, each consisting of 10 participants. Each panel was provided with different sets of instructions, but with otherwise identical set-ups.

The result of the experiment indicates that 80% of the participants were overconfident on their decision for hard task whereas 60% were underconfident when it comes to easy task. Furthermore, the result suggests that the framing of instruction can substantially affect the decision choice of the listeners consequently fluctuating themiss andfalse alarmrate. The outcome of this experiment has potential implication on fields where human speaker recognition is studied or applied.

Keywords: Speaker recognition, voice perception, cognitive bias, gamification, crowsourcing

(3)

Foreword

I am grateful to the University of Eastern Finland for providing me opportunity to study here. I appreciate the effort and dedications of all the teachers and faculty members to provide us with quality education.

I want to express my deepest gratitude to my supervisor, Professor Tomi Kinnunen for providing me guidance throughout this research work. I would like to extend my special thanks to Ville Vestman for producing speaker similarity scores used in our empirical framework. I would also like to thank Rosa Gonzalez and Radu Mariescu- Istodor for their valuable suggestions and advices during the framework development.

Last but not least, I want to thank my family for their everlasting support, encourage- ment and blessings.

(4)

List of Abbreviations

ACM Association for Computing Machinery UEF University of Eastern Finland

ASV Automatic Speaker Verification DNN Deep Neural Network

DL Deep Learning

MFCC Mel Frequency Cepstral Coefficient PLDA Probabilistic Linear Discriminant Analysis DCT Discrete Cosine Transform

MDA Mechanics, Dynamic, Aesthetics

(5)

1 Introduction

Speaker recognition refers to the process of identifying a person based on his/her voice characteristics [48]. This process has crucial role in fields such as authentication, surveillance and forensics. Speaker recognition can also be combined with other authentication techniques such as face recognition to provide secure authentication in a number of commercial applications too. In most of the applications, this process is au- tomated by using machines and computer algorithms. Automatic speaker recognition systems refer to machines equipped with algorithms for identification and verification of speakers [57]. This is achieved by analyzing voice characteristics and comparing them with the samples of hypothesized speakers in a speaker database, using signal processing and machine learning techniques. Though these automatic systems effectively perform the task on a commercial level, in some cases, speaker recognition systems still use human judgement such as forensics. Forensicsuse speaker recognition technique to compare the voice samples of the subject against the claimed person along with other evidences such as telephone conversation or recorded evidence [57]. In particular,forensic voice comparisonrequires critical listening, frequency/waveform analysis and voice biometric testing by audio forensic experts. In such scenarios, where human perception is involved in comparison of the voices, a number of cognitive factors may affect the decision making process. Such application of speaker recognition could benefit from considering the cognitive factors during voice comparison process.

In the real world, speech perception by human subject is a complex phenomenon which is impacted by a number of cognitive biases. Our brain continuously processes the incoming sensory information and categorizes it based on experience and memory built over time. Thus, the human perception involves the balancing between sensory stimuli and the stored representations in the memory. Biased perceptionoccurs whenever the mental speculation inaccurately matches with the incoming stimuli [53]. Acognitive biasrefers to systematic unconscious error in thinking that happens while processing and interpreting information and affects the decision making and judgment process [18][22]. Hence, the decision is directed by the mental speculation or the stored repre- sentation in the memory which might not be rational in some cases.

Cognitive biases are thoroughly studied in psychology and economics to understand human decision making. However, research on cognitive biases in speaker comparison appears more limited. Though some studies have investigated bias in human perception

(8)

in the case offamiliarvoices [53], studies investigating human perception inunfamiliar voices are limited. Studies have suggested that the neurophysiological mechanisms for recognition of familiar and unknown speakers differ from each other [27]. The author of this thesis tries to bridge some of this gap in this thesis by empirically studying how cognitive bias affects speaker comparison by listeners. In this thesis, the author focuses on cognitive bias for unknown speakers. The author studies cognitive bias under two types of effects in speech perception: 1) overconfidence effect and 2) framing effect.

In addition to pursuing the relatively new direction of studying cognitive bias in speaker recognition for unknown speakers, the author has implemented a novel approach to data collection on speaker comparison task in the area of cognitive bias (i.e. framing and overconfidence) study: gamification. Gamification refers to the concept of using game designs and game elements in non-gaming context in order to increase user engagement and motivation to perform a given task [9]. Though the concept of gamification emerged by the late 20^th century, the term ’gamification’ itself was claimed to be coined in 2002 by Nick Pelling who designed a game-like user interface for commercial electronic devices [46]. Throughout the years, the concept of gamification has been implemented in research, crowdsourcing, data-collection and many other contexts. However, in the context of cognitive bias study for speaker recognition in human subjects, this approach is quite new and has a lot of potential because we can easily design game according to our research questions. It has been predicted that the gamified environment encourages or motivates the goal-directed behaviour. Empirically, the majority of studies shows more positive effects of gamification on motivation [52]

[16][5]. The author implemented gamified model to study empirical method of data collection and study two types of cognitive bias under speech perception for speaker recognition task.

1.1 Related works

The author is combining three different fields in this research work as shown in Figure 1. Thus, a systematic review of the related work on these topics are presented below.

(9)

Figure 1: The three-fold aspects of this research work.

In typical studies on cognitive biases such asoverconfidence, participants are asked to solve a number of binary choice quizzes i.e. quizzes having two possible choices. For each answer, participants are then asked to indicate how confident they are, in the form of percentage or probability. For example, a quiz question like Is Peru greater than Mexico in size? can have two options to choose i.e. ’Yes’and’No’. The participants can choose one option and indicate the confidence on their choice from 0% sure to 100% sure. The overconfidence effect is said to occur when the confidence ratings exceed the percentage of correct responses [33]. For example, out of 10 questions, if a participants got 6 answers correct but the average confidence rating on his/her answer is 80% then he/she is overconfident on his/her answer. Initial study in this direction in [33] is remarkable. In that study, the participants were presented with some drawings and asked to choose whether the artist was an Asian or a European. Along with their response, the participants were also asked to indicate the probability of their answer to be correct. The outcome of the experiment [33] indicated that the participants got the

(10)

right answer only 53% of the time, while the average response about their confidence for the correct answers was nearly 68%. This indicates a discrepancy between actual performance and their confidence. In fact, the participants were overconfident, they believed that they performed better than they actually did.

In another experiment, the authors in [33] asked participants to answer 150 general knowledge questions and indicate the probability for their answer to be correct. While analyzing the results, the questions were divided intohard and easy categories. For easy questions, a discrepancy was found between the number of correct answers and the probability of being right indicated by the participants. The Authors found that the participants demonstrated underconfidence for easy task, as they indicated 60%

probability for being correct in their answers when 75% got it right [33]. This variation of confidence (overconfidence and underconfidence) caused by the difficulty of the task was termed ashard-easy effect.

Another well known cognitive bias in psychology and economics isframing effect, which the author of this thesis is studying in the context of speaker comparison by listeners. In 1981, Tversky and Kahneman studied how phrasing the same information in different ways influenced the responses to a hypothetical situation [60], which is the base for the studies of framing effects later. In their study, the participants were asked to choose from two options for the treatment of 600 people affected by a fatal disease.

In the first option, there was the possibility of death for 400 people. If the participants choose the second option, there was 66% chance that everyone would die and 33%

chance that no one would die. The participants were presented with these options in two different framing scenarios, anegative framing (describing how many would die) and a positive framing (describing how many would survive). It was observed that 72% of participants selected first option when it was framed positively and only 22%

selected the same option in case of negative framing of the same scenario.

In contrast to the traditional methodology in framing study, a recent work in [14]

adapted relatively newcrowdsourcingplatform for data-collection to study miss trade- off in perceptual speaker comparison. This work has inspired us to study framing effect on voice perception in gamified framework. The experiment in [14] was conducted through Amazon’s Mechanical Turk (AMT) crowdsourcing platform to study whether framing could alter the decision of listener to provoke more accept or reject decision.

There were four scenarios that the authors called ’neutral’, ’forensic’, ’user-convenient

(11)

same set of trials, each involving speakers comparison in a pair of speech samples. The results indicated that the task framing could influence the listener’s decisions.

A study that is somewhat close to this work ongamificationapproach is NameThat- Language (NTL) game [6]. It was developed in 2018 to study judgements of language spoken in short audio clips from telephone conversation and broadcast audio. The purpose is to contribute a resulting corpora with potential use in language recognition and confusability. In the game, the players listen to 10 second audio clips and indicate which language they believe is spoken. The player could play/pause audio clips using buttons. Once the audio clip plays to the end, the player is asked to respond with the decision and press next to continue. The player receives points for correct answer and lose one of 3 ’lives’ for each wrong answer. As of March 22, 2021 the game had presented 720339 HITs to players who have returned usable responses 86% of the time (621420). The number of HITs is the number of requests made to the web server to serve a specific page. A response is judged unusable by algorithm, if it lacks a specific decision on recognizing the language spoken, for example, when new game is started or the player logged out before submitting the guess. It was found that using simple ag- gregation, players responses identify spoken languages with high accuracy and signal problematic clips when they do not converge on a single language [6].

1.2 Summary of the proposed approach

In this thesis, the author addresses the hard-easy effect on overall confidence of the participants and the effect of framing on decision making in speaker comparison by human subjects. To study these effects and cognitive biases, the author has designed a gamified speaker comparison framework to conduct online experiments. It is a game where the participants listen to two utterances at a time to decide whether the voices are from same or different speakers (forced binary choice). In addition, the subjects are asked to indicate their confidence (in percentage) on their choice. Importantly, even if the game itself is played by humans, an automatic speaker recognition system is used to pre-determine the difficulty level of the task. Using the developed game framework, the author then studies whether the participants are underconfident or overconfident on their decisions, and how the difficulty level affects their confidence. Apart from that, the goal is also to study the effects of framing the game instructions differently on the decision choices of the participants.

(12)

1.3 Hypothesis

The author brings forward these two hypothesis regarding cognitive bias on human speaker recognition task:

H1:People are more likely to overestimate on hard task whereas underestimate on easy task. The author hypothesizes that while making decision on speaker comparison task, people generally show overconfidence and overestimate their capacity to recognize the speaker correctly when the task is very difficult. Whereas, they underestimate their capacity for a easy speaker comparison task.

H2: Framing effect affects the decision making of participants. The author hypothesizes that the decision choices of the listeners can be influenced by framing instructions differently in speaker comparison task. The game instructions have potential to manipulate the listeners to take decision in certain direction. For example, the game instructions can be framed in certain way to provoke the listener to choose one of the binary choice option more often than the other.

2 Background

2.1 Speaker recognition system

Automatic speaker recognition system extracts information or features from speech recordings in order to identify and distinguish the speakers. Typically, an automatic speaker recognition system consists of two major stages: enrollment and recognition.

At the enrollment stage, features are extracted from a provided speech sample of a known person to form a unique speakers model [38]. At therecognition stage, the utterance from an unknown speaker is compared against the models in the system database to give a similarity score. Finally, thedecision modulemakes a binary (same-different) decision based on the similarity score [24].

Automatic speaker recognition systems have been widely used in security, forensics, biometric and authentication [17]. Due to its wide range of applications, a lot of research has been done and many new methods have been developed throughout decades.

Automatic speaker recognition methods continuously improve and evolve using state-

(13)

of-art technologies. A number ofmodels have been applied, making best use of machine learning, which forms the foundations of speaker recognition technologies.Ma- chine learningrefers to algorithms that build models based on sample data known as

’training data’. These models are then used to make automatic predictions or decisions without the need to program explicitly [25]. Among many different models, the probabilistic linear discriminant analysis (PLDA) provided remarkable performance for several years [4]. Recently, with the popularity of deep learning (DL), a lot of deep learning based speaker recognition method has been proposed [56][29][62]. With the powerful performance in feature extraction capabilities of deep neural networks (DNNs), the deep learning based speech recognition system has substantially boosted the performance of speaker recognition to a new level [4][37][40].

The speaker recognition system adopted for this work uses deep neural networks for utterance-level feature extraction using so-called x-vector [54] based neural network system. For that, first the frame-level acoustic features are extracted. Any given speech file, whether enrollment or test, is typically divided into 20-25 ms longframeswith 10- 15ms overlap across consecutive frames. From each frame of audio, onefeature vector is extracted. The particular system used in this work uses a total of 60 mel-frequency cepstral coefficients (MFCCs) as features [8].

Finally, the neural network takes MFCCs as input and produces speaker probatilities as output. Before the output layer, we extract fixed-sized vector from the layer. This utterance-level feature vector is known as aspeaker embedding. Then the speaker em- beddings of enrollment and test utterances are compared with each other to produce a real-valued similarity score. For this part, we use so-calledprobabilistic linear discriminant analysis[49].

(14)

Figure 2: High-level architecture of Automatic speaker recognition system. Adapted from [41]

2.2 Gamification of perceptual tests

Gamification[9] refers to application of game-inspired design, or game elements used in non-gaming context. For instance, we can use concepts such as points, scoring and difficulty levels to measure the achievement of the user. In contrast to traditional "pas- sive" tasks, gamification enables the goal-oriented behaviour and has the potential to substantially increases the user engagement and motivation for performing the task [52]

[16][5]. Gamification has been studied in a number of contexts such as crowdsourcing, data-collection, health, marketing and social networks.

Crowdsourcingis one of the possible context where gamification could be integrated effectively for empirical data-collection. Crowdsourcing refers to an outsourcing approach that takes tasks as an open call to a large group of people [20]. Knowledge seekers, those who want to collect data, post the requirements for their task in a platform whereproblem solversparticipate to complete the task [34]. Problem solvers are rewarded if they fulfill all the requirements of the task. In the traditional method of

(15)

data-collection such as laboratory-based experiments, the number of participants were usually limited and it was challenging to include participants from wide demographi- cally and culturally diverse population. With the inclusion of crowdsourcing method such as web-based experiments, not only high number of diverse participants can be included in experiments, but also the cost of equipments and space can be saved. Care- fully designed gamified model can be integrated to web-based experiments to gain the benefits of both worlds. [7]

The design and aesthetics play an important role while implementing a gamified model.

The game design can have huge impact on motivation and overall experience of the player. A typical and well established framework for game design is MDA(Mechanics, Dynamic, Aesthetics)[21]. It breaks down the game into the three distinct components mentioned.Mechanicsdescribes the components of the games,dynamicsdescribes the run-time behavior andaestheticsdescribes the emotional responses evoked in players [21].

These fundamental components encompass game design elements such as rewards, stories, feedback and progress. These game design elements are the basic building block of any game. Moreover, the game design elements may vary depending upon the purpose and domain where the game design is used. In Table 1, the author has summarized some of the application of gamification in different domains and their respective game design elements. The three unrelated domains are selected to compare how they approach in terms of game design elements. The common element found in all the domains is points or rewards system, which is supposedly the most fundamental element to include in a gamified design. Other elements may vary based on the individual purpose and needs of the game. In the empirical framework used in this work, the author has used some of these game design elements such as points, level, stories.

1. Pointsare the basic elements of game design and gamified environment. They are the reward provided for successfully completing a task or activities in a gamified environment [64]. Point can also be used to give the feedback or to numerically represent player’s progress in the game [52]. Points serve as measurable performance indicator which instantly indicates how the user performs in the game.

2. Levels symbolize the completion of certain steps in the game and represents the achievement in a cumulative way. Levels also represent how close the player is towards the goal, which may help to encourage or motivate the players. It may represent the

(16)

virtual achievement to help boost the confidence of the player.

3. Stories give the narrative context and instructions for conveying information and directions to the user. It can be communicated through simple instructions or storylines of contemporary role-playing [23]. Narrative contexts can be designed to simulate the real-world context or relay the idea of non-gaming scenarios in gaming context which might excite the player and motivate in boring tasks [50].

It is important to take player’s motivation into consideration when designing gamified model. Motivations can be divided into two groups: intrinsic and extrinsic [51].Intrin- sic motivationsis an internal motivation which reflect genuine interest in or enjoyment of an action for its own sake, reflecting the natural human ability to learn and assim- ilate, whereas extrinsic motivations are external motivation driven mostly by factors external to the agent to seek some kind of reward for e.g. money, praise, certificate etc. [43]. The reward system in the proposed framework as well as progress in terms of completed levels is expected to induce the extrinsic motivation of the player. The intrinsic motivation depends upon various factors such as the interest of the player or the emotional states however, the instant symbolic as well as audio feedback about the performance of the player is expected to contribute to the intrinsic motivation of the player.

(17)

Table 1: Summary of some gamification approach in various domain and their game design elements

Study Domain Purpose Game Elements

Hassan et al. (2019) [19]

E-learning Provide learning experience based on individual students’ learning style.

Points, badges, feedback, chal- lenges

Amriani el al.(2013) [3] E-learning Study the impact of Gamification on e- learning environment.

Points, badge, leaderboard, title, completion track

Ahtinen et al. (2013) [1]

Health and wellness

Usefulness of a mobile application for stress management and mental wellness.

Reward (virtual rose), progress

Kuramoto et al. (2013) [26]

Health and wellness

Motivate standing commuters in crowded public transportation in Japan.

Points, level, avatar

Cieri et al. (2021) [6] Data- collection (research)

Study judgements for language recognition and confusability.

Points, lives, level, feedback

This work Data-

collection (research)

Study cognitive bias in voice perception for human speaker recognition.

Points, level, stories, feedback

(18)

3 Heuristics and cognitive biases in judgement and decision- making

A heuristic approach is mental shortcut that allows people to solve problems and make judgements quickly and efficiently, for example, an educated guess or a rule of thumb [22]. Even if it can be helpful in many cases, it can lead to cognitive biases. Due to limited amount of time and information, the human mind makes decisions subjected to cognitive limitations. People sometimes use heuristics as a lazy approach to reduce the mental effort for making a choice or a decision [2].

In the 1950s, cognitive psychologist and economist Herbert Simon introduced the concept of heuristics. He suggested that as much as people try to make rational choices, human judgements are subjected to cognitive limitations. Rational decisions would need evaluating all the alternatives for their possible benefits and costs. However, people are bound by many factors such as limited time, information, intelligence and experience which influences the evaluation of rational decisions making process. In 1970, psychologists Amos Tversky and Daniel Kahneman added their research on cognitive biases to this field. They proposed that these cognitive biases influence the judgement and decision-making process.

In the following three sub-sections, the author discusses cognitive biases relevant to the present work.

3.1 Overconfidence Effect

Overconfidenceis defined as the cognitive bias where a person’s subjective confidence in his/her own ability exceeds their objective (actual) performance [44]. In psycho- logical research, it can be considered as miscalibration which relies on the idea that people tend to overestimate the precision of their knowledge. Overconfidence can also be attributed to several other factors such as overplacement or overprecision [39].

Overplacement, also know as "better-than-average" effect occurs when people tend to believe that they are better than others. Overprecision is defined as excessive certainty and faith that one knows the truth [39]. However, most research on overconfidence is based on overestimation and the author focuses on overestimation in this work.

(19)

It has been assumed that overestimation is caused by wishful thinking, especially when people have optimistic forecasts for their future. People tend to overestimate the amount of likelihood of certain outcome which they wish or desire to be true [39].

For example, if a student who took a 10 item quiz believes that he answered 5 questions correctly, when in fact he answered only 3 correctly, then he has overestimated his score.

Overconfidence carries important consequences. For instance, overconfidence could lead a student to make poor study decisions which might ultimately impede his/her learning [10]. Furthermore, one person’s overconfidence can cause substantial consequences for other people. For example, people might rely on doctors and lawyers’

advises to make important health and financial decisions [11]. This might have major consequences in a person’s life as both doctors and lawyers might be overconfident on their job-related knowledge and skills [35, 58]. Biased or false beliefs may also lead to bad decisions and consequences. For instance, overconfident athletes, students and contestants may fail to prepare enough for a test or exam and therefore perform worse than they would have done otherwise [61]. People who believe that they are strong and invulnerable can take actions that may have major impact on their life [36].

Overconfidence can be measured quantitatively through empirical setup. Generally, overconfidence is accessed through a calibration measure defined as the difference between the mean confidence rating and mean accuracy (percentage of correct answers).

Bias Score = E[X_i]−x_i

whereE[X_i]is individuali’s estimation about his/her expected performancein a particular test, and x_i is the measure of his/her actualaccuracy [47]. For example, if a person has estimated the chance for his/her answer to be correct as 60% for 4 times and 80% for 6 times out of 10 trials where he/she got 8 answers correct. Then, his actual accuracy is 0.8 out of 10 trials and the mean probability of his/her estimation,

1

10[(0.6×4) + (0.8×6)] = 0.72,

Now, the bias score is calculated as:

0.72−0.8 =−0.08,

(20)

which shows that he/she is underconfident. A measure of bias score > 0 means the respondent is overconfident whereas bias score < 0 means underconfident.

Alternatively, the overall tendency for a judge to be overconfident or underconfident can be measured by:

Over/underconf idence= 1 N

T

X

t=1

n_t(r_t−c_t)

where N is the total number of responses, n_t is the number of times the probability responser_twas used,c_tis the proportion correct for those items that are assigned with probabilityrt, andT is the total number of probability response categories used [33].

For example, if we consider that there are 10 possible probability estimation categories such as ( 0.1, 0.2,...1) to select the expected probability for the correct answer. As previous example, if a person chooses 0.6 for 4 times and 0.8 for 6 times, where he got total 8 answers correct (3 correct where he/she chooses 0.6 probability and 5 correct where he/she chooses probability 0.8), we can calculate over/underconfidence using above expression as,

1 10

0 + 0 + 0 + 0 + 0 + 4(0.6−3

4) + 0 + 6(0.8− 5

6) + 0 + 0] =−0.08,

which is equivalent to the result of previous expression.

A graph that displays proportion correct for each probability response is called acali- bration curve[33]. Figure 3 shows an example of calibration curves where the curve A reflects underconfidence while curve C reflects overconfidence. As we can see, in the former case all the points shows more correct proportion than their respective probability response while the reverse is true for curve C. For example for the first point in curve A, the response is 50% correct while the actual correct is 60%. Finally, Curve B displays perfectly calibrated case where the probability response is same as the correct proportion.

(21)

Figure 3: Exemplar calibration curves. Adapted from [33]

3.2 Hard-Easy Effect

The effect of difficulty levels of the task over the overconfidence has been studied primarily by Lichtenstein and Fischhoff [33], termed as hard-easy effect. The key finding is that people exhibit overconfidence for hard sets of questions than for easy ones. This phenomenon is also referred asdiscriminabilityeffect ordifficultyeffect in some other studies [13][15][12].

There is not much evidence to support the cause of hard-easy effect. However, it has been assumed that the overconfidence in hard task could be associated withconfirma- tion bias, which is a cognitive bias where people are more inclined towards their pre- existing beliefs ignoring the contrary information [42]. It produces systematic errors in decision making when people focus and strengthen their beliefs in certain direction and seek only the supporting informations to conclude their decision. People are unlikely to think of reasons as to why their answer might be wrong instead, they focus on why it might be right [55]. On the other hand, the lack of confidence for easy questions could be related to another cognitive bias known as bikeshedding. Bikeshedding is

(22)

the tendency of people to spend more time thinking about trivial or easy task whereas spending less time on making big decisions [45]. Due to analysing and thinking more about an easy task, people are more likely to come up with conclusion that their decision could be wrong. Consequently, they might be less confident on easy task.

It was observed that the experience and knowledge can help to better estimate a difficult task. However, Lichtenstein found that the knowledge only reduced the gap for error on being overconfident for hard tasks but the knowledgeable people were still underconfident on easy task [33].

3.3 Framing Effect

Earlier, so-calledexpected utility theorywas presented by scholars who demonstrated that people estimate the potential consequence while making decisions in uncertain and potentially risky situations. The theory was dominant in most of the places where risky decision were taken until 1944. The author in [63] put forward a new model on the basis of this theory which predicts that the individuals tend to take decisions which equate to their utmost expectation.

However, in reality, most of the individuals do not show rational behavior because of limited experience, knowledge and other factors. People are unlikely to take perfect decisions in terms of rationality in certain conditions such as lack of knowledge and unforeseen risks. Therefore, decision-making could not support the Expected Utility Theory. This has stimulated researchers to study further about the decision-making process.

The so-calledframing effectis the cognitive bias that affects the decision making process based on how the information is presented [60]. The same information can be presented in different ways to influence the individual’s choices. An example of framing effect is so-called "Asian disease problem" designed by the authors in [59]. In the study, the subjects were asked to select between two alternative programs to combat the disease. In one program, 200 people would survive whereas in other program, there is risky probability that one third of 600 people would survive and two-thirds would not survive.

In alternative framing, the same decision problem is described using loss rather than

(23)

survival vocabulary. For the same Asian disease problem, if the first program is adopted 400 people would die. If second program is adopted then there is risky probability that two third out of 600 people would die and one third probability that no one would die.

The majority of participants (72%) were inclined towards certainoutcome when the framing was ’positive’ emphasizing the lives saved, whereas most of them (78%) in second group were in favour of theriskysituation when the same choices were framed

’negatively’ emphasizing the lives lost. Nonetheless, the both problems are effectively identical in terms of actual number of survival and death. The only difference were that the outcome of first problem was described in terms oflives saved and in second problem aslives lost.

People relate the possible outcomes in terms of gains and losses. They tend to be more risk seeking when choosing an option that seems as loss and risk-averse in case of gain.

In terms of framing, when the choice selections are framed positively, people equate them as possible gain and they tend to be more risk averse. Alternatively, when the same options are framed negatively, people are more likely to consider them as loss and show more risk seeking tendencies [32].

In 1998, The authors in [31], proposed that framing effect could be further dividend into several sub categories based on their operational definitions and underlying processes. Three major distinguishable types areattribute framing,goal framingandrisky choice framing. Generally, attribute framing has a key attribute labelled in positive rather than negative term whereas risky choice framing is associated with willingness to take a risk depending upon how the potential outcome is framed. In this thesis, the author employs the concept of goal-framing effects. Goal-framing effects occur when a message can induce different appeal to a person based on how it stresses the consequences in order to achieve a particular goal. For example, a message stressing about positive consequences of performing certain task has different appeal than a message stressing about negative consequences of not performing the task, to an end user [30].

(24)

4 Gamified model

The author has developed a game-like environment to conduct experiments and test the hypotheses stated in section 1.3. The environment is an application that can be accessed through a web browser. It provides user interface for controlling several aspects of an experiment such as playing audio clips and confidence feedback (see Figure 4). The tool is implemented with HTML, CSS, JavaScript and jQuery for front-end interface, and Python in Django for back-end. MySql is used for database management. To make the game playable in multiple devices including mobile devices and tablets, the author has adopted a responsive design that can adjust in multiple devices with different screen sizes.

As shown in Figure 4, two speech samples at a time are presented to the user to listen to. Participants can play and pause either sample as many times as they wish, before confirming their binary decision (SAME or DIFFERENT speakers). Before confirming their decision, the participant must also provide how confident they are on their choice, either by sliding the slider or by using an input box; both are quantized to percentage (%) in integers i.e. 100 possible confidence values ranging from totally unconfident (1%) to totally confident (100%). When both the decision and the confidence level are submitted, the ’next’ button will be enabled for next trials.

(25)

Figure 4: Game interface in mobile device

4.1 Game controls

Users with administrator permissions (Admin) can change the game configurations through game settings. Once logged in, a new option for settings panel is available to the administrator. Through the setting panel, the game environment can be configured and controlled.

Game Instructions:Game instructions can be provided or edited through the settings panel by Admin. The new instruction will be immediately updated for the current session while the game is being played. Along with instructions, images can also be added or removed through the same interface. These instructions and multiple choice options are combined together as a setting name. For example, in Figure 5, the setting

(26)

name ’Exp1’ is saved with the binary choice options ’same’ and ’different’ along with game instruction and image combined together. From the drop-down list, the Admin can select the setting to be implemented for specific game play. The Admin can create multiple sets of these combinations so that while conducting experiments, it can be easily switched as needed.

Figure 5: Game settings control panel accessible to only authorized Admin

Multiple choice options: The Admin can switch between binary option or multiple choice options. For multiple choice options, score value should be provided by the Admin along with a description of each option. The score value is the points (scores)

(27)

given to the player, in each case conditioned on ground truth of the trial. For example, in Figure 5, when the player chooses the option "same", he/she will be awarded 10 points if the ground truth is same whereas 10 points will be deducted if the ground truth is different for the give pair of voice sample.

Audio Files:

The author is using audio clips from the VoxCeleb1 corpus¹[40]. The audio files are loaded from the remote server at the runtime. Instead of relying on online streaming service such as YouTube, the author has stored the collection of audio files in a dedi- cated server. The Admin can specify selected list of audio pair to be played for specific experiment setting by uploading a trial file list. This list contains path for audio files in plain .txt format as shown in Figure 7.

Figure 6: Options to upload audio list and score file

Figure 7: Sample list of audio files to upload as files.txt

1https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html

(28)

An important design in the game is that the trials (speaker comparisons) get progres- sively more difficult when the player proceeds from a level to the next. In order to determine the difficulty level, we need to know how similar/ different a pair of audio sounds. To this end, we calculate score using anAutomatic Speaker Verification(ASV) system [28]². The pairwise ASV score file is uploaded along with the audio files.

15.34445 2.369956 1.9103882 1.1902475 1.517458 2.369956 15.604827 −1.3685863 0.13076347 3.7607312 1.9103882 −1.3685863 16.573633 2.833692 0.26296276 1.1902475 0.13076347 2.833692 12.426578 1.9774017 1.5174584 3.76073076 0.26296276 1.9774017 12.589277

Figure 8: Speaker similarity score generated by ASV system [28]²

In practice, the score file is a 2 dimensional array (matrix) as shown in Figure 8. It is obtained by comparing each audio file in the list against all the other audio files in the list i.e. a matrix of shapeN ×N whereN is the total number of audio files in the list. While saving the file, a new matrix file will be generated for ground truth which is similar to score file. After uploading the file, it will be referenced by the specific configuration setting while performing the experiment. Each configuration can have separate score file and audio file list.

We have an option to choose to play the audio list sequentially or randomly. If the sequential option is selected, the audio files in the given list are played in the same order as it appears in the list. If therandomized option is selected, the audio pairs are randomly selected based on difficulty level. Audio pairs in specific difficulty level are determined based on the audio score provided.

4.2 Difficulty level:

While loading the game for the first time or after uploading audio file list or score file, all the speech files in the audio file list are mapped with their ASV scores and ground truth. Based on the mapping, a new file is generated and cached in the server. For calculating difficulty level, a lower triangular matrix is extracted from scores excluding

(29)

diagonal element.







0 0 0 0 0

2.369956 0 0 0 0

1.9103882 −1.3685863 0 0 0

1.1902475 0.13076347 2.833692 0 0 1.5174584 3.76073076 0.26296276 1.9774017 0







Figure 9: Lower triangular matrix extracted from audio score

As the score matrix contains the score of each audio compared to all other audio, the score value at position [i, j]is same as the score value at position [j, i], which is comparison between audio filesiandj. So in order to avoid the duplicate record for each score a lower triangular matrix is extracted. The diagonal elements are excluded as it represents the comparison of the audio with itself.

Finally, the extracted scores are sorted while preserving the information about its position (index). As the higher score by ASV represents the speakers are more likely same, the higher score falls in easy category for those speakers whose ground truth is same. As shown in Figure 10, the highest score for target (same speakers) falls in level 1 (easy). As the difficulty level increases, audio pairs with lower score are selected. Similarly, for non-target (different speakers) audio pairs with lowest score falls in easy category and as the difficulty level increases, audio pairs with higher score are selected. So, the pair of audio in each difficulty level are chosen based on its score and the ground truth (see Appendix A for pseudocode).

(30)

Figure 10: Typical setup in the proposed framework showing progression of difficulty level for target and non-target trials on the basis of ASV score. As high ASV score represents the speakers are more likely same, those trials with high scores are used in level 1 (easy) for target speakers and vice versa for non-target speakers.

4.3 Storage and Architecture

All the uploaded files, images and cache files are stored in media directory in the server. These files are served via application server where the back-end logic runs.

Database tables also reference image files from the media directory. Static files such as JavaScript and CSS are accessed through back-end and served by application server

3. Nginx server is used as proxy server which connects user interaction with back-end logic.

3https://www.voicecomp.net/

(31)

Figure 11: Application architecture

4.4 Game data

The game data is stored in MySql database using various relational tables. These data can be downloaded by logged in Admin through settings panel as shown in Figure 12.

For each game configuration, the game data can be downloaded separately. For example, an Administrator can create a separate set of configuration including different instructions, multiple choice options and speech files. These configurations are associated with a unique configuration name. After conducting experiment, the game data corresponding to a specific configuration can be downloaded separately.

Table 2 summarizes the basic relational tables used for storing the game data.

(32)

Figure 12: Example of downloaded game information using control panel

Table 2: Database tables to store game information Table name Description

Batch A record will be created each time game starts. It stores worker information and setting for the trial.

Worker It stores the player information. Automatically, a unique name is assigned for each player. Alternatively, an alias could also be assigned for player’s actual name which may not, however be unique.

Setting Admin can create a setting for under which each trial is conducted. Setting contains options/choices and instructions for the user.

Choices set It stores individual choices details including points for each choice if chosen by user.

Instructions It stores the game instruction presented to the player. It also stores the reference to possible image used with the instruction.

Experiment It stores audio files information, user response, batch information for each trial.

(33)

5 Experiment and Results

5.1 Experimental setup

5.1.1 Game Logic

In order to conduct the experiment in the form of an online game, we need to carefully set some parameters such as total number of trials, number of levels and progression rules before conducting the experiment. The author took an inspiration from other studies in gamified approach for data-collection as mentioned in section 1.1 and also taking the player’s motivation into account the number of trials and difficulty level are determined so that the player does not feel too exhausted while playing the game. We selected 10 audio pairs randomly from 3 difficulty level starting from easy (level 1) to hard (level 3) based on their speaker similarity score. Therefore, the game is divided into 3 levels and each of which consists of 10 trials. The player needs to get more than 50% correct response in each level, in order to progress to next level assuming that the player has 50% chance to get each answer correct as the choices are binary. In case the individual scored less than 50% the same level will be repeated and the score is set back to the beginning of that level as illustrated in Figure 13.

Level 1 N = 10

Level 2 N = 10

Level 3 N = 10

<50%

>50%

<50%

>50%

Figure 13: Level progression on empirical framework

5.1.2 The three different framing effects

The instructions of the game have possibility to alter the decision choices of the participants if they are framed in certain way as discussed earlier in section 3.3. In order to study this, three disjoint framing scenarios are prepared so that the one is neutral and the other two are to bias the player’s decision in each of the binary choices respectively. Then the results of those two scenarios are compared with the neutral scenario

(34)

to observe if the framing has directed the player’s decision in the intended direction.

1. Neutral: In neutral framing, we intend the listeners to be as unbiased as possible towards either one of the two binary choices. Figure 14 shows the instruction for the neutral framing setup.

Figure 14: Neutral framing

2. Biased for ’acceptance’: In this case, the aim is to purposefully bias the decision of the listeners towards choosing more acceptances i.e. ’same speaker’ decisions. It is intended to mimic convenience applications by reducing the number of false rejections (misses) in real-world scenarios. With reference to section 3.3, the instruction isneg- atively framedabout the effect of rejecting a genuine customer as shown in Figure 15.

As can be seen, the instructions now contain a ’story’. To reinforce the idea of ’loss’

the last important sentence is further highlighted using red font.

Figure 15: Framing biased for ’acceptance’

1. Biased for ’rejection’: As opposed to the previous framing, in this case the in- tention is to encourage the listeners to choose more rejections i.e. ’different speaker’

decisions. It is intended to mimic secure applications where keeping false acceptance (false alarm) rate is important. As shown in Figure 16, here too the instruction is negatively framed about the consequences for accepting the different speakers.

(35)

Figure 16: Framing biased for ’rejection’

5.1.3 Measures of performance

1. Framing effect:

The author measures the total number of target response (i.e. same speaker) and non- target response (i.e. different speaker) in each framing scenarios. Then their proportion is calculated by dividing the number of target/non-target response by their respective total trials. Finally, the calculated proportion is normalized for the comparison as shown in following expression.

target/nontarget response proportion= target/nontarget response

total target/nontarget trials × total trials¹ . The author also compares the total number of miss and false alarm in each framing.

Miss (false rejection) occurs when the target speaker or the valid speaker is rejected. It is measured as the number of times participants responded different speaker whereas the actual audio pairs are of same speakers. False alarm (false acceptance) occurs when the imposter speaker is accepted as a target speaker. It is measured as the number of times participants responded same speaker whereas the actual audio pairs are of different speakers. The total number of miss and false alarm will also be affected by the framing effect if the total number of target and non target responses are altered.

2. Overconfidence effect:

The author measures the confidence of the participant in each of their response. Based on their confidence, the bias score is calculated using,

Bias score=Average conf idence−Correct proportion.

The overconfidence of each participant was determined by the bias score. As dis-

(36)

cussed in section 3.1, a positive bias score represents overconfidence, a negative bias score represents underconfidence and a bias score equals to zero indicate accurately calibrated person. In the analyses, the author compares the bias score of participants at each difficulty level.

5.2 Outline of the experiments

Since, the experiment required a controlled framing setup for each panels and was not integrated to large crowdsourcing platform, a small number of 30 participants were recruited from social media. The author conducted three disjoint sets of experimental panels, each of which consists of 10 participants. During the experiment, the speech files and trials were the same in all panels, while the framing (instructions) were altered between the panels. The first one received neutral framing, second set received framing biased for acceptance and third set received framing biased for rejection.

The participants were informed about the game rules, the total number of questions in each level and the highest level to complete. The author also explained the scoring rule and the effect of confidence on positive and negative scoring to the participants.

Participants were given some time to ask about any questions they might have. Once they assured that they understood the game rules and were ready for the game, the URL for the game was provided. Each participant could take part in the experiment only once, however, can repeat each level multiple number of times until he/she reaches the highest level. In case of the failed level and repeated trials, only the data from the first attempt was taken for analysis since, the experience from first attempt might affect the decision choices in second attempt.

Before starting the game, some demographic data about the participants were asked which might affect the voice perception. The following questions were used to gather this data.

(All questions are free-text fields.)

1. Have you ever been involved in voice comparison task before? If yes, how much would you rate your skill: basic, mid or advanced (expert)?

2. Which spoken languages you can understand?

(37)

3. Have you been diagnosed with a hearing problem or do you use hearing aid?

4. Are you using headphones/earphones or loudspeakers?

5. Are you in silent environment or is there some background noises?

6. Which device are you using to play the game: laptop, mobile or any other?

Finally, an explicit consent of the participant was asked to proceed the game (see Ap- pendix B).

5.3 Results

The author analyzes the effect of difficulty level on confidence of participants and the effect of framing on decision choice made by the participants based on the data collected through the experiment. In order to analyze the hard-easy effect, the bias-score is compared on each difficulty level to observe how it impacts the confidence of the participants. For example, higher number of positive bias score at certain level indicates that the participants are more overconfident at the difficulty level. For framing effect, the proportion of target and non-target responses are compared on three different framing scenarios. For example, the high number of target response proportion on certain framing scenario than other suggests that the participants are biased for acceptance in that specific framing scenario.

5.3.1 Overconfidence Effect

The data from the first set of experiments (neutral framing) were taken to analyze the confidence of participants in each difficulty levels. In other framing scenarios, participants are already biased and it might affect their confidence rating so those data are not included in this analysis. Table 3 summarizes the data from the 10 participants in first set of experiment.

In Table 3, we can see that out of 10 participants, 6 are underconfident at level 1, 4 are underconfident at level 2 whereas only one is underconfident at level 3. On the other hand, 4 are overconfident at level 1, 5 are overconfident at level 2 and 8 are overconfident at level 3. These findings suggest that overconfidence increases as

(38)

difficulty level increases. Also, the maximum number of underconfident participants were found at level 1.

Table 3: Bias score of each participants in all three difficulty levels

Worker Level Average confidence Correct proportion Bias score

1 1 0.818 0.9 -0.082

2 0.79 0.8 -0.01

3 0.9 0.6 0.3

2 1 0.748 0.7 0.048

2 0.717 0.4 0.317

3 0.545 0.2 0.345

3 1 0.888 0.9 -0.012

2 0.95 0.8 0.15

3 0.949 1 -0.051

4 1 0.799 0.6 0.199

2 0.786 0.6 0.186

3 0.806 0.7 0.106

5 1 0.89 0.9 -0.01

2 1 0.6 0.4

3 1 0.5 0.5

6 1 0.874 0.9 -0.026

2 0.985 0.5 0.485

3 0.981 0.4 0.581

7 1 0.92 0.7 0.22

2 0.7 0.7 0

3 0.7 0.6 0.1

8 1 0.875 0.9 -0.025

2 0.7 0.8 -0.1

3 0.7 0.7 0

9 1 0.902 0.6 0.302

2 0.709 0.9 -0.191

3 0.765 0.4 0.365

10 1 0.885 0.9 -0.015

2 0.75 0.8 -0.05

3 0.767 0.6 0.167

(39)

In Figure 17, the black line represents perfectly calibrated cases. All the points above black line represents underconfidence whereas the points below black line represents overconfidence. We plotted average confidence probability response of 10 participants against their correct proportion in each level. Most of the points in level 3 are below the ideal calibration line and the number decreases as we move to level 2 and level 1.

Figure 17: Confidence response shown by the participants in different levels

5.3.2 Framing Effect

In order to study framing effect in 3 scenarios i.e. neutral framing, biased for acceptance and biased for rejection, data from three sets of experiment were taken, which is summarized in Tables 4 through 6.

(40)

Table 4: Neutral framing

Worker Target response proportion Non target response proportion N(miss) N(fa)

N-1 0.53 0.47 4 3

N-2 0.40 0.60 11 6

N-3 0.46 0.54 3 0

N-4 0.60 0.40 5 6

N-5 0.43 0.57 7 3

N-6 0.62 0.38 4 8

N-7 0.43 0.57 7 3

N-8 0.43 0.57 5 1

N-9 0.53 0.47 6 5

N-10 0.53 0.47 4 3

Table 5: Framing biased for acceptance

A-1 0.63 0.37 2 4

A-2 0.63 0.37 5 7

A-3 0.60 0.40 6 7

A-4 0.70 0.30 3 7

A-5 0.67 0.33 2 5

A-6 0.70 0.30 1 5

A-7 0.75 0.25 3 8

A-8 0.63 0.37 2 4

A-9 0.63 0.37 3 5

A-10 0.63 0.37 5 7

(41)

Table 6: Framing biased for rejections

R-1 0.33 0.67 10 3

R-2 0.36 0.64 9 3

R-3 0.36 0.64 8 2

R-4 0.60 0.40 1 2

R-5 0.43 0.57 7 3

R-6 0.40 0.60 6 1

R-7 0.46 0.54 7 4

R-8 0.46 0.54 6 3

R-9 0.70 0.30 0 4

R-10 0.70 0.30 1 5

In Figure 18, for neutral framing we see that the average target response is 49.6%

whereas average non-target response is 50.4%. In the next set of experiment as the framing is altered to bias for acceptance we see significantly increase in target response (same speaker) i.e. 65.6% and decrease in the average non-target response (different speaker) i.e. 34.4%. In another experiment with framing biased towards rejection, we see decrease in average number of target response i.e. 48% and increase in average number of non target response i.e. 52% when compared to neutral framing scenario.

These findings suggest that the framing of instructions affects the decision choices of participants, as expected. The data indicates that the framing biased for acceptance favours the possibility for more target response wheres framing biased for rejection favours the possibility for more non-target response.

(42)

Figure 18: Biased response shown by participants in different framing scenarios

In Figure 19, we can see the effects of framing bias on the total number of miss and false alarm too. In the experiment conducted with neutral framing, the average number of false alarm is 12.66% and average number of miss is 18.66%. In the next experiment conducted with framing biased for acceptance, we can see significant increase in number of false alarm i.e. 19.66% whereas the number of miss decreases to 10.66%. How- ever, there is not significant change in the total number of correct answer. Similarly, for the experiment with framing biased for rejection, we can see increase in number of misses i.e. 18.33% and decrease in number of false alarm to 10% if we compare with second experiment. These findings indicate that the framing effect affects the total number of misses and false alarms. The framing biased for acceptance favours the possibility for more false alarm than miss and the framing biased for rejection favours the possibility for more miss than false alarm.

(43)

Figure 19: Effect of framing bias on number of miss and false alarm

In the neutral framing scenario, the increase in difficulty level impacts inversely the number of correct answer and vice versa, as seen in Figure 20. At level 1, the average number of correct answers is 80%. As the difficulty level progresses to 2 and 3, the average number of correct answer decreases to 69% and 57% respectively. Similarly, the average number of miss and false alarm also increases as level increases as shown in 20. It indicates that the difficulty level adjusted using the automatic speaker recognition is inline with the human perceived difficulty of the voice comparison. In other framing scenarios, the decision of the listeners are altered by framing effect thus we can see some discrepancy in total number of correct answer, miss and false alarm in various levels.

(44)

Figure 20: Rate of correct response, miss and false alarm on different levels in different framing scenarios; (a) neutral framing, (b) framing biased for acceptance and (c) framing biased for rejection

5.4 Distribution of data

In the experiment, we have binary choices for each trials and each trial could have

(45)

player has equal probability of success and failure in each trial, the collected data can be interpreted as binomial distribution. Lets consider correct choice assuccess(with probabilityp) and wrong choice asfailure(with probabilityq= 1−p) and total number of trials asn. For each level, we have 10 trials (i.e.n = 10). We obtained the maximum likelihood estimate forpusing formula for binomial distribution:

ˆ

p=mean/n,

where mean is the average number of correct answers in the 10 trials by 10 participants. In this way, p (probability of correct answer) is calculated in each level from the observed data. By usingpfor each level, we obtained binomial probability mass function for each number of correct answer (see Table 7).

P r(X =k) = n

k

·p^kq^n−k

for k=0,1,...,10.

Table 7: Binomial probability mass function for correct answer in each level (Neutral framing scenario)

Pr[X=k]

k Level-1 (ˆp= 0.8) Level-2(pˆ= 0.69) Level-3 (ˆp= 0.57)

0 1.024 0.0000 0.0002

1 4.096 0.0001 0.0028

2 7.372 0.0018 0.017

3 0.0007 0.0108 0.0604

4 0.0055 0.0422 0.1401

5 0.0264 0.1128 0.2229

6 0.088 0.2092 0.2462

7 0.2013 0.2662 0.1865

8 0.3019 0.2221 0.0927

9 0.2684 0.1099 0.0273

10 0.1073 0.0244 0.0036

In Figure 21, we see histogram plot for binomial distribution of data usingpfrom the observed data in each level. In Figure 22, we can see the distribution of the empirical data at each level. It is observed that the distribution of the empirical data is not close

(46)

to binomial distribution however, as the difficulty level rises we see that distribution tends to be close to binomial distribution as the gap between the probability of success and failure minimizes.

Figure 21: Binomial distribution for n=10 and p based on success rate in each level taken from neutral framing scenario

Figure 22: Frequency of success in the empirical data observed in 10 participants in each difficulty level. Data are from neutral framing scenario.

(47)

5.5 Discussion

The results from the first set of the experiment indicates that high number of listeners were overconfident at level 3 (difficulty level ’hard’) and most of them were underconfident at level 1 (difficulty level ’easy’). The rate of increase in positive bias-score was therefore observed to be directly proportional to the level of difficulty of the task. This finding is in sync with the previous studies done in [33] as mentioned in section 1.1 where most of the participants showed overconfidence in difficult quiz questions and underconfidence in easy questions though the setups and context were different. These findings suggests that the hard-easy effect is also prevalent in the context of speaker comparison as with other tasks where decision making process is involved. It supports the hypothesis that people tend to overestimate on hard task and underestimate on easy task in the context of speaker comparison. This work, therefore, indicates that the hard-easy effectholds true for the speaker comparison task on gamified model.

In regard to the experiment on framing effect, the result of the experiment conducted with framing biased for acceptance indicates that the speakers were accepted in most of the trials (65.6%) which is significantly greater than the number of acceptance in the neutral framing scenario (49.6%). In another experiment with framing biased for rejection, the result indicates that the speakers were rejected more number of times (52%) which is slightly greater than the rejection rate in neutral framing scenario (50.4%).

Hence, the findings support the presence offraming effect in the speaker comparison context on gamified model.

Furthermore, the number of miss and false alarm also follows the same pattern where the number of miss is higher than false alarm in framing biased for rejection and vice versa in case or framing biased for acceptance. These results further strengthen the earlier findings that the framing effect has noticeable and substantial impact on decision making of a person [60][14]. As observed in the results, the outcome of these experiments are inline with the notion of earlier studies in [60] and [14] on framing effects referred in section 1.1. These results support the hypothesis that the framing of instructions can alter the decision choices of participants on speaker comparison.

(48)

6 Conclusion

In this thesis, the author studied the potential effect of framing on speaker comparison and the effect of difficulty level on confidence of the listeners. The author implemented the concept of gamification to develop the empirical framework for conducting the experiment. The framework enabled configuration of multiple game environment easily to perform multiple session of experiment online. The outcome of the experiment suggests that the framing of instructions can alter the decision choices of the listeners and can bias to choose one option more than the other. The results further indicates that more number of listeners were overconfident on recognizing the similarity of speaker in hard level and underconfident in easy level. These findings supports that the biases such as framing effects and hard-easy effects also holds true for speaker comparison.

The gamified framework used in the experiment was easy to configure and control re- motely because of web-based approach in framework design. However, there is some possibility for further improvement in data-collection approach and the framework itself. One of the limitation of this experiment was the scarce number of data since the framework was not integrated to any crowdsourcing platform and experiment were conducted under individual supervision. There is the possibility to automate this process by pre-defined rules to set the gaming sessions and integrating the framework to some crowsourcing platform or social networking platform so that high number of worker can participate without explicit need for individual supervision. Furthermore, the framework could be improved by incorporating interactive graphic animations, virtual rewards and added functionalities to enhance user experience and motivate the player.

Overall, The gamified approach was found effective for conducting experiment on speaker comparison and collecting binary response data. Using this approach, there is possibility for further more experiment in this direction such as visuals impact on decision making, multiple choices experiment and time constraint decision making.

Research on other cognitive bias in speech perception, voice spoofing and phishing are other relevant potential further work in this direction making best use of gamification and crowdsourcing.

Gamified crowdsourcing model for studying cognitive biases in speaker comparison