• Ei tuloksia

On the predictive power of acoustic features in the automatic assessment of dysarthria

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "On the predictive power of acoustic features in the automatic assessment of dysarthria"

Copied!
83
0
0

Kokoteksti

(1)

On the Predictive Power of Acoustic Features in the Automatic Assessment of Dysarthria

Nasim Mahdinazhad Sardhaei Master’s Thesis Linguistic Sciences: Linguistics & Language Technology

School of Humanities

Philosophical Faculty

University of Eastern Finland

May 2020

(2)

ITÄ-SUOMEN YLIOPISTO –UNIVERSITY OF EASTERN FINLAND Faculty

Philosophical Faculty

Unit

School of Humanities Author

Nasim Mahdinazhad Sardhaei Name of the Thesis

On the predictive power of acoustic features in the automatic assessment of dysarthria

Major

Linguistics and Language Technology

Description Master’s thesis

Date 10.05.2020

Pages

55+Appendix

Abstract

Machine-learning approaches have paved the way to examine diverse methods of automatic assessment of the symptoms as well as the severity of speech impairments. The acoustic based methods adopting machine- learning approaches automatically extract thousands of acoustic features and the models are learned from the large sets of data resulting in an automatic speech classification system. Despite having several advantages, one of the most crucial drawbacks of such multi-parameter models is they do not convey any direct information on the most pertinent acoustic features. To this end, the focus of the current study lies on dimension reduction of feature space and selection of a subset of acoustic parameters with efficient predictive power from very large number of acoustic features derived automatically from dysarthric speech. The corpus used in this research was the Nemours Database of Dysarthric Speech consisting of the speech data of 11 American male speakers with varying degrees of dysarthria each producing 74 short nonsense sentences. The score from the Frenchay Dysarthria Assessment (Enderby, 1983) for each participant is included in the database.

The feature set for the 2012 Interspeech Speaker Trait Challenge (Schuller et al. 2012) from openSMILE toolkit (Eyben et al. 2010) was used for the extraction of acoustic features containing a total of 6125 features derived from 64 energy-, spectrum- and voicing-related “low-level descriptors” (LLD). A preselection of the features was carried out at the very first stage of the process in which the acoustic features were categorized into separate groups based on sharing statistical functionals applied to LLDs. Then, one or two features with the lowest Akaike information criterion (AIC) were retained from every group. The stepwise selection produced a predictive model of the Frenchay assessment scores with the smallest AIC= 2811.19 and the combination of the 238 survived variables. The results of the multiple regression analysis revealed the linear model is significant F(238,501)=183.5, p<0.001) with the R2=0.9833.

Although, the parameters of the reduced set efficiently predicted the severity scores of dysarthria, relying exclusively on individual variables will not generate a highly reliable automatic assessment. Thus, for having a set of acoustic features with full predictive power, it is also essential to inspect the interaction between acoustic parameters.

Key words

dysarthria, acoustic parameter, dimension reduction, regression analysis

(3)

ITÄ-SUOMEN YLIOPISTO –UNIVERSITY OF EASTERN FINLAND Tiedekunta – Faculty

Filosofinen tiedekunta

Osasto – School Humanistinen osasto Tekijä – Author

Nasim Mahdinazhad Sardhaei Työn nimi – Title

On the predictive power of acoustic features in the automatic assessment of dysarthria

Pääaine – Main subject Kielitiede ja kieliteknologia

Työn laji – Level Pro gradu -tutkielma

Päivämäärä – Date 10.05.2020

Sivumäärä – Number of pages

55+Appendix

Tiivistelmä – Abstract

Koneoppimismenetelmät ovat mahdollistaneet moninaisten puhehäiriöiden oireiden ja vakavuuden automaattisen arvioinnin menetelmien tarkastelua. Akustiikkapohjaiset menetelmät, joissa käytetään koneoppimisen lähestymistapoja, mittaavat automaattisesti tuhansia akustisia parametreja ja

mallinnus tapahtuu suuriin aineistoihin perustuvassa automaattisessa puheen luokitusjärjestelmässä.

Useista eduistaan huolimatta tällaisten moniparametristen mallien keskeisiä haittapuolia on explisiittisen tiedon puute siitä, mitkä ovat olennaisimmat akustiset ominaisuudet. Siksi tämän tutkimuksen painopiste on piirreavaruuden ulottuvuuksien vähentämisessä ja sellaisen dysartrisesta puheesta automaattisesti saatujen akustisten parametrien joukon etsimisessä, jolla on riittävä ennustuskyky. Tutkimuksessa käytettiin Nemours Database of Dysarthric Speech -tietokantaa, joka koostu 74 puhenäytteestä yhdeltätoista yhdysvaltalaiselta miespuhujalta, joille on diagnosoitu eri vaikeusasteiden dysartria. Jokaisen puhujan Frenchay Dysarthria Assessment (Enderby, 1983) -tulos sisältyy tietokantaan.

Akustisten parametrien mittaukseen käytettiin openSMILEn 2012 Interspeech Speaker Trait Challenge (Schuller et al. 2012) -kilpailun piirrejoukkoa, joka sisältää 6125 piirrettä, jotka on laskettu 64

intensiteettiin, spektriin ja fonaatioon liittyvästä kuvaajasta ("low-level descriptor", LLD).

Ensin piirteet jaettiin eri kategorioihin sen mukaan mitkä tilastolliset funktiot LLD-kuvaajiin oli sovellettu. Seuraavaksi jokaisesta ryhmästä valittiin yksi tai kaksi piirrettä, jo(i)lla oli alin Akaiken informaatiokriteeri (AIC). Askeltava valinta sitten tuotti 238 valikoidun muuttujan

Frenchayn arviointituloksien ennustusmallin, jonka AIC oli 2811,19. Monimuuttuja-regressioanalyysin tulokset näyttivät lineaarisen mallin olevan merkitsevä: F (238 501) = 183,5, p <0,001); R2 oli

0,9833.

Vaikka minimoitu piirrejoukko ennusti luotettavasti dysartrian vaikeusastetta, pelkät yksittäiset muuttujat eivät takaa yleisesti toimivaa automaattiluokitusta. Lisäksi on olennaista myös tutkia akustisten parametrien välisiä interaktioita.

Avainsanat:

dysartria, akustinen parametri, ulottuvuuksien vähentäminen, regressioanalyysi

(4)

Table of Contents

1. Introduction ... 1

1.1 Speech Assessment in Motor Disorders ... 1

1.2 An Introduction to Dysarthria ... 2

1.3 Classification of Dysarthria Subgroups ... 3

1.4 The Neuroanatomical Correlates of Dysarthria ... 4

1.4.1 Speech Production ... 4

1.4.2 Speech production as a Motor Skill ... 5

1.4.3 Central vs. Peripheral Nervous System ... 6

1.5 Evaluation of Dysarthric Speech ... 7

1.5.1 Frenchay Dysarthria Assessment (FDA-FDA2) ... 8

1.5.2 Acoustic Analysis of Dysarthria ... 10

1.5.3 Speech Processing ... 12

1.5.3.1 Feature-Space Extraction of Speech Signals ... 12

1.5.3.2 Speech Classification System ... 14

1.6 Research Questions and Hypotheses ... 16

2. Method ... 18

2.1 Introduction ... 18

2.2 The Corpus of the Study ... 18

2.3 Data Gathering ... 20

2.4 Data Analysis ... 23

2.4.1 Dataset description ... 23

2.4.2 Data Pre-processing ... 24

2.4.2.1 Simple Regression analysis ... 24

2.4.2.2 Preselection of the Features ... 25

2.4.2.3.1 Assumption Check ... 27

2.4.2.4 Wrapper Method ... 30

2.4.2.5 Multiple Regression Analysis ... 31

3. Results ... 32

3.1 Preselection of the Features ... 32

3.1.1 Procedure A (Within Groups) ... 32

(5)

3.1.2 Procedure B (Within the Dataset) ... 35

3.2 Wrapper Method ... 38

3.2.1 Stepwise Regression on the Features Pre-selected by Procedure A ... 38

3.2.2 Stepwise Regression on the Features Pre-selected by Procedure B ... 44

4. Discussion ... 50

4.1 Contribution to the Automatic Assessment of pathological speech ... 51

4.2 Limitations of the Study ... 52

4.3 Future Research... 53

4.4 Conclusion ... 54

References ... 55

Acknowledgments ... 71

Appendix ... 72

List of Tables 1.1 The classification of dysarthria based on the lesion sites ... 4

1.2 FDA-2 best-fit descriptors ... 9

2.1 Overall score of Frenchay Dysarthria Assessment for each subject ... 19

2.2 64 provided low-level descriptors ‘LLD’ ... 21

2.3 Applied functionals ... 21

2.4 A sample of the dataset used ... 23

2.5 Ranked AICs for the grouped functionals applied to the feature ... 26

3.1 Preselected acoustic parameters based on the smallest AICs within groups ... 34

3.2 Preselected acoustic parameters based on the smallest AICs within the dataset ... 37

3.3 Acoustic variables comprising the 1st chunk ... 38

3.4 Classification of the main features involved in the optimal mode ... 43

3.5 Classification of the main feature functions involved in the optimal mode... 43

3.6 Classification of the main features involved in the optimal mode ... 48

3.7 Classification of the main feature functions involved in the optimal mode... 48

(6)

List of Figures

1.1 Completed FDA-2 rating form ... 9

2.1 Residual plot of the model Frenchayscore ~ pcm_RMSenergy_sma_lpgain ... 27

2.2 Histogram of the residuals of the model Frenchayscore ~ pcm_RMSenergy_sma_lpgain ... 28

2.3 Q-Q plot of the residuals for the model Frenchayscore ~ pcm_RMSenergy_sma_lpgain ... 28

2.4 Residual plot of the model Frenchayscore ~ log(pcm_RMSenergy_sma_lpgain) ... 29

2.5 Histogram of the residuals of the model Frenchayscore ~ log(pcm_RMSenergy_sma_lpgain) ... 29

2.6 Q-Q plot of the residuals for the model Frenchayscore ~ log(pcm_RMSenergy_sma_lpgain) ... 30

3.1 Scatter plot of the model lm (Frenchayscore ~ pcm_zcr_sma_de_percentile1.0) ... 32

3.2 Scatter plot of the model lm (Frenchayscore ~ audSpec_Rfilt_sma.0._iqr1.2) ... 33

3.3 Scatter plot of the model lm (Frenchayscore ~ audSpec_Rfilt_sma.20._quartile1) ... 33

3.4 Scatter plot of the model lm (Frenchayscore ~ pcm_fftMag_spectralRollOff90.0_sma_falltime) ... 33

3.5 Scatter plot of the model lm (Frenchayscore ~ pcm_fftmag_mfcc_sma_de.1._stddev) ... 35

3.6 Scatter plot of the model lm (Frenchayscore ~ audSpec_Rfilt_sma.5._amean) ... 36

3.7 Scatter plot of the model lm (Frenchayscore ~ audspec_lengthL1norm_sma_de_iqr2.3) ... 36

3.8 Subject-specific boxplots of feature variable ‘pcm_fftMag_mfcc_sma_de.1._rqmean”.. ... 39

3.9 Subject-specific boxplots of the feature variable ‘pcm_fftMag_mfcc_sma.11._peakMeanAbs’ ... 40

3.10 Subject-specific boxplots of the feature variable ‘audspec_lengthL1norm_sma_upleveltime25’... 41

3.11 Subject-specific boxplots of the feature variable ‘pcm_fftMag_mfcc_sma.3._peakMeanAbs’ ... 42

3.12 Subject-specific boxplots of the feature variable ‘shimmerLocal_sma_de_stddev’ ... 42

3.13 Subject-specific boxplots of the feature variable ‘audSpec_Rfilt_sma_de.22._iqr1.2’ ... 45

3.14 Subject-specific boxplots of the feature variable ‘audSpecRasta_lengthL1norm_sma_quartile2’ ... 46

3.15 Subject-specific boxplots of the feature variable ‘pcm_fftMag_mfcc_sma_de.1._pctlrange0.1’ ... 47

3.16 Subject-specific boxplots of the feature variable ‘audSpec_Rfilt_sma.2._quartile3’ ... 47

(7)

1

1. Introduction

1.1 Speech Assessment in Motor Disorders

Speech evaluation in Motor disorders can have a prominent role in obtaining valuable pieces of information on the disturbances to the neuro muscular mechanisms affecting the speech production by providing objective measures. Analysis of speech because of offering non-invasiveness techniques, the accessibility of state-of-the-art instrumentations to view the disordered speech signals, and its potentiality to provide quantitative data for the description of subsystems is of great importance in clinical assessments (Kent et al., 1999). Although speech analysis is not the only effective measurement tool in the assessment of the speech disorders, it can be an informative complement to other instrumental measurements used for the diagnosis of the disease, monitoring the progression, or tracking the effects of intervention. Many researches have also verified the efficiency of speech assessment as a promising tool providing prognosis markers to discriminate healthy and pathological patterns (Roark et al., 2011; Meilán et al., 2014; Ash & Grossman, 2015;

Poole et al., 2017). The analysis of different linguistic levels of speech organization such as phonetic, phonological, lexico semantic and morpho syntactic levels can derive knowledge on the variables having predictive power to identify the deficits associated with the diseases (Boschi et al., 2017).

From another perspective, advances of technology in computer sciences have added a new dimension to the processing of impaired speech leading to capture broad cues of the speech disorders. More specifically, machine-learning approaches have paved the way for the researchers to examine and explore diverse methods of automatic recognition of the symptoms as well as the severity of speech impairments. The most recent acoustic based researches adopting machine- learning approaches to assess speech disorders, automatically extract a large number of acoustic measurement features and the models are learned from the large sets of data resulting in an automatic speech classification system (Bayestehtashk et al. 2015; Werner, 2018). Such Automatic acoustical analyses for classification can have many advantages. They can not only provide more objective, precise, and robust assessments in clinical practice but also prevent consumption of time and money resources and open doors for remote patient rehabilitation (Paja & Falk, 2012).

(8)

2

However, the most crucial disadvantage of such multi-parameter models is that they do not deliver direct information on the most relevant acoustic parameters (Werner, 2018).

To this end, the current thesis is an attempt to study the power of acoustic analysis to predict the severity in patients affected by dysarthria as a speech motor disorder. Being inspired by the work of Werner (2018), the focus of this study lies on dimension reduction of feature space and selection of a subset of acoustic parameters with efficient predictive power from very large number of acoustic features derived automatically from dysarthric speech.

In the following sections, dysarthria, its sub categories, and the deficits of those components of the nervous system leading to this speech disorder will be briefly described. Then, different approaches towards the evaluation of dysarthric speech including acoustic and physiological assessments will be delineated. The last spot of consideration in this chapter will be signal processing and its applications in pathological speech. The introduction section of this thesis closes with an explanation on the existing gap in the speech processing research and presenting the research questions of this study.

1.2 An Introduction to Dysarthria

The production of well-inflected and clearly pronounced speech requires the precise control over the articulators and any disturbance to the neuro-muscular mechanisms, which are responsible for the manipulation of the articulators, will lead to the production of acoustically distorted speech which is indiscernible to listeners (Enderby, 1980). Dysarthria is an umbrella term referring to a collection of neurogenic speech disorders characterized by dysfunctions in the rate, strength, amplitude, steadiness, tone, and accuracy of movements occurring in different respiratory, phonatory, resonatory, articulatory, or prosodic aspects of speech production (Duffy, 2013). The key components of human speech production which may reflect disturbances caused by dysarthria are defined by Darley et al. (1975) as follow:

“Respiration provides the raw material for speech...In phonation the breath stream sets into vibration the adducted vocal folds of the larynx...This breath stream with periodic and aperiodic components must be shaped and modified through two additional processes. Resonance is the selective amplification of the vocal tone; the pharynx, oral cavity, and nasal cavity serve as resonators that reinforce certain components of the tone...Ultimately the breath stream is shaped

(9)

3

into phonemes (articulation) through impedances produced by the various articulators: the tongue, the teeth, and the lips...The term prosody comprises all the variations in time, pitch, loudness that accomplish emphasis, lend interest to speech, and characterize individual and dialectal modes of expression” (p 3-5).

The abnormalities arising from dysarthria occur due to the neurological damages to the central or in the peripheral nervous system. Such damages disturb the transfer of the information from the nervous system to the muscles involved in speech production leading to the disruption of the motor components having a role in speech production and resulting in the distortion of speech.

The term “dysarthria” primarily referred to articulatory disorders. Peacher (1950) suggested the terms “dysarthrophonia” to identify neuromuscular based impairments which involve phonation and articulation deviations. Later, Grewel (1957) applied the term “dysarthropneumo-phonia” to refer to speech disorders with respiratory problems. The execution of speech includes highly related processes in which the involved body parts are complexly interconnected. The interconnection of the various speech processes is such a high that it is more useful to view speech mechanism as fundamentally a uniform system and when the system faces with an impairment at any point, the function of more than one part is probably affected (Ward et al., 2003). Thus, in neurogenic disorders of communication, a mixture of dysfunctions is more common than isolated impairments with a single function. Hence, Darley et al. (1975) proposed to consider dysarthria as an inclusive term for a group of motor speech impairments of phonation, respiration, articulation, resonance, and prosody which have neurologic basis and occur due to malfunction in muscular control of the speech mechanism resulting from damage of any of the basic motor processes involved in the execution of speech.

1.3 Classification of Dysarthria Subgroups

As mentioned earlier, dysarthria is an inclusive term with clinically diverse presentations. A knowledge on the area of the lesion regarding what part of the central or peripheral nervous systems is affected, as well as identifying the distinctive behaviors as the consequences of the damage can aid clinicians to recognize the subtypes of the disorder (Carmichael, 2007). Many early pioneer studies such as Froeschels (1943), Peacher (1950), Grewel (1957), Brain (1962), and Luchsinger and Arnold (1965) have proposed different classifications of dysarthria. However, the

(10)

4

six-fold taxonomy of dysarthria subtypes proposed by Darley et al. (1975) can be viewed as the most well-known and widely used classification system based on the results of the Mayo Clinic Study encompassing Flaccid, Spastic, Ataxic, Hypokinetic, Hyperkinetic, and Mixed dysarthria (Table 1.1). The classification is unitary and neurologically based specifying the neuromuscular conditions and it includes all patterns of the dysarthrias. Later, Enderby (1983) has simplified Darley’s classification of dysarthria subtypes and has proposed a taxonomy of five major categories including Extrapyramidal Hypokinetic, Spastic, Ataxic, Flaccid, and Mixed dysarthria.

Table 1.1

The classification of dysarthria based on the lesion sites (Darley at al., 1975, as cited in Murdoch, B.E., 1998, p.2)

Dysarthria type Lesion site

Flaccid Lower motor neurones

Spastic Upper motor neurones

Ataxic, Cerebellum and /or its connections

Hypokinetic Basal ganglia and associated brainstem nuclei Hyperkinetic Basal ganglia and associated brainstem nuclei Mixed dysarthria

Mixed flaccid-spastic

Mixed ataxic, flaccid, spastic

Both upper motor and lower motor neurones (e.g. amyotrophic lateral sclerosis)

Cerebellum/cerebellar connections

upper motor neurones and lower motor neurones (e.g.

Wilson’s disease)

1.4 The Neuroanatomical Correlates of Dysarthria 1.4.1 Speech Production

Spoken language production is a highly complex activity engaging the integration of linguistic, motor-sensory and cognitive systems (Geranmayeh et al., 2014). In literature, different traditions including linguistics, psycholinguistics, motor control, neuropsychology, and cognitive neuroscience have studied and formulated speech production model using distinct approaches (Hickok, 2014). Each of these traditions have targeted speech production from their own unique perspective dealing with specific levels of speech and had partial convergence with the ideas of the other traditions. One of the major separations in traditional speech production models is the division between higher-level psycholinguistic frameworks and lower level motor control

(11)

5

processes (e.g., Levelt, 1989; Guenther, 2006). While linguistic-cognitive models of speech production restrict themselves to the conceptualization and processing of the traditional linguistic domains such as semantic, lexical, and phonological levels (e.g., Dell, 1986; Fromkin, 1971; Garrett, 1975), the motor control models attempt to uncover the intricacy of sensorimotor aspects of speech and account for the neural and mechanic motor control of articulatory structures (Fairbanks, 1954; Guenther et al., 1998; Houde & Jordan, 1998). However, recent studies have revealed similarities between the levels focused by the traditions and have proposed integrated, more inclusive approaches depicting the speech production processes uniformly (Hickok 2012;

Hickok, 2014).

1.4.2 Speech production as a Motor Skill

Dysarthria as the neuro-motor disorder only impairs the neuro-muscular mechanisms responsible for controlling the articulators and the disorder does not affect the speech planning units or other psycholinguistic processes involved in the abstract formulation of the meaningful, syntactically correct speech (Rudzicz, 2013). Thus, the focus of this section will be only on the neuro-motor mechanisms underlying speech production and those components of the motor units, damages to which are associated to the occurrences of dysarthrias.

The speech mechanism requires the precise coordination of various subsystems including orofacial (tongue, lip), velopharyngeal, laryngeal, and respiratory systems (Gentil, M, 1990). It is estimated that in human motor speech process, approximately 100 muscles of the speech motor units such as laryngeal, orofacial, and respiratory muscles coordinate precisely (Simonyan & Horwitz, 2011).

A better understanding of such a high and complex degree of coordination cannot be obtained without considering the role of peripheral and central nervous systems in speech process. In other words, speech production can be viewed as a fine tuned and highly complex motor skill in which certain structures of nervous system play a vital role. Such structures involve in the control of the coordinated contraction of a large number of speech mechanism muscles including the muscles of the lips, tongue, jaw, soft palate, pharynx and larynx, as well as the muscles of the respiration by the nerve impulses arising from the cerebral cortex and passing to the muscles by motor pathways.

In such a perspective, the nervous system consists of a series of higher and lower levels of motor control in which the former regulates the latter.

(12)

6

The lower level of motor control comprises the lower motor neurons which form the pathway for the transmission of the nerve impulses from the central nervous system to control the contraction of the skeletal muscles. The lower level motor neurones arising from either brainstem or anterior horns of spinal cord, run in the cranial or spinal nerves and any lesions to these two areas are treated as the lesions of lower motor neurons leading to the disturbance of transmission of nerve impulses from the central system to the skeletal muscle fibres. The main symptoms of lower neuron lesions observable in dysarthria are the loss of voluntary control of muscles, weak and flaccid muscles, reduction of the muscle reflexes and fasciculation.

The highest motor control level contains the motor areas of the cerebral cortex, which are in charge of voluntary muscle activity and dominate the lower motor neurones arising from the brainstem or spinal cord through direct or indirect descending motor pathways. These pathways are constituted of upper motor neurones and the lesion to these neurones primarily in the cerebral cortex, internal capsule, or brainstem may lead to dysarthria. Clinical symptoms of such lesions can be spastic paralysis of the muscles, lack of muscle atrophy, hyperactive muscle stretch reflexes, etc.

Considering the neuroanatomy of the motor control, various neuropathological causes can lead to the occurrence of speech deviations attributed to Dysarthria. Motor speech activities are integrated through five levels of the central nervous system including cerebral cortex, basal ganglia of the cerebrum, the cerebellum, the brainstem, and the spinal cord and any lesions at any of these levels can result in dysarthria. Also, any damages to the peripheral nervous system involved in controlling the muscles of speech mechanism as well as any disorders of the speech mechanism muscles may lead to dysarthria. In addition, dysarthria can be a consequence of the disorders preventing the normal transmission of nerve impulses at the level of the neuromuscular junction (Murdoch, B.E., 1998).

1.4.3 Central vs. Peripheral Nervous System

The nervous system is very complex structure divided into two major interacting subsystems:

Central (CNS) and peripheral nervous system (PNS). The central nervous system consists of the brain and spinal cord. The peripheral nervous system is an extensive network of nerves arising from the base of the brain and spinal cord connecting the CNS to the muscles and sensory structures. The main function of the PNS is to transfer the nerve impulses to and from the central

(13)

7

nervous system by afferent and efferent nerve fibres. While the afferent or sensory nerve fibres convey the nerve impulses arising from the sensory receptors to the central nervous system, the efferent or motor nerve fibres carry the nerve impulses arising from the CNS to the effector organs such as muscles and glands. The peripheral nervous systems consists of three principle interactive subdivisions: A) twelve pairs of cranial nerves five of which, V, VII, IX, X, and XII contribute saliently in speech production process and disruption in any of these areas can reflect itself as a manifestation of dysarthria. B) thirty-one pairs of spinal nerves which transfer the motor, sensory, and autonomic signals between the spinal cord and the body. Dysarthria associated with injuries to any of these nerves or spinal cord will reflect respiratory dysfunction and its downstream effects on phonation, articulation, and prosody (Hoit et al., 2017) . C) and peripheral portions of the autonomic nervous system which is not involved as a cause of dysarthria.

1.5 Evaluation of Dysarthric Speech

One of the prominent challenges in the early diagnosis of dysarthria is the heterogeneity in the speech characteristics across patients affected by dysarthria. In addition to the variability in individual speaking styles (Vick et al., 2012) as an associative factor, the variations in underlying neuropathology of Dysarthria affecting different speech production subsystems (Allison & Hustad, 2018) contribute to this heterogeneity. Such variations arise from the severity level of disturbance, location of the neurological damage resulting in various malfunctions as well as the extent of the damage affecting a single subsystem or involving simultaneously many speech production components including respiration, phonation, resonance, articulation, and prosody.

Thus, when the goal is to provide a comprehensive profile of dysarthric speech, the disorder’s manifestations across the key dimensions underlying speech production need to be taken into account (Allison & Hustad, 2018).

From another perspective, descriptions of the speech characteristics associated with dysarthria depend on the analysis procedure used to assess the disorder (Forest et al., 1989) and the target research question. Among many possibilities, two general approaches towards the assessment of dysarthria can be defined. In neurologic disease-based approach the focus lies on pathophysiological processes that contribute to speech disorders. Such an approach principally investigates the physiological functions of articulatory muscles in clinical setting through qualitative measures provided by a speech pathologist. A very well-known assessment of this type

(14)

8

is Frenchay Dysarthria Assessment (Enderby, 1980; Enderby & Palmer, 2008) which will be described in details in next section. The alternative to the first approach is (neuro)linguistic-based approach which accents the investigation of linguistic processes to discover the facts about underlying neurological dysfunctions impacting the speech production. This type of approach provides quantitative measures by means of instrumental analyses including acoustic analysis to delineate the patterns varying in dysarthric speech from healthy speech.

1.5.1 Frenchay Dysarthria Assessment (FDA-FDA2)

Frenchay Dysarthria Assessment (FDA), as one of the most widely used dysarthria diagnostic evaluation procedures was first introduced by Enderby (1993) and its second edition (FDA-2) was later published by Enderby and Palmer in 2008. FDA-2 is a profile type rating scale used by speech pathologists to subjectively assess the performance of patients affected by dysarthria on a series of exercises related to speech functions. As a diagnostic tool, FDA-2 incorporates customized measurement systems to allow estimation of the stage of disorder development

(i.e. mild, moderate and severe) and identify the specific dysarthria sub-category the patient manifests (Carmichael, 2007). The current version of FDA consists of seven assessment categories and each category is composed of series of individual testing items as follow:

Reflexes: Ratings for cough, swallow, and dribble/drool Respiration: Ratings at rest and in speech

Lips: Ratings for at rest, spread, seal, alternate, and in speech Palate: Ratings for fluids, maintenance, and in speech Laryngeal: Ratings for time, pitch, volume, and in speech

Tongue: Ratings for at rest, protrusion, elevation, lateral, alternate, and in speech Intelligibility: Ratings for words, sentences, and conversation

The rating scale of FDA-2 has five best fit descriptors (table 1.2) ranging from ‘a’ to ‘e’ which provide a general information of the type and level of severity in patient and help the pathologist assess the patient’s performance.

(15)

9 Table 1.2

FDA-2 best-fit descriptors (Enderby & Palmer, 2008, p.6) FDA-2 Best-Fit Descriptors

a Normal for age

b Mild abnormality noticeable to skilled observer

c Abnormality obvious but can perform task/movements with reasonable approximation

d Some production of task but poor in quality, unable to sustain, inaccurate or extremely labored e Unable to undertake task/movement/sound

Assigning a score to a specific sub-task requires to blacken the column in to the appropriate point corresponding to the grade (table 3.1). If the patient performance on an item falls between two scores, the evaluator can assign an interval grade by using the in between line. Thus, each individual column of the test has nine discernable points corresponding to nine letters. Four of these points are composed of interval grades (‘b+’, ‘c+’, ‘d+’ and ‘e+’) which appoint a performance better than the grade below but not sufficiently competent as to merit the grade above.

Fig. 1.1. Completed FDA-2 rating form (Enderby & Palmer, 2008, p.7)

Although the FDA scale includes a wide range of aspects to evaluate dysarthria such as reflexes, respiration, lips movement, palate movement, laryngeal capacity, tongue posture/movement, intelligibility, and others, it suffers from psychometric weakness because of reliance on subjective evaluation techniques as there can be inconsistency and disagreement among the pathologists or experts using the assessment regarding the diagnosis or classification of the disorder sub-type (inter-rater variability). The ambiguity and inaccuracy of subjective descriptions of dysarthria condition can be also apparent in the intelligibility section of the test as several factors may affect

(16)

10

the comprehensibility of the speaker’s speech by the listener such as familiarity with the speaker’s accent because of previous exposure or sharing same cultural experiences (Bodt et al., 2002;

Nuffelen et al., 2009). In addition to the inter-rater variability which has been reported in previous research (Carmichael & Green, 2003), the discrepancy in intra rater evaluation, i.e. the same listener may scores differently for the same data after re-assessing the data set, can also affect the reliability of intelligibility assessment in FDA test. Another source of inaccuracy in FDA assessment can be in symptom assessment procedure of the FDA. The test has a section with the title of influencing factors where the pathologist or the evaluator expert is supposed to fill out any noticeable factors visible in the patient such as a patient’s emotional state or mood during the FDA interview, his/her willingness to co-operate and level of motivation, however; the effects of these factors on the scores of different individual tasks as well as the overall diagnostic score, are explained in the guidelines very vague (Carmichael, 2007). The lack of explicit and detailed guidelines can have a substantial skewing impact on the diagnosis.

1.5.2 Acoustic Analysis of Dysarthria

The early acoustic based researches (e.g., Lehiste, 1965; Hanley & Peters, 1971; Kent et al., 1999;

Rosen et al., 2006) have endeavored to characterize the correlates of dysarthric speech through analyses of speech signals. Acoustic assessments because of being sensitive to detect variations between normal and disordered speech (Higgins & Hodge, 2002; Hustad et al., 2010; Lee, Hustad,

& Weismer, 2014) as well as providing objective and quantitative measures at a fine grained level can have the potential to help the early diagnosis of the disorder by giving account of the features correlating with dysarthria.

Previous studies have reported some general acoustic features as the result of disturbances by dysarthria such as reduced intonation and F0 variability. Data shows that fundamental frequency (F0) appears unpredictable across studies, while some researches have reported lower F0 (Metter

& Hanson, 1986; Dogan et al., 2007; Konstantopoulos et al., 2010; Yamout et al., 2013), others have revealed higher (Feijo et al., 2004) in patient group compared to healthy control group. Also, Canter (1963, 1965) noticed decreased F0 range during syllable production and paragraph reading.

F0 jitter is a character of pitch in which the pitch randomly varies over consecutive periods. The harshness of the voice in the patients of dysarthria due to time-varying characteristics of the vocal

(17)

11

tract and vocal folds can be the cause of the increase of jitter value in dysarthric speech (Thoppil et al., 2017). Teager and Teager (1990) associated the increased F0 jitter to the harsh speech. In another study conducted by Ackermann and Zeigler (as cited in Murdoch, 1998) it was also explored the F0 jitter was above the normal range in the subjects with dysarthria. They also observed increased pitch levels in dysarthric subjects as well as pitch fluctuations in the pitch contour among patients with ataxic dysarthria. The formant analyses have shown increased variability of first formant values during vowel prolongations (Goberman & Coelho, 2002;

Zwirner and Barnes, 1992) in comparison to the healthy group, and reduced F1-F2 vowel space (Weismer et al., 2001). Based on studies (Condor et al., 1989; Flint et al., 1992), F1 and F2 transition rates were flatter in speakers with dysarthria.

As dysarthria can involve deviations across many speech dimensions, identifications of the candidate acoustic features which can quantitatively measure the deviations across those different dimensions is essential. In literature, many studies on dysarthria have focused on characterizing the acoustic consequences of physiological speech subsystem perturbations on speech signals.

While some of these investigations have focused on one or two of the speech dimensions including phonation, resonance, articulation, respiration, prosody, and fluency (Feijó et al., 2004;

Konstantopoulos et al., 2010; Galaz et al., 2016; Kuoa & Tjaden, 2016; Lam & Tjaden, 2016;

Pennington et al., 2018; Gomez et al., 2017; Whitfield etal., 2018; Ramos et al., 2020;

Konstantopoulos & Karangioules, 2019), others have conducted multidimensional researches to grasp a more comprehensive picture of the acoustic speech profile in patients with dysarthria (Allison & Hustad, 2018, Mucha et al., 2017; Cuartero et al., 2018; Mekyska etal., 2018).

More recently, computer science has started to play a critical role in processing of speech. Many programs with various functionalities have been introduced which facilitate analysis of properties and manipulation of digital speech data. The computer system multi-dimensional voice program constructed by Kay Elemetrics Corp. (Kay Elemetrics Corp., 1999), and open source program of PRAAT developed by Boersma and Weenink (last version in 2017), are among the programs which provide many measures for speech evaluation. Aside from the analyses examining different speech dimensions in dysarthria a few of which mentioned above, a group of recent acoustic based studies following speech processing approaches have proposed developments of more enhanced automatic procedures for the objective analysis of pathological speech by taking advantage of

(18)

12

computer science and collaboration with the related fields. The common aim of all these technology-based, interdisciplinary approaches has been to replace the traditional acoustic based analysis of speech with more enhanced automatic systems targeted for different purposes ranging from automatic extraction of acoustic parameters from the speech signals to the automatic classifications of signals extracted from the speech and automatic recognition of the voice/speech of the patients suffering from the disorder. Such automated systems can not only decrease the amount of work required for the analysis of incoming audio signals (Uebler, 2006) as well as the consumption of time and resources, but also increase the rate of accuracy of results.

1.5.3 Speech Processing

Speech processing is the study of speech signals and involve the processing methods for this end.

As the analysis of speech signals requires some forms of digitalization, it can be claimed that speech processing is an intersection of digital signal processing and natural language processing (Abhang et al., 2016). Speech processing utilizes various techniques for the development of various systems such as automatic speech recognition (ASR), text to speech synthesis, dialog systems, and information extraction. The domain of speech processing is very vast, and it covers wide range of technologies and applications, accordingly, research in this area has been continuously growing. With the advancements of scientific instruments, speech processing has also shown great strides to conduct cross disciplinary research. In this regard, speech pathology has gained special attention of speech scientists. The speech processing research by taking advantage of the novel machine learning approaches and data science has been enabled to scrutinize the speech disorders and their underlying patterns. In the following section, two of the speech processing technologies which have been frequently examined in the area of dysarthria research are explicated.

1.5.3.1 Feature-Space Extraction of Speech Signals

Feature extraction involves analysis of speech signal and extracting the essential vectors needed for use such as power, pitch, and vocal tract configuration from the speech signal without user control. Different specific audio analysis tasks such as analysis of paralinguistics in speech, or Music Information Retrieval (MIR) require the application of feature extraction techniques.

Feature extraction is also a prominent process in speech recognition (SR) system designs as the

(19)

13

extracted features form the underpinning for the precise recognition of the speech (Jolad & Khanai, 2018). Automatic speech recognition systems enable a computer to recognize the human voice commands by altering the input signal into a text through employing algorithms implemented in the computer. This technology can have several different applications including clinical usage for pathological speech as well. Although the specific characteristics of dysarthric speech such as slow speech rate, disfluency, inconsistency, involuntary sounds, and voice quality may technically affect performance of different types of ASR systems (Rosen & Yampolsky, 2000), there have been several research studies which have proposed using diverse methods in the development of robust recognition systems for dysarthric speech (España-Bone & Fonollosa, 2016; Bhat et al., 2016; Kim et al. 2018; Nidhyananthan et al., 2018; Mulfari et al., 2018) or exploring the ways for reducing the errors and enhancing the performance of ASRs for dysarthria (Moore et al., 2018;

Bajpai et al., 2018). Based on speech recognition task, different feature extraction techniques can be exploited to extract different individual characteristics implanted in a statement. Generally, the feature extraction techniques can be categorized as temporal analysis in which the speech waveform is used and spectral analysis which uses the spectral representation of speech signal for the analysis (Kesarkar & Rao, 2003). Some of significant techniques of each taxonomy are mentioned here:

Spectral analysis techniques: critical band filter bank analysis, cepstral analysis, mel cepstrum analysis, linear predictive coding (LPC) analysis, and perceptual linear prediction (PLP).

Temporal analysis techniques: power estimation, fundamental frequency estimation, and Cepstrum based pitch determination.

Most of the state-of-the-art feature extraction utilities are designed for the use in specific domains.

Furthermore, they are either libraries or targeted at off-line data processing, which do not provide an accessible and flexible feature extractor (Eyben et al., 2010). In the field of speech research, some of the related feature extraction tools are the Hidden Markov Model Toolkit (Young et al., 2006), the PRAAT Software (Boersma &Weenink , 2017), the Auditory Toolbox4, a MatlabTM toolbox5 (Fernandez, 2004), and the Tracter framework (Garner et al., 2009). However, a few of these tools are freely available. OpenSMILE toolkit developed by Eyben et al. (2010) is another type of novel feature extractor for real-time, incremental processing which is distributed under open source license. It integrates the feature extraction algorithms from two domains of speech

(20)

14

processing and Music Information Retrieval and is capable of extracting a very large number of audio Low-Level Descriptors (LLD) such as CHROMA and CENS features, loudness, Mel- frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies. Various filters and statistical functionals can be applied to the LLDs by OpenSMILE. Another property of this toolkit is that it supports the extraction of the baseline acoustic feature sets of the INTERSPEECH 2009 Emotion Challenge feature set, INTERSPEECH 2010 Paralinguistic Challenge feature set, INTERSPEECH 2011 Speaker State Challenge feature set, INTERSPEECH 2012 Speaker Trait Challenge feature set, INTERSPEECH 2013 ComParE feature set, MediaEval 2012 TUM feature set for violent scenes detection, and other recent INTERSPEECH Challenge parameter sets.

1.5.3.2 Speech Classification System

Speech classification technology involves the automatic classification of input audio signals and prioritization of the most relevant signals by deployment of the statistical measurements and algorithms on labelled training data to train an acoustic model and subsequently a classifier. The development of a speech classification system requires specific steps which can be generally defined as preprocessing and defining different classes of speech, selection of a speech feature extraction technique, speech classification modeling, and measuring the classification performance of the system (Madan & Gupta, 2014). The classifiers are trained based on the target application and the training data extracted for that specific task. The available classifiers according to the classes of the training material are speech detection, speech enhancement, language identification, speaker identification, topic spotting, and word spotting (Uebler, 2006). Recently, the classification methods have been mainly centered on statistical approaches (Sturim et al., 2007) such as Gaussian Mixture Modelling (GMM), Support Vector Machine (SVM), Hidden Markov Modelling (HMM), Artificial Neural networks, and Normalization techniques.

Among the works proposing diverse methods to establish new embodied classification systems, some have also attempted to develop speech classification systems for dysarthric speech which have obtained satisfactory results. Narendra and Alku (2018) in their study confirmed the discriminative power of glottal parameters in combination with the openSMILE-based acoustic features in their proposed speech classification system developed by training support vector

(21)

15

machines (SVMs) using features extracted from speech utterances and their labels indicating dysarthria/healthy. The experiments demonstrated the glottal parameters resulted in good classification accuracies of approximately 70%. Farhadipour et al., (2018) proposed a dysarthric speaker identification system using artificial neural networks and presenting a feature‐extraction method based on deep belief networks. Based on their results, the highest accuracy of 97.3 % was achieved by using the proposed system. In another speech detection work, Ijitona et al., (2017) reported using centroid formants as the extended speech feature for detection of different levels of severity of dysarthria from healthy speech gave an accuracy of 75.6% . They generated rules for their classifier by adopting neural networks method as one of the frequent machine learning techniques for the classification systems. Prakash (2020) in a similar study made effort to differentiate dysarthric and non-dysarthric audio signals by using convolutional neural network method and got the best result of detecting dysarthric at 68% accuracy.

As mentioned earlier, development of an efficient dysarthric speech classification system is a very precise process with specific stages. The related studies in the feature extraction stage of developing a classification system, have focused on the extraction of wide range of acoustic features which can grasp the existing variabilities in pathological speech patterns or sources (Falk et al., 2012; Eyben et al., 2013). These works have assessed the efficiency of several features including spectral parameters such as mel-frequency cepstral coefficients and formants, voice quality acoustic measures such as jitter, shimmer, and harmonic to noise ratio, prosody features such as fundamental frequency, pitch contour, and RMS energy, phonological features (Dibazar et al., 2009; Liss et al., 2010; Paja & Falk, 2012; Falk et al., 2012; Eyben et al., 2013; Kim et al., 2015), or glottal parameters in combination with other acoustic parameters for developing cross- database models to detect dysarthria (Gillespie et al., 2017; Narenda & Alku, 2018).

The common approach at feature extraction stage is to automatically extract a large number of measurement features and subsequently train a classifier by using the extracted feature set.

However, based on literature, this current typical methodology is not always advantageous and the increase in the number of features can paradoxically have negative consequences. It has been frequently noticed that the inclusion of large number of measurement features can lead to imperfect performance (Brunzell & Eriksson, 2000; Wang & Paliwal, 2001). According to Chiou and Chen (2013) also, too many features can reduce performance of the system and increase computing time.

(22)

16

A reason for the degraded performance of the classifier can be the fact that the increase of the feature dimensionality leads to the need for more training data to train the system models, however;

the size of training data is limited in practice. As the result, inadequate training data may cause the trained system models lose the generalization properties (Wang, 2003). Another drawback of this regular feature extraction approach is that it prevents the possibility to gain explicit knowledge on the most relevant acoustic parameters. By increasing the dimensionality of feature vectors, recognizing the relevant and irrelevant features becomes more challenging. In consequence, some irrelevant features containing less useful information may bring errors into the classification system and corrupt the performance of the classification algorithms (Brunzell & Eriksson, 2000).

As not all features contribute positively to the performance of a classifier, therefore, it becomes of great importance to remove the source of error in the system. A direct solution to the above- mentioned issues caused by high feature dimensionality can be reducing the dimensionality of feature vectors. It can be desirable to discover those features which provide the most significant contribution to the classification system and define a minimal subset of parameters that still deliver the most essential knowledge required for the system.

1.6 Research Questions and Hypotheses

In literature, several efforts in different research areas including data mining, speech recognition, and machine learning, and in many fields of expertise such as electrical engineering, artificial intelligence, computer science or other technology based disciplines have been attempted to reduce the feature space and isolate the best subset of features by taking advantage of machine learning techniques and advanced statistical procedures including principal component analysis (PCA), linear discriminant analysis (LDA), etc. However, this machine learning problem has been less the focus of investigation in the humanities research such as the linguistics and more specifically phonetics. The scarcity of humanities research on feature dimension reduction may be associated with the limited background knowledge of computational methods and complex machine learning algorithms when compared with other engineering fields. Furthermore, to the best of researcher’s knowledge, it appears that the dimensionality reduction for the specific case of pathological speech and feature sets extracted from dysarthric speech have not been previously addressed or explored.

(23)

17

Considering the research on dysarthria from another perspective, as it was mentioned earlier, previous studies have frequently discussed the probable existing discrepancies in the inter and intra rater reliability of the current Frenchay severity assessment for dysarthria. The psychometric weakness of this subjective severity assessment of dysarthria (FDA) necessitates formulation of more objective measures which can provide more accurate and consistent descriptors of the severity by means of quantitative measurements. With the goal of contributing to fill these gaps, the present study aims at scrutinizing the strength of acoustic analysis for the objective assessment of dysarthria. The study will make an effort to reduce the feature space of multi-parameter models and search for an optimal subset of acoustic parameters which maintain efficient predictive power to automatically detect the severity of dysarthria by means of linear statistics. Following the objectives of this research, the alternative hypothesis of the current thesis is that a manageable subset of acoustic features with the most relevant features can be isolated by means of two related statistical methods from very large number of extracted acoustic features and both subsets are capable of predicting the severity scores of patients suffering from dysarthria. This study will perform an analysis following a sequence of steps to test this hypothesis and probe the answers for the following questions:

1. Do both statistical methods applied in this study have the power to reduce the feature space dimension and produce a meaningful subsets containing the most essential predictive acoustic features?

2. Are the generated subsets of acoustic parameters capable of predicting the Frenchay severity scores obtained from the patients of dysarthria?

The findings of this thesis can not only give much insight on the efficiency of the acoustic analysis to give a precise account on the most relevant correlates of the disorder, but also provide practical results on the optimal acoustic feature set which can be used for future works on the dysarthria classification systems.

(24)

18

2. Method 2.1 Introduction

The present study endeavored to provide a dimension reduction of feature space in dysarthric speech. To actualize this end, the study made an effort to automatically derive large number of acoustic parameters from samples of dysarthric speech and then select a reasonably limited subset of acoustic parameters which has efficient power to predict the severity of the disorder.

This chapter presents the methodology utilized to perform the analysis phase of the research. First, the chapter provides essential information on the corpus used in this study. Then, the applied instruments for the analyses are introduced. Finally, the procedures for scrutinizing the data, and the statistical methods carried out in each step of the study are described in detail.

2.2 The Corpus of the Study

According to the literature, there are three main English databases for dysarthric speech, the Nemours Database of Dysarthric Speech, the Torgo Database of Dysarthric Articulation, and Dysarthric Speech Database for Universal Access Research. In this research, the Nemours Database of Dysarthric Speech was used to probe the acoustic features which can be obtained by licensing agreement.

The Nemours corpus includes the speech data of 11 American male participants: 1 healthy speaker as a control and 10 patients affected by varying degrees of dysarthria caused by cerebral palsy (CP) or head trauma (HT). From seven speakers with CP, three patients have CP with spastic quadriplegia, two had athetoid CP, and two others had a mixture of spastic and athetoid CP with quadriplegia. The three remaining subjects suffer from head trauma. (Polikoff and Bunnell, 1999, James et al., 1996, Kadi & Selouani as cited in Patil et al. 2018).

In the corpus, each speaker has been identified by a two-letter code such as SC, FB, RK, etc. which are also used as directory names to segregate participants into separate directories. Similarly, the waveform files are also named with the two-letter subject codes plus a sequence number such as SC1.wav, SC2.wav and can be found in the directory ./speech/sent/SC/wav/. All the directories include the waveforms of the parallel production from 1 no dysarthric male speaker which have been labeled by the code jp. such as jpbb1.wav, jpbb2.wav. Additionally, the score from the

(25)

19

Frenchay Dysarthria Assessment (Enderby, 1983) for each participant is included in the database as illustrated in the table 2.1.

Table 2.1

Overall score of Frenchay Dysarthria Assessment for each subject.

Subject Frenchay Score

Subject Frenchay

Score

BB 175 LL 164

BK 46 MH 200

BV 107 RK 122

JF 110 RL 123

KS 70 SC 88

Based on the Frenchay Dysarthria Assessment scores, the participants can be divided into three subgroups: First, subjects FB, BB, MH and LL suffering from mild dysarthria; the second subgroup including the subjects RK, RL, and JF, and the third subgroup having severe dysarthria which involves the subjects KS, SC, BV, and BK.

In the first section of the database, each speaker produces 74 short nonsense sentences (74 waveform files) with restricted vocabulary containing 4 to 6 words. The target words (X, Y and Z) are selected based on some constraints to provide closed-set phonetic contrasts such as place, manner, and voicing contrasts within the associated set of 4 to 6 words so that all of the target words within a set differ in a single phoneme (Menéndez-Pidal et al. 1992-1996). The second section contains the commonly used two passages of connected speech: the “Grandfather”

paragraph and the “Rainbow” paragraph, produced by the same speakers.

Following the work of Werner (2018), in order to maintain the input homogeneous and avoid any inessential unexpected factors, this study only used the total of 1660 sound files from the sentence section of the Nemours corpus. The structure of each nonsense sentence in the Nemours database is as: ‘X’ is ‘Y’ing the‘Z’. An example is presented here:

“The bash is pairing the bath”

X Y Z

(26)

20

The X and Z in the sentences are randomly selected from a set of 74 monosyllabic nouns and the Y is selected from a set of 37 bisyllabic verbs. As the result, 37 sentences have been generated from which another 37 sentences are produced by swapping the X and Z tokens in the original set (see the example below). In other words, over the complete set of 74 sentences, each noun and verb has been generated twice over the complete set of 74 sentences for each talker (Menéndez- Pidal et al. 1996).

“The bad is sleeping the bin”

“The bin is sleeping the bad”

Based on the information provided in the database, the recording session was conducted by a speech pathologist in a sound dampened room. The Pulse Code Modulation (PCM) was used for speech coding and the speech waveforms are sampled at 16 kHz and 16 bit sample resolution after low pass filtering at a nominal 7500 Hz cutoff frequency with a 90 dB/Octave filter. The whole sentence data are segmented at phonemic level through Hidden Markov Model labeler.

Furthermore, the Resource Interchange File Format was used for all audio files.

2.3 Data Gathering

For the extraction of acoustic features from each participant in the corpus, openSMILE toolkit (Eyben et al. 2013) was used which is an open source feature extractor for audio-signal processing and machine learning applications. OpenSMILE comprises some example configuration files for commonly used feature sets including the baseline acoustic feature sets of the 2009-2013 INTERSPEECH challenges on affect and paralinguistics. In this research, the feature set for the 2012 Interspeech Speaker Trait Challenge (Schuller et al. 2012) was used containing a total of 6125 features derived from 64 energy-, spectrum- and voicing-related “low-level descriptors”

(LLD). The names of the low-level descriptors are shown in table 2.2 and the set of applied functionals to each of these 64 LLD is given in detail in Table 2.3.

The suffix -sma has been appended to the names of some of low-level descriptors indicating that the LLDs were smoothed by a moving average filter with window length 3. Also, the suffix -de appended to -sma suffix indicates that the current feature is a 1st order delta coefficient (differential) of the smoothed low-level descriptor (Schuller et al. 2012).

(27)

21 Table 2.2

64provided Low-level descriptors ‘LLD’ (Schuller et al. 2012).

4 energy related LLD

Sum of auditory spectrum (loudness)

Sum of RASTA-style filtered auditory spectrum RMS Energy

Zero-Crossing Rate 54 spectral LLD

RASTA-style auditory spectrum, bands 1-26 (0–8 kHz) MFCC 1–14

Spectral energy 250–650 Hz, 1 k–4 kHz Spectral Roll Off Point 0.25, 0.50, 0.75, 0.90

Spectral Flux, Entropy, Variance, Skewness, Kurtosis, Slope, Psychoacoustic Sharpness, Harmonicity

6 voicing related LLD

F0 by SHS + Viterbi smoothing, Probability of voicing logarithmic HNR, Jitter (local, delta), Shimmer (local)

All the low-level descriptors (LLDs) of the 2012 Interspeech Speaker Trait Challenge are saved in the Arff file, but by making minor modifications in the config/shared/standard data output.conf.inc., the default output option was converted to CSV file format for feature summaries.

Table 2.3

Applied functionals. (Schuller et al. 2012) Functionals applied to LLD / ∆ LLD quartiles 1–3, 3 inter-quartile ranges

1 % percentile (≈ min), 99 % percentile (≈ max) position of min / max

percentile range 1 %–99 %

arithmetic mean1, root quadratic mean contour centroid, flatness

standard deviation, skewness, kurtosis

rel. duration LLD is above / below 25 / 50 / 75 / 90% range rel. duration LLD is rising / falling

rel. duration LLD has positive / negative curvature2 gain of linear prediction (LP), LP Coefficients 1–5 mean, max, min, std. dev. of segment length3 Functionals applied to LLD only

(28)

22 mean of peak distances

standard deviation of peak distances mean value of peaks

mean value of peaks – arithmetic mean mean / std.dev. of rising / falling slopes mean / std.dev. of inter maxima distances amplitude mean of maxima / minima amplitude range of maxima

linear regression slope, offset, quadratic error quadratic regression a, b, offset, quadratic error percentage of non-zero frames4

* 1 : arithmetic mean of LLD / positive ∆ LLD. 2 : only applied to voice related LLD. 3 : not applied to voice related LLD except F0. 4 : only applied to F0.

Meaning of some of the functionals applied to LLDs described in the openSMILE documentation (Eyben et al. 2013) is presented here:

range: max-min

maxPos: The absolute position of the maximum value (in frames).

minPos: The absolute position of the minimum value (in frames).

amean: The arithmetic mean of the contour.

linregc1: The slope (m) of a linear approximation of the contour.

linregc2: The offset (t) of a linear approximation of the contour.

linregerrA: The linear error computed as the difference of the linear approximation and the actual contour.

linregerrQ: The quadratic error computed as the difference of the linear approximation and the actual contour.

stddev: The standard deviation of the values in the contour.

skewness: The skewness (3rd order moment).

kurtosis: The kurtosis (4th order moment).

quartile1: The _first quartile (25% percentile).

quartile2: The _first quartile (50% percentile).

quartile3: The _first quartile (75% percentile).

iqr1-2: The inter-quartile range: quartile2-quartile1.

iqr2-3: The inter-quartile range: quartile3-quartile2.

Viittaukset

LIITTYVÄT TIEDOSTOT

Hä- tähinaukseen kykenevien alusten ja niiden sijoituspaikkojen selvittämi- seksi tulee keskustella myös Itäme- ren ympärysvaltioiden merenkulku- viranomaisten kanssa.. ■

Jos valaisimet sijoitetaan hihnan yläpuolelle, ne eivät yleensä valaise kuljettimen alustaa riittävästi, jolloin esimerkiksi karisteen poisto hankaloituu.. Hihnan

Vuonna 1996 oli ONTIKAan kirjautunut Jyväskylässä sekä Jyväskylän maalaiskunnassa yhteensä 40 rakennuspaloa, joihin oli osallistunut 151 palo- ja pelastustoimen operatii-

Kvantitatiivinen vertailu CFAST-ohjelman tulosten ja kokeellisten tulosten välillä osoit- ti, että CFAST-ohjelman tulokset ylemmän vyöhykkeen maksimilämpötilasta ja ajasta,

Tornin värähtelyt ovat kasvaneet jäätyneessä tilanteessa sekä ominaistaajuudella että 1P- taajuudella erittäin voimakkaiksi 1P muutos aiheutunee roottorin massaepätasapainosta,

Työn merkityksellisyyden rakentamista ohjaa moraalinen kehys; se auttaa ihmistä valitsemaan asioita, joihin hän sitoutuu. Yksilön moraaliseen kehyk- seen voi kytkeytyä

GOAL empowerment Goal: self-expression Goal: success Goal: educ.&amp;knowledge EMPOWERMENT:… Empowerment: given empowerment: self-grade empowerment: tests ASSESSMENT: badly

The new European Border and Coast Guard com- prises the European Border and Coast Guard Agency, namely Frontex, and all the national border control authorities in the member