• Ei tuloksia

Automatic speech recognition and speech intelligibility

II. Literature review

3. Automatic speech recognition and speech intelligibility

ASR, or automatic speech recognition in general, is a system, which uses different

methodologies and technologies to enable speech recognition, or in other words to translate spoken language into text with the use of computers. The translation process, however, is quite complex and requires several steps. First, the acoustic speech signal is recorded using a

microphone. Then the signal is processed by computer and further adjusted (i.e. filtered,

denoised, converted to a needed frequency and cut into slices). The resulting signal slices can be then analyzed as separate entities. Depending on the method, the program extracts features from those slices and compares them to a source language library, essentially finding a matching segment from that library, allowing the analyzed slice to be recorded as certain sound. Some of the existing systems, such as the system described below, can also give different options along with the accuracy of the prediction for the segment under question. These segments are then combined, or concatenated, into words, and words may be concatenated into sentences. Modern ASR systems also allow the usage of different language models, mathematical methods of analysis and specific accuracy-increasing vocabularies.

Even though a thorough description including all the mathematical and computational details is beyond the scope of this research, a simple metaphor by Kitzing et al. (2009) may help to understand the inner workings of such systems. They describe this process as a jigsaw puzzle or a memory game, where “a picture can only be entirely right or wrong”, but “in a puzzle the picture is built up by many pieces, some of which sometimes may be ‘almost right’”. Comparing that to ASR systems, it provides some insight into why these systems are not always able to produce a 100% correct result.

Modern advances in ASR have made this technology widely available, easy to set up, which made it an option for transcribing audio input from various sources. In some ways, it might yield

better results compared to manual assessment, since it is not affected by subjectivity, emotion, fatigue, or accidental lack of concentration (Neri et al, 2010).

General-purpose off-the-shelf ASR systems such as Google ASR4, are becoming progressively more popular each day due to these reasons (see Meeker, 2017, Këpuska, V.; Bohouta, G., 2017). It is a very useful tool to assess non-specific speech, since it’s trained on large datasets, spans several different domains, and is improved upon constantly.

The main problem of the ASR systems, however, is that while they are quite good at recognizing native English speech, the recognition for non-native speakers and sometimes even speakers of different English dialects, can be quite poor, mostly due to the heterogeneity of non-native speech (Park & Culnan, 2019). Google reports the error rate for gASR as 5% for native speech and up to 30% for non-native speech. (Tejedor-García et al., 2021) Word error rate, or WER, is commonly used to provide this measure.

Word Error Rate (WER), according to Ali and Renals (2021) is commonly used to ‘evaluate the performance of a large vocabulary continuous speech recognition (LVCSR) system. The

sequence of words ‘hypothesized by the ASR system is aligned with a reference transcription, and the number of errors is computed as the sum of substitutions (S), insertions (I), and deletions (D)’. If there are N words in the reference transcription in total, then the word error rate (WER) is calculated using the following formula:

𝑊𝐸𝑅 =

𝐼+𝐷+𝑆

𝑁

× 100

(Ali et al., 2016)

For WER to be reliable, however, ‘at least two hours of data’ (Ali and Renals, 2021) are needed

‘for a standard LVCSR system’. The common problem with this procedure, however, is that the data in question has to be manually transcribed at least at the sentence level. In this case,

4https://cloud.google.com/speech-to-text, accessed on 19 October 2021

however, due to the nature of the experiment, all of the reference sentences are already available and paired with voice recordings comprising about 1.5 – 2 hours of speech for each speaker.

There are also studies (Park et al., 2008, Ali & Renals, 2018) that have used WER or its’

analogues to display speech production accuracy. Even though some of the authors (He et al., 2011, Favre et al., 2013) suggest that WER is not a good metric, they talk about different applications such as speech recognizer training for speech translation task (He et al., 2011) or

“human subjects’ success in finding decisions” (Favre et al., 2013). Since in this study the goal is to compare the accuracy between speakers’ language level and accuracy production and to find if there is a general relation between speech intelligibility and recognition accuracy, WER is expected to serve as a good measure of how much the Russian accent affects the intelligibility of L2 English.

Lane (1962) talks about intelligibility and accent, mentioning that non-native speakers of English may have problems making themselves understood, especially in non-ideal listening conditions.

Flege (1995) expands on that, listing several other possible problems that non-native speakers may encounter, among them the misperception of prosodic features, and even frustration in speech recipients, which may prompt speech recipients to adjust their speech output provided to L2 speakers.

As far as this study is concerned, even though there are a lot of studies discussing the benefits of ASR systems (Gerosa et al., 2009, Raju et al., 2020, Krishna et al., 2019, Mirghafori, Fosler and Morgan, 1996) and their use in various situations, there are no studies that employ ASR models to specifically look into the ASR model’s accuracy of recognizing Russian L2 speakers of English accented speech and its’ subsequent comparison with the data from manual annotation and error analysis.